System Observability

System Observability: The Different Demands

We have had a considerable drive with innovation that has spawned several megatrends that have affected how we manage and view our network infrastructure and the need for observability. In reality, we have seen the decomposition of everything, from one to many. Many services and dependencies in multiple locations need to be managed and operated instead of the monolithic where everything is generally housed internally. The megatrends have resulted in a dynamic infrastructure with new failure modes not seen in the monolithic forcing us to look a different system observability tools and practices.

There has also been a shift in point of control. We move towards new technologies, and many of these loosely coupled services or infrastructure your services lay upon are not directly under your control. The edge of control has been pushed, creating different types of network and security perimeters. These parameters are now closer to the workload than a central security stack. Therefore the workloads themselves are concerned with security. For a more detailed explanation of these changes that drive the need for good observability and how they may effect you, a full 2-hour course I did for Pluralsight on DevOps Operational Strategies can be found here: DevOps: Operational Strategies.

 

System Observability Design

Diagram: System Observability Design.

 

  • How This Affects Failures

The major issue that I have seen with my clients is that application failures are no longer predictable, and the dynamic systems can fail in very creative ways challenging existing monitoring solutions. But, more importantly, the practices that support them. We have a lot of partial failures that are not just unexpected but not known or ever seen before. For example, if you recall, we have the network hero. 

 

  • The Network Hero

It is someone that knows every part of the network and has seen every failure at least once before. They are no longer useful in today’s world, and you need proper Observability. When I was working as an Engineer, we would have plenty of failures, but more than likely, we would have seen them before. And there was a system in place to fix the error. Today’s environment is much different. We can no longer rely on simply seeing either a UP or Down and setting static thresholds and then alerting based on those thresholds. A key point to note at this stage is that none of these thresholds consider the customer’s perspective.  If your POD is running at 80% CPU, does that mean the customer is unhappy? When looking to monitor, you should look from your customer’s perspective and what matters to them. Content Delivery Network (CDN) was one of the first to realize this game and measure what matters most to the customer.

 

The Different Demands

So the new modern and complex distributed systems place very different demands on your infrastructure and the people that manage the infrastructure.  For example, in microservices, there can be several problems with a particular microservice:

    • The microservices could be running under high resource utilization and therefore slow to respond, causing a timeout
    • The microservices could have crashed or been stopped and is therefore unavailable
    • The microservices could be fine, but there could be slow-running database queries.
    • So we have a lot of partial failures. 

 

Therefore: We Can No Longer Predict

The big shift we see with software platforms is that they are evolving much quicker than products and paradigms that we are using, for example, to monitor them. As a result, we need to consider new practices and technologies with dedicated platform teams along with good system observability. We really can’t predict anything anymore, which puts the brakes on some traditional monitoring approaches, especially the metrics-based approach to monitoring. I’m not saying that these monitoring tools are not doing what you want them to do. But, they work in a siloed environment, and there is a lack of connectivity. So we have monitoring tools working in silos in different parts of the organization and more than likely managed by different people trying to monitor a very dispersed application with multiple components and services in various places. 

 

Diagram: Prometheus Monitoring: Link to YouTube video.

 

  • Relying On Known Failures: Metric-Based Approach

A metrics-based monitoring approach relies on having encountered known failure modes in the past. The metric-based approach relies on known failures and predictable failure modes. So we have predictable thresholds that someone is considered to experience abnormal.  Monitoring can detect when these systems are either over or under the predictable thresholds that are previously set. And then, we can set alerts and hope that these alerts are actionable. This is only useful for variants of predictable failure modes.  Traditional metrics and monitoring tools can tell you any performance spikes or notice that a problem occurs. But they don’t let you dig into the source of problems and let us slice and dice or see correlations between errors. If the system is complex, this approach is harder to get to the root cause in a reasonable timeframe.

With the traditional style metrics systems, you had to define custom metrics, and these were always defined upfront. So with this approach, we can’t start to ask new questions about problems. So it would be best if you defined the questions to ask upfront. Then we set performance thresholds and pronounce them “good” or “bad.” and check and re-check those thresholds. We would tweak the thresholds over time, but that was about it. This monitoring style has been the de facto approach, but we don’t now want to predict how a system can fail. Always looking and always observing instead of waiting for problems, such as reaching a certain threshold before acting.

System Observability Analysis

Diagram: System Observability Analysis.

 

  • Metrics: Lack of Connective Event

Metrics did not retain the connective event. As a result, you cannot ask new questions in the existing dataset.  These traditional system metrics could miss unexpected failure modes in complex distributed systems. Also, the condition detected via system metrics might be unrelated to what is happening. An example of this could be an abnormal number of running threads on one component that might indicate garbage collection is in progress. It might also indicate that slow response times might be imminent in an upstream service.

 

  • Users Experience: Static Thresholds

User experience means different things to different sets of users. We now have a model where different service users may be routed through the system in different ways, using different components, providing experiences that can vary widely. We also know now that the services no longer tend to break in the same few predictable ways over and over.  We should have few alerts triggered by only focusing on symptoms that directly impact user experience and not because a threshold was reached.

 

  • The Challenge: Can’t Reliably Indicate Any Issues With User Experience

If you are using static thresholds, they can’t reliably indicate any issues with user experience. Alerts should be set up to detect failures that impact user experience. Traditional monitoring falls short in trying to do this. With the traditional metrics-based monitoring, we rely on static thresholds to define optimal system conditions, which has nothing to do with user experience. However, modern systems change shape dynamically under different workloads. Static thresholds for monitoring can’t reflect impacts on user experience. They lack context and are too coarse.

 

The Need For System Observability

System observability is a practice. Rather than just focusing on a tool that does logging, metrics, or altering, Observability is all about how you approach problems, and for this, you need to look at your culture.  So you could say that Observability is a cultural practice and allows you to be proactive to findings instead of relying on a reactive approach that we are used to in the past.  Nowadays, we need a different viewpoint, and we generally want to see everything from one place. You want to know how the application works and how it interacts with the other infrastructure components, such as the underlying servers, physical or server, the network, and how data transfer looks in a transfer and stale state.  What level of observation do you need so you know that everything is performing as it should? And what should you be looking at to get this level of detail?

Monitoring is knowing the data points and the entities we are gathering from. On the other hand, Observability is like when you put all of the data together. So monitoring is the act of collecting data, and Observability is putting it all together in one single pane of glass. Observability is observing the different patterns and deviations from baseline; monitoring is getting the data and putting it into the systems.

 

The 3 Pillars of Observability

We have three pillars of System Observability. There are Metrics, Traces, and Logging. So it is an oversimplification to define or view Observability as just having these pillars. But for Observability, you need these in place. Observability is all about how you connect the dots from each of these pillars. If someone asked me which one I prefer, it would be distributed tracing. Distributed tracing allows you to visualize each step in service request executions. As a result, it doesn’t matter if services have complex dependencies. You could say that the complexity of the Dynamic systems is abstracted with distributed tracing.

 

Distributed tracing

Diagram: Distributed Tracing: Link to YouTube video.

 

  • Use Case: Challenges Without Tracing

For example, latency can stack up if a downstream database service experiences performance bottlenecks. As a result, the end-to-end latency is high. By the time that latency is detected three or four layers upstream, it can be incredibly difficult to identify which component of the system is the root of the problem because now that same latency is being seen in dozens of other services.

 

  • Distributed Tracing: A Winning Formula

Modern distributed systems tend to scale into a tangled knot of dependencies. Therefore, distributed tracing shows the relationships between various services and components in a distributed system. Traces help you understand system interdependencies. Unfortunately, those inter-dependencies can obscure problems and make them particularly difficult to debug unless their relationships are clearly understood.

 

system reliability

System Reliability in an Unpredictable World

There have been considerable shifts in our environmental landscape that have caused us to examine how we operate and run our systems and networks. We have had a mega shift with the introduction of various cloud platforms and their services and containers along with the complexity of managing distributed systems that unveil large gaps in current practices in the technologies we use. Not to mention the flaws with the operational practices around these technologies. All of this has caused a knee-jerk reaction to a welcomed drive-in innovation to system reliability. Yet, some technologies and tools used to manage these innovations are not in line with the innovative events. Many of these tools have stayed relatively static in our dynamic environment. So we have static tools used in a dynamic environment, which causes friction.

The big shift we see with software platforms is that they are evolving much quicker than products and paradigms that we use to monitor them. We need to consider new practices and technologies with dedicated platform teams to enable a new era of system reliability.

 

    • Lack of Connective Event: Traditional Monitoring

If you examine traditional monitoring systems, they look to capture and examine signals in isolation. The monitoring systems work in a siloed environment similar to developers and operators before the rise of DevOps. Existing monitoring systems cannot detect the “Unknowns Unknowns” common with modern distributed systems. This often leads to disruptions of services. So you may be asking what an “Unknown Unknown” is.

I’ll put it to you this way: distributed systems we see today don’t have any or much sense of predictability. Certainly not enough predictability to rely on static thresholds, alerts, and old monitoring tools. If something is static, it can be automated, and we have static events such as in Kubernetes, a POD reaching a limit. Then a replica set introduces another pod on a different node as long as certain parameters are met such as Kubernetes Labels and Node Selectors. However, this is only a small piece of the failure puzzle in a distributed environment.  Today, we have what’s known as partial failures and systems that fail in very creative ways.

 

    • System Reliability: Creative Ways to Fail

So we know that some of these failures are easily predicted, and actions are taken. For example, if this Kubernetes POD node reaches a certain utilization, we can automatically reschedule PODs on a different node to stay within our known scale limits. We have predictable failures that can be automated and not just in Kubernetes but with any infrastructure. An Ansible script is useful when we have predictable events. However, we have much more to deal with than POD scaling; we have many partial failures and complicated failures known as black holes.

 

    • Today’s World of Partial Failures

Microservices applications are distributed and susceptible to many external factors. On the other hand, if you examine the traditional monolithic application style, all the functions reside in the same process. It was either switched ON or OFF!! Not much happened in between. So if there is a failure in the process, the application as a whole will fail. The results are binary, and it is usually either a UP or Down. And with some basic monitoring in place, this was easy to detect, and failures were predictable. There was no such thing as a partial failure. In a monolith application, all application functions are within the same process. And a major benefit of these monoliths is that you don’t have partial failures.

However, in a cloud-native world, where we have taken the old monolith and broken it into a microservices-based application, a request made from a client can go through multiple hops of microservices, and we can have several problems to deal with. There is a lack of connectivity between the different domains. There will be many monitoring tools and knowledge tied to each domain, and alerts are often tied to thresholds or rate-of-change violations that have nothing to do with user satisfaction. User satisfaction is the key metric to care about.

 

chaos e

Diagram: Chaos Engineering – How to start a project: Link to YouTube video.

 

Today You Have No Way to Predict

So the new modern and complex distributed systems place very different demands on your infrastructure—considerably different from the simple three-tier application where everything was generally housed in one location.  We really can’t predict anything anymore, which puts the brakes on some traditional monitoring approaches. When you can no longer predict what will happen, you can no longer rely on a reactive approach to monitoring and management. The move towards a proactive approach to system reliability is a welcomed strategy.

 

A Quick Note on Blackholes: Strange Failure Modes

When considering a distributed system, many things can happen. A service or region can disappear or disappear for a few seconds or ms and appear again. We consider this as going into a black hole when we have strange failure modes. So when anything goes into it will disappear. So strange failure modes are unexpected and surprising. There is certainly nothing predictable about strange failure modes. So what happens when your banking transactions are in the black hole? What if your banking balance is displayed incorrectly or if you make a transfer to an external account and it does not show up? I did a demo on this with my training course. Here I examined the effects of Blackholes on system reliability and demoed a sample application called Bank of Anthos in the course DevOp: Operational Strategies

 

Highlighting Site Reliability Engineering (SRE) and Observability

The practices of Site Reliability Engineering (SRE) and Observability are what are needed to manage these types of unpredictability and unknown failures. SRE is about making systems more reliable. And everyone has a different way of implementing the SRE practices. Usually, about 20% of your issues cause 80% of your problems. You need to be proactive and fix these issues upfront. You need to be able to get ahead of the curve and do these things to stop the incidents from occurring. This usually happens in the wake of a massive incident. This usually acts as a teachable moment. This gives the power to be the reason to listen to a Chaos Engineering project. 

 

system reliability

Diagram: Site Reliability Engineering and Observability: Link to YouTube video.

 

  • New Tools and Technologies: Distributed Tracing

We have new tools such as distributed tracing. So if the system becomes slow, what is the best way to find the bottleneck? Here you can use Distributed Tracing and Open Telemetry. So the tracing helps us instrument our system, so we figure out where the time has been spent and can be used across distributed microservice architecture to troubleshoot problems. Open Telemetry provides a standardized way of instrumenting our system and providing those traces.

 

  • SLA, SLI, SLO, and Error Budgets

So we don’t just want to know when something has happened and then react to an event that is not looking from the customer’s perspective. We need to understand if we are meeting SLA by gathering the number and frequency of the outages and any performance issues. Service Level Objectives (SLO) and Service Level Indicators (SLI) can assist you with measurements. Service Level Objectives (SLO) and Service Level Indicators (SLI) not only assist you with measurements. They offer a tool for having better Reliability and form the base for the Reliability Stack.