observability platform

Observability vs Monitoring

To understand the difference between observability vs monitoring, we need to first discuss the role of monitoring. Monitoring is the evaluation to help identify the most valuable and efficient use of resources. So the big question I put to you is what to monitor. This is the first step to preparing a monitoring strategy and there are a couple of questions you can ask yourself to understand fully if monitoring by itself is enough or do you need to move to an observability platform. Firstly, you should consider what you should be monitoring, why you should be monitoring, and how to monitor it.?  When you know this, you can move into the different tools and platforms available. Some of these tools will be open source and others commercial. When evaluating these tools, one word of caution: does each tool work in a silo, or can it be used across technical domains? Silos are breaking agility in every form of technology.

 

Observability vs Monitoring

Diagram: Observability vs Monitoring

 

The Starting Point: Observability vs Monitoring

You need to measure and gather the correct event information in your environments, and this will be done with several tools. This will let you know what is affecting your application performance and infrastructure. As a good starting point, there are four golden signals to look out for: There is latency, saturation, traffic, and errors. These are Google’s Four Golden Signals. The four most important metrics to keep track of are: 

      1. Latency: How long it takes to serve a request
      2. Traffic: The number of requests being made.
      3. Errors: The rate of failing requests. 
      4. Saturation: How utilized the service is.

So now we have some kind of a guide on what to monitor, and let us apply this to Kubernetes to, for example, let’s say a frontend web service that is part of a tiered application, we would be looking at the following:

      1. How many requests is the front end processing at a particular point in time,
      2. How many 500 errors are users of the service received, and 
      3. Is the service overutilized by request?

So we already know that monitoring is a form of evaluation to help identify the most valuable and efficient use of resources. With monitoring, we observe and check the progress or quality of something over time. So within this, we have metrics, logs, and alerts. Each has a different role and purpose.

 

    • Monitoring: The Role of Metrics

Metrics are related to some entity and allow you to view how many resources you consume. The metric data consists of numeric values instead of unstructured text such as documents and web pages. Metric data is typically also time series, where values or measures are recorded over some time.  An example of such metrics would be available bandwidth and latency. It is important to understand baseline values. Without a baseline, you will not know if something is happening out of the norm. What are the usual baseline values of the different metrics for bandwidth and latency? Are there any fluctuations in these metrics? How do these values rise and fall during normal operations and peak usage? And this may change over the different days in the week and months. If, during normal operations, you notice a rise in these values. This would be deemed abnormal and should act as a trigger that something could be wrong and needs to be investigated. Keep in mind that these values should not be gathered as a once-off and can be gathered over time to give you a good understanding of your application and its underlying infrastructure.

 

    • Monitoring: The Role of Logs

Logging is an essential part of troubleshooting application and infrastructure performance. Logs give you additional information about the events. This is important for troubleshooting or discovering the root cause of the events. Logs will have a lot more detail than metrics. So you will need some way to parse the logs or use a log shipper. A typical log shipper will take these logs from the standard out in a Docker container and ship them to a backend for processing. FluentD or Logstash has its pros and cons and can be used here to the group and sent to a backend database that could be the ELK stack ( Elastic Search). When using this approach, you can add different things to logs before sending them to the backend. For example, you can add GEO IP information. And this will add richer information to the logs that can help you troubleshoot.

 

    • Monitoring: The Role of Alerting

Then we have the alerting, and it would be best to balance how you monitor and what you alert on. So, we know that alerting is not always perfect, and it will take time to get the right alerting strategy in place. It’s not a simple day one installation and requires much effort and cross-team collaboration. You know that alerting on too much will cause alert fatigue. And we are all too familiar with the problems alert fatigue can bring and tensions to departments. To minimize this, you need to consider Service Level Objective (SLO) for alerts. SLOs are measurable characteristics such as availability, throughput, frequency, and response times. Service Level Objectives are the foundation for a reliability stack. Also, it would be best if you also considered alert thresholds. If these are too short, you will get a lot of false positives on your alerts. 

Even with all of these in place, monitoring is not enough. Due to the sheer complexity of today’s landscape, you need to consider and think differently about the tools you use and how you use the intelligence and data that you receive from these tools to resolve issues before they become incidents.  Like that, monitoring by itself is not enough. The tool used to monitor is just a tool that probably does not cross technical domains, and there will be different groups of users who will administer each tool without a holistic view. The tools alone can take you only half the way through the journey.  Also, what needs to be addressed is the culture and the traditional way of working in silos. A siloed environment can affect the monitoring strategy you want to implement. Here you can look into an Observability platform.

 

Observability vs Monitoring

Diagram: Observability vs Monitoring. Link to YouTube video.

 

Observability vs Monitoring

So when it comes to observability vs monitoring we know that monitoring can detect problems and tell you if a system is down and when your system is UP, Monitoring doesn’t care. Monitoring only cares when there is a problem. The problem has to happen before monitoring takes action. It’s very reactive. So if everything is working, monitoring doesn’t care. On the other hand, we have an Observability platform, a more proactive practice. It’s about what and how your system and services are doing. Observability lets you improve your insight into how complex systems are working, and let’s quickly get to the root cause of any problem known and unknown. Observability is best suited for interrogating systems to explicitly discover the source of any problem, along any dimension or combination of dimensions, without first predicting. This is a proactive approach.

 

The Pillars of Observability

This is achieved by utilizing a combination of logs, metrics, and traces. So we need to have data collection, storage, and analysis across these domains. While also being able to perform alerting on what matters most. Let’s say you want to draw correlations between units like TCP/IP packets and HTTP errors experienced by your app. The Observability platform pulls the context from different sources of information like logs, metrics, events, and traces into one central context. Distributed tracing adds a lot of value here. Also, when everything is placed into one context, you can switch between the necessary views to troubleshoot the root cause accordingly easily. A good key component of any observability system is to have the ability to view these telemetry sources with one single pane of glass. 

Distributed Tracing in Microservices

Diagram: Distributed Tracing in Microservices

 

Known and Unknown / Observability Unknown and Unknown

Monitoring automatically reports whether known failure conditions are occurring or are about to occur. In other words, they are optimized for reporting on unknown conditions about known failure modes. This is referred to as known unknowns. In contrast, Observability is centered around discovering if and why previously unknown failure modes may be occurring: in other words, to discover unknown unknowns.

The monitoring-based approach of using metrics and dashboards is an investigative practice that leads with the experience and intuition of humans to detect and make sense of system issues. This is ok for a simple legacy system that fails in predictable ways, but the instinctual technique falls short for modern systems that fail in very unpredictable ways. With modern applications, the complexity and scale of their underlying systems quickly make that approach unattainable, and we can’t rely on hunches. Observability tools differ from traditional monitoring tools because they enable engineers to investigate any system, no matter how complex. You don’t need to react to a hunch or have intimate system knowledge to generate a hunch.

 

Observability and Controllability

Diagram: Distributed Tracing Explained: Link to YouTube video.

 

  • Monitoring vs Observability: Working Together?

Monitoring best helps engineers understand infrastructure concerns. While Observability best helps engineers understand software concerns. So Observability and Monitoring can work together. First, the infrastructure does not change too often, and when it fails, it will fail in more predictable ways. So we can use monitoring here. This is in comparison to software system states that change daily and are not predictable. Observability fits this purpose. The conditions that affect infrastructure health change infrequently, relatively easier to predict. We have several well-established practices to predict, such as capacity planning and the ability to remediate automatically (e.g., as auto-scaling in a Kubernetes environment. All of which can be used to tackle these types of known issues. Due to its relatively predictable and slowly changing nature, the aggregated metrics approach monitors and alerts perfectly for infrastructure problems. So here, a metric-based system works well. Metrics-based systems and their associated alerts help you see when capacity limits or known error conditions of underlying systems are being reached. Now we need to look at monitoring the Software. Now we need access to high-cardinality fields. This may include the user id or a shopping cart id. Code that is well-instrumented for Observability allows you to answer complex questions that are easy to miss when examining aggregate performance.