Observability and Controllability

Observability and Controllability: Issues with Metrics

What Is a Metric: Good for Known.

So when it comes to observability and controllability, one needs to understand the downfall of the metric. In reality, a metric is a single number, with tags optionally appended for grouping and searching those numbers. They are disposable and cheap and have a predictable storage footprint. A metric is a numerical representation of a system state over the recorded time interval and can tell you if a particular resource is over or underutilized at a particular moment in time. For example, CPU utilization might be at 75% right now.

There can be many tools to gather metrics, such as Prometheus along with several techniques used to gather these metrics, such as the PUSH and PULL approaches. There are pros and cons to each method. However, Prometheus and its PULL approach are prevalent in the market. However, if you are looking for full observability and controllability, keep in mind it is solely in the world of metrics-based monitoring solutions.

 

Metrics: Resource Utilization Only

So metrics are useful to tell us about resource utilization. Within a Kubernetes environment, these metrics are used to perform auto-healing and auto-scheduling purposes. So when it comes to metrics, monitoring performs several functions. First, it can collect, aggregate, and analyze metrics to shift through known patterns that indicate troubling trends. The key point here is that it shifts through known patterns. Then, based on a known event, metrics trigger alerts that notify when further investigation is needed. Finally, we have dashboards that display the metrics data trends adapted for visual consumption on top of all of this.

These monitoring systems work well for identifying previously encountered known failures but don’t help as much for the unknown. Unknown failures are the norm these days with disgruntled systems and complex system interactions. Metrics are good for dashboards but there won’t be a predefined dashboard for unknowns as it can’t track something it does not know about. Using metrics and dashboards like this is a very reactive approach. Yet, it’s an approach that has been widely accepted as the norm. Monitoring is a reactive approach best suited for detecting known problems and previously identified patterns. 

So within a microservices environment, the metrics can help you when the microservice is healthy or unhealthy. Still, a metric will have a hard time telling you if a microservices function takes a long time to complete or if there is an intermittent problem with an upstream or downstream dependency. So we need different tools to gather this type of information. We have an issue with metrics because they only look at individual microservices with a given set of attributes. So they don’t give you a holistic view of the entire problem. For example, the application stack now exists in numerous locations and location types; we need a holistic viewpoint. And a metric does not give this. For example, metrics are used to track simplistic system states that might indicate a service may be running poorly or maybe a leading indicator or an early warning signal. However, while those measures are easy to collect, they don’t turn out to be useful measures for triggering alerts.

 

observability and controllability

Diagram: The Three Pillars of Observability: Metrics, Traces and Logs

 

Issues With Dashboards: Useful Only for a Few Metrics

So these metrics are gathered and stored in time-series databases, and we have several dashboards to display these metrics. These dashboards were first built, there weren’t many system metrics to worry about. You could have gotten away with 20 or so dashboards. But that was about it. As a result, it was easy to see the critical data anyone should know about for any given service. Moreover, those systems were pretty simple and did not have many moving parts. This is in contrast to the modern services that typically collect so many metrics that it’s impossible to fit them into the same dashboard.

Issues with Aggregate Metrics

So we need to find ways to fit all the metrics into a few dashboards. Here the metrics are often pre-aggregated and averaged. However, the issue is that the aggregate values no longer provide meaningful visibility even when we have filters and drill-downs. Therefore we need to predeclare conditions that describe conditions that we think we are going to expect in the future.  This is where we use instinctual practices of past experiences and rely on gut feeling. Remember the network and software hero? It would help if you tried to avoid the aggregation and averaging within the metrics store. On the other hand, we have Percentiles that offer a richer view. Keep in mind, however, that they require raw data.

 

Highlighting Observability: Any Question

Observability and controllability tools take on an entirely different approach. They strive for different exploratory approaches to finding problems. Essentially, those operating observability systems don’t sit back and wait for an alert or for something to happen. Instead, they are always actively looking and asking random questions to the observability system. Observability tools should gather rich telemetry for every possible event, having full content of every request and then having the ability to store it and query. In addition, these new observability tools are specifically designed to query against high-cardinality data. High cardinality allows you to interrogate your event data in any arbitrary ways that we see fit. Now we ask any questions about your system and inspect its corresponding state. 

 

Key Observability and Controllability Considerations

No Predicts in Advance

Due to the nature of modern software systems, you want the ability to understand any inner state and any services without anticipating or predicting them in advance. For this, we need to gain valuable telemetry and use some new tools and technological capabilities to gather and interrogate this data once it has been collected. Telemetry needs to be constantly gathered in flexible ways to debug issues without predicting how failures may occur. 

The conditions that affect infrastructure health change infrequently, and they are relatively easier to monitor the infrastructure. In addition, we have several well-established practices to predict, such as capacity planning and the ability to remediate automatically (e.g., such as auto-scaling in a Kubernetes environment. All of which can be used to tackle these types of known issues.

Due to its relatively predictable and slowly changing nature, the aggregated metrics approach monitors and alerts perfectly for infrastructure problems. So here, a metric-based system works well. Metrics-based systems and their associated alerts help you see when capacity limits or known error conditions of underlying systems are being reached. So you could say that metrics-based systems work well for infrastructure problems that don’t change too much but fall dramatically short in the world of complex distributed systems. For these types of systems, you should opt for an observability and controllability platform. Check out my short YouTube on the differences between monitoring and observability.

Comments are closed.