auto scaling observability

Observability vs Monitoring

 

Monitoring vs observability

 

Observability vs Monitoring

Ensuring the performance, reliability, and availability of complex systems is crucial in software development and operations. To achieve this, two critical practices come into play: observability and monitoring. While these terms are often used interchangeably, they represent distinct approaches to understanding and managing system behavior. In this blog post, we will delve into observability and monitoring, exploring their differences, benefits, and how they work together to provide valuable insights into system performance.

Observability is a concept that originated in control theory but has now found its way into the realm of software systems. It refers to the ability to understand what is happening inside a system based on its external outputs. In other words, it is the degree to which we can measure the internal state of a system by analyzing its external behavior.

 

Highlights: Observability vs Monitoring

  • The Role of Monitoring

To understand the difference between observability vs. monitoring, we need first to discuss the role of monitoring. Monitoring is the evaluation to help identify the most practical and efficient use of resources. So the big question I put to you is what to monitor. This is the first step to preparing a monitoring strategy.

There are a couple of questions you can ask yourself to understand fully if monitoring is enough or if you need to move to an observability platform. Firstly, you should consider what you should be monitoring, why you should be monitoring, and how to monitor it.? 

  • Options: Open source or commercial

Knowing this lets you move into the different tools and platforms available. Some of these tools will be open source, and others commercial. When evaluating these tools, one word of caution: does each tool work in a silo, or can it be used across technical domains? Silos are breaking agility in every form of technology.

 

For pre-information, you may find the following posts helpful:

  1. Microservices Observability
  2. Auto Scaling Observability
  3. Network Visibility
  4. WAN Monitoring
  5. Distributed Systems Observability
  6. Prometheus Monitoring
  7. Correlate Disparate Data Points
  8. Segment Routing

 



Monitoring vs Observability

Key Observability vs Monitoring Discussion points:


  • The difference between Monitoring vs Observability. 

  • Google's four Golden signals.

  • The role of metrics, logs and alerts.

  • The need for Observability.

  • Observability and Monitoring working together.

 

  • A key point: Video on Observability vs. Monitoring

In the following video, We will start by discussing how our approach to monitoring needs to adapt to the current megatrends, such as the rise of microservices. Failures are unknown and unpredictable. Therefore a pre-defined monitoring dashboard will have difficulty keeping up with the rate of change and unknown failure modes. For this, we should look to have the practice of observability for software and monitoring for infrastructure.

 

Observability vs Monitoring
Prev 1 of 1 Next
Prev 1 of 1 Next

 

Back to Basics with Observability vs Monitoring

Monitoring and distributed systems

By utilizing distributed architectures, the cloud native ecosystem allows organizations to build scalable, resilient, and novel software architectures. But the ever-changing nature of distributed systems means that previous approaches to monitoring can no longer keep up. The introduction of containers made the cloud flexible and empowered distributed systems.

Nevertheless, the ever-changing nature of these systems can cause them to fail in many ways. Distributed systems are inherently complex, and, as systems theorist Richard Cook notes, “Complex systems are intrinsically hazardous systems.”

Cloud-native systems require a new approach to monitoring, one that is open-source compatible, scalable, reliable, and able to control massive data growth. But cloud-native monitoring can’t exist in a vacuum: it needs to be part of a broader observability strategy.

 

Observability vs Monitoring
Diagram: Observability vs monitoring.

Key Features of Observability:

1. High-dimensional data collection: Observability involves collecting a wide variety of data from different system layers, including metrics, logs, traces, and events. This comprehensive data collection provides a holistic view of the system’s behavior.

2. Distributed tracing: Observability allows tracing requests as they flow through a distributed system, enabling engineers to understand the entire path and identify performance bottlenecks or errors.

3. Contextual understanding: Observability emphasizes capturing contextual information alongside the data, enabling teams to correlate events and understand the impact of changes or incidents.

Benefits of Observability:

1. Faster troubleshooting: By providing detailed insights into system behavior, observability helps teams quickly identify and resolve issues, minimizing downtime and improving system reliability.

2. Proactive monitoring: Observability allows teams to detect potential problems before they become critical issues, enabling proactive measures to prevent service disruptions.

3. Improved collaboration: With observability, different teams, such as developers, operations, and support, can have a shared understanding of the system’s behavior, leading to improved collaboration and faster incident response.

Monitoring:

On the other hand, monitoring focuses on collecting and analyzing metrics to assess the health and performance of a system. It involves setting up predefined thresholds or rules and generating alerts based on specific conditions.

Key Features of Monitoring:

1. Metric-driven analysis: Monitoring relies on predefined metrics collected and analyzed to measure system performance, such as CPU usage, memory consumption, response time, or error rates.

2. Alerting and notifications: Monitoring systems generate alerts and notifications when predefined thresholds or rules are violated, enabling teams to take immediate action.

3. Historical analysis: Monitoring systems provide historical data, allowing teams to analyze trends, identify patterns, and make informed decisions based on past performance.

Benefits of Monitoring:

1. Performance optimization: Monitoring helps identify performance bottlenecks and inefficiencies within a system, enabling teams to optimize resources and improve overall system performance.

2. Capacity planning: By monitoring resource utilization and workload patterns, teams can accurately plan for future growth, ensuring sufficient resources are available to meet demand.

3. Compliance and SLA enforcement: Monitoring systems help organizations meet compliance requirements and enforce service level agreements (SLAs) by tracking and reporting on key metrics.

Observability and Monitoring: A Unified Approach:

While observability and monitoring differ in their approaches and focus, they are not mutually exclusive. When used together, they complement each other and provide a more comprehensive understanding of system behavior.

Observability enables teams to gain deep insights into system behavior, understand complex interactions, and troubleshoot issues effectively. Conversely, monitoring provides a systematic approach to tracking predefined metrics, generating alerts, and ensuring the system meets performance requirements.

Organizations can create a robust system monitoring and management strategy by combining observability and monitoring. This integrated approach empowers teams to quickly detect, diagnose, and resolve issues, improving system reliability, performance, and customer satisfaction.

 

The Starting Point: Observability vs Monitoring

You need to measure and gather the correct event information in your environment, which will be done with several tools. This will let you know what is affecting your application performance and infrastructure. As a good starting point, there are four golden signals for Latency, saturation, traffic, and errors. These are Google’s Four Golden Signals. The four most important metrics to keep track of are: 

      1. Latency: How long it takes to serve a request
      2. Traffic: The number of requests being made.
      3. Errors: The rate of failing requests. 
      4. Saturation: How utilized the service is.

So now we have some guidance on what to monitor and let us apply this to Kubernetes to, for example, let’s say, a frontend web service that is part of a tiered application, we would be looking at the following:

      1. How many requests is the front end processing at a particular point in time,
      2. How many 500 errors are users of the service received, and 
      3. Does the request overutilize the service?

So we already know that monitoring is a form of evaluation to help identify the most practical and efficient use of resources. With monitoring, we observe and check the progress or quality of something over time. So within this, we have metrics, logs, and alerts. Each has a different role and purpose.

 

Monitoring: The role of metrics

Metrics are related to some entity and allow you to view how many resources you consume. The metric data consists of numeric values instead of unstructured text, such as documents and web pages. Metric data is typically also time series, where values or measures are recorded over some time. 

An example of such metrics would be available bandwidth and latency. It is essential to understand baseline values. Without a baseline, you will not know if something is happening outside the norm.

What are the average baseline values for bandwidth and latency metrics? Are there any fluctuations in these metrics? How do these values rise and fall during normal operations and peak usage? And this may change over the different days in the week and months.

If, during normal operations, you notice a rise in these values. This would be deemed abnormal and should act as a trigger that something could be wrong and needs to be investigated. Remember that these values should not be gathered as a once-off but can be gathered over time to understand your application and its underlying infrastructure better.

 

  • A key point: Video on Prometheus Metric Types

In this video tutorial, we are going through the basics of how monitoring systems work, particularly the role of Prometheus and its pull approach, along with the different metrics that Prometheus can scrap.

 

Prometheus Metric Types
Prev 1 of 1 Next
Prev 1 of 1 Next

 

Monitoring: The role of logs

Logging is an essential part of troubleshooting application and infrastructure performance. Logs give you additional information about the events. This is important for troubleshooting or discovering the root cause of the events. Logs will have a lot more detail than metrics. So you will need some way to parse the logs or use a log shipper.

A typical log shipper will take these logs from the standard out in a Docker container and ship them to a backend for processing.

FluentD or Logstash has its pros and cons and can be used here to the group and sent to a backend database that could be the ELK stack ( Elastic Search). Using this approach, you can add different things to logs before sending them to the backend. For example, you can add GEO IP information. And this will add richer information to the logs that can help you troubleshoot.

 

Monitoring: The role of alerting

Then we have the alerting, and it would be best to balance how you monitor and what you alert on. So, we know that alerting is not always perfect, and getting the right alerting strategy in place will take time. It’s not a simple day-one installation and requires much effort and cross-team collaboration.

You know that alerting on too much will cause alert fatigue. And we are all too familiar with the problems alert fatigue can bring and tensions to departments.

To minimize this, consider Service Level Objective (SLO) for alerts. SLOs are measurable characteristics such as availability, throughput, frequency, and response times. Service Level Objectives are the foundation for a reliability stack. Also, it would be best if you also considered alert thresholds. If these are too short, you will get a lot of false positives on your alerts. 

 

Monitoring is not enough.

Even with all of these in place, monitoring is not enough. Due to the sheer complexity of today’s landscape, you need to consider and think differently about the tools you use and how you use the intelligence and data you receive from them to resolve issues before they become incidents.  That monitoring by itself is not enough.

The tool used to monitor is just a tool that probably does not cross technical domains, and different groups of users will administer each tool without a holistic view. The tools alone can take you only half the way through the journey.  Also, what needs to be addressed is the culture and the traditional way of working in silos. A siloed environment can affect the monitoring strategy you want to implement. Here you can look into an Observability platform.

 

Observability vs Monitoring

So when it comes to observability vs monitoring, we know that monitoring can detect problems and tell you if a system is down, and when your system is UP, Monitoring doesn’t care. Monitoring only cares when there is a problem. The problem has to happen before monitoring takes action. It’s very reactive. So if everything is working, monitoring doesn’t care.

On the other hand, we have an Observability platform, a more proactive practice. It’s about what and how your system and services are doing. Observability lets you improve your insight into how complex systems work and let’s quickly get to the root cause of any problem, known and unknown.

Observability is best suited for interrogating systems to explicitly discover the source of any problem, along any dimension or combination of dimensions, without first predicting. This is a proactive approach.

 

The pillars of observability

This is achieved by combining logs, metrics, and traces. So we need data collection, storage, and analysis across these domains. While also being able to perform alerting on what matters most. Let’s say you want to draw correlations between units like TCP/IP packets and HTTP errors experienced by your app.

The Observability platform pulls the context from different sources of information like logs, metrics, events, and traces into one central context. Distributed tracing adds a lot of value here.

Also, when everything is placed into one context, you can quickly switch between the necessary views to troubleshoot the root cause. An excellent key component of any observability system is the ability to view these telemetry sources with one single pane of glass. 

Distributed Tracing in Microservices
Diagram: Distributed tracing in microservices.

 

Known and Unknown / Observability Unknown and Unknown

Monitoring automatically reports whether known failure conditions are occurring or are about to occur. In other words, they are optimized for reporting on unknown conditions about known failure modes. This is referred to as known unknowns. In contrast, Observability is centered around discovering if and why previously unknown failure modes may be occurring: in other words, to discover unknown unknowns.

The monitoring-based approach of metrics and dashboards is an investigative practice that leads with the experience and intuition of humans to detect and make sense of system issues. This is okay for a simple legacy system that fails in predictable ways, but the instinctual technique falls short for modern systems that fail in unpredictable ways.

With modern applications, the complexity and scale of their underlying systems quickly make that approach unattainable, and we can’t rely on hunches. Observability tools differ from traditional monitoring tools because they enable engineers to investigate any system, no matter how complex. You don’t need to react to a hunch or have intimate system knowledge to generate a hunch.

 

Monitoring vs Observability: Working together?

Monitoring best helps engineers understand infrastructure concerns. While Observability best helps engineers understand software concerns. So Observability and Monitoring can work together. First, the infrastructure does not change too often, and when it fails, it will fail more predictably. So we can use monitoring here.

This is compared to software system states that change daily and are unpredictable. Observability fits this purpose. The conditions that affect infrastructure health change infrequently, relatively easier to predict. We have several well-established practices to predict, such as capacity planning and the ability to remediate automatically (e.g., as auto-scaling in a Kubernetes environment. All of which can be used to tackle these types of known issues. 

 

Monitoring and infrastructure problems

Due to its relatively predictable and slowly changing nature, the aggregated metrics approach monitors and alerts perfectly for infrastructure problems. So here, a metric-based system works well. Metrics-based systems and their associated alerts help you see when capacity limits or known error conditions of underlying systems are being reached.

Now we need to look at monitoring the Software. Now we need access to high-cardinality fields. This may include the user id or a shopping cart id. Code that is well-instrumented for Observability allows you to answer complex questions that are easy to miss when examining aggregate performance.

Conclusion:

Observability and monitoring are essential practices in modern software development and operations. While observability focuses on understanding system behavior through comprehensive data collection and analysis, monitoring uses predefined metrics to assess performance and generate alerts. By leveraging both approaches, organizations can gain a holistic view of their systems, enabling proactive measures, faster troubleshooting, and optimal performance. Embracing observability and monitoring as complementary practices can pave the way for more reliable, scalable, and efficient systems in the digital era.

 

Monitoring vs observability

System Observability

Distributed Systems Observability

 

 

Distributed Systems Observability

In today’s technology-driven world, distributed systems have become the backbone of numerous applications and services. These systems are designed to handle large-scale data processing, ensure fault tolerance, and provide high scalability. However, managing and monitoring distributed systems can be challenging. This is where observability comes into play. In this blog post, we will explore the significance of distributed systems observability and how it enables efficient management and troubleshooting.

Distributed systems observability refers to the ability to gain insights into the inner workings of a distributed system. It encompasses monitoring, logging, and tracing capabilities that allow engineers to effectively understand system behavior, performance, and potential issues. By adopting observability practices, organizations can ensure the smooth operation of their distributed systems and identify and resolve problems quickly.

 

Highlights: Distributed Systems Observability

  • The Role of Megatrends

We have had a considerable drive with innovation that has spawned several megatrends that have affected how we manage and view our network infrastructure and the need for distributed systems observability. We have seen the decomposition of everything from one to many.

Many services and dependencies in multiple locations, aka microservices observability, must be managed and operated instead of the monolithic where everything is generally housed internally. The megatrends have resulted in a dynamic infrastructure with new failure modes not seen in the monolithic, forcing us to look a different systems observability tools and network visibility practices. 

  • Shift in Control

There has also been a shift in the point of control. We move towards new technologies, and many of these loosely coupled services or infrastructures your services lay upon are not under your control. The edge of control has been pushed, creating different network and security perimeters. These parameters are now closer to the workload than a central security stack. Therefore the workloads themselves are concerned with security.

 

For pre-information, you may find the following posts helpful:

  1. Observability vs Monitoring 
  2. Prometheus Monitoring
  3. Network Functions

 



Distributed Systems Observability.

Key Distributed Systems Observability points:


  • We no longer have predictable failures.

  • The different demands on networks.

  • The issues with the metric-based approach.

  • Static thresholds and alerting.

  • The 3 pillars of Observability.

 

Back to Basics with Distributed Systems Observability

Distributed Systems

Today’s world of always-on applications and APIs has availability and reliability requirements that would have been needed of solely a handful of mission-critical services around the globe only a few decades ago. Likewise, the potential for rapid, viral service growth means that every application has to be built to scale nearly instantly in response to user demand.

Finally, these constraints and requirements mean that almost every application made—whether a consumer mobile app or a backend payments application—needs to be a distributed system. A distributed system is an environment where different components are spread across multiple computers on a network. These devices split up the work, harmonizing their efforts to complete the job more efficiently than if a single device had been responsible.

 

System Observability Design
Diagram: Systems Observability design.

The Key Components of Observability:

Observability in distributed systems is achieved through three main components: monitoring, logging, and tracing.

1. Monitoring:

Monitoring involves the continuous collection and analysis of system metrics and performance indicators. It provides real-time visibility into the health and performance of the distributed system. By monitoring various metrics such as CPU usage, memory consumption, network traffic, and response times, engineers can proactively identify anomalies and make informed decisions to optimize system performance.

2. Logging:

Logging involves the recording of events, activities, and errors occurring within the distributed system. Log data provides a historical record that can be analyzed to understand system behavior and debug issues. Distributed systems generate vast amounts of log data, and effective log management practices, such as centralized log storage and log aggregation, are crucial for efficient troubleshooting.

3. Tracing:

Tracing involves capturing the flow of requests and interactions between different components of the distributed system. It allows engineers to trace the journey of a specific request and identify potential bottlenecks or performance issues. Tracing is particularly useful in complex distributed architectures where multiple services interact with each other.

Benefits of Observability in Distributed Systems:

Adopting observability practices in distributed systems offers several benefits:

1. Enhanced Troubleshooting:

Observability enables engineers to quickly identify and resolve issues by providing detailed insights into system behavior. With real-time monitoring, log analysis, and tracing capabilities, engineers can pinpoint the root cause of problems and take appropriate actions, minimizing downtime and improving system reliability.

2. Performance Optimization:

By closely monitoring system metrics, engineers can identify performance bottlenecks and optimize system resources. Observability allows for proactive capacity planning and efficient resource allocation, ensuring optimal performance even under high loads.

3. Efficient Change Management:

Observability facilitates the monitoring of system changes and their impact on the overall performance. Engineers can track changes in metrics and easily identify any deviations or anomalies caused by updates or configuration changes. This helps in maintaining system stability and avoiding unexpected issues.

 

How This Affects Failures

The primary issue I have seen with my clients is that application failures are no longer predictable, and dynamic systems can fail creatively, challenging existing monitoring solutions. But, more importantly, the practices that support them. We have a lot of partial failures that are not just unexpected but not known or have never been seen before. For example, if you recall, we have the network hero. 

 

The network hero

It is someone who knows every part of the network and has seen every failure at least once. They are no longer helpful in today’s world and need proper Observability. When I was working as an Engineer, we would have plenty of failures, but more than likely, we would have seen them before. And there was a system in place to fix the error. Today’s environment is much different.

We can no longer rely on simply seeing either a UP or Down, setting static thresholds, and then alerting based on those thresholds. A key point to note at this stage is that none of these thresholds considers the customer’s perspective.  If your POD runs at 80% CPU, does that mean the customer is unhappy?

When monitoring, you should look from your customer’s perspectives and what matters to them. Content Delivery Network (CDN) was one of the first to realize this game and measure what matters most to the customer.

 

Distributed Systems Observability

The different demands

So the new modern and complex distributed systems place very different demands on your infrastructure and the people that manage the infrastructure. For example, in microservices, there can be several problems with a particular microservice:

    • The microservices could be running under high resource utilization and, therefore, slow to respond, causing a timeout
    • The microservices could have crashed or been stopped and is, therefore, unavailable
    • The microservices could be fine, but there could be slow-running database queries.
    • So we have a lot of partial failures. 

 

Therefore: We can no longer predict

The big shift we see with software platforms is that they evolve much quicker than the products and paradigms we use to monitor them. As a result, we need to consider new practices and technologies with dedicated platform teams and good system observability. We can’t predict anything anymore, which puts the brakes on some traditional monitoring approaches, especially the metrics-based approach to monitoring.

I’m not saying that these monitoring tools are not doing what you want them to do. But, they work in a siloed environment, and there is a lack of connectivity. So we have monitoring tools working in silos in different parts of the organization and more than likely managed by different people trying to monitor a very dispersed application with multiple components and services in various places. 

 

Relying On Known Failures

Metric-Based Approach

A metrics-based monitoring approach relies on having previously encountered known failure modes. The metric-based approach relies on known failures and predictable failure modes. So we have predictable thresholds that someone is considered to experience abnormal.

Monitoring can detect when these systems are either over or under the predictable thresholds that are previously set. And then, we can set alerts and hope that these alerts are actionable. This is only useful for variants of predictable failure modes.

Traditional metrics and monitoring tools can tell you any performance spikes or notice that a problem occurs. But they don’t let you dig into the source of problems and let us slice and dice or see correlations between errors. If the system is complex, this approach is harder to get to the root cause in a reasonable timeframe.

 

Traditional style metrics systems

With the traditional style metrics systems, you had to define custom metrics, which were always defined upfront. So with this approach, we can’t start to ask new questions about problems. So it would be best if you defined the questions to ask upfront.

Then we set performance thresholds, pronounce them “good” or “bad, ” and check and re-check those thresholds. We would tweak the thresholds over time, but that was about it. This monitoring style has been the de facto approach, but we don’t now want to predict how a system can fail. Always observe instead of waiting for problems, such as reaching a certain threshold before acting.

System Observability Analysis
Diagram: System Observability analysis.

 

Metrics: Lack of connective event

Metrics did not retain the connective event. As a result, you cannot ask new questions in the existing dataset.  These traditional system metrics could miss unexpected failure modes in complex distributed systems. Also, the condition detected via system metrics might be unrelated to what is happening.

An example of this could be an odd number of running threads on one component that might indicate garbage collection is in progress. It might also indicate that slow response times might be imminent in an upstream service.

 

Users experience static thresholds.

User experience means different things to different sets of users. We now have a model where different service users may be routed through the system in different ways, using different components and providing experiences that can vary widely. We also know now that the services no longer tend to break in the same few predictable ways over and over.  

We should have a few alerts triggered by only focusing on symptoms that directly impact user experience and not because a threshold was reached.

 

  • The Challenge: Can’t reliably indicate any issues with user experience

If you use static thresholds, they can’t reliably indicate any issues with user experience. Alerts should be set up to detect failures that impact user experience. Traditional monitoring falls short in trying to do this. With traditional metrics-based monitoring, we rely on static thresholds to define optimal system conditions, which have nothing to do with user experience.

However, modern systems change shape dynamically under different workloads. Static thresholds for monitoring can’t reflect impacts on user experience. They lack context and are too coarse.

 

The Need For Distributed Systems Observability

Systems observability and reliability in distributed system is a practice. Rather than just focusing on a tool that does logging, metrics, or altering, Observability is all about how you approach problems, and for this, you need to look at your culture. So you could say that Observability is a cultural practice and allows you to be proactive to findings instead of relying on a reactive approach that we are used to in the past.

Nowadays, we need a different viewpoint and want to see everything from one place. You want to know how the application works and how it interacts with the other infrastructure components, such as the underlying servers, physical or server, the network, and how data transfer looks in a transfer and stale state. 

What level of observation do you need so you know that everything is performing as it should? And what should you be looking at to get this level of detail?

Monitoring is knowing the data points and the entities we are gathering from. On the other hand, Observability is like putting all the data together. So monitoring is collecting data, and Observability is putting it together in one single pane of glass. Observability is observing the different patterns and deviations from the baseline; monitoring is getting the data and putting it into the systems. A vital part of an Observability toolkit is service level objectives (slos).

 

The three pillars of distributed systems observability

We have three pillars of Systems Observability. There are Metrics, Traces, and Logging. So it is an oversimplification to define or view Observability as having these pillars. But for Observability, you need these in place. Observability is all about connecting the dots from each of these pillars.

If someone asked me which one I prefer, it would be distributed tracing. Distributed tracing allows you to visualize each step in service request executions. As a result, it doesn’t matter if services have complex dependencies. You could say that the complexity of the Dynamic systems is abstracted with distributed tracing.

 

  • Use Case: Challenges without tracing.

For example, latency can stack up if a downstream database service experiences performance bottlenecks. As a result, the end-to-end latency is high. When latency is detected three or four layers upstream, it can be complicated to identify which component of the system is the root of the problem because now that same latency is being seen in dozens of other services.

 

  • Distributed tracing: A winning formula

Modern distributed systems tend to scale into a tangled knot of dependencies. Therefore, distributed tracing shows the relationships between various services and components in a distributed system. Traces help you understand system interdependencies. Unfortunately, those inter-dependencies can obscure problems and make them challenging to debug unless their relationships are clearly understood.

Conclusion:

In the world of distributed systems, observability plays a vital role in ensuring the stability, performance, and reliability of complex architectures. Monitoring, logging, and tracing provide engineers with the necessary tools to understand system behavior, troubleshoot issues, and optimize performance. By adopting observability practices, organizations can effectively manage their distributed systems and provide seamless and reliable services to their users.