Microservices Observability

August 16, 2022

by Matt Conran: The Visual Age Publications

Monitoring Microservices

In today's rapidly evolving software development landscape, microservices architecture is famous for building scalable and resilient applications. However, as the complexity of these systems increases, so does the need for effective observability. In this blog post, we will explore the concept of microservices observability and why it is crucial in ensuring the stability and performance of modern software systems.

Matt Conran

Microservices observability refers to the ability to gain insights into the behavior and performance of individual microservices, as well as the entire system as a whole. It involves collecting, analyzing, and visualizing data from various sources, such as logs, metrics, traces, and events, to comprehensively understand the system's health and performance.

Highlights: Monitoring Microservices

The Role of Microservices Monitoring

Microservices monitoring is suitable for known patterns that can be automated, while microservices observability is suitable for detecting unknown and creative failures. Microservices monitoring is a critical part of successfully managing a microservices architecture. It involves tracking each microservice’s performance to ensure there are no bottlenecks in the system and that the microservices are running optimally.

Components of Microservices Monitoring

Additionally, microservices monitoring can detect anomalies and provide insights into the microservices architecture. There are several critical components of microservices monitoring, including:

– Metrics: This includes tracking metrics such as response time, throughput, and error rate. This information can be used to identify performance issues or bottlenecks.

– Logging allows administrators to track requests, errors, and exceptions. This can provide deeper insight into the performance of the microservices architecture.

– Tracing: Tracing provides a timeline of events within the system. This can be used to identify the source of issues or to track down errors.

– Alerts: Alerts notify administrators when certain conditions are met. For example, administrators can be alerted if a service is down or performance is degrading.

Finally, it is essential to note that microservices monitoring is not just limited to tracking performance. It can also detect security vulnerabilities and provide insights into the architecture.

By leveraging microservices monitoring, organizations can ensure that their microservices architecture runs smoothly and that any issues are quickly identified and resolved. This can help ensure the organization’s applications remain reliable and secure.

Related: For pre-information, you will find the following posts helpful:

Microservices Monitoring Key Microservices Observability Discussion Points:	The challenges with traditional monitoring. Tools of the past, logs and metrics. Why we need Observability. The use of Disributed Tracing. Observability pillars.

Back to Basics: Containers and Microservices

The challenges

Teams increasingly adopt new technologies as companies transform and modernize applications to leverage container- and microservices development. IT infrastructure monitoring has always been complex but is even more challenging with the changing software architecture and the new technology needed to support it. In addition, many of your existing monitoring tools may not fully support modern applications and frameworks, especially when you throw in serverless and hybrid IT. All of these create a considerable gap in the management of application health and performance.

Containers

Containers can wrap up an application into its isolated package—everything the application needs to run successfully as a process is executed within the container. Kubernetes is an open-source container management tool that delivers an abstraction layer over the container to manage the container fleets, leveraging REST APIs.

Container-based technologies affect infrastructure management services, like backup, patching, security, high availability, disaster recovery, etc. Therefore, we must establish other monitoring and management technologies for containerization and microservices architecture. Prometheus is an example of a container monitoring tool that comes up as a go-to open-source monitoring and alerting solution.

Docker Container Diagram — Diagram: Docker Container. Source Docker.

Microservices

Microservices are an architectural approach to software development that enables teams to create, deploy, and manage applications quickly. Microservices allow greater flexibility, scalability, and maintainability than traditional monolithic applications.

The microservices approach is based on building independent services that communicate with each other over an API. Each service is responsible for a specific business capability, so a single application can comprise many different services. This makes it easy to scale individual components and replace them with newer versions without affecting the rest of the application.

Diagram: Microservices. The source is AVI networks

The Benefits of Microservices Observability:

Implementing a robust observability strategy brings several benefits to a microservices architecture:

1. Enhanced Debugging and Troubleshooting:

Microservices observability gives developers the tools and insights to identify and resolve issues quickly. By analyzing logs, metrics, and traces, teams can pinpoint the root causes of failures, reducing mean time to resolution (MTTR) and minimizing the impact on end-users.

2. Improved Performance and Scalability:

Observability enables teams to monitor the performance of individual microservices and identify areas for optimization. By analyzing metrics and tracing requests, developers can fine-tune service configurations, scale services appropriately, and ensure efficient resource utilization.

3. Proactive Issue Detection:

With comprehensive observability, teams can detect potential issues before they escalate into critical problems. By setting up alerts and monitoring key metrics, teams can proactively identify anomalies, performance degradation, or security threats, allowing for timely intervention and prevention of system-wide failures.

Video: Microservices vs. Observability

We will start by discussing how our approach to monitoring needs to adapt to the current megatrends, such as the rise of microservices. Failures are unknown and unpredictable. Therefore, a pre-defined monitoring dashboard will have difficulty keeping up with the rate of change and unknown failure modes. For this, we should look to have the practice of observability for software and monitoring for infrastructure.

Observability vs Monitoring

Prev 1 of 1 Next

Observability vs Monitoring

Prev 1 of 1 Next

Microservices Monitoring and Observability

Containers, cloud platforms, scalable microservices, and the complexity of monitoring distributed systems have highlighted significant gaps in the microservices monitoring space that have been static for some time. As a result, you must fully understand performance across the entire distributed and complex stack, including distributed traces across all microservices. So, to do this, you need a solution that can collect, process, and store data used for monitoring. And the data needs to cover several domains and then be combined and centralized for analysts.

This can be an all-in-one solution that represents or bundles different components for application observability. The bundled solutions would be, for example, an Application Performance Monitoring (APM) that consists of application performance monitoring tools or a single platform, which could be Prometheus, which lives in a world of metrics only.

Application Performance Monitoring

Application performance monitoring typically involves tracking the response time of an application, the number of requests it can handle, and the amount of memory or other system resources it uses. This data can be used to identify any issues with application performance or scalability. Organizations can take corrective action by monitoring application performance to improve the user experience and ensure their applications run as efficiently as possible.

Application performance monitoring also helps organizations better understand their users by providing insight into how applications are used and how well they are performing. In addition, this data can be used to identify trends and patterns in user behavior, helping organizations decide how to optimize their applications for better user engagement and experience.

The Need for Microservices Observability

Today’s challenges

1. Obfuscation

When creating microservices, your application becomes more distributed, the coherence of failures decreases, and we live in a world of unpredictable failure mode—also, the distance between cause and effect increases. For example, an outage at your cloud provider’s blob storage could cause huge cascading latency for everyone. In today’s environment, we have new cascading problems.

2. Inconsistency and highly independent

Distributed applications might be reliable, but the state of individual components can be much less consistent than in monolithic or non-distributed applications, which have elementary and well-known failure modes. In addition, each element of a distributed application is designed to be highly independent, and each component can be affected by different upstream and downstream components.

3. Decentralization

How do you look for service failures when a thousand copies of that service may run on hundreds of hosts? How do you correlate those failures? So you can make sense of what’s going on.

Tools of the past: Logs and metrics

Traditionally, microservices monitoring has boiled down to two types of telemetry data: log data and time series statistics. The time series data—is also known as metrics, as to make sense of a metric, you need to view a period.

However, as we broke the software into tiny, independently operated services and distributed those fragmented services, the logs and metrics we captured told you very little of what was happening to the critical path.

Understanding the critical path is the most important, as this is what the customer is experiencing. Looking at a single stack trace or watching CPU and memory utilization on predefined graphs and dashboards is insufficient. As software scales in-depth but breadth—telemetry data like logs and metrics alone don’t provide clarity; you must quickly identify production problems.

Monitoring observability — Diagram: Monitoring Observability. Source is Bravengeek

Introduction to Microservices Monitoring Categories

We have several different categories to consider. For microservices monitoring and Observability, you must first address your infrastructures, such as your network devices, hypervisors, servers, and storage. Then, you should manage your application performance and health.

Then, you need to monitor how to manage network quality and optimize when possible. For each category, you must consider white box and black box monitoring and potentially introduce new tools such as Artificial Intelligence (AI) for IT operations (AIOps).

Prevented approach to Microservice monitoring: AI and ML.

When choosing microservices observability software, consider a more preventive approach than a reactive one better suited for traditional environments. Prevented approaches to monitoring can use historical health and performance telemetry as an early warning with the use of Artificial Intelligence (AI) and Machine Learning (ML) techniques.

White box monitoring offers more details than a black box telling you something is broken without knowing why. White box monitoring details the why, but you must ensure the data is easily consumable.

With predictable failures and known failure modes, black box microservices monitoring can help. Still, with the creative ways that applications and systems fail today, we need to examine the details of white-box microservices monitoring. Complex applications fail in unpredictable ways, often termed black holes.

Distributing your software presents new types of failure, and these systems can fail in creative ways and become more challenging to pin down. The service you’re responsible for may be receiving malformed or unexpected data from a source you don’t control because a team manages that service halfway across the globe.

White box monitoring: Exploring failures

White box monitoring relies on a different approach to black box monitoring. It uses a technique called Instrumentation that exposes details about the system’s internals to help you explore these black holes and better understand the creative mode in which applications fail today.

Microservices Observability: Techniques

Collection, storage, and analytics

Regardless of what you are monitoring, the infrastructure or the application service, monitoring requires 3 three inputs, more than likely across three domains. We require:

1. Data collection,
2. Storage, and
3. Analysis.

We need to look at metrics, traces, and logs for these three domains or, let’s say, components. Out of these three domains, trace data is the most beneficial and excellent way to isolate performance anomalies for distributed applications. Trace data fall into the brackets of distributed tracing that enable flexible consumption of capture traces.

What you need to do: The four golden signals

First, you must establish a baseline comprising the four golden signals – latency, traffic, errors, and saturation. The golden signals are good indicators of health and performance and apply to most components of your environment, such as the infrastructure, applications, microservices, and orchestration systems.

application performance monitoring tools — Diagram: Application performance monitoring tools.

A quick recommendation: Alerts and SLIs

I recommend driving this baseline automated along with the automation alerts from deviations from baselines. The problem is that you may alert on too much if you collect too much. Service Level Indicators (SLI) can help you find what is better to alert and what matters to the user experience.

A key point: Distributed tracing

Navigate real-time alerts

Leveraging distributed tracing for directed troubleshooting provides users with distributed tracing capabilities to dig deep when a performance-impacting event occurs. No matter where an issue arises in your environment, you can navigate from real-time alerts directly to application traces and correlate performance trends between infrastructure, Kubernetes, and your microservices. Distributed tracing is essential to monitoring, debugging, and optimizing distributed software architecture, such as microservices–especially in dynamic microservices architectures.

The Effect on Microservices: Microservices Monitoring

When considering a microservice application, many consider this independent microservice as independent, but this is nothing more than an illusion. These microservices are highly interdependent, and a failure or slowdown in one service propagates across the stack of microservices.

A typical architecture may include a backend service, a front-end service, or maybe even a docker-compose file. So, at a minimum, several containers must communicate to carry out operations.

For a simple microservice architecture, we would have a simple front end minimizing a distributed application setup, where microservices serving static contents are at the front end. At the same time, the heavy lifting is done with the other service.

Monolith and microservices monitoring.

We have more components to monitor than we had in the monolithic world. With their traditional monolithic, there are only two components to monitor. Then, we had the applications and the hosts.

Compared to the cloud-native world, we have containerized applications orchestrated by Kubernetes with multiple components requiring monitoring. We have, for example, the hosts, the Kubernetes platform itself, the Docker containers, and the containerized microservices.

Distributed systems have different demands.

Today, distributed systems are the norm, placing different demands on your infrastructure than the classic, three-tier application. Pinpointing issues in a microservices environment is more challenging than with a monolithic one, as requests traverse both between different layers of the stack and across multiple services.

The Challenges: Microservices

The things we love about microservices are independence and idempotence, which make them difficult to understand, especially when things go wrong. As a result, these systems are often referred to as deep systems, not due to their width but their complexity.

We can no longer monitor their application by using a script to access the application over the network every few seconds, report any failures, or use a custom script to check the operating system to understand when a disk is running out of space.

Understanding saturation is an implemented signal, but it’s just one of them. It quickly becomes unrealistic for a single human, or even a group, to understand enough of the services in the critical path of even a single request and continue maintaining it.

Node Affinity or Taints

Microservices-based applications are typically deployed on containers that are dynamic and transient. This leaves an unpredictable environment where the pods get deployed and run unless specific intent is expressed using affinity or taints. However, there can still be unpredictability with pod placement. The unpredictable nature of pod deployment and depth of configuration can lead to complex troubleshooting.

The Beginnings of Distributed Tracing

Open Tracing

So, when you are ready to get started with distributed tracing, you will come across OpenTracing. OpenTracing is a set of standards that are exposed as frameworks. So, it’s a vendor-neutral API and Instrumentation for distributed tracing.

It is not that open tracing gives you the library but more of a set of rules and extensions that another library can adopt so you can use and swap around different libraries and expect the same things.

Diagram: Distributed Tracing Example. Source is Simform

Microservices architecture example

Let’s examine an example of the request library for Python. So we have Requests, an elegant and simple HTTP library for Python. The request library talks to HTTP and will rely on specific standards; the standard here will be HTTP. So in Python, when making a “requests.get”.

The underlying library implementation will do a formal HTTP request using the GET method. So, the HTTP standards and the HTTP specs lay the ground rules of what is expected from the client and the server.

OpenTracing

So, the OpenTracing projects do the same thing. It sets out the ground rules for distributed tracing, regardless of the implementation and the language used. It has several liabilities that are available in 9 languages. Go, JavaScript, Java, Python, Ruby, PHP, Objective-C, C++.

For example, OpenTracing API for Python gives you implementation for open tracing to be used by Python. This is the set of standards for tracing with Python, and it provides examples of what the Instrumentation should look like and common ways to start a trace.

Video: Distributed Tracing

We generally have two types of telemetry data. We have log data and time-series statistics. The time-series data is also known as metrics in a microservices environment. The metrics, for example, will allow you to get an aggregate understanding of what’s happening to all instances of a given service.

Then, we have logs, on the other hand, that provide highly fine-grained detail on a given service. But have no built-in way to provide that detail in the context of a request. Due to how distributed systems fail, you can’t use metrics and logs to discover and address all of your problems. We need a third piece to the puzzle: distributed tracing.

Distributed Tracing Explained

Prev 1 of 1 Next

Distributed Tracing Explained

Prev 1 of 1 Next

Connect the dots with distributed tracing

And this is a big difference in why you would use tracing and logging. Tracing allows you to connect the dots from one end of the application to the other end of the application. So, if you are starting a request on the front end and want to see how that works on the backend, that works. A trace and child traces connected will have a representation.

Visual Representation with Jaeger

You may need to use Jaeger for the visual representation. Jaeger is an open-source end-to-end visual representation of tracing that allows you to monitor and troubleshoot transactions in complex distributed systems.

So, we have a dashboard where we can interact and search for traces. Jaeger addresses problems such as monitoring distributed tracing, performance and latency optimizations, root cause analysis, service dependency analysis, and distributed content propagation. Jaeger has different clients for different types of languages.

So, for example, if you are using Python, there will be client library features for Python.

OpenTelementry

We also have OpenTelementry, and this is similar. It is described as an observability framework for cloud-native software and is in beta across several languages. It is geared towards traces, metrics, and logs, so it does more than OpenTracing.

distributed tracing — Diagram: Distributed tracing and scalable microservices.

Introduction to Microservices Observability

We know that Observability means that the internal states of a system can be inferred from its external outputs. Therefore, the tools used to complete an Observability system help understand the relationships between causes and effects in distributed systems.

The term Observability is borrowed from the control theory. It suggests a holistic, data-centric view of microservices monitoring that enables exploration capabilities and identifying unknown failures with the more traditional anomaly detection and notification mechanisms.

Goal: The ultimate goal of Observability is to :

Improving baseline performance
Restoring baseline performance (after a regression)

By improving the baseline, you improve the user experience. This could be, for user-facing applications, performance often means request latency. Then, we have regressions in performance, including application outages, which can result in a loss of revenue and negatively impact the brand. The regressions time accepted comes down to user expectation. What is accessible, and what is in the SLA?

Chaos engineering

You understand your limits and new places that your system and applications can be made with Chaos Engineering tests. Chaos Engineering helps you know your system by introducing controlled experiments when debugging microservices.

Video: Chaos Engineering

This educational tutorial will begin with guidance on how the application has changed from the monolithic style to the microservices-based approach and how this has affected failures. I will introduce to you how this can be solved by knowing exactly how your application and infrastructure perform under stress and what are their breaking points.

Chaos Engineering: How to Start A Project

Prev 1 of 1 Next

Chaos Engineering: How to Start A Project

Prev 1 of 1 Next

Microservices Observability Pillars

So, to fully understand a system’s internal state, we need tools. Some of which are old and others which are new. These tools are known as the pillars of Observability, a combination of logs, metrics, and distributed tracing. These tools must be combined to understand internal behavior and fulfill the observability definition fully.

Data must be collected continuously across all Observability domains to fully understand the symptoms and causes.

A key point: Massive amount of data

Remember that instrumenting potentially generates massive amounts of data, which can cause challenges in storing and analyzing. You must collect, store, and analyze data across the metrics, traces, and logs domains. And then, you need to alert me on these domains and what matters most. Not just when an arbitrary threshold is met.

The role of metrics

A metric is known to most, comprising a value, timestamp, and metadata. Metrics are collections of statistics that need to be analyzed over time. A single instance of a metric is of limited value. Examples include request rate, average duration, and queue size. These values are usually captured as time series so that operators can see and understand changes to metrics over time.

Add labels to metric.

To better understand metrics, we can add labels as key-value pairs. The labels add additional context to this data point. So, the label is a key-value pair indexed with the metrics as part of the injection process. In addition, metrics can now be broken down into sub-metrics.

As we enter the world of labels and tags for metrics, we need to understand the effects this may have on Cardinality. While each indexed label value adds time series, this will come at storage and processing costs. Therefore, we use Cardinality to understand the impact of labels on a metric store.

observability — Diagram: Observability: The issue with metrics.

Aggregated metrics

The issue I continue to see is that metrics are typically aggregated every minute or even six to twelve times per minute. However, metrics must be aggregated and visualized within at most one minute but ideally even more quickly. Key questions are: What is the window across which values are aggregated? How are the windows from different sources aligned?

A key point: The issues of Cardinality

Aggregated Metrics allow you to get an aggregate understanding of what’s happening to all instances of a given service and even narrow your query to specific groups of services but fail to account for infinite Cardinality. Due to issues with “high-cardinality” within a time series storage engine, it is recommended to use labels rather than hierarchical naming for metrics.

Prometheus Monitoring and Prometheus Metric Types

Examples: Push and Pull

So, to get metric, you need to have a push or pull approach. A push agent transmits data upstream and, more than likely, on a scheduled basis. A pull agent expects to be polled. Then, we have Prometheus and several Prometheus metric types. We have a Prometheus server with a pull approach that fits better into larger environments.

Prometheus does not use the term agent and has what is known as exporters. They allow the Prometheus server to pull metrics back from software that cannot be instrumented using the Prometheus client libraries.

Prometheus Kuberentes

Prometheus Kubernetes is an open-source monitoring platform that originated at SoundCloud and was released in 2012. Its capabilities include metric collection, storage, data analyses, and visualizations. So, we can use Prometheus and Grafana for the visualizations.

Storing Metrics

You can sort metrics that are time-series data in a general-purpose relational database. However, they should be stored in an optimized repository for storing and retrieving time-series data. We have several time-series storage options, such as Altas, InfluxDB, and Prometheus. Prometheus is the one that stands out, but keep in mind that, as far as I’m aware, there is no commercial support and limited professional services to Prometheus.

prometheus and grafana — Diagram: Prometheus and Grafana.

The Role of Logs

Then, we have logs that can be highly detailed. Logs can be anything, unlike metrics that have a daily uniform format. However, logs do provide you with why something is broken. Logs capture activity that can be printed to the screen or sent to a backend to be centrally stored and viewed.

There is very little standard structure to logs apart from a timestamp indicating when the event occurred. There is minimal log schema, and log structure will depend on how the application uses it and how developers create logs.

Emitting Logs

Logs are emitted by almost every entity, such as the basic infrastructure, network and storage, servers and computer notes, operating system nodes, and application software. So, there are a variety of log sources and also several tools involved in the transport and interpretation to make log collection a complex task. However, remember that you may assume a large amount of log data must be stored.

Search engines such as Google have developed several techniques for searching extensive datasets using arbitrary queries, which has proved very efficient. All of which can be applied to log data.

Logstash, Beats, and FluentD

Logstash is a cloud-scale ingestion tool and is part of the Elasticsearch suit. However, there have been concerns with the performance and scalability of Logstash, which brings us to the lightweight version of Beats. So, if you don’t need the sophisticated data manipulation and filtering of Logstash, you can use Beasts. FluentD provides a unified logging layer or a way to aggregate logs from many different sources and distribute them to many destinations with the ability to transform data.

Storing Logs

Structure data such as logs and events are made of key-value pairs, any of which may be searched. This leads us to repositories called nonrelational or no SQL databases. So, storing logs represents a different storage problem from that of metrics. Examples of KV databases include Memcache and Redis.

However, they are not a good choice for log storage due to the inefficiency of indexing and searching. The ELK stack has an indexing and searching engine, a collector, a Logstash, a visualization tool, and the dominant storage mechanism for soft log and event data.

A key point: Analyze logs with AI

So, once you store the logs, they need to be analyzed and viewed. Here, you could, for example, use Splunk. Its data analysis capabilities range from security to AI for IT operations (AIOps). Kibana, which is part of the Elastic Stack, can also be used.

Introducing Distributed Tracing

Distributed tracing is used in microservices and other distributed applications because a single operation touches many services. Distributed tracing is a type of correlated logging that helps you gain visibility into the process of a distributed software system—distributed tracing consists of collecting request data from the application and then analyzing and visualizing this data as traces.

Tracing data, in the form of spans, must be collected from the application, transmitted, and stored to reconstruct complete requests. This can be useful for performance profiling, debugging in production, and root causes analysis of failures or other incidents.

A key point: The value of distributed tracing

Distributed tracing allows you to understand what a particular service is doing as part of the whole. Thus providing visibility into the operation of your microservice architecture. The trace data you generate can display the overall shape of your distributed system and view individual service performance inside a single request.

Distributed tracing components

What is a trace?

Consider your software in terms of requests. Each component of your software stack works in response to a request or a remote procedure call from another service. So, we have a trace encapsulating a single operation within the application, end to end, and represented as a series of spans.

Each traceable unit of work within the operations generates a span. There are two ways you can get trace data. Trace data can be generated through the Instrumentation of your service processes or by transforming existing telemetry data into trace data.

Introducing a SPAN

We call each service’s work a span, as in the period it takes for the work to occur. These spans can be annotated with additional information, such as attributes, tags, or logs. So, we can have a combination of metadata and events that can be added to spans—creating effective spans that unlock insights into the behavior of your service. The span data produced by each service is then forwarded to some external process, where it can be aggregated into a trace, analyzed, and stored for further insights.

Summary: Monitoring Microservices

Monitoring microservices has become a critical aspect of maintaining the performance and reliability of modern applications. With the increasing adoption of microservices architecture, understanding how to monitor and manage these distributed systems effectively has become indispensable. In this blog post, we explored the key considerations and best practices for monitoring microservices.

Section 1: The Need for Comprehensive Monitoring

Microservices are highly distributed and decentralized, which poses unique challenges regarding monitoring. Traditional monolithic applications are more accessible to monitor, but microservices require a different approach. Understanding the need for comprehensive monitoring is the first step toward ensuring the reliability and performance of your microservices-based applications.

Section 2: Choosing the Right Monitoring Tools

This section will delve into the various monitoring tools available for monitoring microservices. From open-source solutions to commercial platforms, there is a wide range of options. We will discuss the critical criteria for selecting a monitoring tool: scalability, real-time visibility, alerting capabilities, and integration with existing systems.

Section 3: Defining Relevant Metrics

To effectively monitor microservices, it is essential to define relevant metrics that provide insights into the health and performance of individual services as well as the overall system. In this section, we will explore the key metrics to monitor, including response time, error rates, throughput, resource utilization, and latency. We will also discuss the importance of setting appropriate thresholds for these metrics to trigger timely alerts.

Section 4: Implementing Distributed Tracing

Distributed tracing plays a crucial role in understanding the flow of requests across microservices. By instrumenting your services with distributed tracing, you can gain visibility into the entire request journey and identify bottlenecks or performance issues. We will explore the benefits of distributed tracing and discuss popular tracing frameworks like Jaeger and Zipkin.

Section 5: Automating Monitoring and Alerting

Keeping up with the dynamic nature of microservices requires automation. This section will discuss the importance of automated monitoring and alerting processes. From automatically discovering new services to scaling monitoring infrastructure, automation plays a vital role in ensuring the effectiveness of your monitoring strategy.

Conclusion:

Monitoring microservices is a complex task, but with the right tools, metrics, and automation in place, it becomes manageable. By understanding the unique challenges of monitoring distributed systems, choosing appropriate monitoring tools, defining relevant metrics, implementing distributed tracing, and automating monitoring processes, you can stay ahead of potential issues and ensure optimal performance and reliability for your microservices-based applications.

Observability vs Monitoring

May 16, 2022

by Matt Conran: The Visual Age Blog

Observability vs Monitoring

In today's fast-paced digital landscape, where complex systems and applications drive businesses, it's crucial to have a clear understanding of observability and monitoring. These two terms are often used interchangeably, but they represent distinct concepts in the realm of system management and troubleshooting. In this blog post, we will delve into the differences between observability and monitoring, shedding light on their unique features and benefits.

What is Observability? Observability refers to the ability to gain insight into the internal state of a system through its external outputs. It focuses on understanding the behavior and performance of a system from an external perspective, without requiring deep knowledge of its internal workings. Observability provides a holistic view of the system, enabling comprehensive analysis and troubleshooting.

The Essence of Monitoring: Monitoring, on the other hand, involves the systematic collection and analysis of various metrics and data points within a system. It primarily focuses on tracking predefined performance indicators, such as CPU usage, memory utilization, and network latency. Monitoring provides real-time data and alerts to ensure that system health is maintained and potential issues are promptly identified.

Data Collection and Analysis:Observability emphasizes comprehensive data collection and analysis, aiming to capture the entire system's behavior, including its interactions, dependencies, and emergent properties. Monitoring, however, focuses on specific metrics and predefined thresholds, often using predefined agents, plugins, or monitoring tools.

Contextual Understanding: Observability aims to provide a contextual understanding of the system's behavior, allowing engineers to trace the flow of data and understand the cause and effect of different components. Monitoring, while offering real-time insights, lacks the contextual understanding provided by observability.

Reactive vs Proactive: Monitoring is primarily reactive, alerting engineers when predefined thresholds are exceeded or when specific events occur. Observability, on the other hand, enables a proactive approach, empowering engineers to explore and investigate the system's behavior even before issues arise.

In conclusion, observability and monitoring are both crucial elements in system management, but they have distinct focuses and approaches. Observability provides a holistic and contextual understanding of the system's behavior, allowing for comprehensive analysis and proactive troubleshooting. Monitoring, on the other hand, offers real-time data and alerts based on predefined metrics, ensuring system health is maintained. Understanding the differences between these two concepts is vital for effectively managing and optimizing complex systems.

Matt Conran

Highlights: Observability vs Monitoring

Observability: The First Steps

The first step towards achieving modern observability is to gather metrics, traces, and logs. From the collected data points, observability aims to generate valuable outcomes for decision-making. The decision-making process goes beyond resolving problems as they arise. Next-generation observability goes beyond application remediation, focusing on creating business value to help companies achieve their operational goals. This decision-making process can be enhanced by incorporating user experience, topology, and security data.

Observability Platform

A full-stack observability platform monitors every monitored host in your environment. Depending on the technologies used, an average of 500 metrics are generated per computational node. AWS, Azure, Kubernetes, and VMware Tanzu are some platforms that use observability to collect important key performance metrics for services and real-user monitored applications.

Within a microservices environment, there can be dozens, if not hundreds, of microservices calling one another. Distributed tracing can help you understand how the different services connect and how your requests flow through them.

The three pillars of observability form a strong foundation for making data-driven decisions, but there are opportunities to extend observability. User experience and security details must be considered to gain a deeper understanding. A holistic, context-driven approach to advanced observability enables proactively addressing potential problems before they arise.

The Role of Monitoring

To understand the difference between observability and monitoring, we need first to discuss the role of monitoring. Monitoring is the evaluation that helps identify the most practical and efficient use of resources. So, the big question I put to you is what to monitor. This is the first step to preparing a monitoring strategy.

You can ask yourself a couple of questions to fully understand if monitoring is enough or if you need to move to an observability platform. Firstly, you should consider what you should be monitoring, why you should be monitoring it, and how you should be monitoring it.

Options: Open source or commercial

Knowing this lets you move into the different tools and platforms available. Some of these tools will be open source, and others commercial. When evaluating these tools, one word of caution: does each tool work in a silo, or can it be used across technical domains? Silos are breaking agility in every form of technology.

For pre-information, you may find the following posts helpful:

Monitoring vs Observability Key Observability vs Monitoring Discussion points:	The difference between Monitoring vs Observability. Google’s four Golden signals. The role of metrics, logs and alerts. The need for Observability. Observability and Monitoring working together.

Back to Basics with Observability vs Monitoring

Monitoring and distributed systems

By utilizing distributed architectures, the cloud native ecosystem allows organizations to build scalable, resilient, and novel software architectures. However, the ever-changing nature of distributed systems means that previous approaches to monitoring can no longer keep up. The introduction of containers made the cloud flexible and empowered distributed systems.

Nevertheless, the ever-changing nature of these systems can cause them to fail in many ways. Distributed systems are inherently complex, and, as systems theorist Richard Cook notes, “Complex systems are intrinsically hazardous systems.”

Cloud-native systems require a new approach to monitoring, one that is open-source compatible, scalable, reliable, and able to control massive data growth. However, cloud-native monitoring can’t exist in a vacuum: it needs to be part of a broader observability strategy.

Observability vs Monitoring — Diagram: Observability vs monitoring.

Key Features of Observability:

1. High-dimensional data collection: Observability involves collecting a wide variety of data from different system layers, including metrics, logs, traces, and events. This comprehensive data collection provides a holistic view of the system’s behavior.

2. Distributed tracing: Observability allows tracing requests as they flow through a distributed system, enabling engineers to understand the path and identify performance bottlenecks or errors.

3. Contextual understanding: Observability emphasizes capturing contextual information alongside the data, enabling teams to correlate events and understand the impact of changes or incidents.

Benefits of Observability:

1. Faster troubleshooting: By providing detailed insights into system behavior, observability helps teams quickly identify and resolve issues, minimizing downtime and improving system reliability.

2. Proactive monitoring: Observability allows teams to detect potential problems before they become critical, enabling proactive measures to prevent service disruptions.

3. Improved collaboration: With observability, different teams, such as developers, operations, and support, can have a shared understanding of the system’s behavior, leading to improved collaboration and faster incident response.

Monitoring:

On the other hand, monitoring focuses on collecting and analyzing metrics to assess the health and performance of a system. It involves setting up predefined thresholds or rules and generating alerts based on specific conditions.

Key Features of Monitoring:

1. Metric-driven analysis: Monitoring relies on predefined metrics collected and analyzed to measure system performance, such as CPU usage, memory consumption, response time, or error rates.

2. Alerting and notifications: Monitoring systems generate alerts and notifications when predefined thresholds or rules are violated, enabling teams to take immediate action.

3. Historical analysis: Monitoring systems provide historical data, allowing teams to analyze trends, identify patterns, and make informed decisions based on past performance.

Benefits of Monitoring:

1. Performance optimization: Monitoring helps identify performance bottlenecks and inefficiencies within a system, enabling teams to optimize resources and improve overall system performance.

2. Capacity planning: By monitoring resource utilization and workload patterns, teams can accurately plan for future growth and ensure sufficient resources are available to meet demand.

3. Compliance and SLA enforcement: Monitoring systems help organizations meet compliance requirements and enforce service level agreements (SLAs) by tracking and reporting on key metrics.

Observability and Monitoring: A Unified Approach:

While observability and monitoring differ in their approaches and focus, they are not mutually exclusive. When used together, they complement each other and provide a more comprehensive understanding of system behavior.

Observability enables teams to gain deep insights into system behavior, understand complex interactions, and troubleshoot issues effectively. Conversely, monitoring provides a systematic approach to tracking predefined metrics, generating alerts, and ensuring the system meets performance requirements.

Combining observability and monitoring can help organizations create a robust system monitoring and management strategy. This integrated approach empowers teams to detect, diagnose, and resolve issues quickly, improving system reliability, performance, and customer satisfaction.

The Starting Point: Observability vs Monitoring

You need to measure and gather the correct event information in your environment, which will be done with several tools. This will let you know what is affecting your application performance and infrastructure. As a good starting point, there are four golden signals for Latency, saturation, traffic, and errors. These are Google’s Four Golden Signals. The four most important metrics to keep track of are:

1. 1. Latency: How long it takes to serve a request
  2. Traffic: The number of requests being made.
  3. Errors: The rate of failing requests.
  4. Saturation: How utilized the service is.

So now we have some guidance on what to monitor and let us apply this to Kubernetes to, for example, let’s say, a frontend web service that is part of a tiered application, we would be looking at the following:

1. 1. How many requests is the front end processing at a particular point in time,
  2. How many 500 errors are users of the service received, and
  3. Does the request overutilize the service?

We already know that monitoring is a form of evaluation that helps identify the most practical and efficient use of resources. With monitoring, we observe and check the progress or quality of something over time. Within this, we have metrics, logs, and alerts. Each has a different role and purpose.

Monitoring: The role of metrics

Metrics are related to some entity and allow you to view how many resources you consume. Metric data consists of numeric values instead of unstructured text, such as documents and web pages. Metric data is typically also a time series, where values or measures are recorded over some time.

Available bandwidth and latency are examples of such metrics. Understanding baseline values is essential. Without a baseline, you will not know if something is happening outside the norm.

What are the average baseline values for bandwidth and latency metrics? Are there any fluctuations in these metrics? How do these values rise and fall during normal operations and peak usage? This may change over different days, weeks, and months.

If you notice a rise in these values during normal operations, this would be deemed abnormal and should act as a trigger that something could be wrong and needs to be investigated. Remember that these values should not be gathered as a once-off but can be gathered over time to understand your application and its underlying infrastructure better.

Monitoring: The role of logs

Logging is an essential part of troubleshooting application and infrastructure performance. Logs give you additional information about events, which is important for troubleshooting or discovering the root cause of the events. Logs will have a lot more detail than metrics, so you will need some way to parse the logs or use a log shipper.

A typical log shipper will take these logs from the standard out in a Docker container and ship them to a backend for processing.

FluentD or Logstash has pros and cons. The group can use it here and send it to a backend database, which could be the ELK stack ( Elastic Search). Using this approach, you can add different things to logs before sending them to the backend. For example, you can add GEO IP information. This will add richer information to the logs that can help you troubleshoot.

Monitoring: The role of alerting

Then we have the alerting, and it would be best to balance how you monitor and what you alert on. So, we know that alerting is not always perfect, and getting the right alerting strategy in place will take time. It’s not a simple day-one installation and requires much effort and cross-team collaboration.

You know that alerting on too much can cause alert fatigue. We are all too familiar with the problems alert fatigue can bring and the tensions it can create in departments.

To minimize this, consider Service Level Objective (SLO) for alerts. SLOs are measurable characteristics such as availability, throughput, frequency, and response times. Service Level Objectives are the foundation for a reliability stack. Also, it would be best if you also considered alert thresholds. If these are too short, you will get a lot of false positives on your alerts.

Monitoring is not enough.

Even with all of these in place, monitoring is not enough. Due to the sheer complexity of today’s landscape, you need to consider and think differently about the tools you use and how you use the intelligence and data you receive from them to resolve issues before they become incidents. That monitoring by itself is not enough.

The tool used to monitor is just a tool that probably does not cross technical domains, and different groups of users will administer each tool without a holistic view. The tools alone can take you only half the way through the journey. Also, what needs to be addressed is the culture and the traditional way of working in silos. A siloed environment can affect the monitoring strategy you want to implement. Here, you can look at an observability platform.

Observability vs Monitoring

When it comes to observability vs. monitoring, we know that monitoring can detect problems and tell you if a system is down, and when your system is UP, Monitoring doesn’t care. Monitoring only cares when there is a problem. The problem has to happen before monitoring takes action. It’s very reactive. So, if everything is working, monitoring doesn’t care.

On the other hand, we have an observability platform, which is a more proactive practice. It’s about what and how your system and services are doing. Observability lets you improve your insight into how complex systems work and quickly get to the root cause of any problem, known or unknown.

Observability is best suited for interrogating systems to explicitly discover the source of any problem, along any dimension or combination of dimensions, without first predicting. This is a proactive approach.

The pillars of observability

This is achieved by combining logs, metrics, and traces. So, we need data collection, storage, and analysis across these domains while also being able to perform alerting on what matters most. Let’s say you want to draw correlations between units like TCP/IP packets and HTTP errors experienced by your app.

The Observability platform pulls context from different sources of information, such as logs, metrics, events, and traces, into one central context. Distributed tracing adds a lot of value here.

Also, when everything is placed into one context, you can quickly switch between the necessary views to troubleshoot the root cause. Viewing these telemetry sources with one single pane of glass is an excellent key component of any observability system.

Distributed Tracing in Microservices — Diagram: Distributed tracing in microservices.

Known and Unknown / Observability Unknown and Unknown

Monitoring automatically reports whether known failure conditions are occurring or are about to occur. In other words, it is optimized for reporting on unknown conditions about known failure modes, which are referred to as known unknowns. In contrast, Observability is centered around discovering if and why previously unknown failure modes may be occurring, in other words, to find unknown unknowns.

The monitoring-based approach of metrics and dashboards is an investigative practice that relies on humans’ experience and intuition to detect and understand system issues. This is okay for a simple legacy system that fails in predictable ways, but the instinctual technique falls short for modern systems that fail in unpredictable ways.

With modern applications, the complexity and scale of their underlying systems quickly make that approach unattainable, and we can’t rely on hunches. Observability tools differ from traditional monitoring tools because they enable engineers to investigate any system, no matter how complex. You don’t need to react to a hunch or have intimate system knowledge to generate a hunch.

Monitoring vs Observability: Working together?

Monitoring helps engineers understand infrastructure concerns, while observability helps engineers understand software concerns. So, Observability and Monitoring can work together. First, the infrastructure does not change too often, and when it fails, it will fail more predictably. So, we can use monitoring here.

This is compared to software system states that change daily and are unpredictable. Observability fits this purpose. The conditions that affect infrastructure health change infrequently and are relatively more straightforward to predict. We have several well-established practices to expect, such as capacity planning and the ability to remediate automatically (e.g., as auto-scaling in a Kubernetes environment. All of which can be used to tackle these types of known issues.

Monitoring and infrastructure problems

Due to its relatively predictable and slowly changing nature, the aggregated metrics approach monitors and alerts perfectly for infrastructure problems. So here, a metric-based system works well. Metrics-based systems and their associated alerts help you see when capacity limits or known error conditions of underlying systems are being reached.

Now, we need to look at monitoring the Software. Now, we need access to high-cardinality fields. This may include the user ID or a shopping cart ID. Code that is well-instrumented for Observability allows you to answer complex questions that are easy to miss when examining aggregate performance.

Observability and monitoring are essential practices in modern software development and operations. While observability focuses on understanding system behavior through comprehensive data collection and analysis, monitoring uses predefined metrics to assess performance and generate alerts. By leveraging both approaches, organizations can gain a holistic view of their systems, enabling proactive measures, faster troubleshooting, and optimal performance. Embracing observability and monitoring as complementary practices can pave the way for more reliable, scalable, and efficient systems in the digital era.

Summary: Observability vs Monitoring

As technology advances rapidly, understanding and managing complex systems becomes increasingly important. Two terms that often arise in this context are observability and monitoring. While they may seem interchangeable, they represent distinct approaches to gaining insights into system performance. In this blog post, we delved into observability and monitoring, exploring their differences, benefits, and how they can work together to provide a comprehensive understanding of system behavior.

Section 1: Understanding Monitoring

Monitoring is a well-established practice in the world of technology. It involves collecting and analyzing data from various sources to ensure the smooth functioning of a system. Monitoring typically focuses on key performance indicators (KPIs) such as response time, error rates, and resource utilization. Organizations can proactively identify and resolve issues by tracking these metrics, ensuring optimal system performance.

Section 2: Unveiling Observability

Observability takes a more holistic approach compared to monitoring. It emphasizes understanding the internal state of a system by leveraging real-time data and contextual information. Unlike monitoring, which focuses on predefined metrics, observability aims to provide a clear picture of how a system behaves under different conditions. It achieves this by capturing fine-grained telemetry data, including logs, traces, and metrics, which can be analyzed to uncover patterns, anomalies, and root causes of issues.

Section 3: The Benefits of Observability

One of the key advantages of observability is its ability to handle unexpected scenarios and unknown unknowns. Capturing detailed data about system behavior enables teams to investigate issues retroactively, even those that were not anticipated during the design phase. Additionally, observability allows for better collaboration between different teams, as the shared visibility into system internals facilitates more effective troubleshooting and faster incident resolution.

Section 4: Synergy between Observability and Monitoring

While observability and monitoring are distinct concepts, they are not mutually exclusive. They can complement each other to provide a comprehensive understanding of system performance. Monitoring can provide high-level insights into system health and performance trends, while observability can dive deeper into specific issues and offer a more granular view. By combining these approaches, organizations can achieve a proactive and reactive system management approach, ensuring stability and resilience.

Conclusion:

Observability and monitoring are two powerful tools in the arsenal of system management. While monitoring focuses on predefined metrics, observability takes a broader and more dynamic approach, capturing fine-grained data to gain deeper insights into system behavior. By embracing observability and monitoring, organizations can unlock a comprehensive understanding of their systems, enabling them to proactively address issues, optimize performance, and deliver exceptional user experiences.

Monitoring Microservices

Table of Contents

Highlights: Monitoring Microservices

Back to Basics: Containers and Microservices

The Benefits of Microservices Observability:

Video: Microservices vs. Observability

Observability vs Monitoring

Microservices Monitoring and Observability

The Need for Microservices Observability

Tools of the past: Logs and metrics

Introduction to Microservices Monitoring Categories

Microservices Observability: Techniques

A key point: Distributed tracing

The Effect on Microservices: Microservices Monitoring

The Beginnings of Distributed Tracing

Microservices architecture example

Video: Distributed Tracing

Distributed Tracing Explained

Connect the dots with distributed tracing

Introduction to Microservices Observability

Video: Chaos Engineering

Chaos Engineering: How to Start A Project

Microservices Observability Pillars

Prometheus Monitoring and Prometheus Metric Types

Introducing Distributed Tracing

Distributed tracing components

Summary: Monitoring Microservices

Observability vs Monitoring

Highlights: Observability vs Monitoring

Back to Basics with Observability vs Monitoring

Monitoring:

Observability and Monitoring: A Unified Approach:

The Starting Point: Observability vs Monitoring

Monitoring: The role of metrics

Observability vs Monitoring

The pillars of observability

Summary: Observability vs Monitoring

Quick Links

Contact

Subscribe Now