In today’s rapidly evolving software development landscape, microservices architecture is famous for building scalable and resilient applications. However, as the complexity of these systems increases, so does the need for effective observability. In this blog post, we will explore the concept of microservices observability and why it is crucial in ensuring the stability and performance of modern software systems.
Microservices observability refers to the ability to gain insights into the behavior and performance of individual microservices, as well as the entire system as a whole. It involves collecting, analyzing, and visualizing data from various sources, such as logs, metrics, traces, and events, to comprehensively understand the system’s health and performance.
Highlights: Microservices Observability
- The Role of Microservices Monitoring
Microservices monitoring is suitable for known patterns that can be automated, while microservices observability is suitable for detecting unknown and creative failures. Microservices monitoring is a critical part of successfully managing a microservices architecture. It involves tracking each microservice’s performance to ensure there are no bottlenecks in the system and that the microservices are running optimally.
- Components of Microservices Monitoring
Additionally, microservices monitoring can detect anomalies and provide insights into the microservices architecture. There are several critical components of microservices monitoring, including:
– Metrics: This includes tracking metrics such as response time, throughput, and error rate. This information can be used to identify performance issues or bottlenecks.
– Logging allows administrators to track requests, errors, and exceptions. This can provide deeper insight into the performance of the microservices architecture.
– Tracing: Tracing provides a timeline of events within the system. This can be used to identify the source of issues or to track down errors.
– Alerts: Alerts notify administrators when certain conditions are met. For example, administrators can be alerted if a service is down or performance is degrading.
Finally, it is essential to note that microservices monitoring is not just limited to tracking performance. It can also detect security vulnerabilities and provide insights into the architecture.
By leveraging microservices monitoring, organizations can ensure that their microservices architecture runs smoothly and that any issues are quickly identified and resolved. This can help ensure the organization’s applications remain reliable and secure.
Related: For pre-information, you will find the following posts helpful:
- A key point: Video on microservices observability and SRE
In this video, we will discuss the importance of distributed systems and the need to fully understand them with practices like Chaos Engineering and Site Reliability Engineering (SRE). We will also discuss the issues with microservices monitoring and static thresholds.
Back to Basics: Containers and Microservices
Teams increasingly adopt new technologies as companies transform and modernize applications to leverage container- and microservices development. IT infrastructure monitoring has always been complex but is even more challenging with the changing software architecture and the new technology needed to support it. In addition, many of your existing monitoring tools may not fully support modern applications and frameworks, especially when you throw in serverless and hybrid IT. All of these create a more considerable gap in application management of application health and performance.
Containers can wrap up an application into its isolated package—everything the application needs to run successfully as a process is executed within the container. Kubernetes is an open-source container management tool that delivers an abstraction layer over the container to manage the container fleets, leveraging REST APIs.
Container-based technologies affect infrastructure management services, like backup, patching, security, high availability, disaster recovery, etc. Therefore, we must establish other monitoring and management technologies for containerization and microservices architecture. Prometheus is an example of a container monitoring tool that comes up as a go-to open-source monitoring and alerting solution.
Microservices are an architectural approach to software development that enables teams to create, deploy, and manage applications quickly. Microservices allow greater flexibility, scalability, and maintainability than traditional monolithic applications.
The microservices approach is based on building independent services that communicate with each other over an API. Each service is responsible for a specific business capability so that a single application can comprise many different services. This makes it easy to scale individual components and replace them with newer versions without affecting the rest of the application.
The Benefits of Microservices Observability:
Implementing a robust observability strategy brings several benefits to a microservices architecture:
1. Enhanced Debugging and Troubleshooting:
Microservices observability gives developers the tools and insights to identify and resolve issues quickly. By analyzing logs, metrics, and traces, teams can pinpoint the root causes of failures, reducing mean time to resolution (MTTR) and minimizing the impact on end-users.
2. Improved Performance and Scalability:
Observability enables teams to monitor the performance of individual microservices and identify areas for optimization. By analyzing metrics and tracing requests, developers can fine-tune service configurations, scale services appropriately, and ensure efficient resource utilization.
3. Proactive Issue Detection:
With comprehensive observability, teams can detect potential issues before they escalate into critical problems. By setting up alerts and monitoring key metrics, teams can proactively identify anomalies, performance degradation, or security threats, allowing for timely intervention and prevention of system-wide failures.
- A key point: Video on Microservices vs. Observability
We will start by discussing how our approach to monitoring needs to adapt to the current megatrends, such as the rise of microservices. Failures are unknown and unpredictable. Therefore, a pre-defined monitoring dashboard will have difficulty keeping up with the rate of change and unknown failure modes. For this, we should look to have the practice of observability for software and monitoring for infrastructure.
Microservices Monitoring and Observability
Containers, cloud platforms, scalable microservices, and the complexity of monitoring distributed systems have highlighted significant gaps in the microservices monitoring space that have been static for some time. As a result, you must fully understand performance across the entire distributed and complex stack, including distributed traces across all microservices. So, to do this, you need a solution that can collect, process, and store data used for monitoring. And the data needs to cover several domains and then be combined and centralized for analysts.
This can be an all-in-one solution that represents or bundles different components for application observability. The bundled solutions would be, for example, an Application Performance Monitoring (APM) that consists of application performance monitoring tools or a single platform, which could be Prometheus, which lives in a world of metrics only.
Application Performance Monitoring
Application performance monitoring typically involves tracking the response time of an application, the number of requests it can handle, and the amount of memory or other system resources it uses. This data can be used to identify any issues with application performance or scalability. Organizations can take corrective action by monitoring application performance to improve the user experience and ensure their applications run as efficiently as possible.
Application performance monitoring also helps organizations better understand their users by providing insight into how applications are used and how well they are performing. In addition, this data can be used to identify trends and patterns in user behavior, helping organizations decide how to optimize their applications for better user engagement and experience.
The Need for Microservices Observability
When creating microservices, your application becomes more distributed, the coherence of failures decreases, and we live in a world of unpredictable failure mode—also, the distance between cause and effect increases. For example, an outage at your cloud provider’s blob storage could cause huge cascading latency for everyone. In today’s environment, we have new cascading problems.
2. Inconsistency and highly independent
Distributed applications might be reliable, but the state of individual components can be much less consistent than in monolithic or non-distributed applications, which have elementary and well-known failure modes. In addition, each element of a distributed application is designed to be highly independent, and each component can be affected by different upstream and downstream components.
How do you look for service failures when a thousand copies of that service may run on hundreds of hosts? How do you correlate those failures? So you can make sense of what’s going on.
Tools of the past: Logs and metrics
Traditionally, microservices monitoring has boiled down to two types of telemetry data: log data and time series statistics. The time series data—is also known as metrics, as to make sense of a metric, you need to view a period.
However, as we broke the software into tiny, independently operated services and distributed those fragmented services, the logs, and metrics we captured told you very little of what was happening to the critical path.
Understanding the critical path is the most important, as this is what the customer is experiencing. Looking at a single stack trace or watching CPU and memory utilization on predefined graphs and dashboards is insufficient. As software scales in-depth but breadth—telemetry data like logs and metrics alone don’t provide clarity, you must quickly identify production problems.
Introduction to Microservices Monitoring Categories
We have several different categories to consider. For microservices monitoring and Observability, you must first address your infrastructures, such as your network devices, hypervisors, servers, and storage. Then, you should manage your application performance and health.
Then, you need to monitor how to manage network quality and optimize when possible. For each category, you must consider white box and black box monitoring and potentially introduce new tools such as Artificial Intelligence (AI) for IT operations (AIOps).
Prevented approach to Microservice monitoring: AI and ML.
When choosing microservices observability software, consider a more preventive approach than a reactive one better suited for traditional environments. Prevented approaches to monitoring can use historical health and performance telemetry as an early warning with the use of Artificial Intelligence (AI) and Machine Learning (ML) techniques.
White box monitoring offers more details than a black box telling you something is broken without knowing why. White box monitoring details the why, but you must ensure the data is easily consumable.
With predictable failures and known failure modes, black box microservices monitoring can help. Still, with the creative ways that applications and systems fail today, we need to examine the details of white-box microservices monitoring. Complex applications fail in unpredictable ways, often termed black holes.
Distributing your software presents new types of failure, and these systems can fail in creative ways and become harder to pin down. The service you’re responsible for may be receiving malformed or unexpected data from a source you don’t control because a team manages that service halfway across the globe.
White box monitoring: Exploring failures
White box monitoring relies on a different approach to black box monitoring. It uses a technique called Instrumentation that exposes details about the system’s internals to help you explore these black holes and better understand the creative mode in which applications fail today.
Microservices Observability: Techniques
Collection, storage, and analytics
Regardless of what you are monitoring, the infrastructure or the application service, monitoring requires 3 three inputs, more than likely across three domains. We require:
- Data collection,
- Storage, and
We need to look at metrics, traces, and logs for these three domains or, let’s say, components. Out of these three domains, trace data is the most beneficial and excellent way to isolate performance anomalies for distributed applications. Trace data fall into the brackets of distributed tracing that enable flexible consumption of capture traces.
What you need to do: The four golden signals
First, you must establish a baseline comprising the four golden signals – latency, traffic, errors, and saturation. The golden signals are good indicators of health and performance and apply to most components of your environment, such as the infrastructure, applications, microservices, and orchestration systems.
A quick recommendation: Alerts and SLIs
I recommend driving this baseline automated along with the automation alerts from deviations from baselines. The problem is that you may alert on too much if you collect too much. Service Level Indicators (SLI) can help you find what is better to alert and what matters to the user experience.
A key point: Distributed tracing
Navigate real-time alerts
Leveraging distributed tracing for directed troubleshooting provides users with distributed tracing capabilities to dig deep when a performance-impacting event occurs. No matter where an issue arises in your environment, you can navigate from real-time alerts directly to application traces and correlate performance trends between infrastructure, Kubernetes, and your microservices. Distributed tracing is essential to monitoring, debugging, and optimizing distributed software architecture, such as microservices–especially in dynamic microservices architectures.
The Effect on Microservices: Microservices Monitoring
When considering a microservice application, many consider this independent microservice as independent, but this is nothing more than an illusion. These microservices are highly interdependent, and a failure or slowdown in one service propagates across the stack of microservices.
A typical architecture may include a backend service, a front-end service, or maybe even a docker-compose file. So at a minimum, several containers must communicate to carry out operations.
For a simple microservice architecture, we would have a simple front end minimizing a distributed application setup, where microservices serving static contents are at the front end. At the same time, the heavy lifting is done with the other service.
Monolith and microservices monitoring.
We have more components to monitor than we had in the monolithic world. With their traditional monolithic, there are only two components to monitor. Then, we had the applications and the hosts.
Compared to the cloud-native world, we have containerized applications orchestrated by Kubernetes with multiple components requiring monitoring. We have, for example, the hosts, the Kubernetes platform itself, the Docker containers, and the containerized microservices.
Distributed systems have different demands.
Today, distributed systems are the norm, placing different demands on your infrastructure than the classic, three-tier application. Pinpointing issues in a microservices environment is more challenging than with a monolithic one, as requests traverse both between different layers of the stack and across multiple services.
The Challenges: Microservices
The things we love about microservices are independence and idempotence, which make them difficult to understand, especially when things go wrong. As a result, these systems are often referred to as deep systems, not due to their width but their complexity.
We can no longer monitor their application by using a script to access the application over the network every few seconds, report any failures, or use a custom script to check the operating system to understand when a disk is running out of space.
Understanding saturation is an implemented signal, but it’s just one of them. It quickly becomes unrealistic for a single human, or even a group, to understand enough of the services in the critical path of even a single request and continue maintaining it.
Node Affinity or Taints
Microservices-based applications are typically deployed on containers that are dynamic and transient. This leaves an unpredictable environment where the pods get deployed and run unless specific intent is expressed using affinity or taints. However, there can still be unpredictability with pod placement. The unpredictable nature of pod deployment and depth of configuration can lead to complex troubleshooting.
The Beginnings of Distributed Tracing
So, when you are ready to get started with distributed tracing, you will come across OpenTracing. OpenTracing is a set of standards that are exposed as frameworks. So, it’s a vendor-neutral API and Instrumentation for distributed tracing.
It is not that open tracing gives you the library but more of a set of rules and extensions that another library can adopt so you can use and swap around different libraries and expect the same things.
Microservices architecture example
Let’s examine an example of the request library for Python. So we have Requests, an elegant and simple HTTP library for Python. The request library talks to HTTP and will rely on specific standards; the standard here will be HTTP. So in Python, when making a “requests.get”.
The underlying library implementation will do a formal HTTP request using the GET method. So, the HTTP standards and the HTTP spec lays the ground rule of what is expected from the client and the server.
For example, OpenTracing API for Python gives you implementation for open tracing to be used by Python. This is the set of standards for tracing with Python, and it provides examples of what the Instrumentation should look like and common ways to start a trace.
- A key point: Video on Distributed Tracing
We generally have two types of telemetry data. We have log data and time-series statistics. The time-series data is also known as metrics in a microservices environment. The metrics, for example, will allow you to get an aggregate understanding of what’s happening to all instances of a given service.
Then, we have logs, on the other hand, that provide highly fine-grained detail on a given service. But have no built-in way to provide that detail in the context of a request. Due to how distributed systems fail, you can’t use metrics and logs to discover and address all of your problems. We need a third piece to the puzzle: distributed tracing.
A key point: Connect the dots with distributed tracing
And this is a big difference in why you would use tracing and logging. Tracing allows you to connect the dots from one end of the application to the other end of the application. So if you are starting a request on the front end and want to see how that works on the backend, that works. A trace and child traces connected will have a representation.
Visual Representation with Jaeger
You may need to use Jaeger for the visual representation. Jaeger is an open-source end-to-end visual representation of tracing that allows you to monitor and troubleshoot transactions in complex distributed systems.
So, we have a dashboard and can interact and search for traces. Jaeger addresses problems such as monitoring distributed tracing, performance and latency optimizations, root cause analysis, service dependency analysis, and distributed content propagation. Jaeger has different clients for different types of languages.
So, for example, if you are using Python, there will be client library features for Python.
We also have OpenTelementry, and this is similar. It is described as an observability framework for cloud-native software and is in beta across several languages. It is geared towards traces, metrics, and logs, so it does more than OpenTracing.
Introduction to Microservices Observability
We know that Observability means that the internal states of a system can be inferred from its external outputs. Therefore, the tools used to complete an Observability system help understand the relationships between causes and effects in distributed systems.
The term Observability is borrowed from the control theory. It suggests a holistic, data-centric view of microservices monitoring that enables exploration capabilities and identifying unknown failures with the more traditional anomaly detection and notification mechanisms.
Goal: The ultimate goal of Observability is to :
- Improving baseline performance
- Restoring baseline performance (after a regression)
By improving the baseline, you improve the user experience. This could be, for user-facing applications, performance often means request latency. Then, we have regressions in performance, including application outages, which can result in a loss of revenue and negatively impact the brand. The regressions time accepted comes down to user expectation. What is accessible, and what is in the SLA?
You understand your limits and new places that your system and applications can be made with Chaos Engineering tests. Chaos Engineering helps you know your system by introducing controlled experiments when debugging microservices.
- A key point: Video on Chaos Engineering
This educational tutorial will start with guidance on how the application has changed from the monolithic style to the microservices-based approach, along with how this has affected failures. I will introduce to you how this can be solved by knowing exactly how your application and infrastructure perform under stress and what are their breaking points
Microservices Observability Pillars
So, to fully understand a system’s internal state, we need tools. Some of which are old and others which are new. These tools are known as the pillars of Observability, a combination of logs, metrics, and distributed tracing. These tools must be combined to fully understand internal behavior and fulfill the observability definition.
Data must be collected continuously across all Observability domains to understand the symptoms and causes fully.
A key point: Massive amount of data
Remember that instrumenting potentially generates massive amounts of data, which can cause challenges in storing and analyzing. You must collect, store, and analyze data across the metrics, traces, and logs domains. And then, you need to alert me on these domains and what matters most. Not just when an arbitrary threshold is met.
The role of metrics
A metric is known to most, comprising a value, timestamp, and metadata. Metrics are collections of statistics that need to be analyzed over time. A single instance of a metric is of limited value. Examples include request rate, average duration, and queue size. These values are usually captured as time series so that operators can see and understand changes to metrics over time.
Add labels to metric.
To better understand metrics, we can add labels as key-value pairs. The labels add additional context to this data point. So, the label is a key-value pair indexed with the metrics as part of the injection process. In addition, metrics can now be broken down into sub-metrics.
As we enter the world of labels and tags for metrics, we need to understand the effects this may have on Cardinality. While each indexed label value adds time series, this will come at storage and processing costs. Therefore, we use Cardinality to understand the impact of labels on a metric store.
The issue I continue to see is that metrics are typically aggregated every minute or even six to twelve times per minute. However, metrics must be aggregated and visualized within at most one minute but ideally even more quickly. Key questions are: What is the window across which values are aggregated? How are the windows from different sources aligned?
A key point: The issues of Cardinality
Aggregated Metrics allow you to get an aggregate understanding of what’s happening to all instances of a given service and even narrow your query to specific groups of services but fail to account for infinite Cardinality. Due to issues with “high-cardinality” within a time series storage engine, it is recommended to use labels rather than hierarchical naming for metrics.
Prometheus Monitoring and Prometheus Metric Types
Examples: Push and Pull
So, to get metric, you need to have a push or pull approach. A push agent transmits data upstream and, more than likely, on a scheduled basis. A pull agent expects to be polled. Then, we have Prometheus and several Prometheus metric types. We have a Prometheus server with a pull approach that fits better into larger environments.
Prometheus does not use the term agent and has what is known as exporters. They allow the Prometheus server to pull metrics back from software that cannot be instrumented using the Prometheus client libraries.
Prometheus Kubernetes is an open-source monitoring platform that originated at SoundCloud and was released in 2012. Its capabilities include metric collection, storage, data analyses, and visualizations. So, we can use Prometheus and Grafana for the visualizations.
You can sort metrics that are time-series data in a general-purpose relational database. However, they should be stored in an optimized repository for storing and retrieving time-series data. We have several time-series storage options, such as Altas, InfluxDB, and Prometheus. Prometheus is the one that stands out, but keep in mind that, as far as I’m aware, there is no commercial support and limited professional services to Prometheus.
The Role of Logs
Then, we have logs that can be highly detailed. Logs can be anything, unlike metrics that have a daily uniform format. However, logs do provide you with why something is broken. Logs capture activity that can be printed to the screen or sent to a backend to be centrally stored and viewed.
There is very little standard structure to logs apart from a timestamp indicating when the event occurred. There is minimal log schema, and log structure will depend on how the application uses it and how developers create logs.
Logs are emitted by almost every entity, such as the basic infrastructure, such as network and storage, servers and computer notes, operating system nodes, and application software. So, there are a variety of log sources and also several tools involved in the transport and interpretation to make log collection a complex task. However, remember that you may assume a large amount of log data must be stored.
Search engines such as Google have developed several techniques for searching extensive datasets using arbitrary queries, which has proved very efficient. All of which can be applied to log data.
Logstash, Beats, and FluentD
Logstash is a cloud-scale ingestion tool and is part of the Elasticsearch suit. However, there have been concerns with the performance and scalability of Logstash, which brings us to the lightweight version of Beats. So, if you don’t need the sophisticated data manipulation and filtering of Logstash, you can use Beasts. FluentD provides a unified logging layer or a way to aggregate logs from many different sources and distribute them to many destinations with the ability to transform data.
Structure data such as logs and events are made of key-value pairs, any of which may be searched. This leads us to repositories called nonrelational or no SQL databases. So, storing logs represents a different storage problem from that of metrics. Examples of KV databases include Memcache and Redis.
However, they are not a good choice for log storage due to the inefficiency of indexing and searching. The ELK stack has an indexing and searching engine, a collector, a Logstash, a visualization tool, and the dominant storage mechanism for soft log and event data.
A key point: Analyze logs with AI
So, once you store the logs, they need to be analyzed and viewed. Here, you could, for example, use Splunk. Its data analysis capabilities range from security to AI for IT operations (AIOps). Kibana can also be used, which is part of the Elastic Stack.
Introducing Distributed Tracing
Distributed tracing is used in microservices and other distributed applications because a single operation touches many services. Distributed tracing is a type of correlated logging that helps you gain visibility into the process of a distributed software system—distributed tracing consists of collecting request data from the application and then analyzing and visualizing this data as traces.
Tracing data, in the form of spans, must be collected from the application, transmitted, and stored to reconstruct complete requests. This can be useful for performance profiling, debugging in production, and root causes analysis of failures or other incidents.
A key point: The value of distributed tracing
Distributed tracing allows you to understand what a particular service is doing as part of the whole. Thus providing visibility into the operation of your microservice architecture. The trace data you generate can display the overall shape of your distributed system and view individual service performance inside a single request.
Distributed tracing components
- What is a trace?
Consider your software in terms of requests. Each component of your software stack works in response to a request or a remote procedure call from another service. So, we have a trace encapsulating a single operation within the application, end to end, and represented as a series of spans.
Each traceable unit of work within the operations generates a span. There are two ways you can get trace data. Trace data can be generated through the Instrumentation of your service processes or by transforming existing telemetry data into trace data.
- Introducing a SPAN
We call each service’s work a span, as in the period it takes for the work to occur. These spans can be annotated with additional information, such as attributes, tags, or logs. So, we can have a combination of metadata and events that can be added to spans—creating effective spans that unlock insights into the behavior of your service. The span data produced by each service is then forwarded to some external process, where it can be aggregated into a trace, analyzed, and stored for further insights.
Conclusion: Microservices observability is a critical aspect of modern software architecture. It empowers teams to gain insights into the behavior and performance of individual microservices and the system as a whole. By leveraging logging, metrics, and distributed tracing, developers can enhance debugging, improve performance, and proactively detect issues. In an era of increasingly complex software systems, prioritizing observability is essential to ensure the stability, reliability, and scalability of microservices architectures.