routing protocols

Microservices Observability

 

observability tutorial

 

Microservices Observability

Teams increasingly adopt new technologies as companies transform and modernize applications to leverage container- and microservices development. IT infrastructure monitoring has always been difficult to do well but is even more challenging with the changing software architecture and the new technology needed to support it. In addition, many of your existing monitoring tools may not fully support modern applications and frameworks, especially when you throw in serverless and hybrid IT. All of these create a more considerable gap in application management of application health and performance. The gap can be filled with the practice of microservices observability with tools such as distributed systems observability with distributed tracing in microservices. To continue, you may find it useful to understand the difference between observability vs monitoring. Microservices monitoring is good for known patterns that can be automated, while microservices observability is good for detecting unknown and creative failures. Practices such as Chaos Engineering Kubernetes will help you monitor Observability efforts to understand your system with controlled experiments better. 

 



Microservices Monitoring

Key Microservices Observability Discussion Points:


  • The challenges with traditional monitoring.

  • Tools of the past, logs and metrics.

  • Why we need Observability.

  • The use of Disributed Tracing.

  • Observability pillars.

 

Microservices Monitoring and Observability

Containers, cloud platforms, scalable microservices, and the complexity of monitoring distributed systems have highlighted significant gaps in the microservices monitoring space that has been static for some time. As a result, you must fully understand performance across the entire distributed and complex stack, including distributed traces across all microservices. So, to do this, you need to have a solution that can collect, process, and store data used for monitoring. And the data needs to cover several domains and then be combined and centralized for analysts.

This can be an all-in-one solution that represents or bundles different components together for application observability. The bundled solutions would be, for example, an Application Performance Monitoring (APM) that consists of application performance monitoring tools, or a single platform which could be Prometheus which lives in a world of metrics only.  

microservices development
Diagram: Observability: Microservices development.

 

The Need for Microservices Observability

Today’s challenges

1.Obfuscation

When creating microservices, your application becomes more distributed, the coherence of failures decreases, and we live in a world of unpredictable failure mode—also, the distance between cause and effect increases. For example, an outage at your cloud provider’s blob storage could cause huge cascading latency for everyone. In today’s environment, we have new cascading problems.

 

2. Inconsistency and highly independent

Distributed applications might be reliable, but the state of individual components can be much less consistent than in monolithic or non-distributed applications, which have really easy and well-known failure modes. In addition, each component of a distributed application is designed to be highly independent, and each component can be affected by different upstream and downstream components.

 

3. Decentralization

How do you look for service failures when there may be a thousand copies of that service running on hundreds of hosts? How do you correlate those failures? So you can make sense of what’s going on.

 

  • A key point: Video on microservices observability and SRE

In this video, we will discuss the importance of distributed systems and the need to fully understand them with practices like Chaos Engineering and Site Reliability Engineering (SRE). We will also discuss the issues with microservices monitoring and static thresholds.

 

 

 

Tools of the past: Logs and metrics

Traditionally microservices monitoring has boiled down to two types of telemetry data: log data and time series statistics. The time series data—is also known as metrics, as to make sense of a metric, you need to view a period. However, as we broke the software into tiny, independently operated services and distributed those fragmented services, the logs and metrics we captured told you very little of what was happening to the critical path.

Understanding the critical path is the most important as this is what the customer is experiencing. It’s not enough to look at a single stack trace or watch CPU and memory utilization on predefined graphs and dashboards. As software scales in-depth but also in breadth—telemetry data like logs and metrics alone don’t provide the clarity you require to identify problems in production quickly.

 

observability
Diagram: The need for an Observability practice.

 

Introduction to Microservices Monitoring Categories

We have several different categories to consider. For microservices monitoring and Observability, you need first to address your infrastructures, such as your network devices, hypervisors, servers, and storage. Then you should address your application performance and health. Then you need to monitor specific to managing network quality and optimizing when you can. For each of these categories, you need to consider white box and black box monitoring and potentially introduce new tools such as Artificial Intelligence (AI) for IT operations (AIOps).

 

Prevented approach to Microservice monitoring: AI and ML

When choosing microservices observability software, consider a more preventive approach than a reactive one better suited for traditional environments. Prevented approaches to monitoring can use historical health and performance telemetry as an early warning with the use of Artificial Intelligence (AI) and Machine Learning (ML) techniques. White box monitoring offers more details than a black box that tells you that something is broken without the why it is broken. White box monitoring provides details on the why, but you must ensure that the data is easily consumable.

 

white box monitoring
Diagram: White box monitoring and black box monitoring.

 

With predictable failures and known failure modes, black box microservices monitoring can help but with the creative ways that applications and systems fail today, we need to examine the details of white box microservices monitoring. Complex applications fail in unpredictable ways, often termed black holes. Distributing your software presents new types of failure, and these systems can fail in creative ways and become harder to pin down. The service you’re responsible for may be receiving malformed or unexpected data from a source you don’t control because a team manages that service halfway across the globe.

 

White box monitoring: Exploring failures

White box monitoring relies on a different approach to black box monitoring. It uses a technique known as Instrumentation that exposes details about the system’s internals that can help you explore these black holes and better understand the creative mode in which applications fail today.

 

Microservices Observability: Techniques

Collection, storage, and analytics

Regardless of what you are monitoring, the infrastructure, or the application service, monitoring requires 3 three inputs, more than likely across three domains. We require:

    1. Data collection
    2. Storage, and 
    3. Analysis.

We need to look at metrics, traces, and logs for these three domains or, let’s say components. Out of these three domains, I find trace data to be the most beneficial and excellent way to isolate performance anomalies for distributed applications. Trace data fall into the brackets of distributed tracing that enable flexible consumption of capture traces. 

 

What you need to do: The 4 golden signals 

The first thing you need to do is establish a baseline that comprises the four golden signals – latency, traffic, errors, and saturation.  The golden signals are good indicators of health and performance and apply to most components of your environment, such as the infrastructure, applications, microservices, and orchestration systems.

 

application performance monitoring tools
Diagram: Application performance monitoring tools.

 

  • A quick recommendation: Alerts and SLIs

I would recommend driving this baseline automated along with the automation alerts from deviations from baselines. The problem is that you may alert on too much if you collect too much. Service Level Indicators (SLI) can help you find what is better to alert and what matters to the user experience. 

 

  • A key point: Distributed tracing

Navigate real-time alerts

Leveraging distributed tracing for directed troubleshooting provides users with distributed tracing capabilities to dig deep when a performance-impacting event occurs. No matter where an issue arises in your environment, you can navigate from real-time alerts directly to application traces and correlate performance trends between infrastructure, Kubernetes, and your microservices. Distributed tracing is essential to monitoring, debugging, and optimizing distributed software architecture, such as microservices–especially in dynamic microservices architectures.

 

The Effect on Microservices: Microservices Monitoring

When considering a microservice application, many consider this independent microservice as independent, but this is nothing more than an illusion. These microservices are highly interdependent, and a failure or slowdown in one service propagates across the stack of microservices. A typical architecture may include a backend service, a front-end service, or maybe even just a docker-compose file. So at a minimum, several containers need to communicate to carry out some operations. 

For a simple microservice architecture, we would have a simple front end minimizing a distributed application setup, where microservices serving static contents are at the front end. At the same time, the heavy lifting is done with the other service.   

 

observability

 

Monolith and microservices monitoring

We have considerably more components to monitor than we had in the monolithic world. With their traditional monolithic, there are only two components to monitor. Then, we had the applications and the hosts. Compared to the cloud-native world, we have containerized applications orchestrated by Kubernetes with multiple components requiring monitoring. We have, for example, the hosts, the Kubernetes platform itself, the Docker containers, and the containerized microservices.

 

Distributed systems have different demands

Distributed systems are the norm today and place different demands on your infrastructure than the classic, three-tier application. Pinpointing issues in a microservices environment is more challenging than with a monolithic one, as requests traverse both between different layers of the stack and across multiple services. 

microservices development

The Challenges: Microservices

The things we love about microservices are independence and idempotence, which make them difficult to understand, especially when things go wrong. As a result, these systems are often referred to as deep systems. Not due to their width but their complexity. We can now longer monitor their application by using a script to access the application over the network every few seconds and report any failures or use a custom script to check the operating system to understand when a disk is running out of space. Understanding saturation is an implemented signal, but it’s just one of them. It quickly becomes unrealistic for a single human, or even a group, to understand enough of the services in the critical path of even a single request and continue maintaining it. 

 

Node Affinity or Taints

Microservices-based applications are typically deployed on containers that are dynamic and transient. All of this leaves an unpredictable environment where the pods get deployed and run unless specific intent is expressed using affinity or taints. However, there can still be unpredictability with pod placement. The unpredictable nature of pod deployment and depth of configuration can lead to complex troubleshooting.

 

The Beginnings of Distributed Tracing

So when you are ready to get started with distributed tracing, you will come across OpenTracing. OpenTracing is a set of standards that are exposed as frameworks. So it’s a vendor-neutral API and Instrumentation for distributed tracing. It is not that open tracing gives you the library but more of a set of rules and extensions that another library can adopt so you can use and swap around different libraries and expect the same things. 

 

Microservices architecture example

Let’s examine an example of the request library for Python. So we have Requests, an elegant and simple HTTP library for Python. The request library talks to HTTP and will rely on certain standards; the standard here will be HTTP. So in Python, when doing a “requests.get”. The underlying implementation of the library will do a formal HTTP request that will use the GET method. So the HTTP standards and the HTTP spec lays the ground rule of what is expected from the client and the server.

 

OpenTracing

So, the OpenTracing projects do the same thing. It sets out the ground rules for what distributed tracing should look like, regardless of the implementation and the language used. It has several liabilities that are available in 9 languages. Go, JavaScript, Java, Python, Ruby, PHP, Objective-C, C++. For example, OpenTracing API for Python gives you implementation for open tracing to be used by Python. This is the set of standards for tracing with Python, and it provides examples of what the Instrumentation should look like and common ways to start a trace. 

 

  • A key point: Connect the dots with distributed tracing

And this is a big difference as to why you would use tracing and logging. Tracing allows you to connect the dots from one end of the application to the other end of the application. So if you are starting a request on the front end and you want to see how that works that entire way to the backend. A trace and child traces connected will have a representation. 

 

Visual Representation with Jaeger

You may need to use Jaeger for the visual representation. Jaeger is an open-source end-to-end visual representation of tracing that allows you to monitor and troubleshoot transactions in complex distributed systems. So we have a dashboard and can interact and search for traces. Jaeger addresses problems such as monitoring distributed tracing, performance and latency optimizations, root cause analysis, service dependency analysis, and distributed content propagation. Jaeger has different clients for different types of languages. So, for example, if you are using Python, there will be client library features for Python. 

 

OpenTelementry

We also have OpenTelementry, and this is similar. It is described as an observability framework for cloud-native software and is in beta across several languages. It is geared towards traces, metrics, and logs, so it does a bit more than OpenTracing. 

 

distributed tracing
Diagram: Distributed tracing and scalable microservices.

 

Introduction to Microservices Observability

We know that Observability means that the internal states of a system can be inferred from its external outputs. Therefore, the tools used to complete an Observability system help understand the relationships between causes and effects in distributed systems. The term Observability is borrowed from the control theory. It suggests a holistic, data-centric view of microservices monitoring that enables exploration capabilities and identifying unknown failures with the more traditional anomaly detection and notification mechanisms.

 

Goal: The ultimate goal of Observability is to :

  • Improving baseline performance
  • Restoring baseline performance (after a regression)

 

By improving the baseline, you improve the user experience. This could be, for user-facing applications, performance often means request latency. Then we have regressions in performance, including application outages, which can result in a loss of revenue and negatively impact the brand. The regressions time accepted comes down to user expectation. What is accessible, and what is in the SLA?

 

Chaos engineering

You understand your limits and new places that your system and applications can be done with Chaos Engineering tests. Chaos Engineering helps you understand your system by introducing controlled experiments when debugging microservices. 

 

Microservices Observability Pillars

So, to fully understand a system’s internal state, we need tools. Some of which are old and others which are new. These tools are known as the pillars of Observability, and they are a combination of logs, metrics, and distributed tracing. All of these tools must be combined to understand internal behaviour and fulfil the observability definition fully. To fully understand the symptoms and causes, data must be collected continuously across all of the Observability domains.

 

  • A key point: Massive amount of data

Remember that instrumenting potentially generate massive amounts of data, which can cause challenges in storing and analyzing. You need to perform data collection, storage, and analysis across the three domains of metrics, traces, and logs. And then, you need to alert me on these domains and what matters most. Not just when an arbitrary threshold is met.

 

The role of metrics

A metric is known to most, comprising a value and a timestamp along with any metadata. Metrics are collections of statistics that need to be analyzed over time. A single instance of a metric is of limited value. Examples include request rate, average duration, and queue size. These values are usually captured as time series so that operators can see and understand changes to metrics over time. 

 

Add labels to metric

To better understand metrics, we can add labels as key-value pairs. The labels add additional context to this data point. So, the label is a key-value pair indexed with the metrics as part of the injections process. In addition, metrics can now be broken down into sub-metrics. As we enter the world of labels and tags for metrics, we need to understand the effects this may have on Cardinality. While each indexed label value adds time series, this will come at storage and processing costs. Therefore, we use Cardinality to understand the impact of labels on a metric store.

observability
Diagram: Observability: The issue with metrics.

 

Aggregated metrics

The issue I continue to see is that metrics are typically aggregated every minute or even six to twelve times per minute. However, metrics must be aggregated and visualized within at most one minute but ideally even more quickly. Key questions to ask are what is the window across which values are aggregated? How are the windows from different sources aligned?

 

  • A key point: The issues of Cardinality

Aggregated Metrics allow you to get an aggregate understanding of what’s happening to all instances of a given service, and even narrow your query to specific groups of services, but fail to account for infinite Cardinality. Due to issues with “high-cardinality” within a time series storage engine, it is recommended to use labels rather than hierarchical naming for metrics.

 

Prometheus Monitoring and Prometheus Metric Types

Examples: Push and Pull

So to get metric, you need to have a push or pull approach. A push agent transmits data upstream and, more than likely, on a scheduled basis. A pull agent expects to be polled. Then we have Prometheus and a number of Prometheus metric types. Where we have a Prometheus server that operates with a pull approach that fits better into larger environments. Prometheus does not use the term agent and has what is known as exporters. They allow the Prometheus server to pull metrics back from software that has not been able to be instrumented using the Prometheus client libraries.

 

Prometheus Kuberentes

Prometheus Kubernetes is an open-source monitoring platform that originated at Sound Cloud and was released back in 2012. Its capabilities include metric collection, storage, data analyses, and visualizations. So we can use Prometheus and Grafana for the visualizations.

 

Storing Metrics

You can sort metrics that are time-series data in a general-purpose relational database. However, they should be stored in an optimized repository for storage and retrieval of time-series data. We have a few time series storage options such as Altas, InfluxDB, and Prometheus. Prometheus is the one that stands out but keeps in mind that, as far as I’m aware, there is no commercial support and limited professional services to Prometheus.

prometheus and grafana
Diagram: Prometheus and Grafana.

 

 

The role of Logs

Then we have logs that can be extremely detailed. Logs can be anything, unlike metrics that have a daily uniform format. However, logs do provide you with why something is broken. Logs capture activity that can be printed to the screen or sent to a backend so they can be centrally stored and viewed. There is very little standard structure to logs apart from having a timestamp that indicates when the event occurred. There is very little log schema, and log structure will depend on how it’s used by the application and how developers create logs.

 

Emitting Logs

Logs are emitted by almost every entity, such as the basic infrastructure such as network and storage, servers and computer notes, operating system nodes, and application software. So there are a variety of log sources and also several tools involved in the transport and interpretation to make log collection a complex task. However, remember that you may assume a large amount of log data must be stored.

Search engines such as Google have developed several techniques for searching extremely large datasets using arbitrary queries, which has proved very efficient. All of which can be applied to log data.  

 

microservice logging
Diagram: Microservice logging.

 

 

Logstash, Beats, and FluentD

Logstash is a cloud-scale ingestion tool and is part of the Elasticsearch suit. However, there have been concerns with the performance and scalability of Logstash, which brings us to the lightweight version of Beats. So if you don’t need the sophisticated data manipulation and filtering of Logstash, you can use Beasts. FluentD provides a unified logging layer or a way to aggregate logs from many different sources and distributes them to many destinations with the ability to transform data.

 

Storing Logs

Structure data such as logs and events are made of key-value pairs, any of which may be searched. This leads us to repositories called nonrelational or no SQL databases. So, storing logs represents a different storage problem from that of metrics. Examples of KV databases include Memcache and Redis. However, they are not a good choice for log storage due to the inefficiency of indexing and searching. The ELK stack has an indexing and searching engine, a collector, a Logstash, and a visualization tool, and the dominant storage mechanism for soft log and event data.

 

  • A key point: Analyze logs with AI

So, once you store the logs, they need to be analyzed and viewed. Here you could, for example, use Splunk. Its data analysis capabilities are diverse and range from security to AI for IT operations (AIOps). Kibana can also be used, which is part of the Elastic Stack.

 

Introducing Distributed Tracing

Distributed tracing is used in microservices and other distributed applications because a single operation touches many services. Distributed tracing is considered to be a type of correlated logging that helps you gain visibility into the operation of a distributed software system. Distributed tracing consists of collecting request data from the application and then analyzing and visualizing this data as traces. Tracing data, in the form of spans, must be collected from the application, transmitted, and stored in such a way that complete requests can be reconstructed. This can be useful for performance profiling, debugging in production, and root causes analysis of failures or other incidents. 

 

  • A key point: The value of distributed tracing

Distributed tracing allows you to understand what a particular individual service is doing as part of the whole. Thus providing visibility into the operation of your microservice architecture. The trace data you generate can display the overall shape of your distributed system and view individual service performance inside a single request.

 

distributed tracing
Diagram: Distributed tracing.

 

Distributed tracing components 

  1. What is a trace

Consider your software in terms of requests. Each component of your software stack is carrying out some work in response to a request which is a remote procedure call from another service. So, we have a trace encapsulating a single operation within the application, end to end, and represented as a series of spans. Each traceable unit of work within the operations generates a span. There are two ways you can get trace data. Trace data can be generated through the Instrumentation of your service processes or by transforming existing telemetry data into trace data. 

 

  1. Introducing a SPAN

We call each service’s work a span, as in the period it takes for the work to occur. These spans can be annotated with additional information such as attributes or tags, or even logs. So we can have a combination of metadata and events that can be added to spans. Creating effective spans that unlock insights into the behaviour of your service. The span data created by each service is then forwarded to some external process, where it can be aggregated into a trace, analyzed, and stored for further insights.

 

observability

Matt Conran: The Visual Age
Latest posts by Matt Conran: The Visual Age (see all)

Comments are closed.