System Observability

Distributed Systems Observability

Distributed Systems Observability

In the realm of modern technology, distributed systems have become the backbone of numerous applications and services. However, the increasing complexity of such systems poses significant challenges when it comes to monitoring and understanding their behavior. This is where observability steps in, offering a comprehensive solution to gain insights into the intricate workings of distributed systems. In this blog post, we will embark on a captivating journey into the realm of distributed systems observability, exploring its key concepts, tools, and benefits.
Observability, as a concept, enables us to gain deep insights into the internal state of a system based on its external outputs. When it comes to distributed systems, observability takes on a whole new level of complexity. It encompasses the ability to effectively monitor, debug, and analyze the behavior of interconnected components across a distributed architecture. By employing various techniques and tools, observability allows us to gain a holistic understanding of the system's performance, bottlenecks, and potential issues.

To achieve observability in distributed systems, it is crucial to focus on three interconnected components: logs, metrics, and traces.

Logs provide a chronological record of events and activities within the system, offering valuable insights into what has occurred. By analyzing logs, engineers can identify anomalies, track down errors, and troubleshoot issues effectively.

Metrics, on the other hand, provide quantitative measurements of the system's performance and behavior. They offer a rich source of data that can be analyzed to gain a deeper understanding of resource utilization, response times, and overall system health.

Traces enable the visualization and analysis of transactions as they traverse through the distributed system. By capturing the flow of requests and their associated metadata, traces allow engineers to identify bottlenecks, latency issues, and performance optimizations.

In the ever-evolving landscape of distributed systems observability, a plethora of tools and frameworks have emerged to simplify the process. Prominent examples include:

1. Prometheus: A powerful open-source monitoring and alerting system that excels in collecting and storing metrics from distributed environments.

2. Jaeger: An end-to-end distributed tracing system that enables the visualization and analysis of transaction flows across complex systems.

3. ELK Stack: A comprehensive combination of Elasticsearch, Logstash, and Kibana, which collectively offer powerful log management, analysis, and visualization capabilities.

4. Grafana: A widely-used open-source platform for creating rich and interactive dashboards, allowing engineers to visualize metrics and logs in real-time.

The adoption of observability in distributed systems brings forth a multitude of benefits. It empowers engineers and DevOps teams to proactively detect and diagnose issues, leading to faster troubleshooting and reduced downtime. Observability also aids in capacity planning, resource optimization, and identifying performance bottlenecks. Moreover, it facilitates collaboration between teams by providing a shared understanding of the system's behavior and enabling effective communication.

Conclusion: In the ever-evolving landscape of distributed systems, observability plays a pivotal role in unraveling the complexity and gaining insights into system behavior. By leveraging the power of logs, metrics, and traces, along with robust tools and frameworks, engineers can navigate the intricate world of distributed systems with confidence. Embracing observability empowers organizations to build resilient, high-performing systems that can withstand the challenges of today's digital landscape.

Highlights: Distributed Systems Observability

The Role of Distributed Systems

Several decades ago, only a handful of mission-critical services worldwide were required to meet the availability and reliability requirements of today’s always-on applications and APIs. In response to user demand, every application must be built to scale nearly instantly to accommodate the potential for rapid, viral growth. Almost every app built today—whether a mobile app for consumers or a backend payment system—must meet these constraints and requirements.

Inherently, distributed systems are more reliable due to their distributed nature. When appropriately designed software engineers build these systems, they can benefit from more scalable organizational models. There is, however, a price to pay for these advantages. Designing, building, and debugging these distributed systems can be challenging. A reliable distributed system requires significantly more engineering skills than a single-machine application, such as a mobile app or a web frontend. Regardless, distributed systems are becoming increasingly important. There is a corresponding need for tools, patterns, and practices to build them.

Multicloud environments are increasingly distributed.

As digital transformation accelerates, organizations adopt multicloud environments to drive secure innovation and achieve speed, scale, and agility. As a result, technology stacks are becoming increasingly complex and scalable. Today, even the simplest digital transaction is supported by an array of cloud-native services and platforms delivered by a variety of providers. In order to improve user experience and resilience, IT and security teams must monitor and manage their applications.

Fragmented Monitoring Tools

Fragmented monitoring tools and manual analytics strategies challenge IT and security teams. The lack of a single source of truth and real-time insight makes it increasingly difficult for these teams to access the answers they need to accelerate innovation and optimize digital services. To gain insight, they must manually query data from various monitoring tools and piece together different sources of information. This complex and time-consuming process distracts Team members from driving innovation and creating new value for the business and customers. In addition, many teams monitor only their mission-critical applications due to the effort involved in managing all these tools, platforms, and dashboards. The result is a multitude of blind spots across the technology stack, which makes it harder for teams to gain insights.

Kubernetes is Complex

Understanding how Kubernetes adds to the complexity of technology stacks is imperative. In the drive toward modern technology stacks, it is the platform of choice for organizations refactoring their applications for the cloud-native world. Through dynamic resource provisioning, Kubernetes architectures can quickly scale services to new users and increase efficiency.

However, the constant changes in cloud environments make it difficult for IT and security teams to maintain visibility into them. To provide observability in their Kubernetes environments, these teams cannot manually configure various traditional monitoring tools. The result is that they are often unable to gain real-time insights to improve user experience, optimize costs, and strengthen security. Due to this visibility challenge, many organizations are delaying moving more of their mission-critical services to Kubernetes.

The Role of Megatrends

We have had a considerable drive with innovation that has spawned several megatrends that have affected how we manage and view our network infrastructure and the need for distributed systems observability. We have seen the decomposition of everything from one to many.

Many services and dependencies in multiple locations, aka microservices observability, must be managed and operated instead of the monolithic where everything is generally housed internally. The megatrends have resulted in a dynamic infrastructure with new failure modes not seen in the monolithic, forcing us to look at different systems observability tools and network visibility practices. 

Shift in Control

There has also been a shift in the point of control. As we move towards new technologies, many of these loosely coupled services or infrastructures your services depend on are not under your control. The edge of control has been pushed, creating different network and security perimeters. These parameters are now closer to the workload than a central security stack. Therefore, the workloads themselves are concerned with security.

For pre-information, you may find the following posts helpful:

  1. Observability vs Monitoring 
  2. Prometheus Monitoring
  3. Network Functions



Distributed Systems Observability.

Key Distributed Systems Observability points:


  • We no longer have predictable failures.

  • The different demands on networks.

  • The issues with the metric-based approach.

  • Static thresholds and alerting.

  • The 3 pillars of Observability.

Back to Basics with Distributed Systems Observability

Distributed Systems

Today’s world of always-on applications and APIs has availability and reliability requirements that would have been needed of solely a handful of mission-critical services around the globe only a few decades ago. Likewise, the potential for rapid, viral service growth means that every application has to be built to scale nearly instantly in response to user demand.

Finally, these constraints and requirements mean that almost every application made—whether a consumer mobile app or a backend payments application—needs to be a distributed system. A distributed system is an environment where different components are spread across multiple computers on a network. These devices split up the work, harmonizing their efforts to complete the job more efficiently than if a single device had been responsible.

System Observability Design
Diagram: Systems Observability design.

The Key Components of Observability:

Observability in distributed systems is achieved through three main components: monitoring, logging, and tracing.

1. Monitoring:

Monitoring involves the continuous collection and analysis of system metrics and performance indicators. It provides real-time visibility into the health and performance of the distributed system. By monitoring various metrics such as CPU usage, memory consumption, network traffic, and response times, engineers can proactively identify anomalies and make informed decisions to optimize system performance.

2. Logging:

Logging involves recording events, activities, and errors within the distributed system. Log data provides a historical record that can be analyzed to understand system behavior and debug issues. Distributed systems generate vast amounts of log data, and effective log management practices, such as centralized log storage and log aggregation, are crucial for efficient troubleshooting.

3. Tracing:

Tracing involves capturing the flow of requests and interactions between different distributed system components. It allows engineers to trace the journey of a specific request and identify potential bottlenecks or performance issues. Tracing is particularly useful in complex distributed architectures where multiple services interact.

Benefits of Observability in Distributed Systems:

Adopting observability practices in distributed systems offers several benefits:

1. Enhanced Troubleshooting:

Observability enables engineers to quickly identify and resolve issues by providing detailed insights into system behavior. With real-time monitoring, log analysis, and tracing capabilities, engineers can pinpoint the root cause of problems and take appropriate actions, minimizing downtime and improving system reliability.

2. Performance Optimization:

By closely monitoring system metrics, engineers can identify performance bottlenecks and optimize system resources. Observability allows for proactive capacity planning and efficient resource allocation, ensuring optimal performance even under high loads.

3. Efficient Change Management:

Observability facilitates monitoring system changes and their impact on overall performance. Engineers can track changes in metrics and easily identify any deviations or anomalies caused by updates or configuration changes. This helps maintain system stability and avoid unexpected issues.

How This Affects Failures

The primary issue I have seen with my clients is that application failures are no longer predictable, and dynamic systems can fail creatively, challenging existing monitoring solutions. But, more importantly, the practices that support them. We have a lot of partial failures that are not just unexpected but not known or have never been seen before. For example, if you recall, we have the network hero. 

The network hero

It is someone who knows every part of the network and has seen every failure at least once. These people are no longer helpful in today’s world and need proper Observation. When I was working as an Engineer, we would have plenty of failures, but more than likely, we would have seen them before. And there was a system in place to fix the error. Today’s environment is much different.

We can no longer rely on simply seeing a UP or Down, setting static thresholds, and then alerting based on those thresholds. A key point to note at this stage is that none of these thresholds considers the customer’s perspective.  If your POD runs at 80% CPU, does that mean the customer is unhappy?

When monitoring, you should look from your customer’s perspectives and what matters to them. Content Delivery Network (CDN) was one of the first to realize this game and measure what matters most to the customer.

Distributed Systems Observability

The different demands

So, the new, modern, and complex distributed systems place very different demands on your infrastructure and the people who manage it. For example, in microservices, there can be several problems with a particular microservice:

    • The microservices could be running under high resource utilization and, therefore, slow to respond, causing a timeout
    • The microservices could have crashed or been stopped and is, therefore, unavailable
    • The microservices could be fine, but there could be slow-running database queries.
    • So we have a lot of partial failures. 

Therefore, We can no longer predict

The big shift we see with software platforms is that they evolve much quicker than the products and paradigms we use to monitor them. As a result, we need to consider new practices and technologies with dedicated platform teams and good system observability. We can’t predict anything anymore, which puts the brakes on some traditional monitoring approaches, especially the metrics-based approach to monitoring.

I’m not saying that these monitoring tools are not doing what you want them to do. But, they work in a siloed environment, and there is a lack of connectivity. So we have monitoring tools working in silos in different parts of the organization and more than likely managed by other people trying to monitor a very dispersed application with multiple components and services in various places. 

Relying On Known Failures

Metric-Based Approach

A metrics-based monitoring approach relies on having previously encountered known failure modes. The metric-based approach relies on known failures and predictable failure modes. So, we have predictable thresholds that someone is considered to experience abnormal.

Monitoring can detect when these systems are either over or under the predictable thresholds that were previously set. Then, we can set alerts, and we hope that these alerts are actionable. This is only useful for variants of predictable failure modes.

Traditional metrics and monitoring tools can tell you any performance spikes or notice that a problem occurs. But they don’t let you dig into the source of the issues and let us slice and dice or see correlations between errors. If the system is complex, this approach is more challenging to get to the root cause in a reasonable timeframe.

Traditional style metrics systems

With traditional metrics systems, you had to define custom metrics, which were always defined upfront. This approach prevents us from starting to ask new questions about problems. So, it would be best if you defined the questions to ask upfront.

Then, we set performance thresholds, pronounce them “good” or “bad, ” and check and re-check those thresholds. We would tweak the thresholds over time, but that was about it. This monitoring style has been the de facto approach, but we don’t now want to predict how a system can fail. Always observe instead of waiting for problems, such as reaching a certain threshold before acting.

System Observability Analysis
Diagram: System Observability analysis.

Metrics: Lack of connective event

Metrics did not retain the connective event. As a result, you cannot ask new questions in the existing dataset.  These traditional system metrics could miss unexpected failure modes in complex distributed systems. Also, the condition detected via system metrics might be unrelated to what is happening.

An example of this could be an odd number of running threads on one component, which might indicate garbage collection is in progress or that slow response times are imminent in an upstream service.

Users experience static thresholds.

User experience means different things to different sets of users. We now have a model where different service users may be routed through the system in other ways, using various components and providing experiences that can vary widely. We also know now that the services no longer tend to break in the same few predictable ways over and over.  

We should have a few alerts triggered by only focusing on symptoms that directly impact user experience and not because a threshold was reached.

The Challenge: Can’t reliably indicate any issues with user experience

If you use static thresholds, they can’t reliably indicate any issues with user experience. Alerts should be set up to detect failures that impact user experience. Traditional monitoring falls short in trying to do this. With traditional metrics-based monitoring, we rely on static thresholds to define optimal system conditions, which have nothing to do with user experience.

However, modern systems change shape dynamically under different workloads. Static monitoring thresholds can’t reflect impacts on user experience. They lack context and are too coarse.

The Need For Distributed Systems Observability

Systems observability and reliability in distributed system is a practice. Rather than just focusing on a tool that does logging, metrics, or altering, Observability is all about how you approach problems, and for this, you need to look at your culture. So you could say that Observability is a cultural practice and allows you to be proactive to findings instead of relying on a reactive approach that we are used to in the past.

Nowadays, we need a different viewpoint and want to see everything from one place. You want to know how the application works and how it interacts with the other infrastructure components, such as the underlying servers, physical or server, the network, and how data transfer looks in a transfer and stale state. 

What level of observation is needed to ensure everything is performing as it should? What should you look at to obtain this level of detail?

Monitoring is knowing the data points and the entities from which we gather information. On the other hand, Observability is like putting all the data together. So monitoring is collecting data, and Observability is putting it together in one single pane of glass. Observability is observing the different patterns and deviations from the baseline; monitoring is getting the data and putting it into the systems. A vital part of an Observability toolkit is service level objectives (slos).

The three pillars of distributed systems observability

We have three pillars of Systems Observability. There are Metrics, Traces, and Logging. So, defining or viewing Observability as having these pillars is an oversimplification. But for Observability, you need these in place. Observability is all about connecting the dots from each of these pillars.

If someone asked me which one I prefer, it would be distributed tracing. Distributed tracing allows you to visualize each step in service request executions. As a result, it doesn’t matter if services have complex dependencies. You could say that the complexity of the Dynamic systems is abstracted with distributed tracing.

Use Case: Challenges without tracing.

For example, latency can stack up if a downstream database service experiences performance bottlenecks. As a result, the end-to-end latency is high. When latency is detected three or four layers upstream, it can be complicated to identify which component of the system is the root of the problem because now that same latency is being seen in dozens of other services.

Distributed tracing: A winning formula

Modern distributed systems tend to scale into a tangled knot of dependencies. Therefore, distributed tracing shows the relationships between various services and components in a distributed system. Traces help you understand system interdependencies. Unfortunately, those inter-dependencies can obscure problems and make them challenging to debug unless their relationships are clearly understood.

In distributed systems, observability is vital in ensuring complex architectures’ stability, performance, and reliability. Monitoring, logging, and tracing provide engineers with the tools to understand system behavior, troubleshoot issues, and optimize performance. By adopting observability practices, organizations can effectively manage their distributed systems and provide seamless and reliable services to their users.

Summary: Distributed Systems Observability

In the vast landscape of distributed systems, observability is crucial in ensuring their reliable and efficient functioning. This blogpost aims to delve into the critical components of distributed systems observability and shed light on their significance.

Telemetry

Telemetry forms the foundation of observability in distributed systems. It involves collecting, processing, and analyzing various metrics, logs, and traces. By monitoring and measuring these data points, developers gain valuable insights into the performance and behavior of their distributed systems.

Logging

Logging is an essential component of observability, providing a detailed record of events and activities within a distributed system. It captures important information such as errors, warnings, and informational messages, which aids in troubleshooting and debugging. Properly implemented logging mechanisms enable developers to identify and resolve issues promptly.

Metrics

Metrics are quantifiable measurements that provide a high-level view of the health and performance of a distributed system. They offer valuable insights into resource utilization, throughput, latency, error rates, and other critical indicators. By monitoring and analyzing metrics, developers can proactively identify bottlenecks, optimize performance, and ensure the smooth operation of their systems.

Tracing

Tracing allows developers to understand the flow and behavior of requests as they traverse through a distributed system. It provides detailed information about the path a request takes, including the various services and components it interacts with. Tracing is instrumental in diagnosing and resolving performance issues, as it highlights potential latency hotspots and bottlenecks.

Alerting and Visualization

Alerting mechanisms and visualization tools are vital for effective observability in distributed systems. Alerts notify developers when certain predefined thresholds or conditions are met, enabling them to take timely action. Visualization tools provide intuitive and comprehensive representations of system metrics, logs, and traces, making identifying patterns, trends, and anomalies easier.

Conclusion

In conclusion, the key components of distributed systems observability, namely telemetry, logging, metrics, tracing, alerting, and visualization, form a comprehensive toolkit for monitoring and understanding the intricacies of such systems. By leveraging these components effectively, developers can ensure their distributed systems’ reliability, performance, and scalability.

Matt Conran: The Visual Age
Latest posts by Matt Conran: The Visual Age (see all)

Comments are closed.