service level objectives

Starting Observability

 

 

Starting Observability

In the ever-evolving landscape of technology, the complexity of modern systems continues to increase. As software developers, system administrators, and IT professionals, we are constantly faced with the challenge of ensuring the smooth operation of these intricate systems. This is where observability comes into play. In this blog post, we will delve into the concept of observability, its significance in modern systems, and the benefits it brings to the table.

Observability is the ability to gain insight into the inner workings of a system through its outputs, allowing us to infer its internal state. Unlike traditional monitoring, which focuses on measuring predefined metrics, observability takes a more holistic approach. It emphasizes the collection, analysis, and interpretation of various data points, enabling us to gain a deeper understanding of our systems.

 

Highlights: Starting Observability

  • A New Paradigm Shift

To support the new variations, your infrastructure is amid a paradigm shift. As systems become more distributed and complex, methods for building and operating them are evolving, making network visibility into your services and infrastructure more critical than ever. This leads you to adopt new practices, such as Starting Observability and implementing service level objectives (SLO).

  • The Internal States

Observability aims to provide a level of introspection to understand the internal state of the systems and applications. That state can be achieved in various ways. The most common way to fully understand this state is with a combination of logs, metrics, and traces as debugging signals—all of these need to be viewed as one, not as a single entity.

So, you have probably come across the difference between monitoring and Observability. But how many articles have you crossed with guidance on starting an observability project?

 

For additional pre-information, you may find the following helpful

  1. Observability vs Monitoring
  2. Distributed Systems Observability
  3. WAN Monitoring
  4. Reliability In Distributed System

 

Back to basics with Starting Observability

Observability is critical for building, changing, and understanding the software that powers complex modern systems. Teams that adopt Observability are much better equipped to ship code swiftly and confidently, identify outliers and aberrant behaviors, and understand every user’s experience. In that case, Observability for software systems measures how well you can understand and explain any state your system can get into, no matter how novel or bizarre.

 

Observability Engineering
Diagram: Observability engineering. The starting point.

The Three Pillars of Observability

Observability rests upon three main pillars: logs, metrics, and traces. Logs capture textual records of system events, providing valuable context and aiding in post-incident analysis. Metrics, on the other hand, are quantitative measurements of system behavior, allowing us to track performance and identify anomalies. Lastly, traces provide a detailed view of request flows and interactions between system components, facilitating troubleshooting and understanding of system dependencies.

The Power of Proactive Maintenance

One of the key advantages of observability lies in its ability to enable proactive maintenance. By continuously monitoring and analyzing system data, we can identify potential issues or anomalies before they escalate into critical problems. This proactive approach empowers us to take preventive measures, reducing downtime and improving overall system reliability.

Unleashing the Potential of Data Analysis

Observability generates a wealth of data that can be harnessed to drive informed decision-making. By leveraging data analysis techniques, we can uncover patterns, identify performance bottlenecks, and optimize system behavior. This data-driven approach empowers us to make data-backed decisions that enhance system performance, scalability, and user experience.

The Immediate Starting Strategy

Start your observability project in the middle, not in the fridge, and start with something important. There is no point in starting an observability project on something no one cares about or uses that much. So choose something that matters, and the result will be noticed. But, on the other hand, something that no one cares about will not attract any interest from stakeholders.

 

Service level objectives (SLO)

So, to start an observability project on something that matters and will attract interest, you need to look at metrics that matter, which will be with Service Level Objectives (SLOs). With service level objectives, we are attaching the needs of the product and business to the needs of the individual components finding the perfect balance for starting observability projects.

The service level objective aggregates over time and is a mathematical equivalent of an error budget. So over this period, am I breaching my target? If you exceed your SLO target, your users will be happy with the state of your service.

If you are missing your SLO target, your users are unhappy with the state of your service. It’s as simple as that. So the SLO is the target’s goal over a measurement period. The SLO includes two things: it contains the target and a measurement window. Example: 99.9% of checkout requests in the past 30 days have been successful. Thirty days are the measurement window.  

 

    • Key Point: Take advantage of Error Budgets

Once you have determined your service level objectives, it would help to look at your error budgets. Nothing can always be reliable, and it’s ok to fail. This is the only way to perform tests and innovate to meet better user requirements, which is why we have an error budget. An error budget references a budget of failure that you are allowed to have per hour or month.

It is the amount of unreliability we will tolerate, as we need a way to measure that. So once you know how much of the error budget you have left, you can take more risks and roll out new features. They help you balance velocity and reliability. So the practices of SLO and error budgets prioritize reliability and velocity.

 

    • Key Point: Issues with MTTR

SLO is an excellent way to start and win. This can be approached on a team-by-team basis. It’s a much more accurate way to measure reliability than Mean Time to Recovery (MTTR).

The issue with MTTR is that for every incident, you measure the time it takes to resolve it. However, it can be subject to measurement error. The SLO is harder to cheat and a better way to measure.

So we have key performance indicators ( KPI), service level indicators (SLI), and service level objectives (SLO). These are the first ways to implement Observability, not just look at the KPI. First, you should monitor KPI and SLx along with the system’s internal state. From there, you can derive your service level metrics. And these will be the best place to start an observability project.

 

What is a KPI and SLI: User experience

The key performance indicator is tied to system implementation, and it conveys health and performance and may change if there are architectural changes to the system. For example, database latency would be a KPI.   In contrast to KPI, we have service level indicators. An SLI is a measurement of your user experience. And can be derived from several signals.

The SLI does not change unless the user needs to change it. It’s a metric that matters most to the user. This indicator tells you if your service is acceptable or not. So this line tells you if you have a happy or sad user. It’s a performance measurement, like a metric that describes a user’s experience. 

 

  • Types of service level indicators

An example of an SLI would be availability, latency, correctness, quality, freshness, and throughout. So we need to gather these metrics, which can be supposed by implementing several measurement strategies such as application-level metrics, logs processing, and client-side instrumentation.

So, if we look at an SLI implementation for availability, it would be, for example, the portion of HTTP GET request for /type of request.

The users care about SLI and not KPI. I’m not saying that database latency is not essential. You should measure it and put it in a predefined dashboard. But users don’t care about database latency or how quickly their requests can be restored.

Instead, the role of the SLI is to capture the user’s expectation of how the system behaves. So, if your database is too slow, you have to front your database with a cache. So the cache hit ratio becomes a KPI, but the user’s expectations have not changed. 

 

Starting Observability Engineering

Good Quality Telemetry

You need to understand the importance of high-quality telemetry. You must adopt this carefully, the first step to good Observability. So it would be best to have quality logs and metrics and a modern approach such as Observability, which is required for long-term success. For this, you need good Telemetry; without good telemetry is going to be hard to shorten the length of outages. 

Instrumentation: OpenTelemetry 

The first step to consider is how your applications will omit telemetry data. For instrumentation of both frameworks and application code, OpenTelemetry is the emerging standard. With OpenTelemetry’s pluggable exporters, you can configure your instrumentation to send data to the analytics tool of your choice.

In addition, OpenTelementry helps you with distributed tracing, which helps you understand system interdependencies. Those interdependencies can obscure problems and make them challenging to debug unless the relationships between them are clearly understood.

distributed tracing example
Diagram: Distributed tracing example.

 

Data storage and analytics

Once you have high-quality telemetry data, you must consider how it’s stored and analyzed. Data storage and analytics are often bundled into the same solution, depending on whether you use open-source or proprietary solutions.

Commercial vendors typically bundle storage and analytics. These solutions will be proprietary all-in-one solutions, including Honeycomb, Lightstep, New Relic, Splunk, Datadog, etc.

Then we have the open-source solutions that typically require separate data storage and analytics approaches. These open-source frontends include solutions like Grafana, Prometheus, or Jaeger. While they handle analytics, they all need an independent data store to scale. Popular open-source data storage layers include Cassandra, Elastic, M3, and InfluxDB.

 

  • A final note: Buy instead of building?

Knowing how to start is the most significant pain point; deciding to build your observability tooling vs buying a commercially available solution quickly proves a return on investment (ROI). You will need to buy it if you don’t have enough time.

I prefer buying to get a quick recovery and stakeholder attention. While at the side, you could start to build with open-source components.

Essentially, you are running two projects in parallel. You buy to get immediate benefits and gain stakeholder attraction, and then on the side, you can start to build your own., which may be more flexible for you in the long term.

Conclusion:

Observability has become an indispensable tool in the realm of modern systems. Its holistic approach, encompassing logs, metrics, and traces, enables us to gain deep insights into system behavior. By adopting observability practices, we can proactively maintain systems, analyze data to drive informed decisions, and ultimately ensure the smooth operation of complex systems in today’s technology-driven world.

 

System Observability

Distributed Systems Observability

 

 

Distributed Systems Observability

In today’s technology-driven world, distributed systems have become the backbone of numerous applications and services. These systems are designed to handle large-scale data processing, ensure fault tolerance, and provide high scalability. However, managing and monitoring distributed systems can be challenging. This is where observability comes into play. In this blog post, we will explore the significance of distributed systems observability and how it enables efficient management and troubleshooting.

Distributed systems observability refers to the ability to gain insights into the inner workings of a distributed system. It encompasses monitoring, logging, and tracing capabilities that allow engineers to effectively understand system behavior, performance, and potential issues. By adopting observability practices, organizations can ensure the smooth operation of their distributed systems and identify and resolve problems quickly.

 

Highlights: Distributed Systems Observability

  • The Role of Megatrends

We have had a considerable drive with innovation that has spawned several megatrends that have affected how we manage and view our network infrastructure and the need for distributed systems observability. We have seen the decomposition of everything from one to many.

Many services and dependencies in multiple locations, aka microservices observability, must be managed and operated instead of the monolithic where everything is generally housed internally. The megatrends have resulted in a dynamic infrastructure with new failure modes not seen in the monolithic, forcing us to look a different systems observability tools and network visibility practices. 

  • Shift in Control

There has also been a shift in the point of control. We move towards new technologies, and many of these loosely coupled services or infrastructures your services lay upon are not under your control. The edge of control has been pushed, creating different network and security perimeters. These parameters are now closer to the workload than a central security stack. Therefore the workloads themselves are concerned with security.

 

For pre-information, you may find the following posts helpful:

  1. Observability vs Monitoring 
  2. Prometheus Monitoring
  3. Network Functions

 



Distributed Systems Observability.

Key Distributed Systems Observability points:


  • We no longer have predictable failures.

  • The different demands on networks.

  • The issues with the metric-based approach.

  • Static thresholds and alerting.

  • The 3 pillars of Observability.

 

Back to Basics with Distributed Systems Observability

Distributed Systems

Today’s world of always-on applications and APIs has availability and reliability requirements that would have been needed of solely a handful of mission-critical services around the globe only a few decades ago. Likewise, the potential for rapid, viral service growth means that every application has to be built to scale nearly instantly in response to user demand.

Finally, these constraints and requirements mean that almost every application made—whether a consumer mobile app or a backend payments application—needs to be a distributed system. A distributed system is an environment where different components are spread across multiple computers on a network. These devices split up the work, harmonizing their efforts to complete the job more efficiently than if a single device had been responsible.

 

System Observability Design
Diagram: Systems Observability design.

The Key Components of Observability:

Observability in distributed systems is achieved through three main components: monitoring, logging, and tracing.

1. Monitoring:

Monitoring involves the continuous collection and analysis of system metrics and performance indicators. It provides real-time visibility into the health and performance of the distributed system. By monitoring various metrics such as CPU usage, memory consumption, network traffic, and response times, engineers can proactively identify anomalies and make informed decisions to optimize system performance.

2. Logging:

Logging involves the recording of events, activities, and errors occurring within the distributed system. Log data provides a historical record that can be analyzed to understand system behavior and debug issues. Distributed systems generate vast amounts of log data, and effective log management practices, such as centralized log storage and log aggregation, are crucial for efficient troubleshooting.

3. Tracing:

Tracing involves capturing the flow of requests and interactions between different components of the distributed system. It allows engineers to trace the journey of a specific request and identify potential bottlenecks or performance issues. Tracing is particularly useful in complex distributed architectures where multiple services interact with each other.

Benefits of Observability in Distributed Systems:

Adopting observability practices in distributed systems offers several benefits:

1. Enhanced Troubleshooting:

Observability enables engineers to quickly identify and resolve issues by providing detailed insights into system behavior. With real-time monitoring, log analysis, and tracing capabilities, engineers can pinpoint the root cause of problems and take appropriate actions, minimizing downtime and improving system reliability.

2. Performance Optimization:

By closely monitoring system metrics, engineers can identify performance bottlenecks and optimize system resources. Observability allows for proactive capacity planning and efficient resource allocation, ensuring optimal performance even under high loads.

3. Efficient Change Management:

Observability facilitates the monitoring of system changes and their impact on the overall performance. Engineers can track changes in metrics and easily identify any deviations or anomalies caused by updates or configuration changes. This helps in maintaining system stability and avoiding unexpected issues.

 

How This Affects Failures

The primary issue I have seen with my clients is that application failures are no longer predictable, and dynamic systems can fail creatively, challenging existing monitoring solutions. But, more importantly, the practices that support them. We have a lot of partial failures that are not just unexpected but not known or have never been seen before. For example, if you recall, we have the network hero. 

 

The network hero

It is someone who knows every part of the network and has seen every failure at least once. They are no longer helpful in today’s world and need proper Observability. When I was working as an Engineer, we would have plenty of failures, but more than likely, we would have seen them before. And there was a system in place to fix the error. Today’s environment is much different.

We can no longer rely on simply seeing either a UP or Down, setting static thresholds, and then alerting based on those thresholds. A key point to note at this stage is that none of these thresholds considers the customer’s perspective.  If your POD runs at 80% CPU, does that mean the customer is unhappy?

When monitoring, you should look from your customer’s perspectives and what matters to them. Content Delivery Network (CDN) was one of the first to realize this game and measure what matters most to the customer.

 

Distributed Systems Observability

The different demands

So the new modern and complex distributed systems place very different demands on your infrastructure and the people that manage the infrastructure. For example, in microservices, there can be several problems with a particular microservice:

    • The microservices could be running under high resource utilization and, therefore, slow to respond, causing a timeout
    • The microservices could have crashed or been stopped and is, therefore, unavailable
    • The microservices could be fine, but there could be slow-running database queries.
    • So we have a lot of partial failures. 

 

Therefore: We can no longer predict

The big shift we see with software platforms is that they evolve much quicker than the products and paradigms we use to monitor them. As a result, we need to consider new practices and technologies with dedicated platform teams and good system observability. We can’t predict anything anymore, which puts the brakes on some traditional monitoring approaches, especially the metrics-based approach to monitoring.

I’m not saying that these monitoring tools are not doing what you want them to do. But, they work in a siloed environment, and there is a lack of connectivity. So we have monitoring tools working in silos in different parts of the organization and more than likely managed by different people trying to monitor a very dispersed application with multiple components and services in various places. 

 

Relying On Known Failures

Metric-Based Approach

A metrics-based monitoring approach relies on having previously encountered known failure modes. The metric-based approach relies on known failures and predictable failure modes. So we have predictable thresholds that someone is considered to experience abnormal.

Monitoring can detect when these systems are either over or under the predictable thresholds that are previously set. And then, we can set alerts and hope that these alerts are actionable. This is only useful for variants of predictable failure modes.

Traditional metrics and monitoring tools can tell you any performance spikes or notice that a problem occurs. But they don’t let you dig into the source of problems and let us slice and dice or see correlations between errors. If the system is complex, this approach is harder to get to the root cause in a reasonable timeframe.

 

Traditional style metrics systems

With the traditional style metrics systems, you had to define custom metrics, which were always defined upfront. So with this approach, we can’t start to ask new questions about problems. So it would be best if you defined the questions to ask upfront.

Then we set performance thresholds, pronounce them “good” or “bad, ” and check and re-check those thresholds. We would tweak the thresholds over time, but that was about it. This monitoring style has been the de facto approach, but we don’t now want to predict how a system can fail. Always observe instead of waiting for problems, such as reaching a certain threshold before acting.

System Observability Analysis
Diagram: System Observability analysis.

 

Metrics: Lack of connective event

Metrics did not retain the connective event. As a result, you cannot ask new questions in the existing dataset.  These traditional system metrics could miss unexpected failure modes in complex distributed systems. Also, the condition detected via system metrics might be unrelated to what is happening.

An example of this could be an odd number of running threads on one component that might indicate garbage collection is in progress. It might also indicate that slow response times might be imminent in an upstream service.

 

Users experience static thresholds.

User experience means different things to different sets of users. We now have a model where different service users may be routed through the system in different ways, using different components and providing experiences that can vary widely. We also know now that the services no longer tend to break in the same few predictable ways over and over.  

We should have a few alerts triggered by only focusing on symptoms that directly impact user experience and not because a threshold was reached.

 

  • The Challenge: Can’t reliably indicate any issues with user experience

If you use static thresholds, they can’t reliably indicate any issues with user experience. Alerts should be set up to detect failures that impact user experience. Traditional monitoring falls short in trying to do this. With traditional metrics-based monitoring, we rely on static thresholds to define optimal system conditions, which have nothing to do with user experience.

However, modern systems change shape dynamically under different workloads. Static thresholds for monitoring can’t reflect impacts on user experience. They lack context and are too coarse.

 

The Need For Distributed Systems Observability

Systems observability and reliability in distributed system is a practice. Rather than just focusing on a tool that does logging, metrics, or altering, Observability is all about how you approach problems, and for this, you need to look at your culture. So you could say that Observability is a cultural practice and allows you to be proactive to findings instead of relying on a reactive approach that we are used to in the past.

Nowadays, we need a different viewpoint and want to see everything from one place. You want to know how the application works and how it interacts with the other infrastructure components, such as the underlying servers, physical or server, the network, and how data transfer looks in a transfer and stale state. 

What level of observation do you need so you know that everything is performing as it should? And what should you be looking at to get this level of detail?

Monitoring is knowing the data points and the entities we are gathering from. On the other hand, Observability is like putting all the data together. So monitoring is collecting data, and Observability is putting it together in one single pane of glass. Observability is observing the different patterns and deviations from the baseline; monitoring is getting the data and putting it into the systems. A vital part of an Observability toolkit is service level objectives (slos).

 

The three pillars of distributed systems observability

We have three pillars of Systems Observability. There are Metrics, Traces, and Logging. So it is an oversimplification to define or view Observability as having these pillars. But for Observability, you need these in place. Observability is all about connecting the dots from each of these pillars.

If someone asked me which one I prefer, it would be distributed tracing. Distributed tracing allows you to visualize each step in service request executions. As a result, it doesn’t matter if services have complex dependencies. You could say that the complexity of the Dynamic systems is abstracted with distributed tracing.

 

  • Use Case: Challenges without tracing.

For example, latency can stack up if a downstream database service experiences performance bottlenecks. As a result, the end-to-end latency is high. When latency is detected three or four layers upstream, it can be complicated to identify which component of the system is the root of the problem because now that same latency is being seen in dozens of other services.

 

  • Distributed tracing: A winning formula

Modern distributed systems tend to scale into a tangled knot of dependencies. Therefore, distributed tracing shows the relationships between various services and components in a distributed system. Traces help you understand system interdependencies. Unfortunately, those inter-dependencies can obscure problems and make them challenging to debug unless their relationships are clearly understood.

Conclusion:

In the world of distributed systems, observability plays a vital role in ensuring the stability, performance, and reliability of complex architectures. Monitoring, logging, and tracing provide engineers with the necessary tools to understand system behavior, troubleshoot issues, and optimize performance. By adopting observability practices, organizations can effectively manage their distributed systems and provide seamless and reliable services to their users.