To support the new variations, your infrastructure is amid a paradigm shift. As systems become more distributed and complex, methods for building and operating them are evolving, making visibility into your services and infrastructure more important than ever. All of which lead you to adopt new practices, such as Observability and the implementation of service level objectives (SLO). Observability aims to provide a level of introspection to understand the internal state of the systems and applications. That state can be achieved in various ways. The most common way to fully understand this state is with a combination of logs, metrics, and traces as debugging signals—all of these need to be viewed as one, not as a single entity. So, you have probably come across the difference between monitoring and Observability. But how many articles have you crossed with guidance on starting an observability project?
Diagram: Observability Engineering. The Starting Point
The Immediate Starting Strategy
You should start your observability project in the middle, not in the fridge, and start with something important. There is no point in starting an observability project on something no one cares about or uses that much. So choose something that matters, and the result will be noticed. But, on the other hand, something that no one cares about will not attract any interest from stakeholders.
Service Level Objectives (SLO)
So, to start an observability project on something that matters and will attract interest, you need to look at metrics that matter, which will be with service level objectives (SLO). With service level objectives, we are attaching the needs of the product and business to the needs of the individual components finding the perfect balance for starting observability projects. The service level objective aggregates over time, and it’s a mathematical equivalent of an error budget. So over this period, am I breaching my target? If you exceed your SLO target, your users are happy with the state of your service. If you are missing your SLO target, your users are unhappy with the state of your service. It’s as simple as that. So the SLO is the target’s goal over a measurement period. The SLO includes two things: it contains the target and a measurement window. Example: 99.9% of checkout requests in the past 30 days have been successful. Thirty days is the measurement window.
Diagram: Site Reliability Engineering (SRE) and Observability. Link to YouTube video.
- Key Point: Take advantage of Error Budgets
Once you have determined your service level objectives, it would help if you looked at your error budgets. Nothing can always be reliable, and it’s ok to fail. This is the only way to perform tests and innovate to meet better user requirements, which is why we have an error budget. An error budget references a budget of failure that you are allowed to have per hour or month. It is the amount of unreliability we will tolerate, as we need a way to measure that. So once you know how much of the error budget you have left, you can take more risks and roll out new features. They help you balance the velocity and reliability. So the practices of SLO and error budgets prioritize reliability and velocity.
- Key Point: Issues with MTTR
SLO is a good way to start and win. This can be approached on a team-by-team basis. It’s a much more accurate way to measure reliability than Mean Time to Recovery (MTTR). The issue with MTTR is that for every incident, you measure the time it took to resolve it. However, it can be subject to measurement error. The SLO is harder to cheat and a better way to measure. So we have key performance indicators ( KPI), service level indicators (SLI), and service level objectives (SLO). These are the first ways to implement Observability and not just look at the KPI. First, you should monitor KPI and SLx along with the system’s internal state. From there, you can derive your service level metrics. And these will be the best place to start an observability project.
What Is a KPI and SLI: User Experience
The key performance indicator is tied to system implementation, and it conveys health and performance and may change if there are architectural changes to the system. For example, database latency would be a KPI. In contrast to KPI, we have the service level indicators. An SLI is a measurement of your user experience. And can be derived from several signals. The SLI does not change unless the user needs to change it. It’s a metric that matters most to the user. This indicator tells you if your service is acceptable or not. So this is the line that tells you if you have a happy or sad user. It’s a performance measurement, like a metric that describes a user’s experience.
- Types of SLI
An example of an SLI would be availability, latency, correctness, quality, freshness, and throughout. So we need to gather these metrics, which can be gathered by implementing several measurement strategies such as application-level metrics, logs processing, and client-side instrumentation. So, if we look at an SLI implementation for availability it would be, for example, the portion of HTTP GET request for /type of request.
The users care about SLI and not KPI. I’m not saying that database latency is not important. You should measure it and put it in a predefined dashboard. But users don’t care about database latency or how quickly their requests can be restored. Instead, the role of the SLI is to capture the user’s expectation of how the system is behaving. So, if your database is too slow, you have to front your database with a cache. So the cache hit ratio becomes a KPI, but the user’s expectations have not changed.
Starting Observability Engineering
- Good Quality Telemetry
You need to understand the importance of high-quality telemetry. You must adopt this carefully, which is the first step to good Observability. So it would be best if you had quality logs and metrics and a modern approach such as Observability, which is required for long-term success. For this, you need good Telemetry; without good telemetry is going to be hard to shorten the length of outages.
- Instrumentation: OpenTelemetry
The first step to consider is how your applications will omit telemetry data. For instrumentation of both frameworks and application code, OpenTelemetry is the emerging standard. With OpenTelemetry’s pluggable exporters, you can configure your instrumentation to send data to the analytics tool of your choice. In addition, OpenTelementry helps you with distributed tracing, which helps you understand system interdependencies. Those interdependencies can obscure problems and make them difficult to debug unless the relationships between them are clearly understood.
Diagram: Distributed Tracing Example
- Data Storage and Analytics
Once you have high-quality telemetry data, you need to consider how it’s stored and analyzed. Data storage and analytics are often bundled into the same solution, but that depends on whether you use open source or proprietary solutions. Commercial vendors typically bundle storage and analytics. These solutions will be proprietary all-in-one solutions, including Honeycomb, Lightstep, New Relic, Splunk, Datadog, etc. Then we have the open-source solutions that typically require separate data storage and analytics approaches. These open-source frontends include solutions like Grafana, Prometheus, or Jaeger. While they handle analytics, they all require a separate data store to scale. Popular open-source data storage layers include Cassandra, Elastic, M3, and InfluxDB.
- A Final Note: Buy Instead of Building.
Knowing how to start is the biggest pain point; deciding to build your observability tooling vs. buying a commercially available solution quickly proves return on investment (ROI). You will need to buy it if you don’t have enough time. I prefer buying to get a quick return and stakeholder attention. While at the side, you could start to build with open source components. Essentially, you are running two projects in parallel. You buy to get quick benefits and gain stakeholder attraction, and then on the side, you can start to build your own., which may be more flexible for you in the long term.