Site Reliability Engineering (SRE) teams have tools such as Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budget that can guide them on the road to building a reliable system with the customer viewpoint as the metric. These new tools or technologies form the basis for a reliable system and are the core building blocks of a reliable stack. The first thing you need to understand is the service’s expectations. This introduces the areas of service-level management and its components. The core concepts of service level management are Service Level Agreement (SLA), Service Level Objectives (SLO), and Service Level Indicators (SLIs). The common indicators used are Availability, latency, duration, and efficiency. It is critical to monitor these indicators to catch problems before your SLO is violated. These are the cornerstone of developing a good SRE practice.
- SLI: Service level Indicator: A well-defined measure of “successful enough.” It is a quantifiable measurement of whether a given user interaction was good enough. Did it meet the expectation of the users? Does a web page load. within a certain time. This allows you to categorize whether a given interaction is good or bad.
- SLO: Service level objective: A top-line target for a fraction of successful interactions.
- SLA: Service level agreement: consequences. It’s more a legal construct.
So, Reliability is not so much a feature but more of a practice that must be prioritized and taken into consideration from the very beginning and is not something that should be added later on. For example, when a system or service is in production. The most important feature of any system is Reliability, and it’s not a feature that a vendor can sell you. So if someone tries to sell you an add-on solution called Reliability, don’t buy it, especially if they offer you 100% reliability. Nothing can be 100% reliable all the time. If you strive for 100% reliability, you will miss out on opportunities to perform innovative tasks and the need to experiment and take risks that can help you build better products and services.
Components of a Reliable System
To build reliable systems that can tolerate a variety of failures, the system needs to be distributed so that a problem in one location doesn’t mean your entire service stops operating. So you need to be able to build a system that can handle, for example, a node dying or perform adequately with a certain load. To create a reliable system, you need to understand it fully and what happens when the different components that make up the system reach certain thresholds. This is where practices such as Chaos Engineering can help you.
We can have practices like Chaos Engineering that can confirm your expectations, give you confidence in your system at different levels, and prove you can have a certain amount of tolerance levels to Reliability. Chaos Engineering allows you to find weaknesses and vulnerabilities in complex systems. It is an important task that can be automated into your CI/CD pipelines. So you can have various Chaos Engineering verifications before you reach production. And these Chaos Engineering tests, such as load and Latency tests, can all be automated with little or no human interaction. The practice of Chaos Engineering is often used by Site Reliability Engineering (SRE) teams to improve resilience and must be used as part of your software development/deployment process.
It’s All About Perception: Customer-Centric View
Reliability is all about perception. Suppose the user considers your service unreliable. In that case, you will lose consumer trust as service perception is poor, so it’s important to provide consistency with your services as much as you can. For example, it’s OK to have some outages. Outages are expected, but you can’t have them all the time and for long durations. Users expect to have outages at some point in time but not for so long. User Perception is everything, and if the user thinks you are unreliable, you are. Therefore you need to have a customer-centric view, and using customer satisfaction is a critical metric to measure. This is where the key components of service management, such as Service Level Objectives (SLO) and Service Level Indicators (SLI), come to play. There is a balance that you need to find between Velocity and Stability. You can’t stop innovation, but you can’t take too many risks. An Error Budget will help you here and Site Reliability Engineering (SRE) principles.
Users Experience: Static Thresholds
User experience means different things to different sets of users. We now have a model where different service users may be routed through the system in different ways, using different components. Therefore providing experiences that can vary widely. We also know now that the services no longer tend to break in the same few predictable ways over and over. With complex microservices and many software interactions, we have a lot of unpredictable failures that are never seen before. These are often referred to as black holes. We should have few alerts triggered by only focusing on symptoms that directly impact user experience and not because a threshold was reached. If your POD network reaches a certain threshold, this tells you nothing about user experience. You can’t rely on static thresholds anymore as they have no relationship to customer satisfaction.
If you are using static thresholds, they can’t reliably indicate any issues with user experience. Alerts should be set up to detect failures that impact user experience. Traditional monitoring falls short in trying to do this as it usually has predefined dashboards looking for something that has happened before. This brings us back to the challenges with the traditional metrics-based monitoring; we rely on static thresholds to define optimal system conditions, which has nothing to do with user experience. However, modern systems change shape dynamically under different workloads. Static thresholds for monitoring can’t reflect impacts on user experience. They lack context and are too coarse.
How to Approach Reliability
New Tools and Technologies
We have new tools such as distributed tracing. So if the system becomes slow, what is the best way to find the bottleneck? Here you can use Distributed Tracing and Open Telemetry. So the tracing helps us instrument our system, so we figure out where the time has been spent and can be used across distributed microservice architecture to troubleshoot problems. Open Telemetry provides a standardized way of instrumenting our system and providing those traces.
We have already touched on Service Level Objectives, Service Level Indicators, and Error Budget. You want to know why and how something has happened. So we don’t just want to know when something has happened and then react to an event that is not looking from the customer’s perspective. We need to understand if we are meeting Service Level Agreement (SLA) by gathering the number and frequency of the outages and any performance issues. Service Level Objectives (SLO) and Service Level Indicators (SLI) can assist you with measurements.
Service Level Objectives (SLO) and Service Level Indicators (SLI) not only assist you with measurements. They offer a tool for having better system reliability and form the base for the Reliability Stack. SLIs and SLOs help us interact with Reliability differently and offer us a path to build a reliable system. So now we have the tools and a disciple to use the tools within. Can you recall what that disciple is? the discipline is Site Reliability Engineering (SRE)
SLO-Based Approach to Reliability
If you’re too reliable all the time, you’re also missing out on some of the fundamental features that SLO-based approaches give you. The main area you will be missing out on is the freedom to do what you want, test, and innovate. If you’re too reliable, you’re missing out on opportunities to experiment, perform chaos engineering, ship features quicker than before, or even introduce structured downtime to see how your dependencies react. To learn a system, you need to break it. So if you are 100% reliable, you can’t touch your system, so you will never truly learn and understand your system. You want to give your users a good experience, but you’ll run out of resources in various ways if you try to ensure this good experience happens 100% of the time. SLOs let you pick a target that lives between those two worlds.
Balance Velocity and Stability
So you can’t just have Reliability by itself; you also need to have new features and innovation. Therefore, you need to find a balance between velocity and stability. So we need to balance Reliability with other features you have and are proposing to offer. Suppose you have access to a system with an amazing feature that doesn’t work. The users that have the choice will leave. So the framework for finding the balance between velocity and stability is Site Reliability Engineering. So how do you know what level of Reliability you need to provide to your customer? This all goes back to the business needs that reflect the customer’s expectations. So with SRE, we have a customer-centric approach.
The main source of outages is making changes even when the changes are planned. This can come in many forms, such as pushing new features, applying security patches, deploying new hardware, and scaling up to meet customer demand will greatly impact if you strive for a 100% reliability target. If nothing changes to the physical/logical infrastructure or other components, we will not have bugs. We can freeze our current user base and never have to scale the system. In reality, this will not happen. There will always be changes. So it would be best if you found a balance.