Cisco ACI

Service Level Objectives (SLOs): Customer-centric view

 

 

Service Level Objectives (SLOs)

In today’s fast-paced digital world, businesses heavily rely on various software applications and online services to ensure smooth operations and deliver value to their customers. However, the increasing complexity of these systems often poses challenges in terms of reliability, availability, and performance. This is where Service Level Objectives (SLOs) come into play. In this blog post, we will delve into the concept of SLOs and explore their significance in achieving service excellence.

Service Level Objectives, or SLOs, are measurable targets defining the desired performance level, availability, and service reliability. They are critical to Service Level Agreements (SLAs) between service providers and customers. SLOs help set clear expectations and enable businesses to monitor, measure, and improve their service delivery based on agreed-upon metrics.

 

Highlights: Service Level Objectives (SLOs)

Site Reliability Engineering (SRE) teams have tools such as Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets that can guide them on the road to building a reliable system with the customer viewpoint as the metric. These new tools or technologies form the basis for reliability in distributed system and are the core building blocks of a reliable stack that assist with baseline engineering. The first thing you need to understand is the service’s expectations. This introduces the areas of service-level management and its components.

  • The Role of Service-Level Management

The core concepts of service level management are Service Level Agreement (SLA), Service Level Objectives (SLO), and Service Level Indicators (SLIs). The common indicators used are Availability, latency, duration, and efficiency. Monitoring these indicators to catch problems before your SLO is violated is critical. These are the cornerstone of developing a good SRE practice.

    • SLI: Service level Indicator: A well-defined measure of “successful enough.” It is a quantifiable measurement of whether a given user interaction was good enough. Did it meet the expectation of the users? Does a web page load? Within a specific time. This allows you to categorize whether a given interaction is good or bad.
    • SLO: Service level objective: A top-line target for a fraction of successful interactions.
    • SLA: Service level agreement: consequences. It’s more of a legal construct. 

 

For pre-information, you may find the following helpful:

  1. Starting Observability
  2. Distributed Firewalls
  3. Network Traffic Engineering
  4. Brownfield Network Automation

 



Service Level Objectives

Key Service Level Objectives (slos) Discussion points:


  • Required for baseline engineering. 

  • Components of a Reliable system.

  • Chaos Engineering.

  • The issue with static thresholds.

  • How to approach Reliability.

 

  • A key point: Video on System Reliability and SLOs

The following video will discuss the importance of distributed systems observability and the need to fully comprehend them with practices like Chaos Engineering and Site Reliability Engineering (SRE). In addition, we will again discuss the problems with monitoring and static thresholds.

 

Site Reliability Engineering | Observability
Prev 1 of 1 Next
Prev 1 of 1 Next

 

  • A key point: Back to basics with Service Level Objectives

Site Reliability Engineering (SRE)

Pioneered by Google to make more scalable and reliable large-scale systems, SRE has become one of today’s most valuable software innovation opportunities. SRE is a concrete opinionated implementation of the DevOps philosophy. The main goals are to create scalable and highly reliable software systems.

According to Benjamin Treynor Sloss, the founder of Google’s Site Reliability Team, “SRE is what happens when a software engineer is tasked with what used to be called operations.”

System Reliability Meaning
Diagram: System reliability meaning.

 

 

So, Reliability is not so much a feature but more of a practice that must be prioritized and taken into consideration from the very beginning and is not something that should be added later on. For example, when a system or service is in production. Reliability is the essential feature of any system, and it’s not a feature that a vendor can sell you.

So if someone tries to sell you an add-on solution called Reliability, don’t buy it, especially if they offer 100% reliability. Nothing can be 100% reliable all the time. If you strive for 100% reliability, you will miss out on opportunities to perform innovative tasks and the need to experiment and take risks that can help you build better products and services. 

Why are SLOs Important?

SLOs play a vital role in ensuring customer satisfaction and meeting business objectives. Here are a few reasons why SLOs are essential:

1. Accountability: SLOs provide a framework for holding service providers accountable for meeting the promised service levels. They establish a baseline for evaluating the performance and quality of the service.

2. Customer Experience: By setting SLOs, businesses can align their service offerings with customer expectations. This helps deliver a superior customer experience, foster customer loyalty, and gain a competitive edge in the market.

3. Performance Monitoring and Improvement: SLOs enable businesses to monitor their services’ performance and identify improvement areas continuously. Regularly tracking SLO metrics allows for proactive measures and optimizations to enhance service reliability and availability.

Critical Elements of SLOs:

To effectively implement SLOs, it is essential to consider the following key elements:

1. Metrics: SLOs should be based on relevant, measurable metrics that accurately reflect the desired service performance. Standard metrics include response time, uptime percentage, error rate, and throughput.

2. Targets: SLOs must define specific targets for each metric, considering customer expectations, industry standards, and business requirements. Targets should be achievable yet challenging enough to drive continuous improvement.

3. Monitoring and Alerting: Establishing robust monitoring and alerting mechanisms allows businesses to track the performance of their services in real time. This enables timely intervention and remediation in case of deviations from the defined SLOs.

4. Communication: Effective communication with customers is crucial to ensure transparency and manage expectations. Businesses should communicate SLOs, including the metrics, targets, and potential limitations, to foster trust and maintain a healthy customer-provider relationship.

 

Components of a Reliable System

Distributed system

To build reliable systems that can tolerate various failures, the system needs to be distributed so that a problem in one location doesn’t mean your entire service stops operating. So you need to build a system that can handle, for example, a node dying or perform adequately with a particular load.

To create a reliable system, you need to understand it fully and what happens when the different components that make up the system reach certain thresholds. This is where practices such as Chaos engineering kubernetes can help you.

 

Chaos Engineering 

We can have practices like Chaos Engineering that can confirm your expectations, give you confidence in your system at different levels, and prove you can have certain tolerance levels to Reliability. Chaos Engineering allows you to find weaknesses and vulnerabilities in complex systems. It is an important task that can be automated into your CI/CD pipelines.

So you can have various Chaos Engineering verifications before you reach production. And these Chaos Engineering tests, such as load and Latency tests, can all be automated with little or no human interaction. Site Reliability Engineering (SRE) teams often use Chaos Engineering to improve resilience and must be part of your software development/deployment process.  

 

  • A key point: Video on Starting a Chaos Engineering Project

This educational tutorial will begin with guidance on how the application has changed from the monolithic style to the microservices-based approach and how this has affected failures. I will introduce to you how this can be solved by knowing exactly how your application and infrastructure perform under stress and what are their breaking points.

 

Chaos Engineering: How to Start A Project
Prev 1 of 1 Next
Prev 1 of 1 Next

 

It’s All About Perception: Customer-Centric View

Reliability is all about perception. Suppose the user considers your service unreliable. In that case, you will lose consumer trust as service perception is poor, so it’s important to provide consistency with your services as much as possible. For example, it’s OK to have some outages. Outages are expected, but you can’t have them all the time and for long durations.

Users expect to have outages at some point in time, but not for so long. User Perception is everything; if the user thinks you are unreliable, you are. Therefore you need to have a customer-centric view, and using customer satisfaction is a critical metric to measure.

This is where the key components of service management, such as Service Level Objectives (SLO) and Service Level Indicators (SLI), come to play. There is a balance that you need to find between Velocity and Stability. You can’t stop innovation, but you can’t take too many risks. An Error Budget will help you here and Site Reliability Engineering (SRE) principles. 

 

Users experience Static thresholds.

User experience means different things to different sets of users. We now have a model where different service users may be routed through the system in different ways, using different components. Therefore providing experiences that can vary widely. We also know now that the services no longer tend to break in the same few predictable ways over and over.

With complex microservices and many software interactions, we have many unpredictable failures that we have never seen before. These are often referred to as black holes. We should have a few alerts triggered by only focusing on symptoms that directly impact user experience and not because a threshold was reached.

If your POD network reaches a certain threshold, this tells you nothing about user experience. You can’t rely on static thresholds anymore, as they have no relationship to customer satisfaction.

If you use static thresholds, they can’t reliably indicate any issues with user experience. Alerts should be set up to detect failures that impact user experience. Traditional monitoring falls short in trying this as it usually has predefined dashboards looking for something that has happened before.

This brings us back to the challenges with traditional metrics-based monitoring; we rely on static thresholds to define optimal system conditions, which have nothing to do with user experience. However, modern systems change shape dynamically under different workloads. Static thresholds for monitoring can’t reflect impacts on user experience. They lack context and are too coarse.

 

How to Approach Reliability 

New tools and technologies

We have new tools, such as distributed tracing. So if the system becomes slow, what is the best way to find the bottleneck? Here you can use Distributed Tracing and Open Telemetry. So the tracing helps us instrument our system, so we figure out where the time has been spent and can be used across distributed microservice architecture to troubleshoot problems. Open Telemetry provides a standardized way of instrumenting our system and providing those traces.

We have already touched on Service Level Objectives, Indicators, and Error Budget. You want to know why and how something has happened. So we don’t just want to know when something has happened and then react to an event that is not looking from the customer’s perspective.

We need to understand if we are meeting Service Level Agreement (SLA) by gathering the number and frequency of the outages and any performance issues. Service Level Objectives (SLO) and Service Level Indicators (SLI) can assist you with measurements. 

Service Level Objectives (SLO) and Service Level Indicators (SLI) not only assist you with measurements. They offer a tool for having better system reliability and form the base for the Reliability Stack. SLIs and SLOs help us interact with Reliability differently and offer us a path to build a reliable system.

So now we have the tools and a disciple to use the tools within. Can you recall what that disciple is? The discipline is Site Reliability Engineering (SRE)

System Reliability Formula
Diagram: System Reliability Formula.

 

  • SLO-Based approach to reliability

If you’re too reliable all the time, you’re also missing out on some of the fundamental features that SLO-based approaches give you. The main area you will miss is the freedom to do what you want, test, and innovate. If you’re too reliable, you’re missing out on opportunities to experiment, perform chaos engineering, ship features quicker than before, or even introduce structured downtime to see how your dependencies react.

To learn a system, you need to break it. So if you are 100% reliable, you can’t touch your system, so you will never truly learn and understand your system. You want to give your users a good experience, but you’ll run out of resources in various ways if you try to ensure this good experience happens 100% of the time. SLOs let you pick a target that lives between those two worlds.

 

  • Balance velocity and stability

So you can’t just have Reliability by itself; you must also have new features and innovation. Therefore, you need to find a balance between velocity and stability. So we need to balance Reliability with other features you have and are proposing to offer. Suppose you have access to a system with a fantastic feature that doesn’t work. The users that have the choice will leave.

So Site Reliability Engineering is the framework for balancing velocity and stability. So how do you know what level of Reliability you need to provide your customer? This all goes back to the business needs that reflect the customer’s expectations. So with SRE, we have a customer-centric approach.

The primary source of outages is making changes even when the changes are planned. This can come in many forms, such as pushing new features, applying security patches, deploying new hardware, and scaling up to meet customer demand will significantly impact if you strive for a 100% reliability target. 

If nothing changes to the physical/logical infrastructure or other components, we will not have bugs. We can freeze our current user base and never have to scale the system. In reality, this will not happen. There will always be changes. So it would be best if you found a balance.

Conclusion:

In conclusion, Service Level Objectives (SLOs) are a cornerstone for delivering reliable and high-quality services in today’s technology-driven world. By setting measurable targets, businesses can align their service performance with customer expectations, drive continuous improvement, and ultimately enhance customer satisfaction. Implementing and monitoring SLOs allows businesses to proactively address issues, optimize service delivery, and stay ahead of the competition. By embracing SLOs, businesses can pave the way for successful service delivery and long-term growth.

 

Matt Conran: The Visual Age
Latest posts by Matt Conran: The Visual Age (see all)

Comments are closed.