system reliability

Reliability In Distributed System



Reliability In Distributed System

When considering reliability in a distributed system, considerable shifts in our environmental landscape have caused us to examine how we operate and run our systems and networks. We have had a mega shift with the introduction of various cloud platforms and their services and containers, along with the complexity of managing distributed systems observability and microservices observability that unveil significant gaps in current practices in our technologies. Not to mention the flaws with the operational practices around these technologies.

This has caused a knee-jerk reaction to a welcomed drive-in innovation to system reliability. Yet, some technologies and tools used to manage these innovations do not align with the innovative events. Many of these tools have stayed relatively static in our dynamic environment. So we have static tools used in a dynamic environment, which causes friction to reliability in distributed systems and the rise for more efficient network visibility.


Preliminary Information: Useful Links to Relevant Content

Before you proceed, you may find the following post helpful:

  1. Distributed Firewalls


Reliability In Distributed Systems

Key Reliability in Distributed System Discussion Points:

  • Complexity managing distributed systems.

  • Static tools in a dynamic environment.

  • Observability vs Monitoring.

  • Creative failures and black holes.

  • SRE teams and service level objectives.

  • New tools: Disributed tracing.


  • A key point: Video reliability in the distributed system

In the following video, we will discuss the essential feature of any system, reliability, which is not a feature that a vendor can sell you. We will discuss the importance of distributed systems and the need to fully understand them with practices like Chaos Engineering and Site Reliability Engineering (SRE). We will also discuss the issues with monitoring and static thresholds.



A Key Point: Knowledge Check 

  • A key point: Back to basics with distributed systems.

Distributed systems are required to implement the reliability, agility, and scale expected of modern computer programs. Distributed systems are applications of many different components running on many other machines. Containers are the foundational building block, and groups of containers co-located on a single device comprise the atomic elements of distributed system patterns.

Distributed System Observability

The big shift we see with software platforms is that they evolve much quicker than the products and paradigms we use to monitor them. We need to consider new practices and technologies with dedicated platform teams to enable a new era of system reliability in a distributed system. Along with the practices of Observability that are a step up to the traditional monitoring of static infrastructure: Observability vs monitoring.


Lack of Connective Event: Traditional Monitoring

If you examine traditional monitoring systems, they look to capture and examine signals in isolation. The monitoring systems work in a siloed environment, similar to developers and operators before the rise of DevOps. Existing monitoring systems cannot detect the “Unknowns Unknowns” familiar with modern distributed systems. This often leads to disruptions of services. So you may be asking what an “Unknown Unknown” is.

I’ll put it to you this way: distributed systems we see today don’t have much sense of predictability—certainly not enough predictability to rely on static thresholds, alerts, and old monitoring tools. If something is static, it can be automated, and we have static events such as in Kubernetes, a POD reaching a limit. Then a replica set introduces another pod on a different node if specific parameters are met, such as Kubernetes Labels and Node Selectors. However, this is only a tiny piece of the failure puzzle in a distributed environment.  Today, we have what’s known as partial failures and systems that fail in very creative ways.


Reliability In Distributed System: Creative ways to fail

So we know that some of these failures are quickly predicted, and actions are taken. For example, if this Kubernetes POD node reaches a certain utilization, we can automatically reschedule PODs on a different node to stay within our known scale limits. We have predictable failures that can be automated, not just in Kubernetes but with any infrastructure. An Ansible script is useful when we have predictable events. However, we have much more to deal with than POD scaling; we have many partial and complicated failures known as black holes.


In today’s world of partial failures

Microservices applications are distributed and susceptible to many external factors. On the other hand, if you examine the traditional monolithic application style, all the functions reside in the same process. It was either switched ON or OFF!! Not much happened in between. So if there is a failure in the process, the application as a whole will fail. The results are binary, usually either a UP or Down.

And with some essential monitoring, this was easy to detect, and failures were predictable. There was no such thing as a partial failure. In a monolith application, all application functions are within the same process. And a significant benefit of these monoliths is that you don’t have partial failures.

However, in a cloud-native world, where we have broken the old monolith into a microservices-based application, a request made from a client can go through multiple hops of microservices, and we can have several problems to deal with. There is a lack of connectivity between the different domains. Many monitoring tools and knowledge will be tied to each domain, and alerts are often tied to thresholds or rate-of-change violations that have nothing to do with user satisfaction. User satisfaction is a critical metric to care about.


System reliability: Today, you have no way to predict

So the new modern and complex distributed systems place very different demands on your infrastructure—considerably different from the simple three-tier application where everything was generally housed in one location.  We can’t predict anything anymore, which puts the brakes on traditional monitoring approaches. When you can no longer predict what will happen, you can no longer rely on a reactive approach to monitoring and management. The move towards a proactive approach to system reliability is a welcomed strategy.



  • A note on Blackholes: Strange failure modes

When considering a distributed system, many things can happen. A service or region can disappear or disappear for a few seconds or ms and reappear. We consider this as going into a black hole when we have strange failure modes. So when anything goes into it will disappear. So strange failure modes are unexpected and surprising. There is certainly nothing predictable about strange failure modes. So what happens when your banking transactions are in a black hole? What if your banking balance is displayed incorrectly or if you make a transfer to an external account and it does not show up? 


Highlighting Site Reliability Engineering (SRE) and Observability

Site Reliability Engineering (SRE) and Observability practices are needed to manage these types of unpredictability and unknown failures. SRE is about making systems more reliable. And everyone has a different way of implementing SRE practices. Usually, about 20% of your issues cause 80% of your problems. You need to be proactive and fix these issues upfront. You need to be able to get ahead of the curve and do these things to stop the incidents from occurring. This usually happens in the wake of a massive incident. This usually acts as a teachable moment. This gives the power to be the reason to listen to a Chaos Engineering project. 


New tools and technologies: Distributed tracing

We have new tools, such as distributed tracing. So if the system becomes slow, what is the best way to find the bottleneck? Here you can use Distributed Tracing and Open Telemetry. So the tracing helps us instrument our system, so we figure out where the time has been spent and can be used across distributed microservice architecture to troubleshoot problems. Open Telemetry provides a standardized way of instrumenting our system and providing those traces.

distributed tracing



  • SLA, SLI, SLO, and Error Budgets

So we don’t just want to know when something has happened and then react to an event that is not looking from the customer’s perspective. We need to understand if we are meeting SLA by gathering the number and frequency of the outages and any performance issues. Service Level Objectives (SLO) and Service Level Indicators (SLI) can assist you with measurements. Service Level Objectives (SLOs) and Service Level Indicators (SLI) not only assist you with measurements. They offer a tool for having better Reliability and form the base for the Reliability Stack.



Matt Conran: The Visual Age
Latest posts by Matt Conran: The Visual Age (see all)

Comments are closed.