system reliability

Reliability In Distributed System



Reliability In Distributed System

Distributed systems have become an integral part of our modern technological landscape. Whether it’s cloud computing, internet banking, or online shopping, these systems play a crucial role in providing seamless services to users worldwide. However, as distributed systems grow in complexity, ensuring their reliability becomes increasingly challenging. In this blog post, we will explore the concept of reliability in distributed systems and discuss various techniques to achieve fault-tolerant operations.

Reliability in distributed systems refers to the ability of the system to consistently function as intended, even in the presence of hardware failures, network partitions, and other unforeseen events. To achieve reliability, system designers employ various techniques, such as redundancy, replication, and fault tolerance, to minimize the impact of failures and ensure continuous service availability.


Highlights: Reliability in Distributed System

  • Shift in Landscape

When considering reliability in a distributed system, considerable shifts in our environmental landscape have caused us to examine how we operate and run our systems and networks. We have had a mega shift with the introduction of various cloud platforms and their services and containers, along with the complexity of managing distributed systems observability and microservices observability that unveil significant gaps in current practices in our technologies. Not to mention the flaws with the operational practices around these technologies.

  • Existing Static Tools

This has caused a knee-jerk reaction to a welcomed drive-in innovation to system reliability. Yet, some technologies and tools used to manage these innovations do not align with the innovative events. Many of these tools have stayed relatively static in our dynamic environment. So we have static tools used in a dynamic environment, which causes friction to reliability in distributed systems and the rise for more efficient network visibility.


Before you proceed, you may find the following post helpful:

  1. Distributed Firewalls
  2. SD WAN Static Network Based


Reliability In Distributed Systems

Key Reliability in Distributed System Discussion Points:

  • Complexity managing distributed systems.

  • Static tools in a dynamic environment.

  • Observability vs Monitoring.

  • Creative failures and black holes.

  • SRE teams and service level objectives.

  • New tools: Disributed tracing.


  • A key point: Video reliability in the distributed system

In the following video, we will discuss the essential feature of any system, reliability, which is not a feature that a vendor can sell you. We will discuss the importance of distributed systems and the need to fully understand them with practices like Chaos Engineering and Site Reliability Engineering (SRE). We will also discuss the issues with monitoring and static thresholds.



Back to basics with Reliability in Distributed Systems

Distributed Systems.

Distributed systems are required to implement the reliability, agility, and scale expected of modern computer programs. Distributed systems are applications of many different components running on many other machines. Containers are the foundational building block, and groups of containers co-located on a single device comprise the atomic elements of distributed system patterns.

Distributed System Observability

The big shift we see with software platforms is that they evolve much quicker than the products and paradigms we use to monitor them. We need to consider new practices and technologies with dedicated platform teams to enable a new era of system reliability in a distributed system. Along with the practices of Observability that are a step up to the traditional monitoring of static infrastructure: Observability vs monitoring.


Reliability in Distributed Systems: Components

Redundancy and Replication:

Redundancy and replication are two fundamental concepts used in distributed systems to enhance reliability. Redundancy involves duplicating critical system components, such as servers, storage devices, or network links, so that if one fails, the redundant component can seamlessly take over. Replication, on the other hand, involves creating multiple copies of data across different nodes in a system, enabling efficient data access and fault tolerance. By incorporating redundancy and replication, distributed systems can continue to operate even when individual components fail.

Fault Tolerance:

Fault tolerance is a key aspect of achieving reliability in distributed systems. It involves designing systems that can continue to operate correctly even when one or more components encounter failures. There are several techniques employed to achieve fault tolerance, such as error detection, error recovery, and error prevention mechanisms.

Error Detection:

Error detection techniques, such as checksums, hashing, and cyclic redundancy checks (CRC), are used to identify errors or data corruption during transmission or storage. By verifying the integrity of data, these techniques help identify and mitigate potential failures in distributed systems.

Error Recovery:

Error recovery mechanisms, such as checkpointing and rollback recovery, aim to restore the system to a consistent state after a failure. Checkpointing involves periodically saving the system’s state and data, allowing recovery to a previously known good state in case of failures. Rollback recovery, on the other hand, involves undoing the effects of failed operations and bringing the system back to a consistent state.

Error Prevention:

To enhance reliability, distributed systems employ error prevention techniques, such as redundancy elimination, consensus algorithms, and load balancing. Redundancy elimination reduces unnecessary duplication of data or computation, thereby reducing the chances of errors. Consensus algorithms ensure that all nodes in a distributed system agree on a common state despite failures or message delays. Load balancing techniques distribute computational tasks evenly across multiple nodes to prevent overloading and potential failures.


Lack of Connective Event: Traditional Monitoring

If you examine traditional monitoring systems, they look to capture and examine signals in isolation. The monitoring systems work in a siloed environment, similar to developers and operators before the rise of DevOps. Existing monitoring systems cannot detect the “Unknowns Unknowns” familiar with modern distributed systems. This often leads to disruptions of services. So you may be asking what an “Unknown Unknown” is.

I’ll put it to you this way: distributed systems we see today don’t have much sense of predictability—certainly not enough predictability to rely on static thresholds, alerts, and old monitoring tools. If something is static, it can be automated, and we have static events such as in Kubernetes, a POD reaching a limit.

Then a replica set introduces another pod on a different node if specific parameters are met, such as Kubernetes Labels and Node Selectors. However, this is only a tiny piece of the failure puzzle in a distributed environment.  Today, we have what’s known as partial failures and systems that fail in very creative ways.


Reliability In Distributed System: Creative ways to fail

So we know that some of these failures are quickly predicted, and actions are taken. For example, if this Kubernetes POD node reaches a certain utilization, we can automatically reschedule PODs on a different node to stay within our known scale limits.

We have predictable failures that can be automated, not just in Kubernetes but with any infrastructure. An Ansible script is useful when we have predictable events. However, we have much more to deal with than POD scaling; we have many partial and complicated failures known as black holes.


In today’s world of partial failures

Microservices applications are distributed and susceptible to many external factors. On the other hand, if you examine the traditional monolithic application style, all the functions reside in the same process. It was either switched ON or OFF!! Not much happened in between. So if there is a failure in the process, the application as a whole will fail. The results are binary, usually either a UP or Down.

And with some essential monitoring, this was easy to detect, and failures were predictable. There was no such thing as a partial failure. In a monolith application, all application functions are within the same process. And a significant benefit of these monoliths is that you don’t have partial failures.

However, in a cloud-native world, where we have broken the old monolith into a microservices-based application, a request made from a client can go through multiple hops of microservices, and we can have several problems to deal with.

There is a lack of connectivity between the different domains. Many monitoring tools and knowledge will be tied to each domain, and alerts are often tied to thresholds or rate-of-change violations that have nothing to do with user satisfaction. User satisfaction is a critical metric to care about.


System reliability: Today, you have no way to predict

So the new modern and complex distributed systems place very different demands on your infrastructure—considerably different from the simple three-tier application where everything was generally housed in one location.  We can’t predict anything anymore, which puts the brakes on traditional monitoring approaches.

When you can no longer predict what will happen, you can no longer rely on a reactive approach to monitoring and management. The move towards a proactive approach to system reliability is a welcomed strategy.


Blackholes: Strange failure modes

When considering a distributed system, many things can happen. A service or region can disappear or disappear for a few seconds or ms and reappear. We consider this as going into a black hole when we have strange failure modes. So when anything goes into it will disappear. So strange failure modes are unexpected and surprising.

There is certainly nothing predictable about strange failure modes. So what happens when your banking transactions are in a black hole? What if your banking balance is displayed incorrectly or if you make a transfer to an external account and it does not show up? 


  • A key point: Video on Observability and Monitoring

We will start by discussing how our approach to monitoring needs to adapt to the current megatrends, such as the rise of microservices. Failures are unknown and unpredictable. Therefore a pre-defined monitoring dashboard will have a hard time keeping up with the rate of change and unknown failure modes. For this, we should look to have the practice of observability for software and monitoring for infrastructure.



Highlighting Site Reliability Engineering (SRE) and Observability

Site Reliability Engineering (SRE) and Observability practices are needed to manage these types of unpredictability and unknown failures. SRE is about making systems more reliable. And everyone has a different way of implementing SRE practices. Usually, about 20% of your issues cause 80% of your problems.

You need to be proactive and fix these issues upfront. You need to be able to get ahead of the curve and do these things to stop the incidents from occurring. This usually happens in the wake of a massive incident. This usually acts as a teachable moment. This gives the power to be the reason to listen to a Chaos Engineering project. 


New tools and technologies: Distributed tracing

We have new tools, such as distributed tracing. So if the system becomes slow, what is the best way to find the bottleneck? Here you can use Distributed Tracing and Open Telemetry. So the tracing helps us instrument our system, so we figure out where the time has been spent and can be used across distributed microservice architecture to troubleshoot problems. Open Telemetry provides a standardized way of instrumenting our system and providing those traces.

distributed tracing



  • SLA, SLI, SLO, and Error Budgets

So we don’t just want to know when something has happened and then react to an event that is not looking from the customer’s perspective. We need to understand if we are meeting SLA by gathering the number and frequency of the outages and any performance issues.

Service Level Objectives (SLO) and Service Level Indicators (SLI) can assist you with measurements. Service Level Objectives (SLOs) and Service Level Indicators (SLI) not only assist you with measurements. They offer a tool for having better Reliability and form the base for the Reliability Stack.


Conclusion: Reliability is a critical aspect of distributed systems, ensuring that operations continue seamlessly despite component failures or unexpected events. By employing redundancy, replication, fault tolerance, and error prevention techniques, system designers can enhance the reliability of distributed systems. As technology advances and distributed systems become more complex, ensuring reliability will remain a key challenge, requiring continuous innovation and adaptation.


Matt Conran: The Visual Age
Latest posts by Matt Conran: The Visual Age (see all)

Comments are closed.