System Observability

Distributed Systems Observability

Distributed Systems Observability

In the realm of modern technology, distributed systems have become the backbone of numerous applications and services. However, the increasing complexity of such systems poses significant challenges when it comes to monitoring and understanding their behavior. This is where observability steps in, offering a comprehensive solution to gain insights into the intricate workings of distributed systems. In this blog post, we will embark on a captivating journey into the realm of distributed systems observability, exploring its key concepts, tools, and benefits.

Observability, as a concept, enables us to gain deep insights into the internal state of a system based on its external outputs. When it comes to distributed systems, observability takes on a whole new level of complexity. It encompasses the ability to effectively monitor, debug, and analyze the behavior of interconnected components across a distributed architecture. By employing various techniques and tools, observability allows us to gain a holistic understanding of the system's performance, bottlenecks, and potential issues.

To achieve observability in distributed systems, it is crucial to focus on three interconnected components: logs, metrics, and traces.

Logs: Logs provide a chronological record of events and activities within the system, offering valuable insights into what has occurred. By analyzing logs, engineers can identify anomalies, track down errors, and troubleshoot issues effectively.

Metrics: Metrics, on the other hand, provide quantitative measurements of the system's performance and behavior. They offer a rich source of data that can be analyzed to gain a deeper understanding of resource utilization, response times, and overall system health.

Traces: Traces enable the visualization and analysis of transactions as they traverse through the distributed system. By capturing the flow of requests and their associated metadata, traces allow engineers to identify bottlenecks, latency issues, and performance optimizations.

In the ever-evolving landscape of distributed systems observability, a plethora of tools and frameworks have emerged to simplify the process. Prominent examples include:

1. Prometheus: A powerful open-source monitoring and alerting system that excels in collecting and storing metrics from distributed environments.

2. Jaeger: An end-to-end distributed tracing system that enables the visualization and analysis of transaction flows across complex systems.

3. ELK Stack: A comprehensive combination of Elasticsearch, Logstash, and Kibana, which collectively offer powerful log management, analysis, and visualization capabilities.

4. Grafana: A widely-used open-source platform for creating rich and interactive dashboards, allowing engineers to visualize metrics and logs in real-time.

The adoption of observability in distributed systems brings forth a multitude of benefits. It empowers engineers and DevOps teams to proactively detect and diagnose issues, leading to faster troubleshooting and reduced downtime. Observability also aids in capacity planning, resource optimization, and identifying performance bottlenecks. Moreover, it facilitates collaboration between teams by providing a shared understanding of the system's behavior and enabling effective communication.

In the ever-evolving landscape of distributed systems, observability plays a pivotal role in unraveling the complexity and gaining insights into system behavior. By leveraging the power of logs, metrics, and traces, along with robust tools and frameworks, engineers can navigate the intricate world of distributed systems with confidence. Embracing observability empowers organizations to build resilient, high-performing systems that can withstand the challenges of today's digital landscape.

Highlights: Distributed Systems Observability

The Role of Distributed Systems

A – Several decades ago, only a handful of mission-critical services worldwide were required to meet the availability and reliability requirements of today’s always-on applications and APIs. In response to user demand, every application must be built to scale nearly instantly to accommodate the potential for rapid, viral growth. Almost every app built today—whether a mobile app for consumers or a backend payment system—must meet these constraints and requirements.

B – Inherently, distributed systems are more reliable due to their distributed nature. When appropriately designed software engineers build these systems, they can benefit from more scalable organizational models. There is, however, a price to pay for these advantages.

C – Designing, building, and debugging these distributed systems can be challenging. A reliable distributed system requires significantly more engineering skills than a single-machine application, such as a mobile app or a web frontend. Regardless, distributed systems are becoming increasingly important. There is a corresponding need for tools, patterns, and practices to build them.

D – As digital transformation accelerates, organizations adopt multicloud environments to drive secure innovation and achieve speed, scale, and agility. As a result, technology stacks are becoming increasingly complex and scalable. Today, even the most straightforward digital transaction is supported by an array of cloud-native services and platforms delivered by various providers. To improve user experience and resilience, IT and security teams must monitor and manage their applications.

**Key Components of Observability**

Observability in distributed systems typically relies on three pillars: logs, metrics, and traces. Logs provide detailed records of events within the system, offering context for debugging issues. Metrics offer quantitative data, such as CPU usage and request rates, allowing teams to monitor system health and performance over time. Traces enable the tracking of requests as they move through the system, helping to pinpoint where latency or failures occur. Together, these components create a comprehensive picture of the system’s state and behavior.

**Challenges in Achieving Observability**

While observability is essential, achieving it in distributed systems is not without its challenges. The sheer volume of data generated by these systems can be overwhelming. Additionally, correlating data from disparate sources to form a cohesive narrative requires sophisticated tools and techniques. Moreover, ensuring that observability doesn’t introduce too much overhead or affect system performance is a delicate balancing act. Organizations must invest in the right infrastructure and expertise to tackle these challenges effectively.

**Best Practices for Enhancing Observability**

To maximize observability in distributed systems, organizations should adopt several best practices. Firstly, they should implement centralized logging and monitoring solutions that can aggregate data from all system components. Secondly, leveraging open standards like OpenTelemetry can facilitate consistent data collection and integration with various tools. Thirdly, incorporating automated alerting and anomaly detection can help teams proactively address issues before they impact users. Lastly, fostering a culture of collaboration between development and operations teams can ensure that observability is an ongoing, shared responsibility.

Cloud Service Mesh

### What is a Cloud Service Mesh?

A Cloud Service Mesh is a dedicated infrastructure layer that facilitates service-to-service communication in a microservices architecture. It abstracts the complex communication patterns between services into a manageable, secure, and observable framework. By deploying a service mesh, organizations can effectively manage the interactions of their microservices, ensuring seamless connectivity, security, and resilience.

### Key Benefits of Implementing a Cloud Service Mesh

1. **Enhanced Security**: A service mesh provides robust security features such as mutual TLS authentication, which encrypts communications between services. This ensures that data remains secure and tamper-proof as it travels across the network.

2. **Traffic Management**: With a service mesh, you can implement sophisticated traffic management policies, including load balancing, circuit breaking, and retries. This leads to improved performance and reliability of your distributed systems.

3. **Observability**: One of the standout features of a service mesh is its ability to provide deep observability into the interactions between services. Metrics, logs, and traces are collected and analyzed, offering invaluable insights into system health and performance.

### Enhancing Observability in Distributed Systems

Observability is a key concern in managing distributed systems. With the proliferation of microservices, tracking and understanding service interactions can become overwhelmingly complex. A Cloud Service Mesh addresses this challenge by offering comprehensive observability features:

– **Metrics Collection**: Collects real-time metrics on service performance, latency, error rates, and more.

– **Distributed Tracing**: Enables tracing of requests as they propagate through multiple services, helping identify bottlenecks and performance issues.

– **Centralized Logging**: Aggregates logs from various services, providing a unified view for easier troubleshooting and analysis.

These capabilities empower teams to detect issues early, optimize performance, and ensure the reliability of their applications.

### Real-World Applications and Use Cases

Several organizations have successfully implemented Cloud Service Meshes to transform their operations. For instance, financial institutions use service meshes to secure sensitive transactions, while e-commerce platforms leverage them to manage high traffic volumes during peak shopping seasons. By providing a robust framework for service communication, a service mesh enhances scalability, reliability, and security across industries.

Googles Ops Agent

Ops Agent is a lightweight agent that runs on your Compute Engine instances, collecting and forwarding metrics and logs to Google Cloud Monitoring and Logging. By installing Ops Agent on your instances, you gain real-time visibility into your Compute Engine’s performance and behavior.

To start monitoring your Compute Engine, you must install Ops Agent on your instances. The installation process is straightforward and can be done manually or through automation tools like Cloud Deployment Manager or Terraform. Once installed, the Ops Agent will automatically begin collecting metrics and logs from your Compute Engine.

Ops Agent allows you to customize the metrics and logs you want to monitor for your Compute Engine. Various options are available, allowing you to choose specific metrics and logs relevant to your application or system. By configuring metrics and logs, you can gain deeper insights and track the performance of critical components.

**Challenge: Fragmented Monitoring Tools**

Fragmented monitoring tools and manual analytics strategies challenge IT and security teams. The lack of a single source of truth and real-time insight makes it increasingly difficult for these teams to access the answers they need to accelerate innovation and optimize digital services. To gain insight, they must manually query data from various monitoring tools and piece together different sources of information.

This complex and time-consuming process distracts Team members from driving innovation and creating new value for the business and customers. In addition, many teams monitor only their mission-critical applications due to the effort involved in managing all these tools, platforms, and dashboards. The result is a multitude of blind spots across the technology stack, which makes it harder for teams to gain insights.

**Challenge: Kubernetes is Complex**

Understanding how Kubernetes adds to the complexity of technology stacks is imperative. In the drive toward modern technology stacks, it is the platform of choice for organizations refactoring their applications for the cloud-native world. Through dynamic resource provisioning, Kubernetes architectures can quickly scale services to new users and increase efficiency.

However, the constant changes in cloud environments make it difficult for IT and security teams to maintain visibility into them. To provide observability in their Kubernetes environments, these teams cannot manually configure various traditional monitoring tools. The result is that they are often unable to gain real-time insights to improve user experience, optimize costs, and strengthen security. Due to this visibility challenge, many organizations are delaying moving more mission-critical services to Kubernetes.

GKE-Native Monitoring

The Basics of GKE-Native Monitoring

GKE-Native Monitoring is a comprehensive monitoring solution provided by Google Cloud Platform (GCP) designed explicitly for GKE clusters. It offers deep insights into your applications’ performance and behavior, allowing you to proactively detect and resolve issues. With GKE-Native Monitoring, you can easily collect and analyze metrics, monitor logs, and set up alerts to ensure the reliability and availability of your applications.

One of the critical features of GKE-Native Monitoring is its ability to collect and analyze metrics from your GKE clusters. It provides preconfigured dashboards that display essential metrics such as CPU usage, memory utilization, and network traffic. Additionally, you can create custom dashboards tailored to your specific requirements, allowing you better to understand your application’s performance and resource consumption.

The Role of Megatrends

We have had a considerable drive with innovation that has spawned several megatrends that have affected how we manage and view our network infrastructure and the need for distributed systems observability. We have seen the decomposition of everything from one to many.

Many services and dependencies in multiple locations, aka microservices observability, must be managed and operated instead of the monolithic where everything is generally housed internally. The megatrends have resulted in a dynamic infrastructure with new failure modes not seen in the monolithic, forcing us to look at different systems observability tools and network visibility practices. 

Shift in Control

There has also been a shift in the point of control. As we move towards new technologies, many of these loosely coupled services or infrastructures your services depend on are not under your control. The edge of control has been pushed, creating different network and security perimeters. These parameters are now closer to the workload than a central security stack. Therefore, the workloads themselves are concerned with security.

Example Product: Cisco AppDynamics

### What is Cisco AppDynamics?

Cisco AppDynamics is an application performance management (APM) solution designed to provide real-time visibility into the performance of your applications. It helps IT professionals identify bottlenecks, diagnose issues, and optimize performance, ensuring a seamless user experience. With its powerful analytics, you can gain deep insights into your application stack, from the user interface to the backend infrastructure.

### Key Features and Capabilities

#### Real-Time Monitoring

One of the standout features of Cisco AppDynamics is its ability to monitor applications in real-time. This allows IT teams to detect and resolve issues as they occur, minimizing downtime and ensuring a smooth user experience. Real-time monitoring covers everything from user interactions to server performance, providing a comprehensive view of your application’s health.

#### End-User Experience Monitoring

Understanding how users interact with your application is crucial for delivering a high-quality experience. Cisco AppDynamics offers end-user experience monitoring, which tracks user sessions and interactions. This data helps you identify any pain points or performance issues that may be affecting user satisfaction.

#### Business Transaction Monitoring

Cisco AppDynamics takes a unique approach to monitoring by focusing on business transactions. By tracking the performance of individual transactions, you can gain a clearer understanding of how different parts of your application are performing. This level of granularity allows for more targeted optimizations and quicker issue resolution.

### Benefits of Using Cisco AppDynamics

#### Improved Application Performance

With its comprehensive monitoring and diagnostic capabilities, Cisco AppDynamics helps you identify and resolve performance issues quickly. This leads to faster load times, fewer errors, and an overall improved user experience.

#### Enhanced Operational Efficiency

By automating many of the monitoring and diagnostic processes, Cisco AppDynamics reduces the workload on your IT team. This allows them to focus on more strategic initiatives, driving greater value for your business.

#### Better Decision Making

The insights provided by Cisco AppDynamics enable better decision-making at all levels of your organization. Whether you’re looking to optimize resource allocation or plan for future growth, the data and analytics provided can inform your strategies and drive better outcomes.

### Integrations and Flexibility

Cisco AppDynamics offers seamless integrations with a wide range of third-party tools and platforms. Whether you’re using cloud services like AWS and Azure or CI/CD tools like Jenkins and GitHub, AppDynamics can integrate into your existing workflows, providing a unified view of your application’s performance.

For pre-information, you may find the following posts helpful:

  1. Observability vs Monitoring 
  2. Prometheus Monitoring
  3. Network Functions

Distributed Systems Observability

Distributed Systems

Today’s world of always-on applications and APIs has availability and reliability requirements that would have been needed of solely a handful of mission-critical services around the globe only a few decades ago. Likewise, the potential for rapid, viral service growth means that every application has to be built to scale nearly instantly in response to user demand.

Finally, these constraints and requirements mean that almost every application made—whether a consumer mobile app or a backend payments application—needs to be a distributed system. A distributed system is an environment where different components are spread across multiple computers on a network. These devices split up the work, harmonizing their efforts to complete the job more efficiently than if a single device had been responsible.

**The Key Components of Observability**

Observability in distributed systems is achieved through three main components: monitoring, logging, and tracing.

1. Monitoring: Monitoring involves continuously collecting and analyzing system metrics and performance indicators. It provides real-time visibility into the health and performance of the distributed system. By monitoring various metrics such as CPU usage, memory consumption, network traffic, and response times, engineers can proactively identify anomalies and make informed decisions to optimize system performance.

2. Logging: Logging involves recording events, activities, and errors within the distributed system. Log data provides a historical record that can be analyzed to understand system behavior and debug issues. Distributed systems generate vast amounts of log data, and effective log management practices, such as centralized log storage and log aggregation, are crucial for efficient troubleshooting.

3. Tracing: Tracing involves capturing the flow of requests and interactions between different distributed system components. It allows engineers to trace the journey of a specific request and identify potential bottlenecks or performance issues. Tracing is particularly useful in complex distributed architectures where multiple services interact.

**Benefits of Observability in Distributed Systems**

Adopting observability practices in distributed systems offers several benefits:

1. Enhanced Troubleshooting: Observability enables engineers to quickly identify and resolve issues by providing detailed insights into system behavior. With real-time monitoring, log analysis, and tracing capabilities, engineers can pinpoint the root cause of problems and take appropriate actions, minimizing downtime and improving system reliability.

2. Performance Optimization: By closely monitoring system metrics, engineers can identify performance bottlenecks and optimize system resources. Observability allows for proactive capacity planning and efficient resource allocation, ensuring optimal performance even under high loads.

3. Efficient Change Management: Observability facilitates monitoring system changes and their impact on overall performance. Engineers can track changes in metrics and easily identify any deviations or anomalies caused by updates or configuration changes. This helps maintain system stability and avoid unexpected issues.

What are VPC Flow Logs?

VPC Flow Logs is a feature offered by Google Cloud that captures and records network traffic information within Virtual Private Cloud (VPC) networks. Each network flow is logged, providing a comprehensive view of the traffic traversing. These logs include valuable information such as source and destination IP addresses, ports, protocol, and packet counts.

Once the VPC Flow Logs are enabled and data is being recorded, we can start leveraging the power of analysis. Google Cloud provides several tools and services for analyzing VPC Flow Logs. One such tool is BigQuery, a scalable and flexible data warehouse. By exporting VPC Flow Logs to BigQuery, we can perform complex queries, visualize traffic patterns, and detect anomalies using industry-standard SQL queries.

**How This Affects Failures**

The primary issue I have seen with my clients is that application failures are no longer predictable, and dynamic systems can fail creatively, challenging existing monitoring solutions. But, more importantly, the practices that support them. We have a lot of partial failures that are not just unexpected but not known or have never been seen before. For example, if you recall, we have the network hero. 

**The network Hero**

It is someone who knows every part of the network and has seen every failure at least once. These people are no longer helpful in today’s world and need proper Observation. When I was working as an Engineer, we would have plenty of failures, but more than likely, we would have seen them before. And there was a system in place to fix the error. Today’s environment is much different.

We can no longer rely on simply seeing a UP or Down, setting static thresholds, and then alerting based on those thresholds. A key point to note at this stage is that none of these thresholds considers the customer’s perspective.  If your POD runs at 80% CPU, does that mean the customer is unhappy?

When monitoring, you should look from your customer’s perspectives and what matters to them. Content Delivery Network (CDN) was one of the first to realize this game and measure what matters most to the customer.

Distributed Systems Observability

The different demands

So, the new, modern, and complex distributed systems place very different demands on your infrastructure and the people who manage it. For example, in microservices, there can be several problems with a particular microservice:

    • The microservices could be running under high resource utilization and, therefore, slow to respond, causing a timeout
    • The microservices could have crashed or been stopped and is, therefore, unavailable
    • The microservices could be fine, but there could be slow-running database queries.
    • So we have a lot of partial failures. 

Consequently, We can no longer predict

The significant shift we see with software platforms is that they evolve much quicker than the products and paradigms we use to monitor them. As a result, we need to consider new practices and technologies with dedicated platform teams and sound system observability. We can’t predict anything anymore, which puts the brakes on some traditional monitoring approaches, especially the metrics-based approach to monitoring.

I’m not saying that these monitoring tools are not doing what you want them to do. But, they work in a siloed environment, and there is a lack of connectivity. So we have monitoring tools working in silos in different parts of the organization and more than likely managed by other people trying to monitor a very dispersed application with multiple components and services in various places. 

Relying On Known Failures

Metric-Based Approach

A metrics-based monitoring approach relies on having previously encountered known failure modes. The metric-based approach relies on known failures and predictable failure modes. So, we have predictable thresholds that someone is considered to experience abnormal.

Monitoring can detect when these systems are either over or under the predictable thresholds that were previously set. Then, we can set alerts, and we hope that these alerts are actionable. This is only useful for variants of predictable failure modes.

Traditional metrics and monitoring tools can tell you any performance spikes or notice that a problem occurs. But they don’t let you dig into the source of the issues and let us slice and dice or see correlations between errors. If the system is complex, this approach is more challenging in getting to the root cause in a reasonable timeframe.

Google Cloud Trace

Example: Application Latency & Cloud Trace

Before we discuss Cloud Trace’s specifics, let’s establish a clear understanding of application latency. Latency refers to the time delay between a user’s action or request and the corresponding response from the application. It includes network latency, server processing time, and database query execution time. By comprehending the different factors contributing to latency, developers can proactively optimize their applications for improved performance.

Google Cloud Trace is a powerful diagnostic tool offered by Google Cloud Platform (GCP) that enables developers to identify and analyze application latency bottlenecks. It provides detailed insights into the flow of requests and events within an application, allowing developers to pinpoint areas of concern and optimize accordingly. Cloud Trace integrates seamlessly with other GCP services and provides a comprehensive view of latency across various components of an application stack.

Traditional style metrics systems

With traditional metrics systems, you had to define custom metrics, which were always defined upfront. This approach prevents us from starting to ask new questions about problems. So, it would be best to determine the questions to ask upfront.

Then, we set performance thresholds, pronounce them “good” or “bad, ” and check and re-check those thresholds. We would tweak the thresholds over time, but that was about it. This monitoring style has been the de facto approach, but we don’t now want to predict how a system can fail. Always observe instead of waiting for problems, such as reaching a certain threshold before acting.

**Metrics: Lack of connective event**

The metrics did not retain the connective event, so you cannot ask new questions in the existing dataset. These traditional system metrics could miss unexpected failure modes in complex distributed systems. Also, the condition detected via system metrics might be unrelated to what is happening.

An example of this could be an odd number of running threads on one component, which might indicate garbage collection is in progress or that slow response times are imminent in an upstream service.

**Users experience static thresholds**

User experience means different things to different sets of users. We now have a model where different service users may be routed through the system in other ways, using various components and providing experiences that can vary widely. We also know now that the services no longer tend to break in the same few predictable ways over and over.  

We should have a few alerts triggered by only focusing on symptoms that directly impact user experience and not because a threshold was reached.

The Challenge: Can’t reliably indicate any issues with user experience

If you use static thresholds, they can’t reliably indicate any issues with user experience. Alerts should be set up to detect failures that impact user experience. Traditional monitoring falls short in this regard. With traditional metrics-based monitoring, we rely on static thresholds to define optimal system conditions, which have nothing to do with user experience.

However, modern systems change shape dynamically under different workloads. Static monitoring thresholds can’t reflect impacts on user experience. They lack context and are too coarse.

Required: Distributed Systems Observability

Systems observability and reliability in distributed systems are practices. Rather than just focusing on a tool that logs, metrics, or alters, Observability is all about how you approach problems, and for this, you need to look at your culture. So you could say that Observability is a cultural practice that allows you to be proactive about findings instead of relying on a reactive approach that we are used to in the past.

Nowadays, we need a different viewpoint and want to see everything from one place. You want to know how the application works and how it interacts with the other infrastructure components, such as the underlying servers, physical or server, the network, and how data transfer looks in a transfer and stale state. 

Levels of Abstraction

What level of observation is needed to ensure everything performs as it should? What should you look at to obtain this level of detail?

Monitoring is knowing the data points and the entities from which we gather information. On the other hand, Observability is like putting all the data together. So monitoring is collecting data, and Observability is putting it together in one single pane of glass. Observability is observing the different patterns and deviations from the baseline; monitoring is getting the data and putting it into the systems. A vital part of an Observability toolkit is service level objectives (slos).

**Preference: Distributed Tracing**

We have three pillars of Systems Observability. There are Metrics, Traces, and Logging. So, defining or viewing Observability as having these pillars is an oversimplification. But for Observability, you need these in place. Observability is all about connecting the dots from each of these pillars.

If someone asked me which one I prefer, it would be distributed tracing. Distributed tracing allows you to visualize each step in service request executions. As a result, it doesn’t matter if services have complex dependencies. You could say that the complexity of the Dynamic systems is abstracted with distributed tracing.

**Use Case: Challenges without tracing**

For example, latency can stack up if a downstream database service experiences performance bottlenecks, resulting in high end-to-end latency. When latency is detected three or four layers upstream, it can be complicated to identify which component of the system is the root of the problem because now that same latency is being seen in dozens of other services.

**Distributed tracing: A winning formula**

Modern distributed systems tend to scale into a tangled knot of dependencies. Therefore, distributed tracing shows the relationships between various services and components in a distributed system. Traces help you understand system interdependencies. Unfortunately, those inter-dependencies can obscure problems and make them challenging to debug unless their relationships are clearly understood.

In distributed systems, observability is vital in ensuring complex architectures’ stability, performance, and reliability. Monitoring, logging, and tracing provide engineers with the tools to understand system behavior, troubleshoot issues, and optimize performance. By adopting observability practices, organizations can effectively manage their distributed systems and provide seamless and reliable services to their users.

Summary: Distributed Systems Observability

In the vast landscape of distributed systems, observability is crucial in ensuring their reliable and efficient functioning. This blogpost aims to delve into the critical components of distributed systems observability and shed light on their significance.

Telemetry

Telemetry forms the foundation of observability in distributed systems. It involves collecting, processing, and analyzing various metrics, logs, and traces. By monitoring and measuring these data points, developers gain valuable insights into the performance and behavior of their distributed systems.

Logging

Logging is an essential component of observability, providing a detailed record of events and activities within a distributed system. It captures important information such as errors, warnings, and informational messages, which aids in troubleshooting and debugging. Properly implemented logging mechanisms enable developers to identify and resolve issues promptly.

Metrics

Metrics are quantifiable measurements that provide a high-level view of the health and performance of a distributed system. They offer valuable insights into resource utilization, throughput, latency, error rates, and other critical indicators. By monitoring and analyzing metrics, developers can proactively identify bottlenecks, optimize performance, and ensure the smooth operation of their systems.

Tracing

Tracing allows developers to understand the flow and behavior of requests as they traverse through a distributed system. It provides detailed information about the path a request takes, including the various services and components it interacts with. Tracing is instrumental in diagnosing and resolving performance issues, as it highlights potential latency hotspots and bottlenecks.

Alerting and Visualization

Alerting mechanisms and visualization tools are vital for effective observability in distributed systems. Alerts notify developers when certain predefined thresholds or conditions are met, enabling them to take timely action. Visualization tools provide intuitive and comprehensive representations of system metrics, logs, and traces, making identifying patterns, trends, and anomalies easier.

Conclusion

In conclusion, the key components of distributed systems observability, namely telemetry, logging, metrics, tracing, alerting, and visualization, form a comprehensive toolkit for monitoring and understanding the intricacies of such systems. By leveraging these components effectively, developers can ensure their distributed systems’ reliability, performance, and scalability.

Reliability in Distributed Systems

Reliability In Distributed System

Reliability In Distributed System

Distributed systems have become an integral part of our modern technological landscape. Whether it's cloud computing, internet banking, or online shopping, these systems play a crucial role in providing seamless services to users worldwide. However, as distributed systems grow in complexity, ensuring their reliability becomes increasingly challenging.

In this blog post, we will explore the concept of reliability in distributed systems and discuss various techniques to achieve fault-tolerant operations.

Reliability in distributed systems refers to the ability of the system to consistently function as intended, even in the presence of hardware failures, network partitions, and other unforeseen events. To achieve reliability, system designers employ various techniques, such as redundancy, replication, and fault tolerance, to minimize the impact of failures and ensure continuous service availability.

Highlights: Reliability In Distributed System

Understanding Distributed Systems

At their core, distributed systems consist of multiple interconnected nodes working together to achieve a common goal. These nodes can be geographically dispersed and communicate through various protocols. Understanding the structure and behavior of distributed systems is crucial before exploring reliability measures.

To grasp the inner workings of distributed systems, it’s essential to familiarize ourselves with their key components. These include communication protocols, consensus algorithms, fault tolerance mechanisms, and distributed data storage. Each component plays a crucial role in ensuring the reliability and efficiency of distributed systems.

– Challenges and Risks: Reliability in distributed systems faces several challenges and risks due to their inherent nature. Network failures, node crashes, message delays, and data inconsistency are common issues compromising system reliability. Furthermore, the complexity of these systems amplifies the difficulty of diagnosing and resolving failures promptly.

– Replication and Redundancy: To mitigate the risks associated with distributed systems, replication and redundancy techniques are employed. Replicating data and functionalities across multiple nodes ensures fault tolerance and enhances reliability. The system can continue to operate with redundant components even if specific nodes fail.

– Consistency and Coordination: Maintaining data consistency is crucial in distributed systems. Distributed consensus protocols, such as the Paxos or Raft consensus algorithms, ensure that all nodes agree on the same state despite failures or network partitions. Coordinating actions among distributed nodes is essential to prevent conflicts and ensure reliable system behavior.

– Monitoring and Failure Detection: Continuous monitoring and failure detection mechanisms are essential for identifying and resolving issues promptly. Various monitoring tools and techniques, such as heartbeat protocols and health checks, can help detect failures and initiate recovery processes. Proactive monitoring and regular maintenance significantly contribute to the overall reliability of distributed systems.

**Shift in Landscape**

When considering reliability in a distributed system, considerable shifts in our environmental landscape have caused us to examine how we operate and run our systems and networks. We have had a mega change with the introduction of various cloud platforms and their services and containers.

In addition, with the complexity of managing distributed systems observability and microservices observability that unveil significant gaps in current practices in our technologies. Not to mention the flaws with the operational practices around these technologies.

Managed Instance Groups (MIGs)

 

**Understanding the Basics of Managed Instance Groups**

Managed Instance Groups are collections of virtual machine (VM) instances that are treated as a single entity. They are designed to simplify the management of multiple instances by automating tasks like scaling, updating, and load balancing. With MIGs, you can ensure that your application has the right number of instances running at any given time, responding dynamically to changes in demand.

Google Cloud’s MIGs make it easy to deploy applications with high availability and reliability. By using templates, you can define the configuration for all instances in the group, ensuring consistency and reducing the potential for human error.

**Ensuring Reliability in Distributed Systems**

Reliability is a critical component of any distributed system, and Managed Instance Groups play a significant role in achieving it. By distributing workloads across multiple instances and regions, MIGs help prevent single points of failure. If an instance fails, the group automatically replaces it with a new one, minimizing downtime and ensuring continuous service availability.

Moreover, Google Cloud’s infrastructure ensures that your instances are backed by a robust network, providing low-latency access and high-speed connectivity. This further enhances the reliability of your applications and services, giving you peace of mind as you scale.

**Scaling with Ease and Flexibility**

One of the standout features of Managed Instance Groups is their ability to scale quickly and efficiently. Whether you’re dealing with sudden spikes in traffic or planning for steady growth, MIGs offer flexible scaling policies to meet your needs. You can scale based on CPU utilization, load balancing capacity, or even custom metrics, allowing for precise control over your application’s performance.

Google Cloud’s autoscaling capabilities mean you only pay for the resources you use, optimizing cost-efficiency while maintaining high performance. This flexibility makes MIGs an ideal choice for businesses looking to grow their cloud infrastructure without unnecessary expenditure.

**Integrating with Google Cloud Ecosystem**

Managed Instance Groups seamlessly integrate with other Google Cloud services, providing a cohesive ecosystem for your applications. They work in harmony with Cloud Load Balancing to distribute traffic efficiently and with Stackdriver for monitoring and logging, giving you comprehensive insights into your application’s performance.

By leveraging Google Cloud’s extensive suite of tools, you can build, deploy, and manage applications with greater agility and confidence. This integration streamlines operations and simplifies the complexities of managing a distributed system.

Managed Instance Group

Example Product: Cisco AppDynamics

### What is Cisco AppDynamics?

Cisco AppDynamics is an application performance management (APM) solution that offers deep insights into your application’s performance, user experience, and business impact. It helps IT teams detect, diagnose, and resolve issues quickly, ensuring a seamless digital experience for end-users. By leveraging machine learning and artificial intelligence, AppDynamics provides actionable insights to optimize your applications.

### Key Features of Cisco AppDynamics

#### Real-Time Performance Monitoring

With Cisco AppDynamics, you can monitor the performance of your applications in real-time. This feature allows you to detect anomalies and performance issues as they happen, ensuring that you can address them before they impact your users.

#### End-User Monitoring

Understanding how your users interact with your applications is crucial. AppDynamics offers end-user monitoring, which provides visibility into the user journey, from the front-end user interface to the back-end services. This helps you identify and resolve issues that directly affect user experience.

#### Business Transaction Monitoring

AppDynamics breaks down your application into business transactions, which are critical user interactions within your application. By monitoring these transactions, you can gain insights into how your application supports key business processes and identify areas for improvement.

#### AI-Powered Analytics

The platform’s AI-powered analytics enable you to predict and prevent performance issues before they occur. By analyzing historical data and identifying patterns, AppDynamics helps you proactively manage your application’s performance.

### Benefits of Using Cisco AppDynamics

#### Improved Application Performance

By continuously monitoring your application’s performance, AppDynamics helps you identify and resolve issues quickly, ensuring that your application runs smoothly and efficiently.

#### Enhanced User Experience

With end-user monitoring, you can gain insights into how users interact with your application and address any issues that may affect their experience. This leads to increased user satisfaction and retention.

#### Better Business Insights

Business transaction monitoring provides a clear understanding of how your application supports critical business processes. This helps you make data-driven decisions to optimize your application and drive business growth.

Monitoring GKE Environment

The Significance of Monitoring in GKE

Monitoring in GKE goes beyond simply monitoring resource utilization. It provides valuable insights into the health and performance of your Kubernetes clusters, nodes, and applications running within them. You gain a comprehensive understanding of your system’s behavior by closely monitoring key metrics such as CPU usage, memory utilization, and network traffic. You can proactively address issues before they escalate.

GKE-Native Monitoring has many powerful features that simplify the monitoring process. One notable feature is integrating with Stackdriver, Google Cloud’s monitoring and observability platform.

With this integration, you can access a rich set of monitoring tools, including customizable dashboards, alerts, and logging capabilities, all designed explicitly for GKE deployments. Additionally, GKE-Native Monitoring seamlessly integrates with other Google Cloud services, enabling you to leverage advanced analytics and machine learning capabilities.

Challenges to Gaining Reliability

**Existing Static Tools**

This has caused a knee-jerk reaction to a welcomed drive-in innovation to system reliability. Yet, some technologies and tools used to manage these innovations do not align with the innovative events. Many of these tools have stayed relatively static in our dynamic environment. So, we have static tools used in a dynamic environment, which causes friction to reliability in distributed systems and the rise for more efficient network visibility.

**Understanding the Complexity**

Distributed systems are inherently complex, with multiple components across different machines or networks. This complexity introduces challenges like network latency, hardware failures, and communication bottlenecks. Understanding the intricate nature of distributed systems is crucial to devising reliable solutions.

Gaining Reliability

**Redundancy and Replication**

One critical approach to enhancing reliability in distributed systems is redundancy and replication. By duplicating critical components or data across multiple nodes, the system becomes more fault-tolerant. This ensures the system can function seamlessly even if one component fails, minimizing the risk of complete failure.

**Consistency and Consensus Algorithms**

Maintaining consistency in distributed systems is a significant challenge due to the possibility of concurrent updates and network delays. Consensus algorithms, such as the Paxos or Raft algorithms, are vital in achieving consistency by ensuring agreement among distributed nodes. These algorithms enable reliable decision-making and guarantee that all nodes reach a consensus state.

**Monitoring and Failure Detection**

To ensure reliability, robust monitoring mechanisms are essential. Monitoring tools can track system performance, resource utilization, and network health. Additionally, implementing efficient failure detection mechanisms allows for prompt identification of faulty components, enabling proactive measures to mitigate their impact on the overall system.

**Load Balancing and Scalability**

Load balancing is crucial in distributing the workload evenly across nodes in a distributed system. It ensures that no single node is overwhelmed, reducing the risk of system instability. Furthermore, designing systems with scalability in mind allows for seamless expansion as the workload grows, ensuring that reliability is maintained even during periods of high demand.

Required: Distributed Tracing

Using distributed tracing, you can profile or monitor the results of requests across a distributed system. Distributed systems can be challenging to monitor since each node generates its logs and metrics. To get a complete view of a distributed system, it is necessary to aggregate these separate node metrics holistically. 

A distributed system generally doesn’t access its entire set of nodes but rather a path through those nodes. With distributed tracing, teams can analyze and monitor commonly accessed paths through a distributed system. The distributed tracing is installed on each system node, allowing teams to query the system for information on node health and performance.

Benefits: Distributed Tracing

Despite the challenges, distributed systems offer a wide array of benefits. One notable advantage is enhanced fault tolerance. Distributing tasks and data across multiple nodes improves system reliability, as a single point of failure does not bring down the entire system.

Additionally, distributed systems enable improved scalability, accommodating growing demands by adding more nodes to the network. The applications of distributed systems are vast, ranging from cloud computing and large-scale data processing to peer-to-peer networks and distributed databases.

Google Cloud Trace

Understanding Cloud Trace

Cloud Trace, an integral part of Google Cloud’s observability offerings, provides developers with a detailed view of their application’s performance. It enables tracing and analysis of requests as they flow through different components of a distributed system. By visualizing the latency, bottlenecks, and dependencies, Cloud Trace empowers developers to optimize their applications for better performance and user experience.

Cloud Trace offers a range of features to simplify the monitoring and troubleshooting process. With its distributed tracing capabilities, developers can gain insights into how requests traverse various services, identify latency issues, and pinpoint the root causes of performance bottlenecks. The integration with Google Cloud’s ecosystem allows seamless correlation between traces and other monitoring data, enabling a comprehensive view of application health.

Improve Performance & Reliability 

By leveraging Cloud Trace, developers can significantly improve the performance and reliability of their applications. The ability to pinpoint and resolve performance issues quickly translates into enhanced user satisfaction and higher productivity. Moreover, Cloud Trace enables proactive monitoring, ensuring that potential bottlenecks and inefficiencies are identified before they impact end-users. For organizations, this translates into cost savings, improved scalability, and better resource utilization.

Adopting Cloud Trace

Cloud Trace has been adopted by numerous organizations across various industries, with remarkable outcomes. From optimizing the response time of e-commerce platforms to enhancing the efficiency of complex microservices architectures, Cloud Trace has proven its worth in diagnosing performance issues and driving continuous improvement. The real-time visibility provided by Cloud Trace empowers organizations to make data-driven decisions and deliver exceptional user experiences.

Related: Before you proceed, you may find the following post helpful:

  1. Distributed Firewalls
  2. SD WAN Static Network Based

Reliability In Distributed System

Adopting Distributed Systems

Distributed systems refer to a network of interconnected computers that communicate and coordinate their actions to achieve a common goal. Unlike traditional centralized systems, where a single entity controls all components, distributed systems distribute tasks and data across multiple nodes. This decentralized approach enables enhanced scalability, fault tolerance, and resource utilization.

**Key Components of Distributed Systems**

To comprehend the inner workings of distributed systems, we must familiarize ourselves with their key components. These components include nodes, communication channels, protocols, and distributed file systems. Nodes represent individual machines or devices within the network; communication channels facilitate data transmission, protocols ensure reliable communication and distributed file systems enable data storage across multiple nodes.

Distribued vs centralized

**Distributed Systems Use Cases**

Many modern applications use distributed systems, including mobile and web applications with high traffic. Web browsers or mobile applications serve as clients in a client-server environment, and the server becomes its own distributed system. The modern web server follows a multi-tier system pattern. Requests are delegated to several server logic nodes via a load balancer.

Kubernetes is popular among distributed systems since it enables containers to be combined into a distributed system. Kubernetes orchestrates network communication between the distributed system nodes and handles dynamic horizontal and vertical scaling of the nodes. 

Cryptocurrencies like Bitcoin and Ethereum are also peer-to-peer distributed systems. The currency ledger is replicated at every node in a cryptocurrency network. To bootstrap, a currency node connects to other nodes and downloads its full ledger copy. Additionally, cryptocurrency wallets use JSON RPC to communicate with the ledger nodes.

Challenges in Distributed Systems

While distributed systems offer numerous advantages, they also pose various challenges. One significant challenge is achieving consensus among distributed nodes. Ensuring that all nodes agree on a particular value or decision can be complex, especially in the presence of failures or network partitions. Additionally, maintaining data consistency across distributed nodes and mitigating issues related to concurrency control requires careful design and implementation.

**Example: Distributed System of Microservices**

Microservices are one type of distributed system since they decompose an application into individual components. A microservice architecture, for example, may have services corresponding to business features (payments, users, products, etc.), with each element handling the corresponding business logic. Multiple redundant copies of the services will then be available, so there is no single point of failure.

**Distributed Systems: The Challenge**

Distributed systems are required to implement the reliability, agility, and scale expected of modern computer programs. Distributed systems are applications of many different components running on many other machines. Containers are the foundational building block, and groups of containers co-located on a single device comprise the atomic elements of distributed system patterns.

The significant shift we see with software platforms is that they evolve much quicker than the products and paradigms we use to monitor them. We need to consider new practices and technologies with dedicated platform teams to enable a new era of system reliability in a distributed system. Along with the practices of Observability that are a step up to the traditional monitoring of static infrastructure: Observability vs monitoring.

Knowledge Check: Distributed Systems Architecture

  • Client-Server Architecture

A client-server architecture has two primary responsibilities. The client presents user interfaces and is connected to the server via a network. The server handles business logic and state management. Unless the server is redundant, a client-server architecture can quickly degrade into a centralized architecture. A truly distributed client-server setup will consist of multiple server nodes that distribute client connections. In modern client-server architectures, clients connect to encapsulated distributed systems on the server.

  • Multi-tier Architecture

Multi-tier architectures are extensions of client-server architectures. Multi-tier architectures decompose servers into further granular nodes, decoupling additional backend server responsibilities like data processing and management. By processing long-running jobs asynchronously, these additional nodes free up the remaining backend nodes to focus on responding to client requests and interacting with the data store.

  • Peer-to-Peer Architecture

Peer-to-peer distributed systems contain complete instances of applications on each node. There is no separation between presentation and data processing at the node level. A node consists of a presentation layer and a data handling layer. Peer nodes may contain the entire system’s state data. 

Peer-to-peer systems have a great deal of redundancy. When initiated and brought online, peer-to-peer nodes discover and connect to other peers, thereby synchronizing their local state with the system’s. As a result of this feature, nodes on a peer-to-peer network won’t be disrupted by the failure of one. Additionally, peer-to-peer systems will persist. 

  • Service-orientated Architecture

A service-oriented architecture (SOA) is a precursor to microservices. Microservices differ from SOA primarily in their node scope, which is at the feature level. Each microservice node encapsulates a specific set of business logic, such as payment processing—multiple nodes of business logic interface with independent databases in a microservice architecture. In contrast, SOA nodes encapsulate an entire application or enterprise division. Database systems are typically included within the service boundary of SOA nodes.

Because of their benefits, microservices have become more popular than SOA. The small service nodes provide functionality that teams can reuse through microservices. The advantages of microservices include greater robustness and a more extraordinary ability for vertical and horizontal scaling to be dynamic.

Reliability in Distributed Systems: Components

A- Redundancy and Replication:

Redundancy and replication are two fundamental concepts distributed systems use to enhance reliability. Redundancy involves duplicating critical system components, such as servers, storage devices, or network links, so the redundant component can seamlessly take over if one fails. Replication, on the other hand, involves creating multiple copies of data across different nodes in a system, enabling efficient data access and fault tolerance. By incorporating redundancy and replication, distributed systems can continue to operate even when individual components fail.

B – Fault Tolerance:

Fault tolerance is a crucial aspect of achieving reliability in distributed systems. It involves designing systems to operate correctly even when one or more components encounter failures. Several techniques, such as error detection, recovery, and prevention mechanisms, are employed to achieve fault tolerance.

C – Error Detection:

Error detection techniques, such as checksums, hashing, and cyclic redundancy checks (CRC), identify errors or data corruption during transmission or storage. By verifying data integrity, these techniques help identify and mitigate potential failures in distributed systems.

D – Error Recovery:

Error recovery mechanisms, such as checkpointing and rollback recovery, aim to restore the system to a consistent state after a failure. Checkpointing involves periodically saving the system’s state and data, allowing recovery to a previously known good state in case of failures. On the other hand, rollback recovery involves undoing the effects of failed operations and returning the system to a consistent state.

E – Error Prevention:

Distributed systems employ error prevention techniques, such as redundancy elimination, consensus algorithms, and load balancing to enhance reliability. Redundancy elimination reduces unnecessary duplication of data or computation, thereby reducing the chances of errors. Consensus algorithms ensure that all nodes in a distributed system agree on a shared state despite failures or message delays. Load balancing techniques distribute computational tasks evenly across multiple nodes to prevent overloading and potential shortcomings.

Challenges: Traditional Monitoring

**Lack of Connective Event**

If you examine traditional monitoring systems, they look to capture and investigate signals in isolation. They work in a siloed environment, similar to that of developers and operators before the rise of DevOps. Existing monitoring systems cannot detect the “Unknowns Unknowns” that are familiar with modern distributed systems. This often leads to service disruptions. So, you may be asking what an “Unknown Unknown” is.

I’ll put it to you this way: the distributed systems we see today lack predictability—certainly not enough predictability to rely on static thresholds, alerts, and old monitoring tools. If something is fixed, it can be automated, and we have static events, such as in Kubernetes, a POD reaching a limit.

Then, a replica set introduces another pod on a different node if specific parameters are met, such as Kubernetes Labels and Node Selectors. However, this is only a tiny piece of the failure puzzle in a distributed environment.  Today, we have what’s known as partial failures and systems that fail in very creative ways.

Reliability In Distributed System: Creative ways to fail

So, we know that some of these failures are quickly predicted, and actions are taken. For example, if this Kubernetes POD node reaches a specific utilization, we can automatically reschedule PODs on a different node to stay within our known scale limits.

Predictable failures can be automated in Kubernetes and with any infrastructure. An Ansible script is useful when these events occur. However, we have much more to deal with than POD scaling; we have many partial and complicated failures known as black holes.

**In today’s world of partial failures**

Microservices applications are distributed and susceptible to many external factors. On the other hand, if you examine the traditional monolithic application style, all the functions reside in the same process. It was either switched ON or OFF!! Not much happened in between. So, if there is a failure in the procedure, the application as a whole will fail. The results are binary, usually either a UP or Down.

This was easy to detect with some essential monitoring, and failures were predictable. There was no such thing as a partial failure. In a monolith application, all application functions are within the same process. A significant benefit of these monoliths is that you don’t have partial failures.

However, in a cloud-native world, where we have broken the old monolith into a microservices-based application, a client request can go through multiple hops of microservices, and we can have several problems to deal with.

There is a lack of connectivity between the different domains. Many monitoring tools and knowledge will be tied to each domain, and alerts are often tied to thresholds or rate-of-change violations that have nothing to do with user satisfaction, which is a critical metric to care about.

**System reliability: Today, you have no way to predict**

So, the new, modern, and complex distributed systems place very different demands on your infrastructure—considerably different from the simple three-tier application, where everything is generally housed in one location.  We can’t predict anything anymore, which breaks traditional monitoring approaches.

When you can no longer predict what will happen, you can no longer rely on a reactive approach to monitoring and management. The move towards a proactive approach to system reliability is a welcomed strategy.

**Blackholes: Strange failure modes**

When considering a distributed system, many things can happen. A service or region can disappear or disappear for a few seconds or ms and reappear. We believe this is going into a black hole when we have strange failure modes. So when anything goes into it will disappear. Peculiar failure modes are unexpected and surprising.

Strange failure modes are undoubtedly unpredictable. So, what happens when your banking transactions are in a black hole? What if your banking balance is displayed incorrectly or if you make a transfer to an external account and it does not show up? 

Site Reliability Engineering (SRE) and Observability

Site reliability engineering (SRE) and observational practices are needed to manage these types of unpredictability and unknown failures. SRE is about making systems more reliable. And everyone has a different way of implementing SRE practices. Usually, about 20% of your issues cause 80% of your problems.

You need to be proactive and fix these issues upfront. You need to be able to get ahead of the curve and do these things to prevent incidents from occurring. This usually happens in the wake of a massive incident. This usually acts as a teachable moment. It gives the power to be the reason to listen to a Chaos Engineering project. 

New tools and technologies:

1 – Distributed tracing

We have new tools, such as distributed tracing. So, what is the best way to find the bottleneck if the system becomes slow? Here, you can use Distributed Tracing and Open Telemetry. The tracing helps us instrument our system, figuring out where the time has been spent and where it can be used across distributed microservice architecture to troubleshoot problems. Open Telemetry provides a standardized way of instrumenting our system and providing those traces.

2 – SLA, SLI, SLO, and Error Budgets

So we don’t just want to know when something has happened and then react to an event that is not looking from the customer’s perspective. We need to understand if we are meeting SLA by gathering the number and frequency of the outages and any performance issues.

Service Level Objectives (SLO) and Service Level Indicators (SLI) can assist you with measurements. Service Level Objectives (SLOs) and Service Level Indicators (SLI) not only help you with measurements but also offer a tool for having better reliability and forming the base for the reliability stack.

Summary: Reliability In Distributed System

In modern technology, distributed systems have become the backbone of numerous applications and services. These systems, consisting of interconnected nodes, provide scalability, fault tolerance, and improved performance. However, maintaining reliability in such distributed environments is a challenging endeavor. This blog post explored the key aspects and strategies for ensuring reliability in distributed systems.

Understanding the Challenges

Distributed systems face a myriad of challenges that can impact their reliability. These challenges include network failures, node failures, message delays, and data inconsistencies. These aspects can introduce vulnerabilities that may disrupt system operations and compromise reliability.

Replication for Resilience

One fundamental technique for enhancing reliability in distributed systems is data replication. By replicating data across multiple nodes, system resilience is improved. Replication increases fault tolerance and enables load balancing and localized data access. However, maintaining reliability is crucial to managing consistency and synchronization among replicated copies.

Consensus Protocols

Consensus protocols play a vital role in achieving reliability in distributed systems. These protocols enable nodes to agree on a shared state despite failures or network partitions. Popular consensus algorithms such as Paxos and Raft ensure that distributed nodes reach a consensus, making them resilient against failures and maintaining system reliability.

Fault Detection and Recovery

Detecting faults in a distributed system is crucial for maintaining reliability. Techniques like heartbeat monitoring, failure detectors, and health checks aid in identifying faulty nodes or network failures. Once a fault is detected, recovery mechanisms such as automatic restarts, replica synchronization, or reconfigurations can be employed to restore system reliability.

Load Balancing and Scalability

Load balancing and scalability can also enhance reliability in distributed systems. By evenly distributing the workload among nodes and dynamically scaling resources, the system can handle varying demands and prevent bottlenecks. Load-balancing algorithms and auto-scaling mechanisms contribute to overall system reliability.

Conclusion:

In the world of distributed systems, reliability is a paramount concern. By understanding the challenges, employing replication techniques, utilizing consensus protocols, implementing fault detection and recovery mechanisms, and focusing on load balancing and scalability, we can embark on a journey of resilience. Reliability in distributed systems requires careful planning, robust architectures, and continuous monitoring. By addressing these aspects, we can build distributed systems that are truly reliable, empowering businesses and users alike.