prometheus Monitoring

Prometheus Monitoring: The Pull Approach

Prometheus Monitoring: The Pull Approach

In the world of monitoring and observability, Prometheus has emerged as a powerful tool for collecting and analyzing metrics. One of the key aspects of Prometheus is its unique approach to data collection, known as the pull approach. In this blog post, we will explore what the pull approach entails and why it has gained popularity among DevOps engineers and operators.

Prometheus is an open-source monitoring system that was developed at SoundCloud. It is designed to monitor highly dynamic containerized environments and provides a flexible and scalable solution for collecting time-series data. By using a multi-dimensional data model and a powerful query language, Prometheus allows users to gain deep insights into the performance and health of their systems.

Prometheus Monitoring offers a plethora of features that make it a preferred choice for monitoring modern systems. Its powerful query language, PromQL, allows users to slice and dice metrics, create alerts, and build custom dashboards. The multi-dimensional data model provides flexibility in organizing and querying metrics. Additionally, Prometheus' alerting system enables proactive identification of anomalies, helping organizations mitigate potential issues before they impact end-users.

Prometheus Server: The backbone of the Prometheus ecosystem is the Prometheus Server. This component is responsible for collecting time-series data, processing it, and storing it in a highly efficient and scalable manner. It leverages a pull-based model, periodically scraping metrics from configured targets. With its built-in data storage and querying capabilities, the Prometheus Server acts as a central hub for collecting and storing metrics from various sources.

Exporters: To monitor different types of systems and applications, Prometheus relies on exporters. Exporters are responsible for converting metrics from various formats into Prometheus-compatible data. Whether it's monitoring a database, a web server, or a cloud service, exporters provide the necessary bridges to gather metrics from these systems and make them available to the Prometheus Server. Popular examples include the Node Exporter, Blackbox Exporter, and Prometheus Pushgateway.

Alertmanager: Effective alerting is a crucial aspect of any monitoring system. Prometheus achieves this through its Alertmanager component. Alertmanager receives alerts from the Prometheus Server and applies various routing and grouping rules to ensure that the right people are notified at the right time. It supports multiple notification channels, such as email, PagerDuty, and Slack, making it highly flexible for integrating with existing incident management workflows.

Grafana Integration: Prometheus's power is further enhanced by its integration with Grafana, a popular open-source visualization tool. Grafana allows users to create stunning dashboards and visualizations using data from Prometheus. With its vast array of panels and plugins, Grafana enables users to gain deep insights into their monitoring data and build custom monitoring views tailored to their specific needs.

Highlights: Prometheus Monitoring

Created by SoundCloud

In this post, I would like to discuss Prometheus monitoring and its pull-based approach ( Prometheus Pull ) to the metric collection and the Prometheus metric types. Prometheus is a powerful open-source monitoring system created by SoundCloud to monitor and alert the infrastructure and applications within their environment. It has since become one of the most popular monitoring systems due to its ability to monitor various services, from simple applications to complex distributed systems.

Pull-based System:

– Prometheus is designed to be simple to use. It uses a pull-based system, meaning it collects data from the services it monitors rather than having the services push the data to Prometheus. This makes it easy to set up and configure, allowing for great flexibility in the services it can monitor. It also has an intuitive user interface, making it easy to use and navigate.

– Understanding Prometheus’s architecture is crucial for effectively deploying and leveraging its capabilities. At its core, Prometheus consists of a server responsible for data collection, a time-series database for storing metrics, and a user interface for visualization and querying. The server scrapes metrics from various targets using exporters or service discovery, while the time-series database facilitates efficient storage and retrieval of collected data.

– Prometheus Monitoring integrates with many systems, making it highly versatile in diverse environments. It provides exporters with popular technologies like Kubernetes, Docker, and AWS, allowing an accessible collection of relevant metrics. Moreover, Prometheus can be integrated with other monitoring tools, such as Grafana, to create comprehensive dashboards and visualizations.

Key Features of Prometheus Monitoring:

1. Time-Series Database: Prometheus utilizes its own time-series database, tailored explicitly for efficiently storing and querying time-series data. This allows for real-time monitoring and analysis, enabling quick troubleshooting and performance optimization.

2. Flexible Query Language: PromQL, Prometheus’s query language, offers a powerful and expressive way to retrieve and manipulate metrics data. It supports various operations, including filtering, aggregation, and mathematical calculations, empowering users to extract meaningful insights from their monitoring data.

3. Alerting and Alert Manager: Prometheus has a built-in alerting mechanism that allows users to define alert rules based on specific conditions. Prometheus can trigger alerts when the Alert Manager component meets, manages, and routers such situations.

**Challenge: The traditional approach** 

First, let us roll back in time before we had Prometheus network monitoring, say ten years, and look at the birth of monitoring and the tools used. The monitoring tools often operated in a silo, which led to more blind spots. 

The old approach to monitoring is considerably different from today’s approach to Observability.  Traditionally, you can use something like Ganglia to monitor. Ganglia was often used to monitor CDN networks involving several PoPs in different locations. However, within this CDN network, the PoPs look the same. The same servers, storage, etc., and only with the difference in the number of transit providers and servers.

Then, to alert people, we can use Icinga and have this on the back of the Ganglia. With this monitoring design, the infrastructure pushes metrics to the central collectors. The central collectors are in one location, maybe two for backup, but often two locations.

**Complexity and distributed systems**

However, you will see some issues as infrastructure grows ( infrastructure does grow at alarming rates ), and there is a need to push more metrics into Ganglia. For example, with some monitoring systems, the pushing style of the metric collection can cause scalability issues as the number of servers increases. Especially in more effective distributed systems observability use cases.

Within this CDN monitoring design, only one or two machines collect the telemetry for your infrastructures. So, as you scale your infrastructure and throw more data at the system, you have to scale up instead of out. This can be costly and will often hit bottlenecks.

**Scale with the Infrastructure**

However, you want a monitoring solution to scale your infrastructure growth. As you roll out new infrastructure to meet demands, you want to have monitoring systems that can scale. The monitoring system can scale as the infrastructure scales, such as in the use case with Prometheus network monitoring. However, with Ganglia and Icinga, we also have limited graphing functions.

Creating custom dashboards on unique metrics was difficult, and no alerting support existed. Also, there was no API to get and consume the metric data around that time. If you wanted to access the data and consume it in a different system or perform interesting analyses, all of it was locked into the Ganglia.

Related: Before you proceed, you may find the following post helpful.

  1. Correlate Disparate Data Points
  2. Service Level Objectives
  3. Kubernetes Networking 101

Prometheus Monitoring

Understanding Prometheus’ Pull-Based System

Prometheus operates on a pull-based model, meaning it actively fetches metrics from the targets it monitors. Instead of waiting for the targets to push metrics to it, Prometheus takes the initiative and pulls the data at regular intervals. This approach offers several advantages over traditional push-based systems.

1. Flexibility and Reliability: One of the key benefits of the pull-based system is its flexibility. Prometheus can quickly adapt to dynamic environments and handle changes in the configuration of monitored targets. It can automatically discover new targets and adjust the scraping frequency based on the importance of metrics. This flexibility ensures that Prometheus can keep up with the ever-changing nature of modern infrastructure.

2. Efficiency and Scalability: The pull-based system also provides efficiency and scalability. By fetching metrics directly from the targets, Prometheus reduces the resource overhead on each target. This is particularly beneficial in scenarios where the number of targets is large, or the resources on the targets are limited. Additionally, Prometheus can distribute the scraping workload across multiple instances, enabling horizontal scalability and ensuring smooth operations even under heavy loads.

3. Data Integrity and Consistency: Another advantage of the pull-based system is its ability to ensure data integrity and consistency. Since Prometheus fetches metrics directly from the targets, potential data loss in a push-based system is eliminated. By actively pulling the data, Prometheus guarantees that the most up-to-date and accurate metrics are available for analysis and alerting.

4. Alerting and Monitoring: Prometheus’ pull-based system seamlessly integrates with its powerful alerting and monitoring capabilities. By regularly fetching metrics, Prometheus can evaluate them against predefined rules and trigger alerts when certain thresholds are exceeded. This proactive approach to monitoring ensures that any potential issues or anomalies are promptly detected, allowing for timely remedial actions.

Example: Prometheus is Google Cloud

In the following example, I have set up a  Google Kubernetes Engine cluster ad and then deployed the Managed Service for Prometheus to ingest metrics from a simple application. Remember that the Managed Service for Prometheus is Google Cloud’s fully managed storage and query service for Prometheus metrics. This service is built on Monarch, the same globally scalable data store as Cloud Monitoring.

Note: A thin fork of Prometheus replaces existing Prometheus deployments and sends data to the managed service with no user intervention. This data can then be queried using PromQL through the Prometheus Query API supported by the managed service and the existing Cloud Monitoring query mechanisms.

In this last section, quickly use gcloud to deploy a custom monitoring dashboard that shows the metrics from this application in a line chart. Once created, navigate to Monitoring > Dashboards to see the newly created

Prometheus Network Monitoring

Prometheus network monitoring is an open-source, metrics-based system. It includes a robust data model and a query language that lets you analyze how your applications and infrastructure are performing. It does not try to solve problems outside the metrics space and works solely in metric-based monitoring.

However, it can be augmented with tools and a platform for additional observability. For Prometheus to work, you need to instrument your code.

Available Client libraries

Client libraries are available in all the popular languages and runtimes for instrumenting your code, including Go, Java/JVM, C#/.Net, Python, Ruby, Node.js, Haskell, Erlang, and Rust. In addition, software like Kubernetes and Docker are already instrumented with Prometheus client libraries. So you say these are out of the box, so are you ready for Prometheus to scrap their metrics?

In the following diagram, you will see the Prometheus settings from a fresh install. I have downloaded Prometheus from the Prometheus website on my local machine. Prometheus, by default, listens on port 9090 and contains a highly optimized Time Series Database (TSDB), which you can see is started. Also displayed at the very end of the screenshot is the default name of the Prometheus configuration file.

The Prometheus configuration file is written in YAML format and is defined by the scheme. I have done a CAT on the Prometheus configuration file to give you an idea of what it looks like. In the default configuration, a single job called Prometheus scrapes the time series data exposed by the Prometheus server. The job contains a single, statically configured target, the local host on port 9090. 

The Transition to Prometheus network monitoring

Around eight years ago, Ganglia introduced SaaS-based monitoring solutions. These now solved some problems with alerting built-in and API to get to the data. However, now there are two systems, and this introduces complexity. The collector and the agents are pushing to the SaaS-based system along with an on-premises design.

These systems may need to be managed by two different teams. There can be cloud teams looking after the cloud-based SaaS solution and on-premises network or security teams looking at the on-premises monitoring. So, there is already a communication gap, not to mention the creation of a considerably siloed environment in one technology set—monitoring.

Also, questions arise about where to put the metrics in the SaaS-based product or Ganga. For example, we could have different metrics in the same place or the same metrics in only one spot. How can you keep track and ensure consistency?

Ideally, if you have a dispersed PoP design and expect your infrastructure to grow and plan for the future, you don’t want to have centralized collectors. But unfortunately, most on-premise solutions still have a push-based centralized model. 

Prometheus Monitoring: Prometheus Pull

Then Prometheus Pull came around and offered a new approach to monitoring. It can handle millions of metrics on modest hardware. In general, rather than having external services push metrics to it, Prometheus uses a pull approach compared to a push approach.

Prometheus network monitoring is a server application written in GO. It is an open-source, decentralized monitoring tool that can be centralized using the federate option. Prometheus has a server component, which you run in each environment. You can also run a Prometheus container in each Kubernetes pod.

We use a time-series database for Prometheus monitoring, and every metric is recorded with a timestamp. Prometheus is not an SQL database; you need to use PromQL as its query language, which allows you to query the metrics. 

Prometheus Monitoring: Legacy System

So, let us expand on this and look at two environments for monitoring. We have a legacy environment and a modern Kubernetes environment. We are running a private cloud for the legacy with many SQL, Windows, and Linux servers. There is nothing new here. Here, you would run Prometheus on the same subnet. A Prometheus agent would also be installed. 

We would have Node Exporters for both Linux and Windows, extracting and creating a metric endpoint on your servers. The metric endpoint is needed on each server or host so Prometheus can scrap the metrics. So, a Daemon is running, collecting all of the metrics. These metrics are exposed on a page, for example, http://host:port/metrics, that allows Prometheus to scrape. 

There is also a Prometheus federation feature. You can have a federated endpoint, allowing Prometheus to expose its metrics to other services. This will enable you to pull metrics around different subnets. So, we can have another Prometheus in a different subnet, scrapping the other Prometheus. The federate option will enable you to link these two Prometheus solutions quickly. 

Prometheus Monitoring
Diagram: Prometheus Monitoring. Source is Opcito

Prometheus Monitoring: Modern Kubernetes

Here, we have a microservices observability platform and a bunch of containers or VMs running in a Kubernetes cluster. In this environment, we usually create a namespace; for example, we could call the namespace monitoring. So, we deploy a Prometheus pod in our environments.

The Prometheus pod YAML file will point to the Kubernetes API. The Kubernetes API has a metric server that gets all metrics from your environments. So here we are, getting metrics for the container processes. You can deploy the library in your code if you want to instrument your application.

This can be done with Prometheus code libraries. We now have a metrics endpoint similar to before, and we can grab metrics specific to your application. We also have a metrics endpoint on each container that Prometheus can scrape.

**What is Prometheus Exporter**

Prometheus exporter is a specialized component that extracts and exposes metrics from third-party systems, applications, and services. It bridges Prometheus, an open-source monitoring and alerting toolkit, and the target system or application.

By implementing the Prometheus exporter, users can conveniently collect and monitor custom metrics, enabling them to gain valuable insights into their systems’ health and performance.

How Does Prometheus Exporter Work?

Prometheus exporter follows a simple and efficient architecture. It utilizes a built-in HTTP server to expose metrics in a format that Prometheus understands. The exporter periodically collects metrics from the target system or application and makes them available over HTTP endpoints. Prometheus then scrapes these endpoints, discovers and stores the metrics, and performs analysis and alerting based on the defined rules.

Benefits of Prometheus Exporter:

1. Flexibility: The Prometheus exporter provides flexibility in monitoring various metrics, making it suitable for multiple systems and applications. With its support for custom metrics, users can easily monitor specific aspects and gain insights into their systems’ behavior.

2. Compatibility: Due to its popularity, many systems and applications offer native support for Prometheus exporters. This compatibility allows users to effortlessly integrate the exporter into their existing monitoring infrastructure, eliminating the need for complex configurations or additional tools.

3. Extensibility: Prometheus exporter encourages extensibility by offering a straightforward mechanism to develop and expose custom metrics. This capability enables users to monitor specific application-level metrics critical to their unique monitoring requirements.

4. Scalability: With Prometheus exporter, users can scale their monitoring infrastructure as their systems grow. The exporter’s lightweight design and efficient data collection mechanism ensure that monitoring remains reliable and efficient, even in high-throughput environments.

Exposing Runtime Metrics: The Prometheus Exporter

  • Exporter Types:

To enable Prometheus monitoring, you must add a metric API to the application containers to support this. For applications that don’t have their metric API, we use what is known as an Exporter. This utility reads the runtime metrics the app has already collected and exposes them on an HTTP endpoint.

Prometheus can then look at this HTTP endpoint. So we have different types of Exporters that collect metrics for other runtimes, such as a JAVA Exporter, which will give you a set of JVM statistics, and a .NET Exporter, which will provide you with a set of Windows performance metrics.

Essentially, we are adding a Prometheus endpoint to the application. In addition, we use an Exporter utility alongside the application. So, we will have two processes running in the container. 

With this approach, you don’t need to change the application. This could be useful for some regulatory environments where you can’t change the application code. So now you have application runtime metrics without changing any code.

This is the operating system and application host data already collected in the containers. To make these metrics available to Prometheus, add an Exporter to the Docker Image. Many use the Exporters for legacy applications instead of changing the code to support Prometheus monitoring.

Essentially, we are exporting the statistics to a metric endpoint.

Summary: Prometheus Monitoring

In today’s rapidly evolving digital landscape, businesses rely heavily on their IT infrastructure to ensure smooth operations and deliver uninterrupted customer services. As the complexity of these infrastructures grows, so does the need for effective network monitoring solutions.

Prometheus, an open-source monitoring and alerting toolkit, has emerged as a popular choice for organizations seeking deep insights into their network performance.

Prometheus offers a robust and flexible platform for monitoring various aspects of an IT infrastructure, including network components such as routers, switches, and servers. It collects and stores time-series data, allowing administrators to analyze historical trends, detect anomalies, and make informed decisions.

Prometheus pull-based model

One of the critical strengths of Prometheus lies in its ability to collect data through a pull-based model. Instead of relying on agents installed on each monitored device, Prometheus pulls data from the targets it monitors. This lightweight approach minimizes the impact on the monitored systems, making it an efficient and scalable solution for networks of any size.

Prometheus PromQL

Prometheus employs a powerful query language called PromQL, which enables administrators to explore and manipulate the collected data. With PromQL, users can define custom metrics, create complex queries, and generate insightful visualizations. This flexibility allows organizations to tailor Prometheus to their specific monitoring needs and gain a comprehensive understanding of their network performance.

Prometheus Monitoring and Alerting

Another notable feature of Prometheus is its alerting system. Administrators can define custom alert rules based on specific metrics or thresholds, ensuring that they are promptly notified of any network issues. This proactive approach to monitoring allows businesses to mitigate potential downtime, minimize the impact on end-users, and maintain a high level of service availability.

Prometheus also integrates seamlessly with other widespread monitoring tools and platforms, such as Grafana, allowing users to create visually appealing dashboards and gain even deeper insights into their network performance. This interoperability makes Prometheus a versatile choice for organizations that already have an existing monitoring ecosystem in place.

Furthermore, Prometheus has a thriving community of contributors, ensuring constant updates, bug fixes, and new features. The active development community ensures that Prometheus stays relevant and up-to-date with the latest industry trends and best practices in network monitoring.

Conclusion:

Prometheus Monitoring has revolutionized how developers and DevOps teams monitor and troubleshoot modern software systems. With its efficient time-series database, flexible query language, and extensive ecosystem, Prometheus provides a comprehensive solution for monitoring and alerting needs. By leveraging its powerful features, organizations can gain valuable insights, ensure system reliability, and foster continuous improvement in their software delivery pipelines.

service level objectives

Starting Observability

Starting Observability

Openshift Networking is a fundamental aspect of managing and orchestrating containerized applications within the Openshift platform. In this blog post, we will embark on a journey to explore the intricacies of Openshift Networking, shedding light on its key components and functionalities.

Openshift Networking operates on the concepts of pods, services, and routes. Pods are the smallest deployable units in Openshift, services enable communication between pods, and routes provide external access to services. By grasping these fundamental building blocks, we can delve deeper into the networking fabric of Openshift.

Networking Modes: Openshift offers multiple networking modes to suit various requirements. We will explore the three primary modes: Overlay Networking, Host Networking, and IPvlan Networking. Each mode has its advantages and considerations, allowing users to tailor their networking setup to meet specific needs.

Network Policies: Network Policies play a crucial role in controlling the flow of traffic within Openshift. We will discuss how network policies enable fine-grained control over ingress and egress traffic, allowing administrators to enforce security measures and isolate applications effectively.

Service Mesh Integration: Openshift seamlessly integrates with popular service mesh solutions like Istio, enabling advanced networking features such as traffic routing, load balancing, and observability. We will explore the benefits of leveraging a service mesh within Openshift and how it can enhance application performance and reliability.

Openshift Networking forms the backbone of containerized application management, providing a robust and flexible networking environment. By understanding the basics, exploring networking modes, and leveraging network policies and service mesh integration, users can harness the full potential of Openshift Networking to optimize their applications and enhance overall performance.

Highlights: Starting Observability

Understanding Observability

To comprehend the essence of observability, we must first grasp its fundamental components. Observability involves collecting, analyzing, and interpreting data from various sources within a system. This data provides insights into the system’s performance, health, and potential issues. By monitoring metrics, logs, traces, and other relevant data points, observability equips organizations with the ability to diagnose problems, optimize performance, and ensure reliability.

The benefits of observability extend far beyond the realm of technology. Organizations across industries are leveraging observability to enhance their operations and drive innovation. In software development, observability enables teams to identify and resolve bugs, bottlenecks, and vulnerabilities proactively.

Additionally, it facilitates efficient troubleshooting and aids in the optimization of critical systems. Moreover, observability is pivotal in enabling data-driven decision-making, empowering organizations to make informed choices based on real-time insights.

**Key Components of Observability: Logs, Metrics, and Traces**

To fully grasp observability, you must understand its key components:

1. **Logs**: Logs are records of events that occur within your system. They provide detailed information about what happened, where, and why. Effective log management allows you to search, filter, and analyze these events to detect anomalies and pinpoint issues.

2. **Metrics**: Metrics are numerical values that represent the performance and health of your system over time. They help you track resource utilization, response times, and other critical parameters. By visualizing these metrics, you can easily identify trends and potential bottlenecks.

3. **Traces**: Traces follow the path of requests as they flow through your system. They offer a high-level view of your application’s behavior, highlighting dependencies and latency issues. Tracing is invaluable for diagnosing complex problems in distributed systems.

A New Paradigm Shift

Your infrastructure is undergoing a paradigm shift to support the new variations. As systems become more distributed and complex, methods for building and operating them are evolving, making network visibility into your services and infrastructure more critical than ever. This leads you to adopt new practices, such as Starting Observability and implementing service level objectives (SLO).

Note: The Internal States

Observability aims to provide a level of introspection to understand the internal state of the systems and applications. That state can be achieved in various ways. The most common way to fully understand this state is with a combination of logs, metrics, and traces as debugging signals—all of these need to be viewed as one, not as a single entity.

You have probably encountered the difference between monitoring and Observability. But how many articles have you read that guide starting an observability project?

**Getting Started with Observability Tools**

With a basic understanding of observability, it’s time to explore the tools that can help you implement it. Several observability platforms cater to different needs, including:

– **Prometheus**: Known for its powerful time-series database and flexible querying language, Prometheus is ideal for collecting and analyzing metrics.

– **ELK Stack (Elasticsearch, Logstash, Kibana)**: This open-source stack excels at log management, providing robust search and visualization capabilities.

– **Jaeger**: Designed for distributed tracing, Jaeger helps you trace requests and identify performance bottlenecks in microservices environments.

Choosing the right tools depends on your specific requirements, but starting with these popular options can set you on the right path.

Observability & Service Mesh

**The Role of Service Mesh in Observability**

Service mesh has emerged as a powerful tool in enhancing observability within distributed systems. At its core, a service mesh is a dedicated infrastructure layer that controls service-to-service communication. By handling aspects like load balancing, authentication, and encryption, service meshes provide a centralized way to manage network traffic. More importantly, they offer detailed telemetry, which is essential for observability. With a service mesh, you can collect metrics, traces, and logs in a standardized manner, making it easier to monitor the health and performance of your services.

**Key Benefits of Using a Service Mesh**

One of the primary advantages of implementing a service mesh is the increased visibility it provides. By abstracting the complexities of network communication, a service mesh allows teams to focus on building and deploying applications without worrying about underlying connectivity issues. Additionally, service meshes facilitate better security practices through features like mutual TLS, enabling encryption and authentication between services. This not only boosts observability but also enhances the overall security posture of your applications.

The Basics: How Cloud Service Mesh Works

At its core, a cloud service mesh is a dedicated infrastructure layer that handles service-to-service communication within a cloud environment. It provides a more secure, reliable, and observable way to manage these communications. Instead of embedding complex logic into each service to handle things like retries, timeouts, and security policies, a service mesh abstracts these concerns into a separate layer, often through sidecar proxies that accompany each service instance.

**Key Benefits: Why Adopt a Cloud Service Mesh?**

1. **Enhanced Security**: By centralizing security policies and automating mutual TLS (mTLS) for service-to-service encryption, a service mesh significantly reduces the risk of data breaches.

2. **Observability**: With built-in monitoring and tracing capabilities, a service mesh provides deep insights into the behavior and performance of services, making it easier to detect and resolve issues.

3. **Traffic Management**: Advanced traffic management features such as load balancing, fine-grained traffic routing, and fault injection help optimize the performance and reliability of services.

**Real-World Applications: Success Stories**

Many organizations have successfully implemented cloud service meshes to solve complex challenges. For instance, a large e-commerce platform might use a service mesh to manage its microservices architecture, ensuring that customer transactions are secure and resilient. Another example is a financial institution that leverages a service mesh to comply with stringent regulatory requirements by enforcing consistent security policies across all services.

**Choosing the Right Service Mesh: Key Considerations**

When selecting a cloud service mesh, it’s essential to consider factors such as compatibility with your existing infrastructure, ease of deployment, and community support. Popular options like Istio, Linkerd, and Consul each offer unique features and benefits. Evaluating these options based on your specific needs and constraints will help you make an informed decision.

Compute Engine Monitoring with Ops Agent

Understanding Ops Agent

Google Cloud developed Ops Agent, a lightweight and versatile monitoring agent. It seamlessly integrates with Compute Engine instances and provides real-time insights into system metrics, logs, and more. By installing Ops Agent, you gain granular visibility into your instances’ performance, allowing you to address any issues proactively.

To start with Ops Agent, you must install and configure it on your Compute Engine instances. This section will guide you through the steps, from enabling the necessary APIs to deploying Ops Agent on your instances. 

Ops Agent offers an extensive range of metrics and logs you can monitor to gain valuable insights into your infrastructure’s health and performance. This section will delve into the available metrics and logs, such as CPU utilization, disk I/O, network traffic, and application logs. 

Observability is critical for building, changing, and understanding the software that powers complex modern systems. Teams that adopt Observability are much better equipped to ship code swiftly and confidently, identify outliers and aberrant behaviors, and understand every user’s experience.

In that case, Observability for software systems measures how well you can understand and explain any state your system can get into, no matter how novel or bizarre.

Observability & Managed Instance Groups

**The Core of Google Cloud’s Scalability**

Managed Instance Groups are a fundamental feature of Google Cloud, designed to automate the deployment and management of virtual machine instances. By leveraging MIGs, developers can automatically scale their applications based on current demand, ensuring optimal performance without manual intervention. This automation is achieved through policy-based scaling, which adjusts the number of instances in real-time.

**Enhanced Observability with Managed Instance Groups**

Observability is crucial for maintaining the health and performance of cloud-based applications. With Managed Instance Groups, Google Cloud provides robust monitoring and logging tools that offer deep insights into application performance. These tools enable developers to track metrics, set up alerts, and gain a comprehensive view of their infrastructure’s behavior. By integrating observability with MIGs, Google Cloud ensures that any issues can be swiftly identified and resolved, minimizing downtime and enhancing reliability.

**Advantages of Using Managed Instance Groups**

Managed Instance Groups offer several advantages that make them indispensable for modern cloud computing. First and foremost, they provide automatic scaling, which optimizes resource usage and reduces costs. Additionally, MIGs ensure high availability by distributing instances across multiple zones, thereby enhancing fault tolerance. Moreover, they support rolling updates, enabling developers to deploy new features or fixes with minimal disruption.

Managed Instance Group

For additional pre-information, you may find the following helpful

  1. Observability vs Monitoring
  2. Distributed Systems Observability
  3. WAN Monitoring
  4. Reliability In Distributed System

Starting Observability

The Three Pillars of Observability

Three Pillars of Observability: Observability rests upon three main pillars: logs, metrics, and traces. Logs capture textual records of system events, providing valuable context and aiding post-incident analysis. Conversely, metrics are quantitative measurements of system behavior, allowing us to track performance and identify anomalies. Traces give a detailed view of request flows and interactions between system components, facilitating troubleshooting and understanding system dependencies.

The Power of Proactive Maintenance: One critical advantage of observability is its ability to enable proactive maintenance. We can continuously monitor and analyze system data to identify potential issues or anomalies before they escalate into critical problems. This proactive approach empowers us to take preventive measures, reducing downtime and improving system reliability.

Unleashing the Potential of Data Analysis: Observability generates a wealth of data that can be harnessed to drive informed decision-making. We can uncover patterns, identify performance bottlenecks, and optimize system behavior by leveraging data analysis techniques. This data-driven approach empowers us to make data-backed decisions that enhance system performance, scalability, and user experience.

The Starting Strategy

Start your observability project in the middle, not in the fridge, and start with something meaningful. There is no point in starting an observability project on something no one cares about or uses that much. So, choose something that matters, and the result will be noticed. On the other hand, something no one cares about will not attract any stakeholder interest.

Service level objectives (SLO)

So, to start an observability project on something that matters and will attract interest, you need to look at metrics that matter: Service Level Objectives (SLOs). With service level objectives, we attach the needs of the product and business to the needs of the individual components, finding the perfect balance for starting observability projects.

The service level objective aggregates over time and is a mathematical equivalent of an error budget. So, over this period, am I breaching my target? If you exceed your SLO target, your users will be happy with the state of your service.

If you are missing your SLO target, your users are unhappy with the state of your service. It’s as simple as that. So, the SLO is the target’s goal over a measurement period. The SLO includes two things: it contains the target and a measurement window. For example, 99.9% of checkout requests in the past 30 days have been successful. Thirty days is the measurement window.  

Example: Understanding GKE-Native Monitoring

GKE-Native Monitoring provides real-time insights into your GKE clusters, allowing you to monitor the health and performance of your applications. You can easily track resource utilization, latency, error rates, and more with built-in metrics and dashboards. This proactive monitoring enables you to identify bottlenecks, troubleshoot issues, and optimize your cluster’s performance.

GKE-Native Logging simplifies the collection, storage, and analysis of logs generated by your applications running on GKE. By consolidating logs from multiple sources, such as containers, system components, and services, GKE-Native Logging provides a unified view of your application’s behavior. This centralized log management enhances troubleshooting capabilities and lets you gain valuable insights into your application’s performance and security.

**Key Point: Take advantage of Error Budgets**

Once you have determined your service level objectives, it would help to look at your error budgets. Nothing can always be reliable, and it’s ok to fail. This is the only way to perform tests and innovate to meet better user requirements, which is why we have an error budget. An error budget references a budget of failure that you are allowed to have per hour or month.

We need a way to measure the amount of unreliability we will tolerate. Once you know how much of the error budget you have left, you can take more risks and roll out new features. Error budgets help you balance velocity and reliability. So, the practices of SLO and error budgets prioritize reliability and velocity.

**Key Point: Issues with MTTR**

SLO is an excellent way to start and win. This can be approached on a team-by-team basis. It’s a much more accurate way to measure reliability than Mean Time to Recovery (MTTR). The issue with MTTR is that you measure the time it takes to resolve every incident. However, this measurement method can be subject to measurement error. The SLO is more challenging to cheat and a better way to measure.

So, we have key performance indicators ( KPI), service level indicators (SLI), and service level objectives (SLO). These are the first ways to implement Observability, not just look at the KPI. First, you should monitor KPI, SLx, and the system’s internal state. From there, you can derive your service level metrics, which will be the best place to start an observability project.

What is a KPI and SLI: User experience

The key performance indicator (KPI) is tied to system implementation. It conveys health and performance and may change if the system’s architecture changes. For example, database latency would be a KPI. In contrast to KPI, we have service level indicators (SLI). An SLI measures your user experience and can be derived from several signals.

The SLI does not change unless the user needs to. It’s a metric that matters most to the user. This indicator tells you if your service is acceptable or not. So, this line tells you if you have a happy or sad user. It’s a performance measurement, like a metric that describes a user’s experience. 

Types of Service Level Indicators

An example of an SLI would be availability, latency, correctness, quality, freshness, and throughout. We need to gather these metrics, which can be done by implementing several measurement strategies, such as application-level metrics, log processing, and client-side instrumentation. So, if we look at an SLI implementation for availability, it would be, for example, the portion of HTTP GET request for /type of request.

The users care about SLI and not KPI. I’m not saying that database latency is not essential. You should measure it and put it in a predefined dashboard. However, users don’t care about database latency or how quickly their requests can be restored. Instead, the role of the SLI is to capture the user’s expectation of how the system behaves. If your database is too slow, you must front it with a cache. So, the cache hit ratio becomes a KPI, but the user’s expectations have not changed. 

Starting Observability Engineering

Good Quality Telemetry

You need to understand the importance of high-quality telemetry and adopt it carefully, as it is the first step to good Observation. So, it would be best to have quality logs and metrics and a modern approach such as Observation, which is required for long-term success. For this, you need good Telemetry; without good telemetry, it will be hard to shorten the length of outages. 

What are VPC Flow Logs?

VPC Flow Logs provide detailed information about the IP traffic flowing into and out of Virtual Private Clouds (VPCs) within Google Cloud. These logs capture data at the network interface level, including source and destination IP addresses, ports, protocols, and timestamps. By enabling VPC Flow Logs, network administrators gain a comprehensive view of network traffic, facilitating analysis and troubleshooting.

Analyzing VPC Flow Logs can yield valuable insights for various use cases. Firstly, it enables network administrators to monitor and detect anomalies in network traffic. By studying patterns and identifying unexpected or suspicious behavior, potential security threats can be identified and mitigated promptly. Additionally, VPC Flow Logs provide performance optimization opportunities by identifying bottlenecks, analyzing traffic patterns, and optimizing resource allocation.

Instrumentation: OpenTelemetry 

The first step to consider is how your applications will omit telemetry data. OpenTelemetry is the emerging standard for instrumenting frameworks and application code. With OpenTelemetry’s pluggable exporters, you can configure your instrumentation to send data to the analytics tool of your choice.

In addition, OpenTelementry helps with distributed tracing, which enables you to understand system interdependencies. Unless their relationships are clearly understood, those interdependencies can obscure problems and make them challenging to debug.

Understanding Application Latency

Application latency refers to the time it takes for a request to travel from the user to the server and back. Various factors influence it, including network latency, database queries, and processing time. By understanding and measuring application latency, developers can identify areas that require optimization and improve overall performance.

Example: Google Cloud Trace

Google Cloud Trace is a powerful tracing tool provided by Google Cloud Platform. It enables developers to analyze application latency by capturing and visualizing detailed traces of requests as they flow through various components of their applications. With Cloud Trace, developers gain insights into their applications’ performance, enabling them to identify and resolve latency issues efficiently.

**Data storage and analytics**

Once you have high-quality telemetry data, you must consider how it’s stored and analyzed. Depending on whether you use open-source or proprietary solutions, data storage, and analytics are often bundled into the same solution.  Commercial vendors typically bundle storage and analytics. These solutions will be proprietary all-in-one solutions, including Honeycomb, Lightstep, New Relic, Splunk, Datadog, etc.

Then, we have the open-source solutions that typically require separate data storage and analytics approaches. These open-source frontends include solutions like Grafana, Prometheus, or Jaeger. While they handle analytics, they all need an independent data store to scale. Popular open-source data storage layers include Cassandra, Elastic, M3, and InfluxDB.

**A final note: Buy instead of building?**

The most significant pain point is knowing how to start. Building your observability tooling rather than buying a commercially available solution quickly proves a return on investment (ROI). However, if you don’t have enough time, you will need to buy it. I prefer buying to get a quick recovery and stakeholder attention. While at the side, you could start to build with open-source components.

Essentially, you are running two projects in parallel. You buy to get immediate benefits and gain stakeholder attraction, and then, on the side, you can start to build your own., which may be more flexible for you in the long term.

Summary: Starting Observability

In today’s complex digital landscape, gaining insights into the inner workings of our systems has become vital. This has led to the rise of observability, a concept that empowers us to understand and effectively monitor our systems. In this blog post, we explored observability from its core principles to its practical applications.

Understanding Observability

Observability, at its essence, is the measure of how well we can understand a system’s internal state based on its external outputs. It goes beyond traditional monitoring by providing the ability to gain insight into what is happening inside a system, even when things go awry. By capturing and analyzing relevant data, observability enables us to identify issues, understand their root causes, and make informed decisions to improve system performance.

The Three Pillars of Observability

To fully understand observability, we must examine its three fundamental pillars: logs, metrics, and traces.

Logs: Logs are textual records that capture system events and activities. They allow us to retrospectively analyze system behavior, diagnose errors, and gain valuable context for troubleshooting.

Metrics: Metrics are numerical measurements that provide quantitative data about the system’s performance. They help us track key indicators, identify anomalies, and make data-driven decisions to optimize the system.

Traces: Traces provide a detailed chronology of events that occur during a specific request’s lifecycle. By following a request’s path across various components, we can pinpoint bottlenecks, optimize performance, and enhance the overall user experience.

Implementing Observability in Practice

Now that we understand the basics let’s explore how observability can be implemented in real-world scenarios.

Step 1: Define Key Observability Metrics: Identify the crucial metrics that align with your system’s goals and user expectations. Consider factors such as response time, error rates, and resource utilization.

Step 2: Set Up Logging Infrastructure: Implement a centralized solution to collect and store relevant logs from different components. This allows for efficient log aggregation, searchability, and analysis.

Step 3: Instrument Your Code: Embed monitoring libraries or agents within your application code to capture metrics and traces. This enables you to gain real-time insights into your system’s health and performance.

Step 4: Utilize Observability Tools: Leverage observability platforms or specialized tools that provide comprehensive dashboards, visualizations, and alerting mechanisms. These tools simplify the analysis process and facilitate proactive monitoring.

Conclusion: Observability has transformed the way we understand and monitor complex systems. By embracing its principles and leveraging the power of logs, metrics, and traces, we can gain valuable insights into our systems, detect anomalies, and optimize performance. As technology advances, observability will remain indispensable for ensuring system reliability and delivering exceptional user experiences.

System Observability

Distributed Systems Observability

Distributed Systems Observability

In the realm of modern technology, distributed systems have become the backbone of numerous applications and services. However, the increasing complexity of such systems poses significant challenges when it comes to monitoring and understanding their behavior. This is where observability steps in, offering a comprehensive solution to gain insights into the intricate workings of distributed systems. In this blog post, we will embark on a captivating journey into the realm of distributed systems observability, exploring its key concepts, tools, and benefits.

Observability, as a concept, enables us to gain deep insights into the internal state of a system based on its external outputs. When it comes to distributed systems, observability takes on a whole new level of complexity. It encompasses the ability to effectively monitor, debug, and analyze the behavior of interconnected components across a distributed architecture. By employing various techniques and tools, observability allows us to gain a holistic understanding of the system's performance, bottlenecks, and potential issues.

To achieve observability in distributed systems, it is crucial to focus on three interconnected components: logs, metrics, and traces.

Logs: Logs provide a chronological record of events and activities within the system, offering valuable insights into what has occurred. By analyzing logs, engineers can identify anomalies, track down errors, and troubleshoot issues effectively.

Metrics: Metrics, on the other hand, provide quantitative measurements of the system's performance and behavior. They offer a rich source of data that can be analyzed to gain a deeper understanding of resource utilization, response times, and overall system health.

Traces: Traces enable the visualization and analysis of transactions as they traverse through the distributed system. By capturing the flow of requests and their associated metadata, traces allow engineers to identify bottlenecks, latency issues, and performance optimizations.

In the ever-evolving landscape of distributed systems observability, a plethora of tools and frameworks have emerged to simplify the process. Prominent examples include:

1. Prometheus: A powerful open-source monitoring and alerting system that excels in collecting and storing metrics from distributed environments.

2. Jaeger: An end-to-end distributed tracing system that enables the visualization and analysis of transaction flows across complex systems.

3. ELK Stack: A comprehensive combination of Elasticsearch, Logstash, and Kibana, which collectively offer powerful log management, analysis, and visualization capabilities.

4. Grafana: A widely-used open-source platform for creating rich and interactive dashboards, allowing engineers to visualize metrics and logs in real-time.

The adoption of observability in distributed systems brings forth a multitude of benefits. It empowers engineers and DevOps teams to proactively detect and diagnose issues, leading to faster troubleshooting and reduced downtime. Observability also aids in capacity planning, resource optimization, and identifying performance bottlenecks. Moreover, it facilitates collaboration between teams by providing a shared understanding of the system's behavior and enabling effective communication.

In the ever-evolving landscape of distributed systems, observability plays a pivotal role in unraveling the complexity and gaining insights into system behavior. By leveraging the power of logs, metrics, and traces, along with robust tools and frameworks, engineers can navigate the intricate world of distributed systems with confidence. Embracing observability empowers organizations to build resilient, high-performing systems that can withstand the challenges of today's digital landscape.

Highlights: Distributed Systems Observability

The Role of Distributed Systems

A – Several decades ago, only a handful of mission-critical services worldwide were required to meet the availability and reliability requirements of today’s always-on applications and APIs. In response to user demand, every application must be built to scale nearly instantly to accommodate the potential for rapid, viral growth. Almost every app built today—whether a mobile app for consumers or a backend payment system—must meet these constraints and requirements.

B – Inherently, distributed systems are more reliable due to their distributed nature. When appropriately designed software engineers build these systems, they can benefit from more scalable organizational models. There is, however, a price to pay for these advantages.

C – Designing, building, and debugging these distributed systems can be challenging. A reliable distributed system requires significantly more engineering skills than a single-machine application, such as a mobile app or a web frontend. Regardless, distributed systems are becoming increasingly important. There is a corresponding need for tools, patterns, and practices to build them.

D – As digital transformation accelerates, organizations adopt multicloud environments to drive secure innovation and achieve speed, scale, and agility. As a result, technology stacks are becoming increasingly complex and scalable. Today, even the most straightforward digital transaction is supported by an array of cloud-native services and platforms delivered by various providers. To improve user experience and resilience, IT and security teams must monitor and manage their applications.

**Key Components of Observability**

Observability in distributed systems typically relies on three pillars: logs, metrics, and traces. Logs provide detailed records of events within the system, offering context for debugging issues. Metrics offer quantitative data, such as CPU usage and request rates, allowing teams to monitor system health and performance over time. Traces enable the tracking of requests as they move through the system, helping to pinpoint where latency or failures occur. Together, these components create a comprehensive picture of the system’s state and behavior.

**Challenges in Achieving Observability**

While observability is essential, achieving it in distributed systems is not without its challenges. The sheer volume of data generated by these systems can be overwhelming. Additionally, correlating data from disparate sources to form a cohesive narrative requires sophisticated tools and techniques. Moreover, ensuring that observability doesn’t introduce too much overhead or affect system performance is a delicate balancing act. Organizations must invest in the right infrastructure and expertise to tackle these challenges effectively.

**Best Practices for Enhancing Observability**

To maximize observability in distributed systems, organizations should adopt several best practices. Firstly, they should implement centralized logging and monitoring solutions that can aggregate data from all system components. Secondly, leveraging open standards like OpenTelemetry can facilitate consistent data collection and integration with various tools. Thirdly, incorporating automated alerting and anomaly detection can help teams proactively address issues before they impact users. Lastly, fostering a culture of collaboration between development and operations teams can ensure that observability is an ongoing, shared responsibility.

Cloud Service Mesh

### What is a Cloud Service Mesh?

A Cloud Service Mesh is a dedicated infrastructure layer that facilitates service-to-service communication in a microservices architecture. It abstracts the complex communication patterns between services into a manageable, secure, and observable framework. By deploying a service mesh, organizations can effectively manage the interactions of their microservices, ensuring seamless connectivity, security, and resilience.

### Key Benefits of Implementing a Cloud Service Mesh

1. **Enhanced Security**: A service mesh provides robust security features such as mutual TLS authentication, which encrypts communications between services. This ensures that data remains secure and tamper-proof as it travels across the network.

2. **Traffic Management**: With a service mesh, you can implement sophisticated traffic management policies, including load balancing, circuit breaking, and retries. This leads to improved performance and reliability of your distributed systems.

3. **Observability**: One of the standout features of a service mesh is its ability to provide deep observability into the interactions between services. Metrics, logs, and traces are collected and analyzed, offering invaluable insights into system health and performance.

### Enhancing Observability in Distributed Systems

Observability is a key concern in managing distributed systems. With the proliferation of microservices, tracking and understanding service interactions can become overwhelmingly complex. A Cloud Service Mesh addresses this challenge by offering comprehensive observability features:

– **Metrics Collection**: Collects real-time metrics on service performance, latency, error rates, and more.

– **Distributed Tracing**: Enables tracing of requests as they propagate through multiple services, helping identify bottlenecks and performance issues.

– **Centralized Logging**: Aggregates logs from various services, providing a unified view for easier troubleshooting and analysis.

These capabilities empower teams to detect issues early, optimize performance, and ensure the reliability of their applications.

### Real-World Applications and Use Cases

Several organizations have successfully implemented Cloud Service Meshes to transform their operations. For instance, financial institutions use service meshes to secure sensitive transactions, while e-commerce platforms leverage them to manage high traffic volumes during peak shopping seasons. By providing a robust framework for service communication, a service mesh enhances scalability, reliability, and security across industries.

Googles Ops Agent

Ops Agent is a lightweight agent that runs on your Compute Engine instances, collecting and forwarding metrics and logs to Google Cloud Monitoring and Logging. By installing Ops Agent on your instances, you gain real-time visibility into your Compute Engine’s performance and behavior.

To start monitoring your Compute Engine, you must install Ops Agent on your instances. The installation process is straightforward and can be done manually or through automation tools like Cloud Deployment Manager or Terraform. Once installed, the Ops Agent will automatically begin collecting metrics and logs from your Compute Engine.

Ops Agent allows you to customize the metrics and logs you want to monitor for your Compute Engine. Various options are available, allowing you to choose specific metrics and logs relevant to your application or system. By configuring metrics and logs, you can gain deeper insights and track the performance of critical components.

**Challenge: Fragmented Monitoring Tools**

Fragmented monitoring tools and manual analytics strategies challenge IT and security teams. The lack of a single source of truth and real-time insight makes it increasingly difficult for these teams to access the answers they need to accelerate innovation and optimize digital services. To gain insight, they must manually query data from various monitoring tools and piece together different sources of information.

This complex and time-consuming process distracts Team members from driving innovation and creating new value for the business and customers. In addition, many teams monitor only their mission-critical applications due to the effort involved in managing all these tools, platforms, and dashboards. The result is a multitude of blind spots across the technology stack, which makes it harder for teams to gain insights.

**Challenge: Kubernetes is Complex**

Understanding how Kubernetes adds to the complexity of technology stacks is imperative. In the drive toward modern technology stacks, it is the platform of choice for organizations refactoring their applications for the cloud-native world. Through dynamic resource provisioning, Kubernetes architectures can quickly scale services to new users and increase efficiency.

However, the constant changes in cloud environments make it difficult for IT and security teams to maintain visibility into them. To provide observability in their Kubernetes environments, these teams cannot manually configure various traditional monitoring tools. The result is that they are often unable to gain real-time insights to improve user experience, optimize costs, and strengthen security. Due to this visibility challenge, many organizations are delaying moving more mission-critical services to Kubernetes.

GKE-Native Monitoring

The Basics of GKE-Native Monitoring

GKE-Native Monitoring is a comprehensive monitoring solution provided by Google Cloud Platform (GCP) designed explicitly for GKE clusters. It offers deep insights into your applications’ performance and behavior, allowing you to proactively detect and resolve issues. With GKE-Native Monitoring, you can easily collect and analyze metrics, monitor logs, and set up alerts to ensure the reliability and availability of your applications.

One of the critical features of GKE-Native Monitoring is its ability to collect and analyze metrics from your GKE clusters. It provides preconfigured dashboards that display essential metrics such as CPU usage, memory utilization, and network traffic. Additionally, you can create custom dashboards tailored to your specific requirements, allowing you better to understand your application’s performance and resource consumption.

The Role of Megatrends

We have had a considerable drive with innovation that has spawned several megatrends that have affected how we manage and view our network infrastructure and the need for distributed systems observability. We have seen the decomposition of everything from one to many.

Many services and dependencies in multiple locations, aka microservices observability, must be managed and operated instead of the monolithic where everything is generally housed internally. The megatrends have resulted in a dynamic infrastructure with new failure modes not seen in the monolithic, forcing us to look at different systems observability tools and network visibility practices. 

Shift in Control

There has also been a shift in the point of control. As we move towards new technologies, many of these loosely coupled services or infrastructures your services depend on are not under your control. The edge of control has been pushed, creating different network and security perimeters. These parameters are now closer to the workload than a central security stack. Therefore, the workloads themselves are concerned with security.

Example Product: Cisco AppDynamics

### What is Cisco AppDynamics?

Cisco AppDynamics is an application performance management (APM) solution designed to provide real-time visibility into the performance of your applications. It helps IT professionals identify bottlenecks, diagnose issues, and optimize performance, ensuring a seamless user experience. With its powerful analytics, you can gain deep insights into your application stack, from the user interface to the backend infrastructure.

### Key Features and Capabilities

#### Real-Time Monitoring

One of the standout features of Cisco AppDynamics is its ability to monitor applications in real-time. This allows IT teams to detect and resolve issues as they occur, minimizing downtime and ensuring a smooth user experience. Real-time monitoring covers everything from user interactions to server performance, providing a comprehensive view of your application’s health.

#### End-User Experience Monitoring

Understanding how users interact with your application is crucial for delivering a high-quality experience. Cisco AppDynamics offers end-user experience monitoring, which tracks user sessions and interactions. This data helps you identify any pain points or performance issues that may be affecting user satisfaction.

#### Business Transaction Monitoring

Cisco AppDynamics takes a unique approach to monitoring by focusing on business transactions. By tracking the performance of individual transactions, you can gain a clearer understanding of how different parts of your application are performing. This level of granularity allows for more targeted optimizations and quicker issue resolution.

### Benefits of Using Cisco AppDynamics

#### Improved Application Performance

With its comprehensive monitoring and diagnostic capabilities, Cisco AppDynamics helps you identify and resolve performance issues quickly. This leads to faster load times, fewer errors, and an overall improved user experience.

#### Enhanced Operational Efficiency

By automating many of the monitoring and diagnostic processes, Cisco AppDynamics reduces the workload on your IT team. This allows them to focus on more strategic initiatives, driving greater value for your business.

#### Better Decision Making

The insights provided by Cisco AppDynamics enable better decision-making at all levels of your organization. Whether you’re looking to optimize resource allocation or plan for future growth, the data and analytics provided can inform your strategies and drive better outcomes.

### Integrations and Flexibility

Cisco AppDynamics offers seamless integrations with a wide range of third-party tools and platforms. Whether you’re using cloud services like AWS and Azure or CI/CD tools like Jenkins and GitHub, AppDynamics can integrate into your existing workflows, providing a unified view of your application’s performance.

For pre-information, you may find the following posts helpful:

  1. Observability vs Monitoring 
  2. Prometheus Monitoring
  3. Network Functions

Distributed Systems Observability

Distributed Systems

Today’s world of always-on applications and APIs has availability and reliability requirements that would have been needed of solely a handful of mission-critical services around the globe only a few decades ago. Likewise, the potential for rapid, viral service growth means that every application has to be built to scale nearly instantly in response to user demand.

Finally, these constraints and requirements mean that almost every application made—whether a consumer mobile app or a backend payments application—needs to be a distributed system. A distributed system is an environment where different components are spread across multiple computers on a network. These devices split up the work, harmonizing their efforts to complete the job more efficiently than if a single device had been responsible.

**The Key Components of Observability**

Observability in distributed systems is achieved through three main components: monitoring, logging, and tracing.

1. Monitoring: Monitoring involves continuously collecting and analyzing system metrics and performance indicators. It provides real-time visibility into the health and performance of the distributed system. By monitoring various metrics such as CPU usage, memory consumption, network traffic, and response times, engineers can proactively identify anomalies and make informed decisions to optimize system performance.

2. Logging: Logging involves recording events, activities, and errors within the distributed system. Log data provides a historical record that can be analyzed to understand system behavior and debug issues. Distributed systems generate vast amounts of log data, and effective log management practices, such as centralized log storage and log aggregation, are crucial for efficient troubleshooting.

3. Tracing: Tracing involves capturing the flow of requests and interactions between different distributed system components. It allows engineers to trace the journey of a specific request and identify potential bottlenecks or performance issues. Tracing is particularly useful in complex distributed architectures where multiple services interact.

**Benefits of Observability in Distributed Systems**

Adopting observability practices in distributed systems offers several benefits:

1. Enhanced Troubleshooting: Observability enables engineers to quickly identify and resolve issues by providing detailed insights into system behavior. With real-time monitoring, log analysis, and tracing capabilities, engineers can pinpoint the root cause of problems and take appropriate actions, minimizing downtime and improving system reliability.

2. Performance Optimization: By closely monitoring system metrics, engineers can identify performance bottlenecks and optimize system resources. Observability allows for proactive capacity planning and efficient resource allocation, ensuring optimal performance even under high loads.

3. Efficient Change Management: Observability facilitates monitoring system changes and their impact on overall performance. Engineers can track changes in metrics and easily identify any deviations or anomalies caused by updates or configuration changes. This helps maintain system stability and avoid unexpected issues.

What are VPC Flow Logs?

VPC Flow Logs is a feature offered by Google Cloud that captures and records network traffic information within Virtual Private Cloud (VPC) networks. Each network flow is logged, providing a comprehensive view of the traffic traversing. These logs include valuable information such as source and destination IP addresses, ports, protocol, and packet counts.

Once the VPC Flow Logs are enabled and data is being recorded, we can start leveraging the power of analysis. Google Cloud provides several tools and services for analyzing VPC Flow Logs. One such tool is BigQuery, a scalable and flexible data warehouse. By exporting VPC Flow Logs to BigQuery, we can perform complex queries, visualize traffic patterns, and detect anomalies using industry-standard SQL queries.

**How This Affects Failures**

The primary issue I have seen with my clients is that application failures are no longer predictable, and dynamic systems can fail creatively, challenging existing monitoring solutions. But, more importantly, the practices that support them. We have a lot of partial failures that are not just unexpected but not known or have never been seen before. For example, if you recall, we have the network hero. 

**The network Hero**

It is someone who knows every part of the network and has seen every failure at least once. These people are no longer helpful in today’s world and need proper Observation. When I was working as an Engineer, we would have plenty of failures, but more than likely, we would have seen them before. And there was a system in place to fix the error. Today’s environment is much different.

We can no longer rely on simply seeing a UP or Down, setting static thresholds, and then alerting based on those thresholds. A key point to note at this stage is that none of these thresholds considers the customer’s perspective.  If your POD runs at 80% CPU, does that mean the customer is unhappy?

When monitoring, you should look from your customer’s perspectives and what matters to them. Content Delivery Network (CDN) was one of the first to realize this game and measure what matters most to the customer.

Distributed Systems Observability

The different demands

So, the new, modern, and complex distributed systems place very different demands on your infrastructure and the people who manage it. For example, in microservices, there can be several problems with a particular microservice:

    • The microservices could be running under high resource utilization and, therefore, slow to respond, causing a timeout
    • The microservices could have crashed or been stopped and is, therefore, unavailable
    • The microservices could be fine, but there could be slow-running database queries.
    • So we have a lot of partial failures. 

Consequently, We can no longer predict

The significant shift we see with software platforms is that they evolve much quicker than the products and paradigms we use to monitor them. As a result, we need to consider new practices and technologies with dedicated platform teams and sound system observability. We can’t predict anything anymore, which puts the brakes on some traditional monitoring approaches, especially the metrics-based approach to monitoring.

I’m not saying that these monitoring tools are not doing what you want them to do. But, they work in a siloed environment, and there is a lack of connectivity. So we have monitoring tools working in silos in different parts of the organization and more than likely managed by other people trying to monitor a very dispersed application with multiple components and services in various places. 

Relying On Known Failures

Metric-Based Approach

A metrics-based monitoring approach relies on having previously encountered known failure modes. The metric-based approach relies on known failures and predictable failure modes. So, we have predictable thresholds that someone is considered to experience abnormal.

Monitoring can detect when these systems are either over or under the predictable thresholds that were previously set. Then, we can set alerts, and we hope that these alerts are actionable. This is only useful for variants of predictable failure modes.

Traditional metrics and monitoring tools can tell you any performance spikes or notice that a problem occurs. But they don’t let you dig into the source of the issues and let us slice and dice or see correlations between errors. If the system is complex, this approach is more challenging in getting to the root cause in a reasonable timeframe.

Google Cloud Trace

Example: Application Latency & Cloud Trace

Before we discuss Cloud Trace’s specifics, let’s establish a clear understanding of application latency. Latency refers to the time delay between a user’s action or request and the corresponding response from the application. It includes network latency, server processing time, and database query execution time. By comprehending the different factors contributing to latency, developers can proactively optimize their applications for improved performance.

Google Cloud Trace is a powerful diagnostic tool offered by Google Cloud Platform (GCP) that enables developers to identify and analyze application latency bottlenecks. It provides detailed insights into the flow of requests and events within an application, allowing developers to pinpoint areas of concern and optimize accordingly. Cloud Trace integrates seamlessly with other GCP services and provides a comprehensive view of latency across various components of an application stack.

Traditional style metrics systems

With traditional metrics systems, you had to define custom metrics, which were always defined upfront. This approach prevents us from starting to ask new questions about problems. So, it would be best to determine the questions to ask upfront.

Then, we set performance thresholds, pronounce them “good” or “bad, ” and check and re-check those thresholds. We would tweak the thresholds over time, but that was about it. This monitoring style has been the de facto approach, but we don’t now want to predict how a system can fail. Always observe instead of waiting for problems, such as reaching a certain threshold before acting.

**Metrics: Lack of connective event**

The metrics did not retain the connective event, so you cannot ask new questions in the existing dataset. These traditional system metrics could miss unexpected failure modes in complex distributed systems. Also, the condition detected via system metrics might be unrelated to what is happening.

An example of this could be an odd number of running threads on one component, which might indicate garbage collection is in progress or that slow response times are imminent in an upstream service.

**Users experience static thresholds**

User experience means different things to different sets of users. We now have a model where different service users may be routed through the system in other ways, using various components and providing experiences that can vary widely. We also know now that the services no longer tend to break in the same few predictable ways over and over.  

We should have a few alerts triggered by only focusing on symptoms that directly impact user experience and not because a threshold was reached.

The Challenge: Can’t reliably indicate any issues with user experience

If you use static thresholds, they can’t reliably indicate any issues with user experience. Alerts should be set up to detect failures that impact user experience. Traditional monitoring falls short in this regard. With traditional metrics-based monitoring, we rely on static thresholds to define optimal system conditions, which have nothing to do with user experience.

However, modern systems change shape dynamically under different workloads. Static monitoring thresholds can’t reflect impacts on user experience. They lack context and are too coarse.

Required: Distributed Systems Observability

Systems observability and reliability in distributed systems are practices. Rather than just focusing on a tool that logs, metrics, or alters, Observability is all about how you approach problems, and for this, you need to look at your culture. So you could say that Observability is a cultural practice that allows you to be proactive about findings instead of relying on a reactive approach that we are used to in the past.

Nowadays, we need a different viewpoint and want to see everything from one place. You want to know how the application works and how it interacts with the other infrastructure components, such as the underlying servers, physical or server, the network, and how data transfer looks in a transfer and stale state. 

Levels of Abstraction

What level of observation is needed to ensure everything performs as it should? What should you look at to obtain this level of detail?

Monitoring is knowing the data points and the entities from which we gather information. On the other hand, Observability is like putting all the data together. So monitoring is collecting data, and Observability is putting it together in one single pane of glass. Observability is observing the different patterns and deviations from the baseline; monitoring is getting the data and putting it into the systems. A vital part of an Observability toolkit is service level objectives (slos).

**Preference: Distributed Tracing**

We have three pillars of Systems Observability. There are Metrics, Traces, and Logging. So, defining or viewing Observability as having these pillars is an oversimplification. But for Observability, you need these in place. Observability is all about connecting the dots from each of these pillars.

If someone asked me which one I prefer, it would be distributed tracing. Distributed tracing allows you to visualize each step in service request executions. As a result, it doesn’t matter if services have complex dependencies. You could say that the complexity of the Dynamic systems is abstracted with distributed tracing.

**Use Case: Challenges without tracing**

For example, latency can stack up if a downstream database service experiences performance bottlenecks, resulting in high end-to-end latency. When latency is detected three or four layers upstream, it can be complicated to identify which component of the system is the root of the problem because now that same latency is being seen in dozens of other services.

**Distributed tracing: A winning formula**

Modern distributed systems tend to scale into a tangled knot of dependencies. Therefore, distributed tracing shows the relationships between various services and components in a distributed system. Traces help you understand system interdependencies. Unfortunately, those inter-dependencies can obscure problems and make them challenging to debug unless their relationships are clearly understood.

In distributed systems, observability is vital in ensuring complex architectures’ stability, performance, and reliability. Monitoring, logging, and tracing provide engineers with the tools to understand system behavior, troubleshoot issues, and optimize performance. By adopting observability practices, organizations can effectively manage their distributed systems and provide seamless and reliable services to their users.

Summary: Distributed Systems Observability

In the vast landscape of distributed systems, observability is crucial in ensuring their reliable and efficient functioning. This blogpost aims to delve into the critical components of distributed systems observability and shed light on their significance.

Telemetry

Telemetry forms the foundation of observability in distributed systems. It involves collecting, processing, and analyzing various metrics, logs, and traces. By monitoring and measuring these data points, developers gain valuable insights into the performance and behavior of their distributed systems.

Logging

Logging is an essential component of observability, providing a detailed record of events and activities within a distributed system. It captures important information such as errors, warnings, and informational messages, which aids in troubleshooting and debugging. Properly implemented logging mechanisms enable developers to identify and resolve issues promptly.

Metrics

Metrics are quantifiable measurements that provide a high-level view of the health and performance of a distributed system. They offer valuable insights into resource utilization, throughput, latency, error rates, and other critical indicators. By monitoring and analyzing metrics, developers can proactively identify bottlenecks, optimize performance, and ensure the smooth operation of their systems.

Tracing

Tracing allows developers to understand the flow and behavior of requests as they traverse through a distributed system. It provides detailed information about the path a request takes, including the various services and components it interacts with. Tracing is instrumental in diagnosing and resolving performance issues, as it highlights potential latency hotspots and bottlenecks.

Alerting and Visualization

Alerting mechanisms and visualization tools are vital for effective observability in distributed systems. Alerts notify developers when certain predefined thresholds or conditions are met, enabling them to take timely action. Visualization tools provide intuitive and comprehensive representations of system metrics, logs, and traces, making identifying patterns, trends, and anomalies easier.

Conclusion

In conclusion, the key components of distributed systems observability, namely telemetry, logging, metrics, tracing, alerting, and visualization, form a comprehensive toolkit for monitoring and understanding the intricacies of such systems. By leveraging these components effectively, developers can ensure their distributed systems’ reliability, performance, and scalability.