auto scaling observability

Auto Scaling Observability

 

 

Auto Scaling Observability

Observability in the context of autoscaling is a crucial aspect of managing and optimizing the scalability and efficiency of modern applications. This blog post will delve into autoscaling observability and its significance in today’s dynamic and rapidly evolving technological landscape.

 

Highlights: Auto Scaling Observability

  • The Role of the Metric

“What Is a Metric: Good for Known” So when it comes to auto-scaling observability and auto-scaling metrics, one needs to understand the downfall of the metric. A metric is a single number, with tags optionally appended for grouping and searching those numbers. They are disposable and cheap and have a predictable storage footprint.

A metric is a numerical representation of a system state over the recorded time interval and can tell you if a particular resource is over or underutilized at a specific moment. For example, CPU utilization might be at 75% right now.

  • Prometheus Pull Approach

There can be many tools to gather metrics, such as Prometheus, along with several techniques used to collect these metrics, such as the PUSH and PULL approaches. There are pros and cons to each method. However, Prometheus metric types and its PULL approach are prevalent in the market. However, if you want full observability and controllability, remember it is solely in metrics-based monitoring solutions.  For additional information on Monitoring and Observability and their difference, visit this post on observability vs monitoring.

 

Related: Before you proceed, you may find the following helpful

  1. Load Balancing
  2. Microservices Observability
  3. Network Functions
  4. Distributed Systems Observability

 



Auto Scaling Metrics

Key Auto Scaling Observability Discussion points:


  • Metrics are good for "known" issues. 

  • Challenges and issues around metrics for monitoring.

  • Observability considerations.

  • No need to predict.

  • Used for unknown / unknown failure modes.

 

Back to basics with Auto Scaling Observability

Understanding Autoscaling

Before we dive into observability, let’s briefly explore the concept of autoscaling. Autoscaling refers to the ability of an application or infrastructure to adjust its resources based on demand automatically. It enables organizations to handle fluctuating workloads and optimize resource allocation efficiently.

Observability, in the context of autoscaling, refers to gaining insights into an autoscaling system’s performance, health, and efficiency. It involves collecting, analyzing, and visualizing relevant data to understand the behavior and patterns of the application and infrastructure. Organizations can make informed decisions to optimize autoscaling algorithms, resource allocation, and overall system performance through observability.

Observability

Main Auto Scaling Observability Components

Auto Scaling Observability

  • Metrics and Monitoring

  • Logging and Tracing

  • Alerting and Tresholds

  • Numerous Tools and Platfroms.

Critical Components of Autoscaling Observability

To achieve effective autoscaling observability, several critical components come into play. These include:

Metrics and Monitoring: Gathering and monitoring key metrics such as CPU utilization, response times, request rates, and error rates are fundamental for understanding the performance of the application and infrastructure.

Logging and Tracing: Logging captures detailed information about events and transactions within the system, while tracing provides insights into the flow of requests across various components. Both logging and tracing contribute to a comprehensive understanding of system behavior.

Alerting and Thresholds: Setting up appropriate alerts and thresholds based on predefined criteria ensures timely notifications when specific conditions are met. This allows

Tools and Technologies for Autoscaling Observability

A wide range of tools and technologies are available to facilitate autoscaling observability. Prominent examples include Prometheus, Grafana, Elasticsearch, Kibana, and CloudWatch. These tools provide robust monitoring, visualization, and analysis capabilities, enabling organizations to gain deep insights into their autoscaling systems.

The first component of observability is the channels that convey observations to the observer. There are three channels: logs, traces, and metrics. These channels are common to all areas of observability, including data observability.

  • Logs

Logs are the most typical channel and take several forms (e.g., line of free-text, JSON. Logs are intended to encapsulate information about an event.

  • Traces

Traces allow you to do what logs don’t—reconnect the dots of a process. Because traces represent the link between all events of the same process, they allow the whole context to be derived from logs efficiently. Each pair of events, an operation, is a span that can be distributed across multiple servers.

  • Metrics

Finally, we have metrics. Every system state has some component that can be represented with numbers, and these numbers change as the state changes. Metrics provide a basis of information that allows an observer not only to understand using factual information but also leverage mathematical methods to derive insight from even a large number of metrics (e.g., the CPU load, the number of open files, the average amount of rows, the minimum date).

 

Auto scaling observability
Auto scaling observability: Metric Overload

 

Auto Scaling Observability

Metrics: Resource Utilization Only

So, metrics help tell us about resource utilization. Within a Kubernetes environment, these metrics are used to perform auto-healing and auto-scheduling purposes. So, when it comes to metrics, monitoring performs several functions. First, it can collect, aggregate, and analyze metrics to shift through known patterns that indicate troubling trends.

The critical point here is that it shifts through known patterns. Then, based on a known event, metrics trigger alerts that notify when further investigation is needed. Finally, we have dashboards that display the metrics data trends adapted for visual consumption on top of all of this.

These monitoring systems work well for identifying previously encountered known failures but don’t help as much for the unknown. Unknown failures are the norm today with disgruntled systems and complex system interactions.

Metrics are suitable for dashboards, but there won’t be a predefined dashboard for unknowns as it can’t track something it does not know about. Using metrics and dashboards like this is a very reactive approach. Yet, it’s an approach widely accepted as the norm. Monitoring is a reactive approach best suited for detecting known problems and previously identified patterns. 

Metrics and intermittent problems?

So, the metrics can help you when the microservice is healthy or unhealthy within a microservices environment. Still, a metric will have difficulty telling you if a microservices function takes a long time to complete or if there is an intermittent problem with an upstream or downstream dependency. So, we need different tools to gather this type of information.

We have an issue with auto-scaling metrics because they only look at individual microservices with a given set of attributes. So, they don’t give you a holistic view of the problem. For example, the application stack now exists in numerous locations and location types; we need a holistic viewpoint.

And a metric does not give this. For example, metrics are used to track simplistic system states that might indicate a service may be running poorly or may be a leading indicator or an early warning signal. However, while those measures are easy to collect, they don’t turn out to be proper measures for triggering alerts.

Auto-scaling metrics: Issues with dashboards: Useful only for a few metrics

So, these metrics are gathered and stored in time-series databases, and we have several dashboards to display these metrics. These dashboards were first built, and there weren’t many system metrics to worry about. You could have gotten away with 20 or so dashboards. But that was about it. As a result, it was easy to see the critical data anyone should know about for any given service. Moreover, those systems were simple and did not have many moving parts. This contrasts the modern services that typically collect so many metrics that fitting them into the same dashboard is impossible.

Auto-scaling metrics: Issues with aggregate metrics

So, we must find ways to fit all the metrics into a few dashboards. Here, the metrics are often pre-aggregated and averaged. However, the issue is that the aggregate values no longer provide meaningful visibility, even when we have filters and drill-downs. Therefore, we need to predeclare conditions that describe conditions we expect in the future. 

This is where we use instinctual practices of past experiences and rely on gut feeling. Remember the network and software hero? It would help to avoid aggregation and averaging within the metrics store. On the other hand, we have Percentiles that offer a richer view. Keep in mind, however, that they require raw data.

Auto Scaling Observability: Any Question

For auto-scaling observability, we take on an entirely different approach. They strive for other exploratory methods to find problems. Essentially, those operating observability systems don’t sit back and wait for an alert or something to happen. Instead, they are always actively looking and asking random questions to the observability system.

Observability tools should gather rich telemetry for every possible event, having full content of every request and then having the ability to store it and query. In addition, these new auto-scaling observability tools are specifically designed to query against high-cardinality data. High cardinality allows you to interrogate your event data in any arbitrary way that we see fit. Now, we ask any questions about your system and inspect its corresponding state. 

 

Key Auto Scaling Observability Considerations

No predictions in advance.

Due to the nature of modern software systems, you want to understand any inner state and services without anticipating or predicting them in advance. For this, we need to gain valuable telemetry and use some new tools and technological capabilities to gather and interrogate this data once it has been collected. Telemetry needs to be constantly gathered in flexible ways to debug issues without predicting how failures may occur. 

The conditions affecting infrastructure health change infrequently and are relatively easier to monitor. In addition, we have several well-established practices to predict, such as capacity planning and the ability to remediate automatically, e.g., auto-scaling in a Kubernetes environment. All of which can be used to tackle these types of known issues.

Auto Scaling Observability
Diagram: Auto Scaling Observability and Observability tools.

 

Due to its relatively predictable and slowly changing nature, the aggregated metrics approach monitors and alerts perfectly for infrastructure problems. So here, a metric-based system works well. Metrics-based systems and their associated signals help you see when capacity limits or known error conditions of underlying systems are being reached.

So, metrics-based systems work well for infrastructure problems that don’t change much but fall dramatically short in complex distributed systems. You should opt for an observability and controllability platform for these systems. 

 

Summary: Understanding Autoscaling

Autoscaling is a mechanism that automatically adjusts the number of computing resources allocated to an application based on its demand. By dynamically scaling resources up or down, autoscaling enables organizations to handle fluctuating workloads efficiently. However, to truly harness the power of autoscaling, it is crucial to have robust observability in place.

Section 1: The Role of Observability in Autoscaling

Observability is the ability to gain insights into the internal state of a system based on its external outputs. Observability plays a pivotal role in understanding the system’s behavior, identifying bottlenecks, and making informed scaling decisions when it comes to autoscaling. It provides visibility into key metrics like CPU utilization, memory usage, and network traffic. With observability, you can make data-driven decisions and ensure optimal resource allocation.

Section 2: Monitoring and Metrics

To achieve effective autoscaling observability, comprehensive monitoring is essential. Monitoring tools collect various metrics, such as response times, error rates, and resource utilization, to provide a holistic view of your infrastructure. These metrics can be analyzed to identify patterns, detect anomalies, and trigger autoscaling actions when necessary. You can proactively address performance issues and optimize resource utilization by monitoring and analyzing metrics.

Section 3: Logging and Tracing

In addition to monitoring, logging, and tracing are critical components of autoscaling observability. Logging captures detailed information about system events, errors, and activities, enabling you to troubleshoot issues and gain insights into system behavior. Tracing helps you understand the flow of requests across different services. Logging and tracing provide a granular view of your application’s performance, aiding in autoscaling decisions and ensuring smooth operation.

Section 4: Automation and Alerting

To truly master autoscaling observability, automation, and alerting mechanisms are vital. You can configure thresholds and triggers that initiate autoscaling actions based on predefined conditions by setting up automated processes. This allows for proactive scaling, ensuring your system is constantly optimized for performance. Additionally, timely alerts can notify you of critical events or anomalies, enabling you to take immediate action and maintain the desired scalability.

Conclusion:

Autoscaling observability is the key to unlocking the true potential of autoscaling. By understanding the behavior of your system through comprehensive monitoring, logging, and tracing, you can make informed decisions and ensure optimal resource allocation. With automation and alerting mechanisms in place, you can proactively respond to changing demands and maintain high efficiency. Embrace autoscaling observability and take your infrastructure management to new heights!

 

System Observability

Distributed Systems Observability

 

 

Distributed Systems Observability

In today’s technology-driven world, distributed systems have become the backbone of numerous applications and services. These systems are designed to handle large-scale data processing, ensure fault tolerance, and provide high scalability. However, managing and monitoring distributed systems can be challenging. This is where observability comes into play. In this blog post, we will explore the significance of distributed systems observability and how it enables efficient management and troubleshooting.

Distributed systems observability refers to the ability to gain insights into the inner workings of a distributed system. It encompasses monitoring, logging, and tracing capabilities that allow engineers to effectively understand system behavior, performance, and potential issues. By adopting observability practices, organizations can ensure the smooth operation of their distributed systems and identify and resolve problems quickly.

 

Highlights: Distributed Systems Observability

  • The Role of Megatrends

We have had a considerable drive with innovation that has spawned several megatrends that have affected how we manage and view our network infrastructure and the need for distributed systems observability. We have seen the decomposition of everything from one to many.

Many services and dependencies in multiple locations, aka microservices observability, must be managed and operated instead of the monolithic where everything is generally housed internally. The megatrends have resulted in a dynamic infrastructure with new failure modes not seen in the monolithic, forcing us to look a different systems observability tools and network visibility practices. 

  • Shift in Control

There has also been a shift in the point of control. We move towards new technologies, and many of these loosely coupled services or infrastructures your services lay upon are not under your control. The edge of control has been pushed, creating different network and security perimeters. These parameters are now closer to the workload than a central security stack. Therefore the workloads themselves are concerned with security.

 

For pre-information, you may find the following posts helpful:

  1. Observability vs Monitoring 
  2. Prometheus Monitoring
  3. Network Functions

 



Distributed Systems Observability.

Key Distributed Systems Observability points:


  • We no longer have predictable failures.

  • The different demands on networks.

  • The issues with the metric-based approach.

  • Static thresholds and alerting.

  • The 3 pillars of Observability.

 

Back to Basics with Distributed Systems Observability

Distributed Systems

Today’s world of always-on applications and APIs has availability and reliability requirements that would have been needed of solely a handful of mission-critical services around the globe only a few decades ago. Likewise, the potential for rapid, viral service growth means that every application has to be built to scale nearly instantly in response to user demand.

Finally, these constraints and requirements mean that almost every application made—whether a consumer mobile app or a backend payments application—needs to be a distributed system. A distributed system is an environment where different components are spread across multiple computers on a network. These devices split up the work, harmonizing their efforts to complete the job more efficiently than if a single device had been responsible.

 

System Observability Design
Diagram: Systems Observability design.

The Key Components of Observability:

Observability in distributed systems is achieved through three main components: monitoring, logging, and tracing.

1. Monitoring:

Monitoring involves the continuous collection and analysis of system metrics and performance indicators. It provides real-time visibility into the health and performance of the distributed system. By monitoring various metrics such as CPU usage, memory consumption, network traffic, and response times, engineers can proactively identify anomalies and make informed decisions to optimize system performance.

2. Logging:

Logging involves the recording of events, activities, and errors occurring within the distributed system. Log data provides a historical record that can be analyzed to understand system behavior and debug issues. Distributed systems generate vast amounts of log data, and effective log management practices, such as centralized log storage and log aggregation, are crucial for efficient troubleshooting.

3. Tracing:

Tracing involves capturing the flow of requests and interactions between different components of the distributed system. It allows engineers to trace the journey of a specific request and identify potential bottlenecks or performance issues. Tracing is particularly useful in complex distributed architectures where multiple services interact with each other.

Benefits of Observability in Distributed Systems:

Adopting observability practices in distributed systems offers several benefits:

1. Enhanced Troubleshooting:

Observability enables engineers to quickly identify and resolve issues by providing detailed insights into system behavior. With real-time monitoring, log analysis, and tracing capabilities, engineers can pinpoint the root cause of problems and take appropriate actions, minimizing downtime and improving system reliability.

2. Performance Optimization:

By closely monitoring system metrics, engineers can identify performance bottlenecks and optimize system resources. Observability allows for proactive capacity planning and efficient resource allocation, ensuring optimal performance even under high loads.

3. Efficient Change Management:

Observability facilitates the monitoring of system changes and their impact on the overall performance. Engineers can track changes in metrics and easily identify any deviations or anomalies caused by updates or configuration changes. This helps in maintaining system stability and avoiding unexpected issues.

 

How This Affects Failures

The primary issue I have seen with my clients is that application failures are no longer predictable, and dynamic systems can fail creatively, challenging existing monitoring solutions. But, more importantly, the practices that support them. We have a lot of partial failures that are not just unexpected but not known or have never been seen before. For example, if you recall, we have the network hero. 

 

The network hero

It is someone who knows every part of the network and has seen every failure at least once. They are no longer helpful in today’s world and need proper Observability. When I was working as an Engineer, we would have plenty of failures, but more than likely, we would have seen them before. And there was a system in place to fix the error. Today’s environment is much different.

We can no longer rely on simply seeing either a UP or Down, setting static thresholds, and then alerting based on those thresholds. A key point to note at this stage is that none of these thresholds considers the customer’s perspective.  If your POD runs at 80% CPU, does that mean the customer is unhappy?

When monitoring, you should look from your customer’s perspectives and what matters to them. Content Delivery Network (CDN) was one of the first to realize this game and measure what matters most to the customer.

 

Distributed Systems Observability

The different demands

So the new modern and complex distributed systems place very different demands on your infrastructure and the people that manage the infrastructure. For example, in microservices, there can be several problems with a particular microservice:

    • The microservices could be running under high resource utilization and, therefore, slow to respond, causing a timeout
    • The microservices could have crashed or been stopped and is, therefore, unavailable
    • The microservices could be fine, but there could be slow-running database queries.
    • So we have a lot of partial failures. 

 

Therefore: We can no longer predict

The big shift we see with software platforms is that they evolve much quicker than the products and paradigms we use to monitor them. As a result, we need to consider new practices and technologies with dedicated platform teams and good system observability. We can’t predict anything anymore, which puts the brakes on some traditional monitoring approaches, especially the metrics-based approach to monitoring.

I’m not saying that these monitoring tools are not doing what you want them to do. But, they work in a siloed environment, and there is a lack of connectivity. So we have monitoring tools working in silos in different parts of the organization and more than likely managed by different people trying to monitor a very dispersed application with multiple components and services in various places. 

 

Relying On Known Failures

Metric-Based Approach

A metrics-based monitoring approach relies on having previously encountered known failure modes. The metric-based approach relies on known failures and predictable failure modes. So we have predictable thresholds that someone is considered to experience abnormal.

Monitoring can detect when these systems are either over or under the predictable thresholds that are previously set. And then, we can set alerts and hope that these alerts are actionable. This is only useful for variants of predictable failure modes.

Traditional metrics and monitoring tools can tell you any performance spikes or notice that a problem occurs. But they don’t let you dig into the source of problems and let us slice and dice or see correlations between errors. If the system is complex, this approach is harder to get to the root cause in a reasonable timeframe.

 

Traditional style metrics systems

With the traditional style metrics systems, you had to define custom metrics, which were always defined upfront. So with this approach, we can’t start to ask new questions about problems. So it would be best if you defined the questions to ask upfront.

Then we set performance thresholds, pronounce them “good” or “bad, ” and check and re-check those thresholds. We would tweak the thresholds over time, but that was about it. This monitoring style has been the de facto approach, but we don’t now want to predict how a system can fail. Always observe instead of waiting for problems, such as reaching a certain threshold before acting.

System Observability Analysis
Diagram: System Observability analysis.

 

Metrics: Lack of connective event

Metrics did not retain the connective event. As a result, you cannot ask new questions in the existing dataset.  These traditional system metrics could miss unexpected failure modes in complex distributed systems. Also, the condition detected via system metrics might be unrelated to what is happening.

An example of this could be an odd number of running threads on one component that might indicate garbage collection is in progress. It might also indicate that slow response times might be imminent in an upstream service.

 

Users experience static thresholds.

User experience means different things to different sets of users. We now have a model where different service users may be routed through the system in different ways, using different components and providing experiences that can vary widely. We also know now that the services no longer tend to break in the same few predictable ways over and over.  

We should have a few alerts triggered by only focusing on symptoms that directly impact user experience and not because a threshold was reached.

 

  • The Challenge: Can’t reliably indicate any issues with user experience

If you use static thresholds, they can’t reliably indicate any issues with user experience. Alerts should be set up to detect failures that impact user experience. Traditional monitoring falls short in trying to do this. With traditional metrics-based monitoring, we rely on static thresholds to define optimal system conditions, which have nothing to do with user experience.

However, modern systems change shape dynamically under different workloads. Static thresholds for monitoring can’t reflect impacts on user experience. They lack context and are too coarse.

 

The Need For Distributed Systems Observability

Systems observability and reliability in distributed system is a practice. Rather than just focusing on a tool that does logging, metrics, or altering, Observability is all about how you approach problems, and for this, you need to look at your culture. So you could say that Observability is a cultural practice and allows you to be proactive to findings instead of relying on a reactive approach that we are used to in the past.

Nowadays, we need a different viewpoint and want to see everything from one place. You want to know how the application works and how it interacts with the other infrastructure components, such as the underlying servers, physical or server, the network, and how data transfer looks in a transfer and stale state. 

What level of observation do you need so you know that everything is performing as it should? And what should you be looking at to get this level of detail?

Monitoring is knowing the data points and the entities we are gathering from. On the other hand, Observability is like putting all the data together. So monitoring is collecting data, and Observability is putting it together in one single pane of glass. Observability is observing the different patterns and deviations from the baseline; monitoring is getting the data and putting it into the systems. A vital part of an Observability toolkit is service level objectives (slos).

 

The three pillars of distributed systems observability

We have three pillars of Systems Observability. There are Metrics, Traces, and Logging. So it is an oversimplification to define or view Observability as having these pillars. But for Observability, you need these in place. Observability is all about connecting the dots from each of these pillars.

If someone asked me which one I prefer, it would be distributed tracing. Distributed tracing allows you to visualize each step in service request executions. As a result, it doesn’t matter if services have complex dependencies. You could say that the complexity of the Dynamic systems is abstracted with distributed tracing.

 

  • Use Case: Challenges without tracing.

For example, latency can stack up if a downstream database service experiences performance bottlenecks. As a result, the end-to-end latency is high. When latency is detected three or four layers upstream, it can be complicated to identify which component of the system is the root of the problem because now that same latency is being seen in dozens of other services.

 

  • Distributed tracing: A winning formula

Modern distributed systems tend to scale into a tangled knot of dependencies. Therefore, distributed tracing shows the relationships between various services and components in a distributed system. Traces help you understand system interdependencies. Unfortunately, those inter-dependencies can obscure problems and make them challenging to debug unless their relationships are clearly understood.

Conclusion:

In the world of distributed systems, observability plays a vital role in ensuring the stability, performance, and reliability of complex architectures. Monitoring, logging, and tracing provide engineers with the necessary tools to understand system behavior, troubleshoot issues, and optimize performance. By adopting observability practices, organizations can effectively manage their distributed systems and provide seamless and reliable services to their users.