Chaos Engineering Kubernetes

Chaos Engineering Kubernetes

When considering Chaos Engineering kubernetes, we must start from the beginning. It was not too long ago that applications were running in single private data centers, potentially two data centers for high availability. These data centers were on-premises and all components were housed internally. Life was easy and troubleshooting and monitoring any issues could be done by a single team, if not a single person with predefined dashboards. Failures were known and there was a capacity planning strategy that did not change too much. The network and infrastructure had fixed perimeters and were pretty static. There weren’t that many changes to the stack, for example, daily. Agility was at an all-time low, but that did not matter for the environments that the application and infrastructure were housed.

Chaos Engineering Principles

Diagram: Chaos Engineering Principles.


However, nowadays we are in a completely different environment. Complexity is at an all-time high and agility in business is a critical factor. Now we have distributed applications with components/services located in many different places and types of places, on-premises and in the cloud with dependencies on both local and remote services. So, in this land of complexity, we need to find system reliability. A reliable system is one that you can trust will be reliable.


system reliability

Diagram: System Reliability: Link to YouTube video.


Beyond the Complexity Horizon

Therefore, monitoring and troubleshooting are a lot harder, especially as everything is interconnected in ways that make it difficult for a single person in one team to fully understand what is going on. The edge of the network and application boundary surpasses one location and one team. Enterprise systems have gone beyond the complexity horizon and you can’t understand every bit of every single system. Even if you are a developer and closely related to the system and truly understand the nuts and bolts of the application and its code. No one can understand every bit of every single system.  So being able to find the correct information is essential, but once you find it, you have to give it to those who can fix it. So monitoring is not just about finding out what is wrong but it needs to alert and these alerts need to be actionable.


Troubleshooting: Chaos Engineering Kubernetes

The goal of Chaos Engineering is to improve the reliability of a system by ensuring it can withstand turbulent conditions. Chaos Engineering makes Kubernetes more secure. So if you are adopting Kubernetes you should adopt Chaos Engineering and have it as an integral part of your monitoring and troubleshooting strategy. Firstly we can pinpoint the application errors and understand at best how these errors arose. This could be anything from badly ordered scripts on a web page to let say a database query that has bad sequel calls or even unoptimized code-level issues. Or there could be something more fundamental going on. It is common to have issues with how something is packaged into a container. You can pull in the incorrect libraries or even you are using a debug version of the container. Or there could be nothing wrong with the packaging and containerization of the container; it is all about where the container is being deployed. Here, there could be something wrong with the infrastructure, either a physical or logical problem. Bad configuration or a hardware fault somewhere in the application path.


Diagram: Chaos Engineering Kubernetes. Link to YouTube video.


Non-ephemeral and ephemeral services

With the introduction of container and microservices, monitoring solutions need to manage non-ephemeral and ephemeral services. We are collecting data for applications that consist of many different services. So when it comes to container monitoring and performing chaos engineering kubernetes tests we need to fully understand the nature and the application that lays upon. Everything is dynamic by nature. You need to have monitoring and troubleshooting in place that can handle the dynamic and transient nature. When it comes to monitoring a containerized infrastructure, you should consider the following.


    • Container Lifespan: Containers have a short lifespan, whereby containers are provisioned and commissioned based on demand. This is in comparison to the VM or bare-metal workloads that generally have a longer lifespan. As a generic guideline, containers have an average lifespan of 2.5 days, while traditional and cloud-based VMs have an average lifespan of 23 days. Containers can move and they do move frequently. One day we could have workload A on a cluster host A, and the next day or even on the same day, the same cluster host could be hosting Application workload B. Therefore, there could be different types of impacts depending on the time of day.


    • Containers are Temporary: Containers are dynamically provisioned for specific use cases on a temporary basis. For example, we could have a new container based on a certain image. There will be new network connections set up for that container, storage, and any integrations to other services that make the application work. All of which is done dynamically and can be on a temporary basis.


    • Different levels of monitoring: We have many different levels to monitor in a Kubernetes environment. The components that make up the Kubernetes deployment will affect application performance. We have, for example, nodes, pods, and application containers. We have monitoring at different levels, such as the VM, storage, and the microservice level.


    • Microservices change fast and often: Apps that consist of microservices are constantly changing. New microservices are added and existing ones decommissioned in a quick sequence. So what does this mean to usage patterns? This will result in different usage patterns on the infrastructure. If everything is often changing, it can be hard to derive the baseline and build a topology map, unless you have something automatic in place. 


    • Metric overload: We now have loads of metrics. With the different levels of containers and the infrastructure, we now have additional metrics to consider. We need to consider metrics for the nodes, cluster components, cluster add-one, application runtime metrics, and custom application metrics. This is in comparison with a traditional application stack where we would use metrics for components such as the operating system and the application. 


      • A Key Point: Metric Explosion

In the traditional world, we didn’t have to be concerned with the additional components such as an orchestrator or the dynamic nature of many containers. With a container cluster, we have to consider metrics from the operating system and the application bit also the orchestrator and containers.  We refer to this as a metric explosion. So now we have loads of metrics that need to be gathered. There are also different ways to pull or scrape these metrics. Prometheus is common in the world of Kubernetes and uses a very scalable pull approach to getting those metrics from HTTP endpoints either through Prometheus client libraries or exports.


Prometheus Monitoring Application

Diagram: Prometheus Monitoring Application: Scaping Metrics.


      • A Key Point: What happens to visibility  

So we need complete visibility now more than ever. And not just for single components but visibility at a holistic level. Therefore, we need to monitor a lot more data points than we had to in the past. We need to monitor the application servers, Pods and containers, clusters running the containers, the network for service/pod/cluster communication, and the host OS. All of the data from the monitoring needs to be in a central place so trends can be seen and different queries to the data can be acted on. In a large multi-tier application with docker containers, correlating local logs would be challenging. We can use Log forwarders or Log shippers such as FluentD or Logstash that can transform and ship logs to a backend such as Elasticsearch.


      • A Key Point: New Avenues for Monitoring

Containers are the norm for managing workloads and adapting quickly to new markets. Therefore, new avenues have opened up for monitoring these environments. So we have, for example, AppDynamics, and Elastic search which are part of the ELK stack, the various logs shippers that can be used that help you provide a welcome layer of unification. We also have Prometheus to get metrics. Keep in mind that Prometheus works in the land of metrics only. There will be different ways to visualize all this data, such as Grafana and Kibana. 


system reliability

Diagram: Prometheus Monitoring: Link to YouTube video.


What Happened to Visibility

What happens to visibility? So we need complete visibility now more than ever. And not just for single components but visibility at a holistic level. Therefore, we need to monitor a lot more data points than we had to in the past. We need to monitor the application servers, Pods and containers, clusters running the containers, the network for service/pod/cluster communication, and the host OS. All of the data from the monitoring needs to be in a central place so trends can be seen and different queries to the data can be acted on. In a large multi-tier application with docker containers, correlating local logs would be challenging. We can use Logforwarders or Log shippers such as FluentD or Logstash that can transform and ship logs to a backend such as Elasticsearch.

Containers are the norm for managing workloads and adapting quickly to new markets. Therefore, new avenues have opened up for monitoring these environments. So I have mentioned AppDynamics, and Elastic search, which is part of the ELK stack, and the various log shippers that can be used that help you provide a layer of unification. We also have Prometheus. There will be different ways to visualize all this data, such as Grafana and Kibana. 


Microservices Complexity: Management is Complex

So with the wave towards microservices, we get the benefits of scalability and business continuity, but managing is very complex. The monolith is much easier to manage and monitor. Also, as they are separate components, they don’t need to be written in the same language or even the same tool kits. So you can mix and match different technologies. So there is a lot of flexibility with this approach but we can have increased latency and complexity. There are a lot more moving parts that will increase complexity. We have, for example, reverse proxies, load balancers, firewalls, and other infrastructure support services. What used to be method calls or interprocess calls within the monoliths host now go over the network and are susceptible to deviations in latency. 


  • Debugging Microservices

With the monolith, the application is simply running in a single process and it is relatively easy to debug. A lot of the traditional tooling and code instrumentation technologies have been built assuming you have the idea of a single process. However, with microservices, we have a completely different approach with a distributed application. Now your application has multiple processes running in different places. The core challenge is that trying to debug microservices applications is challenging. So a lot of the tooling we have today has been built for traditional monolithic applications. So there are new monitoring tools for these new applications, but there is a steep learning curve and a high barrier to entry. New tools and technologies such as distributed tracing and chaos engineering kubernetes are not the simplest to pick up on day one.


  • Automation and Monitoring: Checking and Health Checks

Automation comes to play with the new environment. With automation, we can do periodical checks not just on the functionality of the underlying components, but we can implement the hearth checks of how the application is performing. All can be automated for specific intervals or in reaction to certain events. With the rise of complex systems and microservices, it is more important to have real-time monitoring of performance and metrics that tell you how the systems behave. For example, what is the usual RTT, and how long does it take for transactions to occur under normal conditions.


The Rise of Chaos Engineering

There is a growing complexity of infrastructure and let’s face it, a lot can go wrong. It’s imperative to have a global view of all the components of the infrastructure and a good understanding of the application performance and health. In a large-scale container-based application design, there are many moving pieces and parts, and trying to manually validate the health of each piece is hard to do.  With these new environments, especially cloud-native at scale. Complexity is at its highest and there are a lot more things that can go wrong. For this reason, you need to prepare as much as possible so the impact on users is minimal.

So the dynamic deployment patterns that you get with frameworks with Kubernetes allow you to build better applications. But you need to be able to examine the environment and see if it is working as expected. Most importantly and the focus of this course is that to effectively prepare, you need to implement a solid strategy for monitoring in production environments.



Chaos Engineering

Diagram: Chaos Engineering Testing.


    • Chaos Engineering Kubernetes

For this, you need to understand practices like Chaos Engineering and Chaos Engineering tools, and how they can improve the reliability of the overall system. Chaos Engineering is the ability to perform tests in a controlled way. Essentially, what we are doing is breaking things on purpose in order to learn how to build systems more resilient. So we are injecting faults in a controlled way so we can make the overall application more resilient by injecting a variety of issues and faults. In reality, it comes down to a trade-off and your willingness to accept it. There is a considerable trade-off with distributed computing. You have to monitor efficiently, have performance management, and more importantly accurately test the distributed system in a controlled manner. 


    • Service Mesh Chaos Engineering

 Service Mesh is an option to use to implement Chaos Engineering. You can also implement Chaos Engineering with Chaos Mesh, which is a cloud-native Chaos Engineering platform that orchestrates tests in the Kubernetes environment. The Chaos Mesh project offers a rich selection of experiment types. Currently, here are the choices such as POD lifecycle test, network test, Linux kernel, I/O test, and many other types of stress tests. Implementing practices like Chaos Engineering will help you understand and better manage unexpected failures and performance degradation. The purpose of Chaos Engineering is to build more robust and resilient systems. 

system reliability

System Reliability in an Unpredictable World

There have been considerable shifts in our environmental landscape that have caused us to examine how we operate and run our systems and networks. We have had a mega shift with the introduction of various cloud platforms and their services and containers along with the complexity of managing distributed systems that unveil large gaps in current practices in the technologies we use. Not to mention the flaws with the operational practices around these technologies. All of this has caused a knee-jerk reaction to a welcomed drive-in innovation to system reliability. Yet, some technologies and tools used to manage these innovations are not in line with the innovative events. Many of these tools have stayed relatively static in our dynamic environment. So we have static tools used in a dynamic environment, which causes friction.

The big shift we see with software platforms is that they are evolving much quicker than products and paradigms that we use to monitor them. We need to consider new practices and technologies with dedicated platform teams to enable a new era of system reliability.


    • Lack of Connective Event: Traditional Monitoring

If you examine traditional monitoring systems, they look to capture and examine signals in isolation. The monitoring systems work in a siloed environment similar to developers and operators before the rise of DevOps. Existing monitoring systems cannot detect the “Unknowns Unknowns” common with modern distributed systems. This often leads to disruptions of services. So you may be asking what an “Unknown Unknown” is.

I’ll put it to you this way: distributed systems we see today don’t have any or much sense of predictability. Certainly not enough predictability to rely on static thresholds, alerts, and old monitoring tools. If something is static, it can be automated, and we have static events such as in Kubernetes, a POD reaching a limit. Then a replica set introduces another pod on a different node as long as certain parameters are met such as Kubernetes Labels and Node Selectors. However, this is only a small piece of the failure puzzle in a distributed environment.  Today, we have what’s known as partial failures and systems that fail in very creative ways.


    • System Reliability: Creative Ways to Fail

So we know that some of these failures are easily predicted, and actions are taken. For example, if this Kubernetes POD node reaches a certain utilization, we can automatically reschedule PODs on a different node to stay within our known scale limits. We have predictable failures that can be automated and not just in Kubernetes but with any infrastructure. An Ansible script is useful when we have predictable events. However, we have much more to deal with than POD scaling; we have many partial failures and complicated failures known as black holes.


    • Today’s World of Partial Failures

Microservices applications are distributed and susceptible to many external factors. On the other hand, if you examine the traditional monolithic application style, all the functions reside in the same process. It was either switched ON or OFF!! Not much happened in between. So if there is a failure in the process, the application as a whole will fail. The results are binary, and it is usually either a UP or Down. And with some basic monitoring in place, this was easy to detect, and failures were predictable. There was no such thing as a partial failure. In a monolith application, all application functions are within the same process. And a major benefit of these monoliths is that you don’t have partial failures.

However, in a cloud-native world, where we have taken the old monolith and broken it into a microservices-based application, a request made from a client can go through multiple hops of microservices, and we can have several problems to deal with. There is a lack of connectivity between the different domains. There will be many monitoring tools and knowledge tied to each domain, and alerts are often tied to thresholds or rate-of-change violations that have nothing to do with user satisfaction. User satisfaction is the key metric to care about.


chaos e

Diagram: Chaos Engineering – How to start a project: Link to YouTube video.


Today You Have No Way to Predict

So the new modern and complex distributed systems place very different demands on your infrastructure—considerably different from the simple three-tier application where everything was generally housed in one location.  We really can’t predict anything anymore, which puts the brakes on some traditional monitoring approaches. When you can no longer predict what will happen, you can no longer rely on a reactive approach to monitoring and management. The move towards a proactive approach to system reliability is a welcomed strategy.


A Quick Note on Blackholes: Strange Failure Modes

When considering a distributed system, many things can happen. A service or region can disappear or disappear for a few seconds or ms and appear again. We consider this as going into a black hole when we have strange failure modes. So when anything goes into it will disappear. So strange failure modes are unexpected and surprising. There is certainly nothing predictable about strange failure modes. So what happens when your banking transactions are in the black hole? What if your banking balance is displayed incorrectly or if you make a transfer to an external account and it does not show up? I did a demo on this with my training course. Here I examined the effects of Blackholes on system reliability and demoed a sample application called Bank of Anthos in the course DevOp: Operational Strategies


Highlighting Site Reliability Engineering (SRE) and Observability

The practices of Site Reliability Engineering (SRE) and Observability are what are needed to manage these types of unpredictability and unknown failures. SRE is about making systems more reliable. And everyone has a different way of implementing the SRE practices. Usually, about 20% of your issues cause 80% of your problems. You need to be proactive and fix these issues upfront. You need to be able to get ahead of the curve and do these things to stop the incidents from occurring. This usually happens in the wake of a massive incident. This usually acts as a teachable moment. This gives the power to be the reason to listen to a Chaos Engineering project. 


system reliability

Diagram: Site Reliability Engineering and Observability: Link to YouTube video.


  • New Tools and Technologies: Distributed Tracing

We have new tools such as distributed tracing. So if the system becomes slow, what is the best way to find the bottleneck? Here you can use Distributed Tracing and Open Telemetry. So the tracing helps us instrument our system, so we figure out where the time has been spent and can be used across distributed microservice architecture to troubleshoot problems. Open Telemetry provides a standardized way of instrumenting our system and providing those traces.


  • SLA, SLI, SLO, and Error Budgets

So we don’t just want to know when something has happened and then react to an event that is not looking from the customer’s perspective. We need to understand if we are meeting SLA by gathering the number and frequency of the outages and any performance issues. Service Level Objectives (SLO) and Service Level Indicators (SLI) can assist you with measurements. Service Level Objectives (SLO) and Service Level Indicators (SLI) not only assist you with measurements. They offer a tool for having better Reliability and form the base for the Reliability Stack.