When considering Chaos Engineering kubernetes, we must start from the beginning. It was not too long ago that applications were running in single private data centers, potentially two data centers for high availability. These data centers were on-premises and all components were housed internally. Life was easy and troubleshooting and monitoring any issues could be done by a single team, if not a single person with predefined dashboards. Failures were known and there was a capacity planning strategy that did not change too much. The network and infrastructure had fixed perimeters and were pretty static. There weren’t that many changes to the stack, for example, daily. Agility was at an all-time low, but that did not matter for the environments that the application and infrastructure were housed.
Diagram: Chaos Engineering Principles.
However, nowadays we are in a completely different environment. Complexity is at an all-time high and agility in business is a critical factor. Now we have distributed applications with components/services located in many different places and types of places, on-premises and in the cloud with dependencies on both local and remote services. So, in this land of complexity, we need to find system reliability. A reliable system is one that you can trust will be reliable.
Diagram: System Reliability: Link to YouTube video.
Beyond the Complexity Horizon
Therefore, monitoring and troubleshooting are a lot harder, especially as everything is interconnected in ways that make it difficult for a single person in one team to fully understand what is going on. The edge of the network and application boundary surpasses one location and one team. Enterprise systems have gone beyond the complexity horizon and you can’t understand every bit of every single system. Even if you are a developer and closely related to the system and truly understand the nuts and bolts of the application and its code. No one can understand every bit of every single system. So being able to find the correct information is essential, but once you find it, you have to give it to those who can fix it. So monitoring is not just about finding out what is wrong but it needs to alert and these alerts need to be actionable.
Troubleshooting: Chaos Engineering Kubernetes
The goal of Chaos Engineering is to improve the reliability of a system by ensuring it can withstand turbulent conditions. Chaos Engineering makes Kubernetes more secure. So if you are adopting Kubernetes you should adopt Chaos Engineering and have it as an integral part of your monitoring and troubleshooting strategy. Firstly we can pinpoint the application errors and understand at best how these errors arose. This could be anything from badly ordered scripts on a web page to let say a database query that has bad sequel calls or even unoptimized code-level issues. Or there could be something more fundamental going on. It is common to have issues with how something is packaged into a container. You can pull in the incorrect libraries or even you are using a debug version of the container. Or there could be nothing wrong with the packaging and containerization of the container; it is all about where the container is being deployed. Here, there could be something wrong with the infrastructure, either a physical or logical problem. Bad configuration or a hardware fault somewhere in the application path.
Diagram: Chaos Engineering Kubernetes. Link to YouTube video.
Non-ephemeral and ephemeral services
With the introduction of container and microservices, monitoring solutions need to manage non-ephemeral and ephemeral services. We are collecting data for applications that consist of many different services. So when it comes to container monitoring and performing chaos engineering kubernetes tests we need to fully understand the nature and the application that lays upon. Everything is dynamic by nature. You need to have monitoring and troubleshooting in place that can handle the dynamic and transient nature. When it comes to monitoring a containerized infrastructure, you should consider the following.
- Container Lifespan: Containers have a short lifespan, whereby containers are provisioned and commissioned based on demand. This is in comparison to the VM or bare-metal workloads that generally have a longer lifespan. As a generic guideline, containers have an average lifespan of 2.5 days, while traditional and cloud-based VMs have an average lifespan of 23 days. Containers can move and they do move frequently. One day we could have workload A on a cluster host A, and the next day or even on the same day, the same cluster host could be hosting Application workload B. Therefore, there could be different types of impacts depending on the time of day.
- Containers are Temporary: Containers are dynamically provisioned for specific use cases on a temporary basis. For example, we could have a new container based on a certain image. There will be new network connections set up for that container, storage, and any integrations to other services that make the application work. All of which is done dynamically and can be on a temporary basis.
- Different levels of monitoring: We have many different levels to monitor in a Kubernetes environment. The components that make up the Kubernetes deployment will affect application performance. We have, for example, nodes, pods, and application containers. We have monitoring at different levels, such as the VM, storage, and the microservice level.
- Microservices change fast and often: Apps that consist of microservices are constantly changing. New microservices are added and existing ones decommissioned in a quick sequence. So what does this mean to usage patterns? This will result in different usage patterns on the infrastructure. If everything is often changing, it can be hard to derive the baseline and build a topology map, unless you have something automatic in place.
- Metric overload: We now have loads of metrics. With the different levels of containers and the infrastructure, we now have additional metrics to consider. We need to consider metrics for the nodes, cluster components, cluster add-one, application runtime metrics, and custom application metrics. This is in comparison with a traditional application stack where we would use metrics for components such as the operating system and the application.
- A Key Point: Metric Explosion
In the traditional world, we didn’t have to be concerned with the additional components such as an orchestrator or the dynamic nature of many containers. With a container cluster, we have to consider metrics from the operating system and the application bit also the orchestrator and containers. We refer to this as a metric explosion. So now we have loads of metrics that need to be gathered. There are also different ways to pull or scrape these metrics. Prometheus is common in the world of Kubernetes and uses a very scalable pull approach to getting those metrics from HTTP endpoints either through Prometheus client libraries or exports.
Diagram: Prometheus Monitoring Application: Scaping Metrics.
- A Key Point: What happens to visibility
So we need complete visibility now more than ever. And not just for single components but visibility at a holistic level. Therefore, we need to monitor a lot more data points than we had to in the past. We need to monitor the application servers, Pods and containers, clusters running the containers, the network for service/pod/cluster communication, and the host OS. All of the data from the monitoring needs to be in a central place so trends can be seen and different queries to the data can be acted on. In a large multi-tier application with docker containers, correlating local logs would be challenging. We can use Log forwarders or Log shippers such as FluentD or Logstash that can transform and ship logs to a backend such as Elasticsearch.
- A Key Point: New Avenues for Monitoring
Containers are the norm for managing workloads and adapting quickly to new markets. Therefore, new avenues have opened up for monitoring these environments. So we have, for example, AppDynamics, and Elastic search which are part of the ELK stack, the various logs shippers that can be used that help you provide a welcome layer of unification. We also have Prometheus to get metrics. Keep in mind that Prometheus works in the land of metrics only. There will be different ways to visualize all this data, such as Grafana and Kibana.
Diagram: Prometheus Monitoring: Link to YouTube video.
What Happened to Visibility
What happens to visibility? So we need complete visibility now more than ever. And not just for single components but visibility at a holistic level. Therefore, we need to monitor a lot more data points than we had to in the past. We need to monitor the application servers, Pods and containers, clusters running the containers, the network for service/pod/cluster communication, and the host OS. All of the data from the monitoring needs to be in a central place so trends can be seen and different queries to the data can be acted on. In a large multi-tier application with docker containers, correlating local logs would be challenging. We can use Logforwarders or Log shippers such as FluentD or Logstash that can transform and ship logs to a backend such as Elasticsearch.
Containers are the norm for managing workloads and adapting quickly to new markets. Therefore, new avenues have opened up for monitoring these environments. So I have mentioned AppDynamics, and Elastic search, which is part of the ELK stack, and the various log shippers that can be used that help you provide a layer of unification. We also have Prometheus. There will be different ways to visualize all this data, such as Grafana and Kibana.
Microservices Complexity: Management is Complex
So with the wave towards microservices, we get the benefits of scalability and business continuity, but managing is very complex. The monolith is much easier to manage and monitor. Also, as they are separate components, they don’t need to be written in the same language or even the same tool kits. So you can mix and match different technologies. So there is a lot of flexibility with this approach but we can have increased latency and complexity. There are a lot more moving parts that will increase complexity. We have, for example, reverse proxies, load balancers, firewalls, and other infrastructure support services. What used to be method calls or interprocess calls within the monoliths host now go over the network and are susceptible to deviations in latency.
- Debugging Microservices
With the monolith, the application is simply running in a single process and it is relatively easy to debug. A lot of the traditional tooling and code instrumentation technologies have been built assuming you have the idea of a single process. However, with microservices, we have a completely different approach with a distributed application. Now your application has multiple processes running in different places. The core challenge is that trying to debug microservices applications is challenging. So a lot of the tooling we have today has been built for traditional monolithic applications. So there are new monitoring tools for these new applications, but there is a steep learning curve and a high barrier to entry. New tools and technologies such as distributed tracing and chaos engineering kubernetes are not the simplest to pick up on day one.
- Automation and Monitoring: Checking and Health Checks
Automation comes to play with the new environment. With automation, we can do periodical checks not just on the functionality of the underlying components, but we can implement the hearth checks of how the application is performing. All can be automated for specific intervals or in reaction to certain events. With the rise of complex systems and microservices, it is more important to have real-time monitoring of performance and metrics that tell you how the systems behave. For example, what is the usual RTT, and how long does it take for transactions to occur under normal conditions.
The Rise of Chaos Engineering
There is a growing complexity of infrastructure and let’s face it, a lot can go wrong. It’s imperative to have a global view of all the components of the infrastructure and a good understanding of the application performance and health. In a large-scale container-based application design, there are many moving pieces and parts, and trying to manually validate the health of each piece is hard to do. With these new environments, especially cloud-native at scale. Complexity is at its highest and there are a lot more things that can go wrong. For this reason, you need to prepare as much as possible so the impact on users is minimal.
So the dynamic deployment patterns that you get with frameworks with Kubernetes allow you to build better applications. But you need to be able to examine the environment and see if it is working as expected. Most importantly and the focus of this course is that to effectively prepare, you need to implement a solid strategy for monitoring in production environments.
Diagram: Chaos Engineering Testing.
- Chaos Engineering Kubernetes
For this, you need to understand practices like Chaos Engineering and Chaos Engineering tools, and how they can improve the reliability of the overall system. Chaos Engineering is the ability to perform tests in a controlled way. Essentially, what we are doing is breaking things on purpose in order to learn how to build systems more resilient. So we are injecting faults in a controlled way so we can make the overall application more resilient by injecting a variety of issues and faults. In reality, it comes down to a trade-off and your willingness to accept it. There is a considerable trade-off with distributed computing. You have to monitor efficiently, have performance management, and more importantly accurately test the distributed system in a controlled manner.
- Service Mesh Chaos Engineering
Service Mesh is an option to use to implement Chaos Engineering. You can also implement Chaos Engineering with Chaos Mesh, which is a cloud-native Chaos Engineering platform that orchestrates tests in the Kubernetes environment. The Chaos Mesh project offers a rich selection of experiment types. Currently, here are the choices such as POD lifecycle test, network test, Linux kernel, I/O test, and many other types of stress tests. Implementing practices like Chaos Engineering will help you understand and better manage unexpected failures and performance degradation. The purpose of Chaos Engineering is to build more robust and resilient systems.