chaos engineering

Chaos Engineering: Don’t forget the baseline

In the past, applications were running in single private data centers, potentially two data centers for high availability. There may have been some satellite PoPs but generally, everything was housed in a few locations. These types of data centers were on-premises and all components were housed internally. As a result, troubleshooting and monitoring any issues was relatively easy. The network and infrastructure were pretty static, the network and security perimeters were known and there weren’t that many changes to the stack for example on a daily basis.  However, nowadays we are in a completely different environment where we have distributed applications with components/services located in many different places and types of places, on-premises and in the cloud with dependencies on both local and remote services. In comparison to the monolith, today’s applications have many different types of entry points to the external world.


However! A Lot Can Go Wrong

There is a growing complexity of infrastructure and let’s face it a lot can go wrong. It’s imperative to have a global view of all the components of the infrastructure and a good understanding of the application performance and health. In a large-scale container-based application design, there are many moving pieces and parts, and trying to manually validate the health of each piece is hard to do.  If you want some tips on how to monitor and more importantly how to react to events, you can check out my YouTube video on Chaos Engineering for a quick overview, or for more details,  you can try my course on Monitoring NetDevOps.

Therefore, monitoring and troubleshooting are a lot harder, especially as everything is interconnected in ways that make it difficult for a single person in one team to fully understand what is going on. Nothing is static anymore and things are moving around all the time. This is why it is even more important to focus on the patterns and to be able to efficiently see the path of where the issue is. Some modern applications could be in multiple clouds and different location types at the same time. As a result, there are multiple data points to consider. If any of these segments are slightly overloaded, this sum of each of the overloaded segments results in poor performance on the application level. 


What Does This Mean to Latency

Distributed computing has lots of components and services with components far apart. This is in contrast to a monolith that has all parts in one location. As a result of the distributed nature of modern applications, latency can add up.  So we have both network latency and application latency. The network latency is several orders of magnitude bigger. As a result, you need to minimize the number of Round Trip Times and reduce any unneeded communication to an absolute minimum. When communication is required across the network, it’s better to gather as much data together to get bigger packets that are more efficient to transfer.

With the monolith, the application is simply running in a single process and it is relatively easy to debug. A Lot of the traditional tooling and code instrumentation technologies have been built assuming you have the idea of a single process. The core challenge is that trying to debug microservices applications is challenging. So a lot of the tooling we have today has been built for the traditional monolithic applications. So there are new monitoring tools for these new applications but there is a steep learning curve and a high barrier to entry.


A New Approach: Chaos Engineering

For this, you need to understand practices like Chaos Engineering, and how they can improve the reliability of the overall system. Chaos Engineering is the ability to perform tests in a controlled way. Essentially, what we are doing is breaking things on purpose in order to learn how to build systems more resilient. So we are injecting faults in a controlled way so we can make the overall application more resilient by injecting a variety of issues and faults. Implementing practices like Chaos Engineering will help you understand and better manage unexpected failures and performance degradation. The purpose of Chaos Engineering is to build more robust and resilient systems.


A Final Note On Baselines: Don’t Forget Them!!

Creating a good baseline is a critical factor. You need to understand how things work under normal circumstances. A baseline is a fixed point of reference that is used for comparison purposes. You need to know usually how long it takes to start the Application to the actual login, and how long it takes to do the basic services before there are any types of issues or heavy load. Baselines are critical to monitoring. It’s like security, if you can’t see what you can’t protect. The same assumptions apply here. Go for a good baseline and if you can have this fully automated. Tests need to be carried out against the baseline on an ongoing basis. You need to test all the time to see how long it takes users to use your services. Without baseline data, it’s difficult to estimate any changes or demonstrate progress.

Comments are closed.