For Prometheus monitoring, we want as many metrics as possible. These need to be stored so we can follow trends, understand what has been happening in a historical view, and better predict any issues. So, there are several parts to a Prometheus monitoring solution; we must collect the metrics, known as scraping, store them and then analyze them. In addition, we need to consider security, compliance, and regulatory concerns for storage. Monitoring the correct metric is key, and by having metrics, you can view how the system is performing. Metrics represent raw measurement of resource usage and can help you plan toward upgrading and tell you how many resources are being used.
Metrics can be applied to various components and are a unit of measurement for evaluating an item and are consistently measured. Examples of common measurements include CPU utilization, memory utilization, and interface utilization. These are numbers about how your resources are performing. For the metrics side of things, we have runtime metrics, infrastructure metrics, and application metrics that include Prometheus Exporters, response codes, and time to serve data. We also have CI/CD pipeline metrics such as build time and failures. Let’s discuss these in more detail.
Highlighting Prometheus Monitoring
So previously, Heapsters was used as a monitoring solution that came out of the box with Kubernetes. We now have Prometheus as the de facto standard monitoring system for Kubernetes clusters, bringing many benefits. Firstly, Prometheus monitoring scales with a pull approach and the Prometheus federated options. The challenge is that if we run microservices at scale and the monitoring system pushes metrics out to a metric server, the monitoring system can flood the network. Also, with a push-based approach, you may need to scale up instead of out, which could be costly. So we can have a bunch of different systems we want to monitor. Naturally, therefore, the metrics content will differ for the different systems and components, but Prometheus collects and exports the same. This provides a welcomed layer of unification for the different systems you have in the network.
Diagram: Prometheus Monitoring Architecture.
Prometheus Exporters and Client Libraries
So with Prometheus, you can get metrics from the systems you want to monitor using pre-built exporters and custom client libraries. So Prometheus works very well with Docker and Kubernetes but can also work outside the container world with non-cloud native applications using exporters. So you can monitor your entire stack with a wide range of exporters and client libraries. We gather custom applications and runtime metrics for cloud-native applications by installing the code library. Here we can see the custom metrics that matter most to us by installing some code in the application.
Diagram: Distributed Tracing: Link to YouTube video.
- Metric Type: Runtime Metrics
Runtime Metrics are statistics collected by the operating system and application host. These include CPU usage, memory load, and web server requests. For example, this could be CPU and memory usage from a Tomcat and JVM from a Java app.
- Metric Type: Infrastructure metrics
We examine CPU utilization, latency, bandwidth, memory, and temperature metrics for Infrastructure metrics. These metrics should be collected over a long period and applied to the infrastructure, such as the different types of networking equipment, hypervisor, and host-based systems.
- Metric Type: Application metrics
Then we have Application metrics and custom statistics relevant only to the application and not the infrastructure. Application metrics pertain specifically to an application. This may include the number of API calls made during a particular time. This can be easily done with web-based applications; here, we can get many status codes that provide us with some information. These metrics are easy to measure, and the response codes are available immediately. For example, an HTTP status code of 200 is good, and 400 or more is an issue.
- Metric Type: Time to first byte
So another important metric is the time a web server takes to respond to the data. The important metric here is time to the first byte (TTFB). This measures how long it takes for your application to send data. Time to first byte refers to the time between the browser requesting a page and when it receives the first byte of information from the server. If this metric is higher than usual, you may need to use caching or faster storage, or a better CPU. So let us take the example of the content delivery network (CDN); what is an acceptable time to first byte? On average, anything with a TTFB under 100 ms is fantastic. Anything between 200-500 ms is standard, and anything between 500 ms and 1 second is less than ideal. Anything greater than 1 second should likely be investigated further.
- Metric Type: CI/CD Pipeline Metrics
For the CI/CD Pipeline metrics, we want to measure how long it takes to do the static code analysis. Next, we want to measure the amount of error during running the pipeline. Finally, we want to measure the build time and build failures. These metrics include how long it took to build an application, the time it took to complete tests, and how often builds fail.
- Metric Type: Docker Metrics
Docker metrics come from the Docker platform. This may include container health checks, the number of online and offline nodes in a cluster, and the number of containers and actions. These container actions may be stopped, paused, and running. So we have built-in metrics provided by Docker to give additional visibility to the running containers. When running containers in production, monitoring their runtime metrics, such as CPU and memory usage, is important.
A Key Point: Docker Metrics
Metrics from the Docker Platform are very important for containers when Docker is stopping and starting applications for you. You can’t gather one metric type without the other. For example, if you look at just the application metrics, you are only looking at half of the puzzle, and you may miss the problem. For example, if one of your applications is performing poorly and the docker platform is constantly spinning up new containers, you would not see that just under the application metrics. Your application and runtime metrics may seem to be within the normal thresholds. However, combining this with the Docker Platform metrics shows the container stats, which will show a spike in container creation.
- Exposing Application Metrics to Prometheus
Application metrics give you additional information. Here you need to explicitly record the things you care about, unlike runtime metrics that you get for free. Here we have client libraries that Prometheus offers us. All the major languages have a Prometheus client library which provides the metrics endpoint. The client library makes application metrics available to Prometheus, giving you a very high level of visibility into your application. With Prometheus client libraries, you can see what is happening inside the application. So we have both Prometheus exporters and Prometheus client libraries that allow Prometheus to monitor everything.
Diagram: Exposing metrics to Prometheus. Link to YouTube video.
- Exposing Docker Metrics to Prometheus
The Docker Engine interacts with all the clients; its role is to collect and export the metrics. When you build a Docker image, the Engine records a metric. So we need to have insights into the Docker platform. Here we can expose Docker metrics to Prometheus. The Docker Engine has a built-in mechanism to export metrics in Prometheus format. So we have, for example, the Docker metrics covering the Engine and container and metrics about images.
Docker Metric Types: Three Types
The types of metrics have three areas. First, we have the Docker Engine, Builds, and Containers.
- The Engine will give you information on the host, such as the CPU count and O/S version and build of the Docker Engine.
- Then for Build metrics, it is useful for information such as the number of builds triggers, canceled, and failed.
- Also, container metrics show the number of containers stopped and paused—also, the number of health checks that are fired and failed.
- Wrap Up: Prometheus Monitoring
So we have Prometheus exporters that can get metrics for, let’s say, a Linux server and application metrics that can support Prometheus using a client library. Both have an HTTP endpoint that returns metrics in the standard Prometheus format. Once the HTTP endpoint is up and running on the application ( legacy or cloud-native ), Prometheus will scrape ( collect ) the metric with two approaches: dynamic or static. So we have Exporters that can add metrics to systems that don’t have Prometheus support. Then we also have Prometheus client libraries that can give Prometheus support in the application. These client libraries can provide out-of-the-box runtime metrics and custom metrics relevant to the applications.