In this post, I would like to discuss Prometheus monitoring and its pull-based approach to metric collection. Firstly, let us roll back in time, say ten years, and look at monitoring. For monitoring, traditionally, you can use something like Ganglia. Ganglia was often used to monitor CDN networks involving several PoPs in different geographic locations. However, within this CDN network, the PoPs look the same. The same servers, storage, etc., and only with the difference in the number of transit providers and servers. Then to alert, we can use Icinga and have this on the back of Ganglia. With this type of monitoring design, we have the infrastructure pushing metrics to the central collectors. The central collectors are in one location, maybe two for backup but often two locations.
- The Challenges
However, as infrastructure grows ( infrastructure does grow at alarming rates ) and the need to push more metrics into Ganglia, you will start to see some issues. For example, with some monitoring systems, the pushing style of the metric collection can cause scalability issues as the number of servers increases. Within this CDN monitoring design, you only have one or two machines that collect the telemetry for all of your infrastructures. So as you scale your infrastructure and throw more data at the system, you have to scale up instead of out. This can be costly and will often hit bottlenecks.
Diagram: Prometheus Monitoring: The Challenges
However, you want a monitoring solution to scale your infrastructure growth. As you roll out new infrastructure to meet demands, you want to have monitoring systems that can scale. So as the infrastructure scales, the monitoring system can scale. With Ganglia and Icinga, we also have limited graphing functions. Creating custom dashboards on unique metrics was hard, and there was also no alerting support. Also, there was no API to get and consume the metric data around that time. So if you wanted to get to the data and consume it in a different system or perform interesting analyses, all of this data is essentially locked into the Ganglia.
- The Transitions
Around eight years ago, Ganglia introduced SaaS-based monitoring solutions. These solved some of the problems now with alerting built-in and API to get to the data. However, now there are two systems, and this introduces complexity. The collector and the agents are pushing to the SaaS-based system along with an on-premises design. These systems may need to be managed by two different teams. There can be cloud teams looking after the cloud-based SaaS solution and an on-premises network or security teams looking at the on-premises monitoring. So there is already a communication gap. Not to mention creating a considerable siloed environment in one technology set – monitoring.
Also, questions arise about where to put the metrics in the SaaS-based product or Ganga. For example, we could have different metrics in the same place or the same metrics in only one place. How can you keep track and ensure consistency? Ideally, if you have a dispersed PoP design and expect your infrastructure to grow and plan for the future, you don’t want to have centralized collectors. But unfortunately, most on-premise solutions still have a push-based centralized model.
Diagram: Prometheus Monitoring Tutorial.
Prometheus Monitoring: A Pull-based approach
Then Prometheus came around and offered a new approach to monitoring and can handle millions of metrics on modest hardware. In general, rather than having external services pushing metrics to them. Prometheus uses a pull approach. in comparison to a push approach. Prometheus is a server application that is written in GO. It is an open-source, decentralized monitoring tool but can be centralized when you use the federate option. Prometheus has a server component, and you run this in each environment. You can, if you want, run a Prometheus container in each Kubernetes pod. We use a time-series database for Prometheus monitoring, and every metric is recorded with a timestamp. Prometheus is not a SQL database; you need to use PromQL as its query language. PromQL allows you to query the metrics.
Diagram: Observability and Monitoring. Link to YouTube video.
- Prometheus Monitoring: Legacy System
So let us now expand on this and look at two environments for monitoring. We have a legacy environment and a modern Kubernetes environment. For the legacy, let’s say we are running a private cloud with many SQL, Windows, and Linux servers. Nothing new here. Here you would run Prometheus on the same subnet. There would also be a Prometheus agent installed. We would have Node Exporters for both Linux and Windows, which will extract and create a metric endpoint on each of your servers. The metric endpoint is needed on each server or host so Prometheus can scrap the metrics. So there is a Daemon running, collecting all of the metrics. So these metrics are exposed on a page, for example, http://host:port/metrics, that allows Prometheus to scrape.
There is also a Prometheus federation feature. You can have a federate endpoint and allow Prometheus to expose its metrics to other Prometheus services. This allows you to pull metrics around different subnets. So we can have another Prometheus in a different subnet scrapping the other Prometheus. So the federate option allows you to link these two let’s say, Prometheus solutions together very easily.
- Prometheus Monitoring: Modern Kubernetes
So here we have a Kubernetes platform, and we have a bunch of containers or VM running in a Kubernetes cluster. In this type of environment, we would usually create a namespace; for example, we could call the namespace monitoring. So here, we deploy a Prometheus pod in our environments. So the Prometheus pod YAML file will point to the Kubernetes API. The Kubernetes API has a metric server, which will get all metrics from your environments. So here we are getting metrics for the container processes. If you want to have instrumentation for your application, we have the option to deploy the library in your code. This can be done with Prometheus code libraries. So we now have a metrics endpoint similar to before, and we can grab metrics specific to your application. So we have a metrics endpoint on each container that Prometheus can scrape.
Diagram: Exposing metrics to Prometheus. Link to YouTube video.
Exposing Runtime Metrics: The Prometheus Exporter
- Exporter Types:
To enable Prometheus monitoring, you must add a metric API to the application containers to support this. For applications that don’t have their metric API, we use what is known as an Exporter. This utility reads the runtime metrics the app has already collected and exposed them on an HTTP endpoint. Prometheus can then look at this HTTP endpoint. So we have different types of Exporters that collect metrics for different runtimes, such as a JAVA Exporter, which will give you a set of JVM statistics and a .NET Exporter will give you a set of windows performance metrics. Essentially, we are adding a Prometheus endpoint to the application. In addition, we use an Exporter utility alongside the application. So we will have two processes running in the container.
Diagram: Prometheus Monitoring Example.
With this approach, you don’t need to change the application. So this could be useful for some regulatory environments of cases where you simply can’t make changes to the application code. So now you have application runtime metrics without changing any code. This is the operating system and application host data already being collected in the containers. To make these metrics available to Prometheus, you need to add an Exporter to the Docker Image. Many are using the Exporters for legacy applications, instead of changing the code to support Prometheus monitoring. So essentially, what we are doing is exporting the statistics to a metric endpoint.