microservices development

Microservices Observability

Monitoring Microservices

In the world of software development, microservices have gained significant popularity due to their scalability, flexibility, and ease of deployment. However, as the complexity of microservices architectures grows, so does the need for robust observability practices. In this blog post, we will delve into the realm of microservices observability, exploring its importance, key components, and best practices.

Observability is the ability to gain insights into the internal workings of a system through monitoring and instrumentation. In the context of microservices, observability allows developers and operators to understand how individual services interact, diagnose issues, and ensure optimal performance. By employing observability techniques, organizations can effectively manage the complexity that arises from a distributed architecture.

To achieve comprehensive observability in a microservices environment, several key components come into play. These include:

1. Distributed Tracing: Distributed tracing enables the tracking of requests as they flow through various microservices. It provides a holistic view of the request flow, allowing for performance analysis, bottleneck identification, and troubleshooting.

2. Logging and Log Aggregation: Logging plays a crucial role in capturing important events and data from microservices. By aggregating logs from different services into a central location, it becomes easier to monitor and analyze system behavior, detect anomalies, and perform root cause analysis.

3. Metrics and Monitoring: Metrics provide quantitative data about the behavior and performance of microservices. Monitoring these metrics in real-time helps identify trends, set performance baselines, and trigger alerts when predefined thresholds are breached.

To ensure effective observability in microservices architectures, organizations should consider the following best practices:

1. Instrumentation: Properly instrumenting microservices with observability tools and frameworks is essential. This includes adding code snippets to capture relevant data, such as request/response times, error rates, and resource utilization.

2. Standardization: Establishing common standards for logging, tracing, and metrics across microservices simplifies the observability process. Adopting industry-standard formats and protocols enables seamless integration and interoperability between different observability tools.

3. Automation: Automating observability processes, such as log aggregation, metric collection, and alerting, reduces manual effort and ensures consistent monitoring across the entire microservices ecosystem. Leveraging automation tools and frameworks can streamline observability workflows.

Microservices observability is a critical aspect of managing complex distributed architectures. By understanding the key components and implementing best practices, organizations can gain valuable insights into the behavior and performance of their microservices. Embracing observability empowers developers and operators to proactively identify and resolve issues, optimize performance, and deliver reliable and scalable microservices-based applications.

Highlights: Monitoring Microservices

**Understanding Microservices Observability**

Before we dive deeper, let’s establish a clear understanding of what microservices observability entails. Observability refers to gaining insights into a system’s internal state based on its external outputs. In microservices, observability involves collecting and analyzing data from various sources, such as logs, metrics, and traces, to understand the system’s behavior and performance comprehensively.

Key Points:

Focusing on key components that enable comprehensive monitoring and troubleshooting is crucial to achieving effective observability. These components include logging, metrics, and distributed tracing. Logging provides a detailed record of events and system activities, while metrics measure quantitative system performance. Distributed tracing allows tracking requests propagating through multiple microservices, giving valuable insights into the latency and dependencies between services.

Numerous tools and technologies have emerged to support the observability of microservices. Prominent examples include popular open-source solutions like Prometheus, Grafana, and Jaeger. These tools provide capabilities for collecting, storing, visualizing, and analyzing observability data. Additionally, cloud-based platforms like AWS CloudWatch and Azure Monitor offer managed services that simplify the setup and management of observability infrastructure.

Microservices monitoring is suitable for known patterns that can be automated, while microservices observability is suitable for detecting unknown and creative failures. Microservices monitoring is a critical part of successfully managing a microservices architecture. It involves tracking each microservice’s performance to ensure there are no bottlenecks in the system and that the microservices are running optimally.

The Need for Microservices Observability:

1. Obfuscation: When creating microservices, your application becomes more distributed, the coherence of failures decreases, and we live in a world of unpredictable failure mode—also, the distance between cause and effect increases. For example, an outage at your cloud provider’s blob storage could cause huge cascading latency for everyone. In today’s environment, we have new cascading problems.

2. Inconsistency and highly independent: Distributed applications might be reliable, but the state of individual components can be much less consistent than in monolithic or non-distributed applications, which have elementary and well-known failure modes. In addition, each element of a distributed application is designed to be highly independent, and each component can be affected by different upstream and downstream components.

3. Decentralization: How do you look for service failures when a thousand copies of a service may run on hundreds of hosts? How do you correlate those failures so you can understand what’s going on?

Cloud Monitoring: Compute Engine & Ops Agent

What is an Ops Agent?

Ops Agent is a monitoring agent provided by Google Cloud that allows users to collect and export monitoring data from their Compute Engine instances. It acts as a bridge between your virtual machines and the Google Cloud Monitoring service, enabling real-time visibility into the health and performance of your infrastructure.

Ops Agent Advantages:

Ops Agent offers several advantages when it comes to monitoring a Compute Engine. Firstly, it provides a unified solution for collecting metrics, logs, and events from your instances. This means you can easily access and analyze all the necessary data in one centralized location. Additionally, Ops Agent offers resource-efficient monitoring, minimizing the impact on your instances’ performance while providing accurate and timely information.

Implementing Ops Agent:

To start monitoring your Compute Engine instances with Ops Agent, you need to follow a few simple steps. First, ensure that you have the necessary permissions and enable the necessary APIs in your Google Cloud project. Then, install Ops Agent on your instances using the provided installation script or by creating a custom image with Ops Agent pre-installed. Finally, configure the agent to collect the desired metrics, logs, and events based on your monitoring requirements.

**Tools of the past: Log data and series statistics**

Traditionally, microservices monitoring has boiled down to two types of telemetry data: log data and time series statistics. Time series data is also known as metrics, as to make sense of a metric, you need to view a period. However, as we broke the software into tiny, independently operated services and distributed those fragmented services, the logs, and metrics we captured told you very little of what was happening to the critical path.

Understanding the critical path is the most important thing, as this is what the customer is experiencing. Looking at a single stack trace or watching CPU and memory utilization on predefined graphs and dashboards is insufficient. As software scales in-depth but breadth, telemetry data like logs and metrics alone don’t provide clarity; you must quickly identify production problems.

Components – Microservices Monitoring

Metrics 

A. Metrics: This includes tracking metrics such as response time, throughput, and error rate. This information can be used to identify performance issues or bottlenecks. By collecting and interpreting metrics, organizations gain valuable insights into their microservices-based applications, enabling them to make informed decisions and proactively address potential problems.

Logs 

B. Logs: Logging allows administrators to track requests, errors, and exceptions, which can provide deeper insight into the performance of microservices architecture. Logs provide a unique perspective by capturing valuable information about system events and activities. Logs act as a breadcrumb trail, documenting the inner workings of microservices.

One can detect anomalies, identify bottlenecks, and troubleshoot errors effectively by analyzing logs. Capturing log data from each microservice and centralizing it in a log management system allows comprehensive monitoring across the entire architecture. Logs can reveal valuable insights such as response times, error rates, and resource usage, empowering teams to make data-driven decisions.

Tracing 

C. Tracing: Tracing provides a timeline of events within the system. This can be used to identify the source of issues or to track down errors. In microservices, tracing refers to capturing and analyzing the flow of requests as they traverse through different services. By tracing requests, we can gain valuable insights into the performance and behavior of our microservices architecture. From identifying latency issues to detecting errors and bottlenecks, tracing provides a holistic view of the entire request journey.

Alerts

D. Alerts: Alerts notify administrators when certain conditions are met. For example, administrators can be alerted if a service is down or performance is degrading . Configuring alerting rules is a critical step in microservices monitoring. It involves defining thresholds or conditions that, when breached, trigger alerts. These rules should be set based on the specific requirements of each microservice, considering factors like expected response time, error rates, or resource thresholds. Additionally, it’s essential to determine the appropriate severity levels for different alerts.

Finally, it is essential to note that microservices monitoring is not just limited to tracking performance. It can also detect security vulnerabilities and provide insights into the architecture.

By leveraging microservices monitoring, organizations can ensure that their microservices architecture runs smoothly and that any issues are quickly identified and resolved. This can help ensure the organization’s applications remain reliable and secure.

Example Product: Cisco AppDynamics

### Why Choose Cisco AppDynamics?

Cisco AppDynamics stands out in the crowded APM market for several compelling reasons. First, it offers end-to-end visibility into your application’s performance, from the end-user experience down to the underlying infrastructure. This comprehensive view allows you to pinpoint issues quickly and resolve them before they impact your users. Additionally, AppDynamics employs machine learning algorithms to detect anomalies and provide actionable insights, enabling proactive problem-solving.

### Key Features of Cisco AppDynamics

One of the standout features of AppDynamics is its ability to automatically map your application’s topology, giving you a clear picture of how different components interact. This dynamic mapping is invaluable for troubleshooting and optimizing your application. Another key feature is its robust alerting system, which notifies you of performance issues in real-time, allowing for immediate intervention. Furthermore, AppDynamics offers detailed analytics and reporting capabilities, helping you make data-driven decisions to improve your application’s performance.

### Integrations and Extensibility

Cisco AppDynamics is designed to integrate seamlessly with a wide range of technologies and platforms, making it a versatile tool for any IT environment. Whether you’re using cloud services like AWS or Azure, container orchestration platforms like Kubernetes, or traditional on-premise infrastructure, AppDynamics has you covered. The platform also supports custom extensions, allowing you to tailor it to your specific needs and workflows.

### Real-World Use Cases

Many organizations have successfully leveraged Cisco AppDynamics to achieve significant improvements in their application performance and user experience. For instance, a leading e-commerce company used AppDynamics to identify and resolve a critical bottleneck in their checkout process, resulting in a 20% increase in conversion rates. Similarly, a financial services firm utilized AppDynamics’ machine learning capabilities to predict and prevent potential outages, ensuring uninterrupted service for their customers.

Example: What are VPC Flow Logs?

VPC Flow Logs provide detailed information about the IP traffic flowing through your Virtual Private Cloud (VPC). They capture metadata about each network flow, including source and destination IP addresses, ports, protocol, packet and byte counts, and more. Enabling VPC Flow Logs allows you to gain visibility into the traffic patterns within your VPC, allowing you to monitor, troubleshoot, and analyze network activity.

VPC Logs & Cloud Monitoring

Google Cloud offers a variety of powerful tools for analyzing VPC Flow Logs and extracting meaningful insights. One such tool is Cloud Logging, which allows you to view and search flow logs in real-time, set up alerts and notifications, and create custom dashboards for visualization. Additionally, you can leverage BigQuery, Google Cloud’s data warehouse solution, to store and analyze large volumes of flow log data using SQL queries and advanced analytics techniques.

Related: For pre-information, you will find the following posts helpful:

  1. Observability vs Monitoring
  2. Chaos Engineering Kubernetes
  3. Distributed System Observability
  4. ICMPv6

Monitoring Microservices

Microservices Monitoring and Observability

– Containers, cloud platforms, scalable microservices, and the complexity of monitoring distributed systems have highlighted significant gaps in the microservices monitoring space that have been static for some time. As a result, you must fully understand performance across the entire distributed and complex stack, including distributed traces across all microservices.

– So, to do this, you need a solution that can collect, process, and store data used for monitoring. And the data needs to cover several domains and then be combined and centralized for analysts.

– This can be an all-in-one solution that represents or bundles different components for application observability. The bundled solutions would be, for example, an Application Performance Monitoring (APM) that consists of application performance monitoring tools or a single platform, such as Prometheus, which lives in a world of metrics only.  

Application Performance Monitoring:

Application performance monitoring typically involves tracking an application’s response time, the number of requests it can handle, and the amount of memory or other system resources it uses. This data can be used to identify any issues with application performance or scalability. Organizations can take corrective action by monitoring application performance to improve the user experience and ensure their applications run as efficiently as possible.

Identify Trends & Patterns:

Application performance monitoring also helps organizations better understand their users by providing insight into how applications are used and how well they are performing. In addition, this data can be used to identify trends and patterns in user behavior, helping organizations decide how to optimize their applications for better user engagement and experience.

Monitoring observability
Diagram: Monitoring Observability. Source is Bravengeek

**Microservices Monitoring Categories**

We have several different categories to consider. For microservices monitoring and Observability, you must first address your infrastructures, such as your network devices, hypervisors, servers, and storage. Then, you should manage your application performance and health.

Then, you need to monitor how to manage network quality and optimize when possible. For each category, you must consider white box and black box monitoring and potentially introduce new tools such as Artificial Intelligence (AI) for IT operations (AIOps).

Prevented approach to Microservice monitoring: AI and ML.

When choosing microservices observability software, consider a more preventive approach than a reactive one that is better suited for traditional environments. Prevented approaches to monitoring can use historical health and performance telemetry as an early warning with the use of Artificial Intelligence (AI) and Machine Learning (ML) techniques.

Whitebox Monitoring

White box monitoring offers more details than a black box, which tells you something is broken without telling you why. White box monitoring details the why, but you must ensure the data is easily consumable. Black box microservices monitoring can help with predictable failures and known failure modes. Still, given the creative ways that applications and systems fail today, we must examine the details of white-box microservices monitoring. Complex applications fail in unpredictable ways, often termed black holes.

New failures & failure modes

Distributing your software presents new types of failure, and these systems can fail in creative ways and become more challenging to pin down. For example, the service you’re responsible for may be receiving malformed or unexpected data from a source you don’t control because a team manages that service halfway across the globe.

White box monitoring takes a different approach from black box monitoring. It uses Instrumentation, which exposes details about the system’s internals to help you explore these black holes and better understand the creative mode in which applications fail today.

Example: Application Latency

Application latency refers to the time it takes for an application to respond to a user’s request. It is influenced by various factors such as network latency, processing time, and database queries. Monitoring and analyzing application latency can help identify bottlenecks and optimize performance.

Google Cloud Service Mesh

**What is a Cloud Service Mesh?**

A cloud service mesh is a configurable infrastructure layer for microservices applications that makes communication between service instances flexible, reliable, and fast. It typically includes a set of network proxies deployed alongside application code, which handle tasks such as load balancing, service discovery, and authentication. The service mesh enables developers to focus on the business logic while the mesh handles communication concerns.

**Key Features and Benefits**

1. **Improved Security**: One of the main advantages of a service mesh is its ability to enhance security. By managing service-to-service authentication, authorization, and encryption, it ensures that data is protected during transit.

2. **Observability**: A service mesh provides comprehensive observability through metrics, logs, and traces. This enables better monitoring and troubleshooting, helping teams identify and resolve issues quickly.

3. **Traffic Management**: Service meshes allow for sophisticated traffic management capabilities, such as load balancing, traffic splitting, and fault injection. This ensures high availability and resilience of applications.

**Google’s Approach to Service Mesh**

Google has been a pioneer in developing service mesh technology, with its flagship product, Istio. Istio is an open-source service mesh that provides a uniform way to secure, connect, and monitor microservices. Google Cloud Platform (GCP) integrates Istio to offer these capabilities as part of its suite of managed services. This integration allows developers to leverage the power of service mesh without the operational overhead of managing it themselves.

**Case Studies and Real-World Applications**

Several organizations have successfully implemented Google’s service mesh solutions to optimize their operations. For instance, e-commerce giants and financial institutions have seen significant improvements in their system reliability and security by using Istio on GCP. These real-world applications highlight the practical benefits and transformative potential of adopting a service mesh.

Introducing Cloud Trace

Cloud Trace is a comprehensive performance analysis tool provided by Google Cloud. It allows developers to trace and visualize the latency of requests across their applications. By collecting detailed information about each request, including timing data and associated events, Cloud Trace offers valuable insights into application performance.

Microservices Observability: Techniques

Collection, storage, and analytics: Regardless of what you are monitoring, the infrastructure, or the application service, monitoring requires 3 three inputs, more than likely across three domains. We require:

    1. Data collection, 
    2. Storage, and 
    3. Analysis.

We need to look at metrics, traces, and logs for these three domains or, let’s say, components. Out of these three domains, trace data is the most beneficial and excellent way to isolate performance anomalies for distributed applications. Trace data falls into distributed tracing brackets, enabling flexible consumption of capture traces. 

First, you must establish a baseline comprising the four golden signals – latency, traffic, errors, and saturation. The golden signals are good indicators of health and performance and apply to most components of your environment, such as the infrastructure, applications, microservices, and orchestration systems.

I recommend automating this baseline and the automation alerts for deviations from baselines. However, if you collect too much data, you may be alerted to too much. Service Level Indicators (SLI) can help you determine what to alert about and what matters to the user experience. 

The Effect on Microservices: Microservices Monitoring

When considering a microservice application, many believe this independent microservice is independent, but this is nothing more than an illusion. These microservices are highly interdependent, and a failure or slowdown in one service propagates across the stack of microservices.

A typical architecture may include a backend service, a front-end service, or maybe even a docker-compose file. So, several containers must communicate to carry out operations. 

For a simple microservice architecture, we would have a simple front end minimizing a distributed application setup, where microservices serving static contents are at the front end. At the same time, the heavy lifting is done with the other service.   

**Monolith and microservices monitoring**

We have more components to monitor than we had in the monolithic world. With their traditional monolithic, there are only two components to monitor. Then, we had the applications and the hosts.

Compared to the cloud-native world, we have containerized applications orchestrated by Kubernetes with multiple components requiring monitoring. These components include the hosts, Kubernetes platform, Docker containers, and containerized microservices.

**Distributed systems have different demands**

Today, distributed systems are the norm, placing different demands on your infrastructure than the classic, three-tier application. Pinpointing issues in a microservices environment is more challenging than with a monolithic one, as requests traverse both between different layers of the stack and across multiple services. 

**The Challenges: Microservices**

The things we love about microservices are independence and idempotence, which make them difficult to understand, especially when things go wrong. As a result, these systems are often referred to as deep systems, not due to their width but their complexity.

We can no longer monitor their application by using a script to access the application over the network every few seconds, report any failures, or use a custom script to check the operating system to understand when a disk is running out of space.

Understanding saturation is an implemented signal, but it’s just one of them. It quickly becomes unrealistic for a single human, or even a group, to understand enough of the services in the critical path of even a single request and continue maintaining it. 

**Node Affinity or Taints**

Microservices-based applications are typically deployed on dynamic and transient containers. This leaves an unpredictable environment where the pods get deployed and run unless specific intent is expressed using affinity or taints. However, pod placement can still be unpredictable. The unpredictable nature of pod deployment and depth of configuration can lead to complex troubleshooting.

Understanding GKE-Native Monitoring

Prometheus Integration

GKE-Native Monitoring provides a comprehensive and real-time view of the health and performance of your Kubernetes workloads. Leveraging built-in Prometheus integration enables automatic metrics collection and aggregation, offering deep insights into resource utilization, latency, errors, and more. With GKE-Native Monitoring, developers can quickly identify bottlenecks, optimize resource allocation, and proactively detect and troubleshoot issues before they impact users.

Stackdriver Logging

GKE-Native Monitoring integrates with Stackdriver Logging, Google Cloud’s powerful log management and analysis tool. By combining metrics and logs in a unified platform, developers and operators gain complete application observability. Stackdriver Logging provides advanced filtering and querying capabilities, allowing users to search and analyze logs across multiple Kubernetes clusters quickly. With log-based metrics and alerts, teams can set up proactive monitoring to detect anomalies or specific events, ensuring the reliability and performance of their applications.

The Beginnings of Distributed Tracing

Introducing Distributed Tracing

Distributed tracing is used in microservices and other distributed applications because a single operation touches many services. It is a type of correlated logging that helps you gain visibility into the process of a distributed software system. Distributed tracing consists of collecting request data from the application and then analyzing and visualizing this data as traces.

Tracing data, in the form of spans, must be collected from the application, transmitted, and stored to reconstruct complete requests. This can be useful for performance profiling, debugging in production, and root causes analysis of failures or other incidents. 

A key point: The value of distributed tracing

Distributed tracing allows you to understand what a particular service is doing as part of the whole. Thus providing visibility into the operation of your microservice architecture. The trace data you generate can display the overall shape of your distributed system and view individual service performance inside a single request.

**Distributed Tracing Components** 

  1. What is a trace?

Consider your software in terms of requests. Each component of your software stack works in response to a request or a remote procedure call from another service. So, we have a trace encapsulating a single operation within the application, end to end, and represented as a series of spans. 

Each traceable unit of work within the operations generates a span. You can get trace data in two ways: through the Instrumentation of your service processes or by transforming existing telemetry data into trace data. 

  1. Introducing a SPAN

We call each service’s work a span, as in the period it takes for the work to occur. These spans can be annotated with additional information, such as attributes, tags, or logs. So, we can have a combination of metadata and events that can be added to spans—creating effective spans that unlock insights into the behavior of your service. The span data produced by each service is then forwarded to some external process, where it can be aggregated into a trace, analyzed, and stored for further insights.

**Example: Open Tracing**

When you are ready to get started with distributed tracing, you will come across OpenTracing. OpenTracing is a set of standards exposed as frameworks. It’s a vendor-neutral API and Instrumentation for distributed tracing. 

Open tracing does not give you the library but rather a set of rules and extensions that another library can adopt. Thus, you can use and swap around different libraries and expect the same things. 

Diagram: Distributed Tracing Example. Source is Simform

Microservices Architecture Example

Let’s examine an example of the request library for Python. So we have Requests, an elegant and simple HTTP library for Python. The request library talks to HTTP and will rely on specific standards; the standard here will be HTTP. So in Python, when making a “requests.get”.

The underlying library implementation will make a formal HTTP request using the GET method. Thus, the HTTP standards and specs lay the ground rules for what is expected from the client and the server.

The OpenTracing projects do the same thing. They set out the ground rules for distributed tracing, regardless of the implementation and the language used. They have several liabilities available in nine languages: Go, JavaScript, Java, Python, Ruby, PHP, Objective-C, and C++.

For example, the OpenTracing API for Python implements open tracing. This set of standards for tracing with Python provides examples of what Instrumentation should look like and common ways to start a trace. 

Connect the dots with distributed tracing.

And this is a big difference in why you would use tracing and logging. Tracing allows you to connect the dots from one end of the application to the other end of the application. So, if you are starting a request on the front end and want to see how that works on the backend, that works. A trace and child traces connected will have a representation. 

Visual Representation with Jaeger: You may need to use Jaeger for the visual representation. Jaeger is an open-source end-to-end visual representation of tracing that allows you to monitor and troubleshoot transactions in complex distributed systems.

So, we have a dashboard where we can interact and search for traces. Jaeger addresses problems such as monitoring distributed tracing, performance and latency optimizations, root cause analysis, service dependency analysis, and distributed content propagation. It has clients in different languages.

So, for example, if you are using Python, there will be client library features for Python. 

The Role of OpenTelementry

OpenTelementry: We also have OpenTelementry, which is similar. It is described as an observability framework for cloud-native software and is in beta across several languages. It is geared towards traces, metrics, and logs, so it does more than OpenTracing. 

Observability means a system’s internal states can be inferred from its external outputs. Therefore, the tools used to complete an Observability system help understand the relationships between causes and effects in distributed systems.

The term Observability is borrowed from the control theory. It suggests a holistic, data-centric view of microservices monitoring that enables exploration capabilities and identifying unknown failures with the more traditional anomaly detection and notification mechanisms.

Goal: The ultimate goal of Observability is to :

  • Improving baseline performance
  • Restoring baseline performance (after a regression)

By improving the baseline, you improve the user experience. This could be because performance often means request latency for user-facing applications. Then, we have performance regressions, including application outages, which can result in a loss of revenue and negatively impact the brand. The regressions’ time accepted comes down to user expectations. What is accessible, and what is in the SLA?

With chaos engineering tests, you understand your limits and the new places where your system and applications can be made. Chaos Engineering helps you know your system by introducing controlled experiments when debugging microservices. 

A key point: Massive amount of data

Remember that instrumenting potentially generates massive amounts of data, which can cause challenges in storing and analyzing. You must collect, store, and analyze data across the metrics, traces, and logs domains. Then, you need to alert me to these domains and what matters most, not just when an arbitrary threshold is met.

The role of metrics: Most people know a metric comprising a value, timestamp, and metadata. Metrics are collections of statistics that need to be analyzed over time. A single instance of a metric is of limited value. Examples include request rate, average duration, and queue size. These values are usually captured as time series so that operators can see and understand changes to metrics over time. 

Add labels to metric: We can add labels as key-value pairs to better understand metrics. The labels add additional context to this data point. So, the label is a key-value pair indexed with the metrics as part of the injection process. In addition, metrics can now be broken down into sub-metrics.

As we enter the world of labels and tags for metrics, we need to understand the effects this may have on Cardinality. While each indexed label value adds time series, this will come at storage and processing costs. Therefore, we use Cardinality to understand the impact of labels on a metric store.

Aggregated metrics: I continue to see the issue that metrics are typically aggregated every minute or even six to twelve times per minute. However, metrics must be aggregated and visualized within at most one minute but ideally even more quickly. Key questions are: What is the window across which values are aggregated? How are the windows from different sources aligned?

A key point: The issues of Cardinality

Aggregated Metrics allow you to get an aggregate understanding of what’s happening to all instances of a given service and even narrow your query to specific groups of services but fail to account for infinite Cardinality. Due to issues with “high-cardinality” within a time series storage engine, it is recommended to use labels rather than hierarchical naming for metrics.

Prometheus Monitoring

Examples: Push and Pull

So, to get metric, you need to have a push or pull approach. A push agent transmits data upstream and, more than likely, on a scheduled basis. A pull agent expects to be polled. Then, we have Prometheus and several Prometheus metric types. We have a Prometheus server with a pull approach that fits better into larger environments.

Prometheus does not use the term agent and has what is known as exporters. They allow the Prometheus server to pull metrics back from software that cannot be instrumented using the Prometheus client libraries.

a) Prometheus Kuberentes

Prometheus Kubernetes is an open-source monitoring platform that originated at SoundCloud and was released in 2012. Its capabilities include metric collection, storage, data analyses, and visualizations. For visualizations, we can use Prometheus and Grafana.

b) Storing Metrics

You can sort metrics that are time-series data in a general-purpose relational database. However, they should be stored in an optimized repository for storing and retrieving time-series data. We have several time-series storage options, such as Altas, InfluxDB, and Prometheus. Prometheus stands out, but keep in mind that, as far as I’m aware, there is no commercial support and limited professional services for Prometheus.

c) The Role of Logs

Then, we have highly detailed logs. Logs can be anything, unlike metrics, which have a daily uniform format. However, logs do provide you with why something is broken. Logs capture activity that can be printed to the screen or sent to a backend to be centrally stored and viewed.

There is very little standard structure to logs apart from a timestamp indicating when the event occurred. There is minimal log schema, and log structure will depend on how the application uses it and how developers create logs.

d) Emitting Logs

Logs are emitted by almost every entity, such as the basic infrastructure, network and storage, servers and computer notes, operating system nodes, and application software. So, there are various log sources and several tools involved in transport and interpretation, making log collection a complex task. However, remember that you may assume a large amount of log data must be stored.

Search engines such as Google have developed several techniques for searching extensive datasets using arbitrary queries, which have proved very efficient. All of these techniques can be applied to log data.  

e) Logstash, Beats, and FluentD

Logstash is a cloud-scale ingestion tool and is part of the Elasticsearch suit. However, there have been concerns with the performance and scalability of Logstash, which brings us to the lightweight version of Beats. So, if you don’t need the sophisticated data manipulation and filtering of Logstash, you can use Beasts. FluentD provides a unified logging layer or a way to aggregate logs from many different sources and distribute them to many destinations with the ability to transform data.

f) Storing Logs

Structure data such as logs and events are made of key-value pairs, any of which may be searched. This leads us to repositories called nonrelational or no SQL databases. So, storing logs represents a different storage problem from that of metrics. Examples of KV databases include Memcache and Redis.

However, they are not a good choice for log storage due to the inefficiency of indexing and searching. The ELK stack has an indexing and searching engine, a collector, a Logstash, a visualization tool, and the dominant storage mechanism for soft log and event data.

A key point: Analyze logs with AI

So, once you store the logs, they need to be analyzed and viewed. Here, you could, for example, use Splunk. Its data analysis capabilities range from security to AI for IT operations (AIOps). Kibana, part of the Elastic Stack, can also be used.

Summary: Monitoring Microservices

Monitoring microservices has become a critical aspect of maintaining the performance and reliability of modern applications. With the increasing adoption of microservices architecture, understanding how to monitor and manage these distributed systems effectively has become indispensable. In this blog post, we explored the key considerations and best practices for monitoring microservices.

The Need for Comprehensive Monitoring

Microservices are highly distributed and decentralized, which poses unique challenges regarding monitoring. Traditional monolithic applications are more accessible to monitor, but microservices require a different approach. Understanding the need for comprehensive monitoring is the first step toward ensuring the reliability and performance of your microservices-based applications.

Choosing the Right Monitoring Tools

This section will delve into the various monitoring tools available for monitoring microservices. From open-source solutions to commercial platforms, there is a wide range of options. We will discuss the critical criteria for selecting a monitoring tool: scalability, real-time visibility, alerting capabilities, and integration with existing systems.

Defining Relevant Metrics

To effectively monitor microservices, it is essential to define relevant metrics that provide insights into the health and performance of individual services as well as the overall system. In this section, we will explore the key metrics to monitor, including response time, error rates, throughput, resource utilization, and latency. We will also discuss the importance of setting appropriate thresholds for these metrics to trigger timely alerts.

Implementing Distributed Tracing

Distributed tracing plays a crucial role in understanding the flow of requests across microservices. By instrumenting your services with distributed tracing, you can gain visibility into the entire request journey and identify bottlenecks or performance issues. We will explore the benefits of distributed tracing and discuss popular tracing frameworks like Jaeger and Zipkin.

Automating Monitoring and Alerting

Keeping up with the dynamic nature of microservices requires automation. This section will discuss the importance of automated monitoring and alerting processes. From automatically discovering new services to scaling monitoring infrastructure, automation plays a vital role in ensuring the effectiveness of your monitoring strategy.

Conclusion:

Monitoring microservices is a complex task, but with the right tools, metrics, and automation in place, it becomes manageable. By understanding the unique challenges of monitoring distributed systems, choosing appropriate monitoring tools, defining relevant metrics, implementing distributed tracing, and automating monitoring processes, you can stay ahead of potential issues and ensure optimal performance and reliability for your microservices-based applications.

Cisco ACI

Cisco ACI | ACI Infrastructure

Cisco ACI | ACI Infrastructure

In the ever-evolving landscape of network infrastructure, Cisco ACI (Application Centric Infrastructure) stands out as a game-changer. This innovative solution brings a new level of agility, scalability, and security to modern networks. In this blog post, we will delve into the world of Cisco ACI, exploring its key features, benefits, and the transformative impact it has on network operations.

At its core, Cisco ACI is a software-defined networking (SDN) solution that provides a holistic approach to managing and automating network infrastructure. It combines physical and virtual elements, allowing for simplified policy-based management and enhanced visibility across the entire network fabric. By abstracting network services from the underlying hardware, Cisco ACI enables organizations to achieve greater flexibility and efficiency in network operations.

Cisco ACI offers a wide array of features that empower organizations to optimize their network infrastructure. Some of the notable features include:
1. Application-Centric Policy Model: Cisco ACI shifts the focus from traditional network-centric approaches to an application-centric model. This means that policies are built around applications and their specific requirements, allowing for more granular control and easier application deployment.

2. Automated Network Provisioning: With Cisco ACI, network provisioning becomes a breeze. The solution automates the configuration and deployment of network resources, eliminating manual errors and significantly reducing the time required to provision new services.

3. Enhanced Security and Microsegmentation: Security is a top priority in today's digital landscape. Cisco ACI incorporates advanced security capabilities, including microsegmentation, which enables organizations to isolate and secure different parts of the network, reducing the attack surface and improving overall security posture.

Implementing Cisco ACI involves a well-planned deployment and integration strategy. It seamlessly integrates with existing network infrastructure, making it easier for organizations to adopt and extend their networks. Whether it's a greenfield deployment or a gradual migration from legacy systems, Cisco ACI provides a smooth transition path, ensuring minimal disruption to ongoing operations.

To truly grasp the power of Cisco ACI, let's explore some real-world use cases where organizations have leveraged this technology to revolutionize their network infrastructure:

1. Data Centers: Cisco ACI simplifies data center operations, enabling organizations to achieve greater agility, scalability, and automation. It provides a centralized view of the entire data center fabric, allowing for efficient management and faster application deployments.

2. Multi-Cloud Environments: With the rise of multi-cloud environments, managing network connectivity and security across different cloud providers becomes a challenge. Cisco ACI offers a unified approach to network management, making it easier to extend policies and maintain consistent security across multiple clouds

Cisco ACI is a transformative force in the world of network infrastructure. Its application-centric approach, automated provisioning, enhanced security, and seamless integration capabilities make it a compelling choice for organizations seeking to modernize their networks. By embracing Cisco ACI, businesses can unlock new levels of efficiency, scalability, and agility, enabling them to stay ahead in today's digital landscape.

Highlights:Cisco ACI | ACI Infrastructure

The ACI Cisco Architecture

The ACI Cisco operates with several standard ACI building blocks. These include Endpoint Groups (EPGs) that classify and group similar workloads; then, we have the Bridge Domains (BD), VRFs, Contract constructs, COOP protocol in ACI, and micro-segmentation. With micro-segmentation in the ACI, you can get granular policy enforcement right on the workload anywhere in the network.

Unlike in the traditional network design, you don’t need to place certain workloads in specific VLANs or, in some cases, physical locations. The ACI can incorporate devices separate from the ACI, such as a firewall, load balancer, or an IPS/IDS, for additional security mechanisms. This enables the dynamic service insertion of Layer 4 to Layer 7 services. Here, we have a lot of flexibility with the redirect option and service graphs.

1: – The ACI Infrastructure

The Cisco ACI architecture is optimized to learn endpoints dynamically with its dynamic endpoint learning functionality. So, we have endpoint learning in the data plane. Therefore, the other devices learn of the endpoints connected to that local leaf switch; the spines have a mapping database that saves many resources on the spine and can optimize the data traffic forwarding. So you don’t need to flood traffic any more. If you want, you can turn off flooding in the ACI fabric. Then, we have an overlay network.

As you know, the ACI network has both an overlay and a physical underlay; this would be a virtual underlay in the case of Cisco Cloud ACI. The ACI uses VXLAN, the overlay protocol that rides on top of a simple leaf and spine topology, with standards-based protocols such as IS-IS and BGP for route propagation. 

2: – ACI Networks

ACI Networks also introduces the concept of the Application Policy Infrastructure Controller (APIC), which acts as the central point of control for the network. The APIC allows administrators to define and enforce network policies, monitor performance, and troubleshoot issues.

In addition to network virtualization and policy management, ACI Cisco offers a range of other features. These include integrated security, intelligent workload placement, and seamless integration with other Cisco products and technologies.

3: – COOP Protocol in ACI

The spine proxy receives mapping information (location and identity) via the Council of Oracle Protocol (COOP). Using Zero Message Queue (ZMQ), leaf switches forward endpoint address information to spine switches. As part of COOP, the spine nodes maintain a consistent copy of the endpoint address and location information and maintain the distributed hash table (DHT) database for mapping endpoint identity to location.

4: – Micro-segmentation

Integrated security is achieved through micro-segmentation, which allows administrators to define fine-grained security policies at the application level. This helps to prevent the lateral movement of threats within the network and provides better protection against attacks.

Intelligent workload placement ensures that applications are placed in the most appropriate locations within the network based on their specific requirements. This improves application performance and resource utilization.

Data Center Network Challenges

Let us examine well-known data center challenges and how the Cisco ACI network solves them.

Challenge: Traditional Complicated Topology

A traditional data center network design usually uses core distribution access layers. When you add more devices, this topology can be complicated to manage. Cisco ACI uses a simple spine-leaf topology, wherein all the connections within the Cisco ACI fabric are from leaf-to-spine switches, and a mesh topology is between them. There is no leaf-to-leaf and no spine-to-spine connectivity.

Required: How ACI Cisco overcomes this

The Cisco ACI architecture uses the leaf-spine, consisting of a two-tier “fat tree” topology with equidistant bandwidths. The leaf layer connects to the physical and virtual workloads and network services, while the spine layer is the transport layer that interconnects the leaves.

Challenge: Oversubscription

Oversubscription generally means potentially requiring more resources from a device, link, or component than are available. Therefore, the oversubscription ratio must be examined at multiple aggregation points in the design, including the line card to switch fabric bandwidth and the switch fabric input to uplink bandwidth.

Oversubscription Example

Let’s look at a typical 2-layer network topology with access switches and a central core switch. The access switches have 24 user ports and one uplink port connected to the core switch. Each access switch has 24 1Gb user ports and a 10Gb uplink port. So, in theory, if all the user ports are transmitted to a server simultaneously, they would require 24 GB of bandwidth (24 x 1 GB).

However, the uplink port is only 10, limiting the maximum bandwidth to all the user ports. The uplink port is oversubscribed because the theoretical required bandwidth (24Gb) exceeds the available bandwidth (10Gb). Oversubscription is expressed as a ratio of bandwidth needed to available bandwidth. In this case, it’s 24Gb/10Gb or 2.

Challenge: Varying bandwidths

We have layers of oversubscription with the traditional core, distribution, and access designs. We have oversubscribed at the access, distribution, and core layers. The cause of this will give varying bandwidth to endpoints if they want to communicate with an endpoint that is near or an endpoint that is far away. With this approach, endpoints on the same switch will have more bandwidth than two endpoints communicating across the core layer.

Users and application owners don’t care about networks; they want to place their workload wherever the computer is and want the same BW regardless of where you put it. However, with traditional designs, the bandwidth available depends on where the endpoints are located.

Required: How ACI Cisco overcomes this

The ACI leaf and spine have equidistant endpoints between any two endpoints. So if any two servers have the same bandwidths, which is a big plus for data center performance, then it doesn’t matter where you place the workload, which is a big plus for virtualized workloads. This gives you unlimited workload placement.

Challenge: Lack of portability

Applications are built on top of many building blocks. We use contracts such as VLANs, IP addresses, and ACLs to create connectivity. We use these constructs to create and translate the application requirements to the network infrastructure. These constructs are hardened into the network with configurations applied before connectivity.

These configurations are not very portable. It’s not that they were severely designed; they were never meant to be portable. Location Independent Separation Protocol (LISP) did an excellent job making them portable. However, they are hard-coded for a particular requirement at that time. Therefore, if we have the exact condition in a different data center location, we must reconfigure the IP address, VLANs, and ACLs. 

Required: How ACI Cisco overcomes this

An application refers to a set of networking components that provides connectivity for a given set of workloads. These workloads’ relationship is what ACI calls an “application,” and the connection is expressed by what ACI calls an application network profile. With a Cisco ACI design, we can create what is known as Application Network Profiles (ANPs).

The ANP expresses the relationship between the application and its communications. It is a configuration template used to express the relationship between segments. The ACI then translates those relationships into networking constructs such as VLANs, VXLAN, VRF, and IP addresses that the devices in the network can then implement.

Challenge: Issues with ACL

The traditional ACL is very tightly coupled with the network topology, and anything that is tightly coupled will kill agility. It is configured on a specific ingress and egress interface and pre-set to expect a particular traffic flow. These interfaces are usually at demarcation points in the network. However, many other points in the network could do so with security filtering.

Required: How ACI Cisco overcomes this

The fundamental security architecture of the Cisco ACI design follows an allow-list model, where we explicitly define what traffic should be permitted. A contract is a policy construct used to describe communication between EPGs. Without a contract, no unicast communication is possible between those EPGs unless the VRF is configured in “unenforced” mode or those EPGs are in a preferred group.

A contract is not required to communicate between endpoints in the same EPG (although transmission can be prevented with intra-EPG isolation or intra-EPG contract). We have a different construct for applying the policy in ACI. We use the contract construct, and within the contract construct, we have subjects and filters that specify how endpoints are allowed to communicate.

These managed objects are not tied to the network’s topology because they are not applied to a specific interface. Instead, the contracts are used in the intersection between EPGs. They represent rules the network must enforce irrespective of where these endpoints are connected.   

Challenge: Issues with Spanning Tree Protocol (STP)

A significant shortcoming of STP is that it is a brittle failure mode that can bring down entire data centers or campus networks when something goes wrong. Though modifications and enhancements have addressed some of these risks, this has happened at the cost of technical debt in design and maintenance.

When you think about how this works, we have a BPDU that acts as a HELLO mechanism. When we stop receiving the BPDUs and the link stays up, we forward all the links. So, spanning Tree Protocol causes outages.

Spanning Tree Root Switch stp port states

Required: How ACI Cisco overcomes this

The Cisco ACI does not run Spanning Tree Protocol natively, meaning the ACI control plane does not run STP. Inside the fabric, we are running IS-IS as the interior routing protocol. If we stop receiving, we don’t go into an all-forwarding state with IS-IS. As we have IP reachability between Leaf and Spine, we don’t have to block ports and see actual traffic flows that are not the same as the physical topology.

Within the ACI fabric, we have all the advantages of layer three networks, which are more robust and predictable than with an STP design. With ACI, we don’t rely on SPT for the topology design. Instead, the ACI uses ECMP for layer 2 and Layer 3 forwarding. We can use ECMP because we have routed links between the leaves and spines in the ACI fabric.

Challenge: Core-distribution design

The traditional design uses VLANs to segment Layer 2 boundaries and broadcast domains logically. VLANs use network links inefficiently, resulting in rigid device placement. We also have a cap on the number of VLANs we can create. Some applications require that you need Layer 2 adjacencies.

For example, clustering software requires Layer 2 adjacency between source and destination servers. However, if we are routing at the access layer, only servers connected to the same access switch with the same VLANs trunked down would be Layer 2-adjacent. 

Required: How ACI Cisco overcomes this

VXLAN solves this dilemma in ACI by decoupling Layer 2 domains from the underlying Layer 3 network infrastructure. With ACI, we are using the concepts of overlays to provide this abstract. Isolated Layer 2 domains can be connected over a Layer 3 network using VXLAN. Packets are transported across the fabric using Layer 3 routing.

This paradigm fully supports layer 2 networks. Large layer-2 domains will always be needed, for example, for VM mobility, clusters that don’t or can’t use dynamic DNS and non-IP traffic, and broadcast-based intra-subnet communication.

**Cisco ACI Architecture: Leaf and Spine**

The fabric is symmetric with a leaf and spine design, and we have central bandwidth. Therefore, regardless of where a device is connected to the fabric, it has the same bandwidth as every other device connected to the same fabric. This removes the placement restrictions that we have with traditional data center designs. A spine-leaf architecture is a data center network topology that consists of two switching layers—a spine and a leaf.

The leaf layer comprises access switches that aggregate server traffic and connect directly to the spine or network core. Spine switches interconnect all leaf switches in a full-mesh topology.

With low latency east-west traffic, optimized traffic flows are imperative for performance, especially for time-sensitive or data-intensive applications. A spine-leaf architecture aids this by ensuring traffic is always the same number of hops from its next destination, so latency is lower and predictable.

Displaying a VXLAN tunnel 

We have expanded the original design and added VXLAN. We are creating a Layer 2 network, specifically, a Layer 2 overlay over a Layer 3 routed core. The Layer 2 extension allows the hosts, desktop 0 and desktop 1, to communicate over the Layer 2 overlay that VXLAN creates.

The hosts’ IP addresses are 10.0.0.1 and 10.0.0.2, which are not reachable via the Leaf switches. The Leaf switches cannot ping these. Consider the Leaf and Spine switches a standard Layer 3 WAN or network for this lab. So, we have unicast connectivity over the WAN.

The only IP routing addition I have added is the new loopback addresses on Leafs 1 and 2, of 1.1.1.1/32 and 2.2.2.2/32, used for ingress replication for VXLAN. Remember that the ACI is one of many products that use Layer 2 overlays. VXLAN can be used as a Layer 2 DCI.

Notice below I am running a ping from desktop 0 to the corresponding desktop. These hosts are in the 10.0.0.0/8 range, and the core does not know these subnets. I’m also running a packet capture on the link Gi1 connected to Leaf A.

Notice the source and destination are 1.1.1.1 and 2.2.2.2.2, which are the VTEPs. The IMCP traffic is encapsulated into UDP port 1024, explicitly set in the confirmation as the VXLAN port to use.

ACI Network: VXLAN transport network

In a leaf-spine ACI fabric, we have a native Layer 3 IP fabric that supports equal-cost multi-path (ECMP) routing between any two endpoints in the network. Using VXLAN as the overlay protocol allows any workload to exist anywhere in the network.

We can have physical and virtual machines in the same logical layer 2 domain while running layer 3 routing to the top of each rack. Thus, we can connect several endpoints to each leaf, and for one endpoint to communicate with another, we use VXLAN.

So, the transport of the ACI fabric is carried out with VXLAN. The ACI encapsulates traffic with VXLAN and forwards the data traffic across the fabric. Any policy that needs to be implemented gets applied at the leaf layer. All traffic on the fabric is encapsulated with VXLAN. This allows us to support standard bridging and routing semantics without the standard location constraints.

Diagram: VXLAN operations. The source is Cisco.

Council of Oracle Protocol

COOP protocol in ACI and the ACI fabric

The fabric appears to the outside as one switch capable of forwarding Layers 2 and 3. In addition, the fabric is a Layer 3 network routed network and enables all links to be active, providing ECMP forwarding in the fabric for both Layer 2 and Layer 3. Inside the fabric, we have routing protocols such as BGP; we also use Intermediate System-to-Intermediate System Protocol (IS-IS) and Council of Oracle Protocol (COOP) for all forwarding endpoint-to-endpoint communications.

The COOP protocol in ACI communicates the mapping information (location and identity) to the spine proxy. A leaf switch forwards endpoint address information to the spine switch ‘Oracle’ using Zero Message Queue (ZMQ). The COOP protocol in ACI is something new to data centers. The Leaf switches use COOP to report local station information to the Spine (Oracle) switches.

COOP protocol in ACI

Let’s look at an example of how the COOP protocol in ACI works. We have a Leaf that learns of a host. The Leaf reports this information—let’s say it knows Host B—and sends it to one of the Spine switches chosen randomly using the Council Of Oracle Protocol.

The Spine switch then relays this information to all the other Spines in the ACI fabric so that every Spine has a complete record of every single endpoint. The Spines switches record the information learned via the COOP in the Global Proxy Table, which resolves unknown destination MAC/IP addresses when traffic is sent to the Proxy address.

COOP database.

So, we know that the Spine has a COOP database of all endpoints in the fabric. Council of Oracle Protocol (COOP) is used to communicate the mapping information (location and identity) to the spine proxy. A leaf switch forwards endpoint address information to the spine switch ‘Oracle’ using Zero Message Queue (ZMQ).

The command: Show coop internal info repo key allows us to verify that the endpoint is in the COOP database using the BD VNID of 16154554 mapped to the MAC address of 0050.5690.3eeb. With this command, you can also see the tunnel next hop and IPv4 and IPv6 addresses tied to this MAC address.

coop protocol in ACI
Diagram: COOP protocol in ACI

**The fabric constructs**

The ACI Fabric contains several new network constructs specific to ACI that enable us to abstract much of the complexity we had with traditional data center designs. These new concepts are ACI’s Endpoint Groups, Contracts, Bridge Domains, and COOP protocol.

In addition, we have a distributed Layer 3 Anycast gateway function that ensures optimal Layer 3 and Layer 2 forwarding. We also have original constructs you may have used, such as VRFs. The Layer 3 Anycast feature is popular and allows flexible placement of the default gateway, which is suited for agile designs.

Related: For pre-information, you may find the following helpful:

  1. Data Center Security
  2. Data Center Topologies
  3. Dropped Packet Test
  4. DMVPN
  5. Stateful Inspection Firewall
  6. Cisco ACI Components

Cisco ACI | ACI Infrastructure

ACI Infrastructure

Key components that make up the ACI Cisco architecture. By understanding these components, network administrators and IT professionals can harness the power of ACI to optimize their data center operations.

1. Application Policy Infrastructure Controller (APIC):

The cornerstone of the Cisco ACI architecture is the Application Policy Infrastructure Controller (APIC). APIC is the central management and policy engine for the entire ACI fabric. It provides a single point of control, enabling administrators to define and enforce policies that govern the behavior of applications and services within the data center. APIC offers a user-friendly interface for policy configuration, monitoring, and troubleshooting, making it an essential component for managing the ACI fabric.

2. Spine Switches:

Spine switches form the backbone of the ACI fabric. These high-performance switches provide connectivity between leaf switches and facilitate east-west traffic within the fabric. Spine switches operate at Layer 3 and use routing protocols to efficiently distribute traffic across the fabric. With the ability to handle massive amounts of data, spine switches ensure high-speed connectivity and optimal performance in the ACI environment.

3. Leaf Switches:

Leaf switches act as the access layer switches in the ACI fabric. They connect directly to the endpoints, such as servers, storage devices, and other network devices, and serve as the entry and exit points for traffic entering and leaving the fabric. Leaf switches provide Layer 2 connectivity for endpoint devices and Layer 3 connectivity for communication between endpoints within the fabric. They also play a crucial role in implementing policy enforcement and forwarding traffic based on predefined policies.

**Example: Cisco ACI & IS-IS**

Cisco ACI under the covers runs ISIS. The ISIS routing protocol is an Interior Gateway Protocol (IGP) that enables routers within a network to exchange routing information and make informed decisions on the best path to forward packets. It operates at the OSI model’s Layer 2 (Data Link Layer) and Layer 3 (Network Layer).

ISIS organizes routers into logical groups called areas, simplifying network management and improving scalability. It allows for hierarchical routing, reducing the overhead of exchanging routing information across large networks.

**Note: IS-IS Parameters**

Below, we have four routers. R1 and R2 are in area 12, and R3 and R4 are in area 34. R1 and R3 are intra-area routers so that they will be configured as level 1 routers. R2 and R4 form the backbone so that these routers will be configured as levels 1-2.

Network administrators need to configure ISIS parameters on each participating router to implement ISIS. These parameters include the router’s ISIS system ID, area assignments, and interface settings. ISIS uses the reliable transport protocol (RTP) to exchange routing information between routers.

Routing Protocol
Diagram: Routing Protocol. ISIS.

4. Application Network Profiles (ANPs):

Application Network Profiles (ANPs) are a key Cisco ACI policy model component. They define the policies and configurations required for specific applications or application groups and encapsulate all the necessary information, including network connectivity, quality of service (QoS) requirements, security policies, and service chaining.

By associating endpoints with ANPs, administrators can easily manage and enforce consistent policies across the ACI fabric, simplifying application deployment and ensuring compliance.

5. Endpoint Groups (EPGs):

Endpoint Groups (EPGs) are logical containers that group endpoints with similar network requirements. EPGs provide a way to define and enforce policies at a granular level—endpoints within an EPG share standard policies, such as security, QoS, and network connectivity.

This grouping allows administrators to apply policies consistently to specific endpoints, regardless of their physical location within the fabric. EPGs enable seamless application mobility and simplify policy enforcement within the ACI environment.

**Specific ACI Cisco architecture**

In some of the lab guides in this blog post, we are using the following hardware from a rack rental from Cloudmylabs. Remember that the ACI Fabric is built on the Nexus 9000 Product Family.

The Cisco Nexus 9000 Series Switches are designed to meet the increasing demands of modern networks. With high-performance capabilities, these switches deliver exceptional speeds and low latency, ensuring smooth and uninterrupted data flow. They support high-density 10/25/40/100 Gigabit Ethernet interfaces, allowing businesses to scale and adapt to growing network requirements.

Enhanced Security:

The Cisco Nexus 9000 Series Switches offer comprehensive security features to protect networks from evolving threats. They leverage Cisco TrustSec technology, which provides secure access control, segmentation, and policy enforcement. With integrated security features, businesses can mitigate risks and safeguard critical data, ensuring peace of mind.

Application Performance Optimization:

To meet the demands of modern applications, the Cisco Nexus 9000 Series Switches are equipped with advanced features that optimize application performance. These switches support Cisco Tetration Analytics, which provides deep insights into application behavior, enabling businesses to enhance performance, troubleshoot issues, and improve efficiency.

Diagram: The source is Cloudmylabs.

Cisco ACI Simulator:

Below is a screenshot from Cisco ACI similar to the one below. At the start, you will be asked for the details of the fabric. Remember that once you set the out-of-band management address for the API, you need to change the port group settings on the ESXi VM network. If you don’t change “Promiscuous mode, MAC address changes, and Forged Transmits,” you cannot access the UI from your desktop.

ACI fabric Details
Diagram: Cisco ACI fabric Details

Leaf and spine design

Network Design Methodology

Leaf and spine architecture is a network design methodology commonly used in data centers. It provides a scalable and resilient infrastructure that can handle the increasing demands of modern applications and services. The term “leaf and spine” refers to the physical and logical structure of the network.

In leaf and spine architecture, the network is divided into two main layers: the leaf and spine layers. The leaf layer consists of leaf switches connected to the servers or endpoints in the data center. These leaf switches act as the access points for the servers, providing high-bandwidth connectivity and low-latency communication.

The spine layer, on the other hand, consists of spine switches that connect the leaf switches. The spine switches provide high-speed and non-blocking interconnectivity between the leaf switches, forming a fully connected fabric. This allows for efficient and predictable traffic patterns, as any leaf switch can communicate directly with any other leaf switch through the spine layer.

ACI Cisco with leaf and spine.

The following lab guide has a leaf and spine ACI design that includes 2 leaf switches acting as the leaf layer where the workloads connect. Then, we have a spine attached to the leaf. When the ACI hardware installation is done, all Spines and Leafs are linked and powered up. Once the basic configuration of APIC is completed, the Fabric discovery process starts working.

Note: IFM process

In the discovery process, ACI uses the Intra-Fabric Messaging (IFM) process in which APIC and nodes exchange heartbeat messages.

The process used by the APIC to push policy to the fabric leaf nodes is called the IFM Process. ACI Fabric discovery is completed in three stages. The leaf node directly connected to the APIC is discovered in the first stage. The second discovery stage brings in the spines connected to that initial leaf where APIC was connected. The third stage involves discovering the cluster’s other leaf nodes and APICs.

The fabric membership diagram below shows the inventory, including serial number, Pod, Node ID, Model, Role, Fabric IP, and Status. Cisco ACI consists of the following hardware components: APIC Controller Spine Switches and Leaf Switches.

ACI fabric discovery
Diagram: ACI fabric discovery

Analysis: Overlay based on VXLAN

Cisco ACI uses an overlay based on VXLAN to virtualize physical infrastructure. Like most overlays, this overlay requires the data path at the network’s edge to map from the tenant end-point address in the packet, otherwise referred to as its identifier, to the endpoint’s location, also known as its locator. This mapping occurs in a tunnel endpoint (TEP) function called VXLAN (VTEP).

VTEP Addressing

The VTEP addresses are displayed in the INFRASTRUCTURE IP column. The TEP address pool 10.0.0.0/16 has been configured on the Cisco APIC using the initial setup dialog. The APIC assigns the TEP addresses to the fabric switches via DHCP, so the infrastructure IP addresses in your fabric will differ from the figure.

This configuration is perfectly valid for a Lab but not good for a production environment. The minimum physical fabric hardware for a production environment includes two spines, two leaves, and three APICs.In addition to discovering and configuring the Fabric and applying the Tenant design, the following functionality can be configured:

  1. Routing at Layer 3
  2. Connecting a legacy network at layer 2
  3. Virtual Port Channels at Layer 2

Note: Border Leaf

A note about Border Leafs: ACI fabrics often use this designation along with “Compute Leafs” and “Storage Leafs.” Border Leaf is merely a convention for identifying the leaf pair that hosts all external connectivity external to the fabric (Border Leaf) or the leaf pair that hosts host connectivity (Compute Leaf).

Note: The Link Layer Discovery Protocol (LLDP)

LLDP is responsible for discovering directly adjacent neighbors. When run between the Cisco APIC and a leaf switch, it precedes three other processes: Tunnel endpoint (TEP) IP address assignment, node software upgrade (if necessary), and the intra-fabric messaging (IFM) process, which the Cisco APIC uses to push policy to the leaves.

aci Cisco LLDP

Leaf and Spine: Traffic flows

The leaf and spine network topology is suitable for east-to-west network traffic and comprises leaf switches to which the workloads connect and spine switches to which the leaf switches connect. The spines have a simple role and are geared around performance, while all the intelligence is distributed to the edge of the network, where the leaf layers sit.

This allows engineers to move away from managing individual devices and more efficiently manage the data center architecture with policy. In this model, the Application Policy Infrastructure Controller (APIC) controllers can correlate information from the entire fabric.

**Understanding Leaf and Spine Traffic Flow**

In a leaf and spine architecture, traffic flow follows a structured path. When a device connected to a leaf switch wants to communicate with another device, the traffic is routed through the spine switch to the destination leaf switch. This approach minimizes the hops required for data transmission and reduces latency. Additionally, traffic can be evenly distributed since every leaf switch is connected to every spine switch, preventing congestion and bottlenecks.

**ACI Cisco with leaf and spine**

In the following lab guide, we continue to verify the ACI leaf and spine. To check the ACI fabric, we can run the diagnostic tool Acidiag fnvread. It is also recommended that the LLDP and ISIS adjacencies be checked. With a leaf and spine design, the leaf layer does not connect, and we can see this with the LLDP and ISIS adjacency information below.

ACI leaf and spine
Diagram: ACI leaf and spine

Leaf and Spine Switch Functions

Based on a two-tier (spine and leaf switches) or three-tier (spine switch, tier-1 leaf switch, and tier-2 leaf switch) architecture, Cisco ACI switches provide the following functions:

What are Leaf Switches?

Leaf switches connect between end devices, servers, and the network fabric. They are typically deployed in leaf-spine network architecture, connecting directly to the spine switches. Leaf switches provide high-speed, low-latency connectivity to end devices within a data center network.

Functionalities of Leaf Switches:

1. Aggregation: Leaf switches aggregate traffic from multiple servers and sends it to the spine switches for further distribution. This aggregation helps reduce the network’s complexity and enables efficient traffic flow.

2. High-density Port Connectivity: Leaf switches are designed to provide a high-density port connectivity environment, allowing multiple devices to connect simultaneously. This is crucial in data centers where numerous servers and devices must be interconnected.

These devices have ports connected to classic Ethernet devices, such as servers, firewalls, and routers. In addition, these leaf switches provide the VXLAN Tunnel Endpoint (VTEP) function at the edge of the fabric. In Cisco ACI terminology, IP addresses representing leaf switch VTEPs are called Physical Tunnel Endpoints (PTEPs). The leaf switches route or bridge tenant packets and applies network policies.

What are Spine Switches?

Spine switches, also known as spine or core switches, are high-performance switches that form the backbone of a network. They play a vital role in data centers and large enterprise networks and facilitate the seamless data flow between various leaf switches.

These devices interconnect leaf switches and can also connect Cisco ACI pods to IP networks or WAN devices to build a Cisco ACI Multi-Pod fabric. In addition to mapping entries between endpoints and VTEPs, spine switches also store proxy entries between endpoints and VTEPs. Leaf switches are connected to spine switches within a pod, and spine switches are connected to leaf switches.

No direct connection between tier-1 leaf switches, tier-2 leaf switches, or spine switches is allowed. If you incorrectly cable spine switches to each other or leaf switches in the same tier to each other, the interfaces will be disabled.

Cisco ACI Fabric
Diagram: Cisco ACI Fabric. Source Cisco Live.

BGP Route Reflection

Under the cover, Cisco ACI works with BGP Route-Reflection. BGP Route Reflection creates a hierarchy of routers within the ACI fabric. At the top of the hierarchy is a Route-Reflector (RR), a central point for collecting routing information from other routers within the fabric. The RR then reflects this information to other routers, ensuring that every router in the network has a complete view of the routing table.

The ACI uses MP-BGP protocol to distribute external Network subnets or prefixes inside the ACI fabric. To create an MP-BGP route reflector, we must select two Spines acting as Route Reflectors and make an iBGP Neighbourship to all other leaves.

ACI Cisco and endpoints

In a traditional network, three tables are used to maintain the network addresses of external devices: a MAC address table for Layer 2 forwarding, a Routing Information Base (RIB) for Layer 3 forwarding, and an ARP table for the combination of IP addresses and MAC addresses. Cisco ACI, however, maintains this information differently, as shown below.

ACI Endpoint learning
Diagram: Endpoint Learning. Source Cisco.com

What is ACI Endpoint Learning?

ACI endpoint learning refers to discovering and monitoring the network endpoints within an ACI fabric. Endpoints include devices, virtual machines, physical servers, users, and applications. Network administrators can make informed decisions regarding network policies, security, and traffic optimization by gaining insights into these endpoints’ location, characteristics, and behavior.

How Does ACI Endpoint Learning Work?

ACI fabric leverages a distributed, controller-based architecture to facilitate endpoint learning. When an endpoint is connected to the fabric, ACI utilizes various mechanisms to gather information about it. These mechanisms include Address Resolution Protocol (ARP) snooping, Link Layer Discovery Protocol (LLDP), and even integration with hypervisor-based systems.

Once an endpoint is detected, ACI Fabric builds a comprehensive endpoint database called the Endpoint Group (EPG). This database contains vital information such as MAC addresses, IP addresses, VLANs, and associated policies. By continuously monitoring and updating this database, ACI ensures real-time visibility and control over the network endpoints.

Implementation Endpoint Learning Considerations:

To leverage the benefits of ACI endpoint learning, organizations need to consider a few key aspects:

1. Infrastructure Design: A well-designed ACI fabric with appropriate leaf and spine switches is crucial for efficient endpoint learning. Proper VLAN and subnet design should be implemented to ensure accurate endpoint identification and classification.

2. Endpoint Group (EPG) Definition: Defining and associating EPGs with appropriate policies is essential. EPGs help categorize endpoints based on their characteristics, allowing for granular policy enforcement and simplified management.

Diagram: ACI Endpoint Learning. The source is Cisco.

Forwarding Behavior. The COOP database

Local and remote endpoints are learned from the data plane, but remote endpoints are local caches. Cisco ACI’s fabric relies heavily on local endpoints for endpoint information. A leaf is responsible for reporting its local endpoints to the Council Of Oracle Protocol (COOP) database located on each spine switch, which implies that all endpoint information in the Cisco ACI fabric is stored there.

Each leaf does not need to know about all the remote endpoints to forward packets to the remote endpoints because this database is accessible. When a leaf does not know about a remote endpoint, it can still forward packets to spine switches. This forwarding behavior is called spine proxy.

Diagram: Endpoint Learning. The source is Cisco.

In a traditional network environment, switches rely on the Address Resolution Protocol (ARP) to map IP addresses to MAC addresses. However, this approach becomes inefficient as the network scales, resulting in increased network traffic and complexity. Cisco ACI addresses this challenge by utilizing local endpoint learning, a more intelligent and efficient method of mapping MAC addresses to IP addresses.

Diagram: Local and Remote endpoint learning. The source is Cisco.

ACI Cisco: The Main Features

We are experiencing many changes right now that are impacting almost every aspect of IT. Applications are changing immensely, and we see their life cycles broken into smaller windows as they become less structured. In addition, containers and microservices are putting new requirements on the underlying infrastructure, such as the data centers they live in. This is one of the main reasons why a distributed system, including a data center, is better suited for this environment.

Distributed system/Intelligence at the edge

Like all networks, the Cisco ACI network still has a control and data plane. From the control and data plane perspective, the Cisco ACI architecture is still a distributed system. Each switch has intelligence and knows what it needs to do—one of the differences between ACI and traditional SDN approaches that try to centralize the control plane. If you try to centralize the control plan, you may hit scalability limits, not to mention a single point of failure and an avenue for bad actors to penetrate.

Cisco ACI Design
Diagram: Cisco ACI Design. Source Cisco Live.

Two large core devices

If we examine the traditional data center architecture, intelligence is often in two central devices. You could have two large core devices. What the network used to control and secure has changed dramatically with virtualization via hypervisors. We’re seeing faster change with containers and microservices being deployed more readily.

As a result, an overlay networking model is better suited. However, in a VXLAN overlay network, the intelligence is distributed across the leaf switch layer.

Therefore, distributed systems are better than centralized systems for more scale, resilience, and security. By distributing intelligence to the leaf layer, scalability is not determined by the scalability of each leaf but by the fabric level. However, there are scale limits on each device. Therefore, scalability as a whole is determined by the network design.

A key point: Overlay networking

The Cisco ACI architecture provides an integrated Layer 2 and 3 VXLAN-based overlay networking capability to offload network encapsulation processing from the compute nodes onto the top-of-rack or ACI leaf switches. This architecture provides the flexibility of software overlay networking in conjunction with the performance and operational benefits of hardware-based networking. 

ACI Cisco New Concepts

Networking in the Cisco ACI architecture differs from what you may use in traditional network designs. It’s not different because we use an entirely new set of protocols. ACI uses standards-based protocols such as BGP, VXLAN, and IS-IS. However, the new networking constructs inside the ACI fabric exist only to support policy.

ACI has been referred to as stateless architecture. As a result, the network devices have no application-specific configuration until a policy is defined stating how that application or traffic should be treated on the network.

This is a new and essential concept to grasp. So, now, with the ACI, the network devices in the fabric have no application-specific configuration until there is a defined policy. No configuration is tied to a device. With a traditional configuration model, we have many designs on a device, even if it’s not being used. For example, we had ACL and QoS parameters configured, but nothing was using them.

The APIC controller

The APICs, the management plan that defined the policy, do not need to push resources when nothing connected utilizes that. The APIC controller can see the entire fabric and has a holistic viewpoint.

Therefore, it can correlate configurations and integrate them with devices to help manage and maintain the security policy you define. We see every device on the fabric, physical or virtual, and can maintain policy consistency and, more importantly, recognize when policy needs to be enforced. 

APIC Controller
Diagram: APIC Controller. Source Cisco Live.

Endpoint groups (EPG)

We touched on this a moment ago. Groups or endpoint groups (EPGs) and contracts are core to the ACI. Because this is a zero-trust network by default, communication is blocked in hardware until a policy consisting of groups and contracts is defined. With Endpoint Groups, we can decouple and separate the physical or virtual workloads from the constraints of IP addresses and VLANs. 

So, we are grouping similar workloads into groups known as Endpoint Groups. Then, we can control group behavior by applying policy to the groups and not the endpoints in the group. As a security best practice, it is essential to group similar workloads with similar security sensitivity levels and then apply the policy to the endpoint group.

For example, a traditional data center network could have database and application servers in the same segment controlled by a VLAN with no intra-VLAN filtering. The EPG approach removes the barriers we have had with traditional networks, with the limitation of the IP address being used as the identifier and locator and the restrictions of the VLANs.

This is a new way of thinking, and it allows devices to communicate with each other without changing their IP address, VLAN, or subnet.

ACI Endpoint Groups
Diagram: ACI Endpoint Groups. Source Cisco Live.

EPG Communication

The EPG provides a better segmentation method than the VLAN, which was never meant to live in a world of security. By default, anything in the group can communicate freely, and inter-EPG communication needs a policy. This policy construct that ACI uses is called a contract. So, having similar workloads of similar security levels in the same EPG makes sense. All devices inside the same endpoint group can talk to each other freely.

This behavior can be modified with intra-EPG isolation, similar to a private VLAN where communication between group members is not allowed. Or, intra-EPG contracts can be used only to allow specific communications between devices in an EPG.

Extending the ACI Fabric

Developing the Cisco ACI architecture

I have always found extending data risky when undergoing data center network design projects. However, the Cisco ACI architecture can be extended without the traditional Layer 2 and 3 Data Center Interconnect (DCI) mechanisms. Here, we can use Multi-Pod and Multi-Site to better control large environments that need to span multiple locations and for applications to share those multiple locations in active-active application deployments.

Diagram: Extending the ACI fabric. Source is Cisco

When considering data center designs, terms such as active-active and active-passive are often discussed. In addition, enterprises are generally looking for data center solutions that provide geographical redundancy for their applications.

Enterprises also need to be able to place workloads in any data center where computing capacity exists—and they often need to distribute members of the same cluster across multiple data center locations to provide continuous availability in the event of a data center failure. The ACI gives us options for extending the fabric to multiple locations and location types.

For example, there are stretched fabric, multi-pod, multi-site designs, and, more recently, Cisco Cloud ACI.

ACI design: Multi pod

The ACI Multi-Pod is the next evolution of the original stretch fabric design we discussed. The architecture consists of multiple ACI Pods connected by an IP Inter-Pod Layer 3 network. With the stretched fabric, we have one Pod across several locations. Cisco ACI MultiPod is part of the “single APIC cluster/single domain” family of solutions; a single APIC cluster is deployed to manage all the interconnected ACI networks.

These ACI networks are called “pods,” Each looks like a regular two-tier spine-leaf topology. The same APIC cluster can manage several pods.  All of the nodes deployed across the individual pods are under the control of the same APIC cluster. The separate pods are managed as if they were logically a single entity. This gives you operational simplicity. We also have a fault-tolerant fabric since each Pod has isolated control plane protocols.

Diagram: Multi-pod. Source is Cisco

ACI design: Cisco cloud ACI

Cisco Cloud APIC is an essential new solution component introduced in the architecture of Cisco Cloud ACI. It plays the equivalent of APIC for a cloud site. Like the APIC for on-premises Cisco ACI sites, Cloud APIC manages network policies for the cloud site it runs on by using the Cisco ACI network policy model to describe the policy intent.

ACI design: Multisite

ACI Multi-Site enables you to interconnect separate APIC cluster domains or fabric, each representing a separate availability zone. As a result, we have separate and independent APIC domains and fabrics. This way, we can manage multiple fabrics as regions or availability zones. ACI Multi-Site is the easiest DCI solution in the industry. Communication between endpoints in separate sites (Layers 2 and 3) is enabled simply by creating and pushing a contract between the endpoints’ EPGs.

rsz_secure_access_service_edge1

SASE Definition

SASE Definition

In today's ever-evolving digital landscape, businesses are seeking agile and secure networking solutions. Enter SASE (Secure Access Service Edge), a revolutionary concept that combines network and security functionalities into a unified cloud-based architecture. In this blog post, we will delve into the definition of SASE, its components, implementation benefits, and its potential impact on the future of networking.

At its core, SASE represents a shift from traditional networking models towards a more integrated approach. It brings together wide area networking (WAN), network security services, and cloud-native architecture, resulting in a unified and simplified networking framework. SASE aims to provide organizations with secure access to applications and data from any location, while reducing complexity and improving performance.

SASE is built on several fundamental components that work together harmoniously. These include software-defined wide area networking (SD-WAN), secure web gateways (SWG), cloud access security brokers (CASB), zero-trust network access (ZTNA), and firewall as a service (FWaaS). Each component plays a crucial role in delivering a comprehensive and secure networking experience.

Implementing SASE offers numerous advantages for businesses. Firstly, it simplifies network management by consolidating various services into a single platform. This leads to increased operational efficiency and cost savings. Additionally, SASE enhances security by applying consistent policies across all network traffic, regardless of the user's location. It also improves application performance through intelligent traffic routing and optimization.

As digital transformation continues to shape the business landscape, SASE emerges as a transformative force. Its cloud-native architecture aligns perfectly with the growing adoption of cloud services, enabling seamless integration and scalability. Moreover, SASE accommodates the rise of remote work and the need for secure access from anywhere. As organizations embrace hybrid and multi-cloud environments, SASE is poised to become the backbone of modern networking infrastructure.

SASE represents a paradigm shift in networking, blending security and networking functionalities into a unified framework. By embracing SASE, organizations can streamline operations, enhance security posture, and adapt to the evolving digital landscape. As we move forward, it is essential for businesses to explore the potential of SASE and leverage its benefits to drive innovation and growth.

Highlights: SASE Definition

SASE: A Cloud-Centric Approach

Firstly, the SASE is related to our environment. In a cloud-centric world, users and devices require access to services everywhere. The focal point has changed. Now, the identity of the user and device, as opposed to the traditional model, focuses solely on the data center with many network security components. These environmental changes have created a new landscape we must protect and connect.

Many common problems challenge the new landscape. Due to deployed appliances for different technology stacks, enterprises are loaded with complexity and overhead. The legacy network and security designs increase latency. In addition, the world is encrypted when considering Zero Trust SASE. This needs to be inspected without degrading application performance.

These are reasons to leverage a cloud-delivered secure access service edge (SASE). SASE means a tailored network fabric optimized where it makes the most sense for the user, device, and application – at geographically dispersed PoPs enabling technologies that secure your environment with technologies such as single packet authorization.

**Driving Forces to Adopting SASE**

Challenge 1: Managing the Network: Converging network and security into a single platform does not require multiple integration points. This will eliminate the need to deploy these point solutions and the complexities of managing each.

Challenge 2: Site Connectivity: SASE handles all management complexities. As a result, the administrative overhead for managing and operating a global network that supports site-to-site connectivity and enhanced security, cloud, and mobility is kept to an absolute minimum.

Challenge 3: Performance Between Locations: The SASE cloud already has an optimized converged network and security platforms. Therefore, sites need to connect to the nearest SASE PoP.

Challenge 4: Cloud Agility: SASE natively supports cloud data centers (IaaS) and applications (SaaS) without additional configuration, complexity, or point solutions, enabling built-in cloud connectivity

**Components of SASE**

1. Network as a Service (NaaS): SASE integrates network services such as SD-WAN (Software-Defined Wide Area Network) and cloud connectivity to provide organizations with a flexible and scalable network infrastructure. With NaaS, businesses can optimize network performance, reduce latency, and ensure reliable connectivity across different environments.

2. Security as a Service (SECaaS): SASE incorporates various security services, including secure web gateways, firewall-as-a-service, data loss prevention, and zero-trust network access. By embedding security into the network infrastructure, SASE enables organizations to enforce consistent security policies, protect against threats, and simplify the management of security measures.

3. Zero-Trust Architecture: SASE adopts a zero-trust approach, which assumes that no user or device should be trusted by default, even within the network perimeter. By implementing continuous authentication, access controls, and micro-segmentation, SASE ensures that every user and device is verified before accessing network resources, reducing the risk of unauthorized access and data breaches.

4. Cloud-Native Architecture: SASE leverages cloud-native technologies to provide a scalable, agile, and elastic network and security infrastructure. By transitioning from legacy hardware appliances to software-defined solutions, SASE enables organizations to respond more to changing business requirements, reduce costs, and improve overall efficiency.

Note: Point Solutions

SASE is becoming increasingly popular among organizations because it provides a more flexible and cost-effective approach to networking and security. The traditional approach involves deploying multiple devices or appliances, each with its functions. This approach can be complex, time-consuming, and expensive to manage. On the other hand, SASE simplifies this process by integrating all the necessary functions into a single platform.

Example Technology: Web Security Scanner

security web scanner

Example Product: Cisco Meraki

### Simplified Network Management

One of the standout features of the Cisco Meraki platform is its simplified network management. Gone are the days of complex configurations and time-consuming setups. With Meraki’s cloud-based interface, administrators can manage their entire network from a single dashboard. This ease of use not only saves time but also reduces the need for specialized IT staff, making it an ideal solution for businesses of all sizes.

### Robust Security Features

Security is a top priority for any network, and Cisco Meraki does not disappoint. The platform comes equipped with robust security features, including advanced threat protection, intrusion detection, and automated firmware updates. These features work seamlessly to protect your network from potential threats, ensuring that your data remains secure and your operations run smoothly.

### Scalability and Flexibility

As your business grows, so too does the need for a scalable network solution. Cisco Meraki’s platform is designed with scalability in mind, allowing you to easily add new devices and extend your network without any hassle. Whether you’re expanding to a new office location or integrating additional IoT devices, Meraki’s flexible architecture ensures that your network can grow alongside your business.

### Enhanced Visibility and Analytics

Understanding your network’s performance is crucial for making informed decisions. Cisco Meraki offers enhanced visibility and analytics, providing detailed insights into network usage, device performance, and potential issues. With these analytics, administrators can proactively address problems before they impact operations, optimize resource allocation, and ensure that their network is running at peak efficiency.

### Streamlined Troubleshooting

Troubleshooting network issues can be a daunting task, but Cisco Meraki makes it easier than ever. The platform’s intuitive dashboard provides real-time alerts and diagnostic tools, allowing administrators to quickly identify and resolve issues. This streamlined troubleshooting process minimizes downtime and keeps your network running smoothly.

**SASE Meaning: SASE wraps up**

SASE is a network and security architecture consolidating numerous network and security functions, traditionally delivered as siloed point solutions, into an integrated cloud service. It combines several network and security capabilities along with cloud-native security functions. The functions are produced from the cloud and provided by the SASE vendor.

They are essentially providing a consolidated, platform-based approach to security. We have a cloud-delivered solution consolidating multiple edge network security controls and network services into a unified solution with centralized management and distributed enforcement.

**The appliance-based perimeter**

Even Though there has been a shift to the cloud, the traditional perimeter network security solution has remained appliance-based. The change for moving security controls to the cloud is for better protection and performance, plus ease of deployment and maintenance.

The initial performance of the earlier cloud-delivered solutions has been overcome with the introduction of optimized routing and global footprint. However, there is a split in opinion about performance and protection. Many consider protection and performance prime reasons to remain on-premises and keep the network security solutions on-premises.

Related: For additional pre-information, you may find the following helpful for pre-information:

  1. SD-WAN SASE
  2. SASE Solution
  3. Security Automation
  4. SASE Model
  5. Cisco Secure Firewall
  6. eBOOK on SASE

SASE Definition

SASE Definition with Challenge 1: Managing the Network

Across the entire networking and security industry, everyone sells individual point solutions that are not a holistic joined-up offering. Thinking only about MPLS replacement leads to incremental point solution acquisitions when confronted by digital initiatives, making networks more complex and costly.

Principally, distributed appliances for network and security at every location require additional tasks such as installation, ongoing management, regular updates, and refreshes. This results in far too many security and network configuration points. We see this all the time with NOC and SOC integration efforts.

A: Numerous integration points:

The point-solution approach addresses one issue and requires considerable integration. Therefore, you must constantly add solutions to the stack, likely resulting in management overhead and increased complexity. Let’s say you are searching for a new car. Would you prefer to build the car with all the different parts or buy the already-built one?

In the same way, if we examine the network and security industry, the way it has been geared up presently is provided in parts. It’s your job to support, manage, and build the stack over time and scale it when needed. Fundamentally, it would help if you were an expert in all the different parts. However, if you abstract complexity into one platform, you don’t need to be an expert in everything. SASE is one effective way to abstract management and operational complexity.

B: Required: How SASE solves this

Converging network and security into a single platform does not require multiple integration points. This will eliminate the need to deploy these point solutions and the complexities of managing each. Essentially, with SASE, we can bring each point solution’s functionalities together and place them under one hood—the SASE cloud. SASE merges all of the networking and security capabilities into a single platform.

This way, you now have a holistic joined-up offering. Customers don’t need to perform upgrades or size and scale their network. Instead, all this is done for them in the SASE cloud, creating a fully managed and self-healing architecture. Besides, the convergence is minimal if something goes wrong in one of the SASE Pops. All of this is automatic, and there is no need to set up new tunnels or have administrators step in to perform configurations.

SASE Definition with Challenge 2: Site Connectivity

SD-WAN appliances require other solutions for global connectivity and to connect, secure, and manage mobile users and cloud resources. As a result, many users are turning to Service Providers to handle the integration. The carrier-managed SD-WAN providers integrate a mix of SD-WAN and security devices to form SD-WAN services.

A: Lack of Agility

Unfortunately, this often makes the Service Providers inflexible in accommodating new requests. The telco’s lack of agility and high bandwidth costs will remain problematic. Deploying new locations has been the biggest telco-related frustration, especially when connecting offices outside of the telco’s operating region to the company’s MPLS network. For this, they need to integrate with other telcos.

B: Required: How SASE solves this

SASE handles all of the complexities of management. As a result, the administrative overhead for managing and operating a global network that supports site-to-site connectivity and enhanced security, cloud, and mobility is kept to an absolute minimum.

SASE Definition with Challenge 3: Performance Between Locations

The throughput is primarily determined by latency and packet loss, not bandwidth. Therefore, for an optimal experience for global applications, we must explore ways to manage the latency and packet loss end-to-end for last-mile and middle-mile segments. Most SD-WAN vendors don’t control these segments, affecting application performance and service agility.

Consequently, constant tweaking at the remote ends will be required to attain the best performance for your application. With SD-WAN, we can bundle transports and perform link bonding to solve the last mile. However, this does not create any benefits for the middle mile bandwidth. MPLS will help you overcome the middle-mile problems, but you will likely pay a high price.

A: Required: How SASE solves this

The SASE cloud already has an optimized converged network and security platforms. Therefore, sites need to connect to the nearest SASE PoP. This way, the sites are placed on the global private backbone to take advantage of global route optimization, dynamic path selection, traffic optimization, and end-to-end encryption. The traffic can also be routed over MPLS, directly between sites (not through the SASE PoP), and from IPsec tunnels to third-party devices. The SASE architecture optimizes the last and middle-mile traffic flows.

B: Required: Optimization techniques:

The SASE global backbone uses several techniques to improve network performance, resulting in predictable, consistent latency and packet loss. The SASE cloud has complete control of each PoP and can employ optimizations. It uses proprietary routing algorithms that factor in latency, packet loss, and jitter. These routing algorithms favor performance over cost and select the optimal route for every network packet. This is compared to Internet routing, where metrics don’t consider what is best for the application or the type.

Example TCP Performance Parameters.

SASE Definition with Challenge 4: Cloud Agility

Cloud applications are becoming more critical to organizations, even more so than those hosted in private data centers. When delivering cloud resources, we must consider more than just providing connectivity. In the past, when we spoke about agility, we were concerned only with the addition of new on-premises sites.

However, now, this conversation needs to encompass the cloud. Delivering cloud applications is primarily about providing an application experience that is as responsive as on-premises. However, most SD-WANs have a low response rate for rapidly offering new public cloud infrastructure. MPLS is expensive, rigid, and not built for cloud access.

A: Required: How SASE solves this

SASE natively supports cloud data centers (IaaS) and applications (SaaS) without additional configuration, complexity, or point solutions, enabling built-in cloud connectivity. This further allows the rapid delivery of new public cloud infrastructure.

The SASE PoPs are collocated in the data centers and directly connected to the IXP of the leading IaaS providers, such as Amazon AWS, Microsoft Azure, and Google Cloud Platform. In addition, cloud applications are optimized through SASE’s ability to define the egress points.

This helps exit the cloud application traffic at the points closest to the customer’s application instance. The optimal global routing algorithms can determine the best path from anywhere to the customer’s cloud application instance. This provides optimal performance to the cloud applications regardless of the user’s location.

So, when we talk about performance to the cloud with SASE, the latency to the cloud is comparable to the optimized access provided by the cloud providers, such as AWS Direct Connect or Azure Express Route. So, authentically, SASE provides out-of-the-box cloud performance.

SASE Definition with Challenge 5: Security

The security landscape is constantly evolving. Therefore, network security solutions must develop to form a well-founded landscape. Ransomware and Malware will continue to be the primary security concerns from 2020 onward. Combating the various solutions designed with complex integration points scattered throughout the network domain is challenging for the entire organization.

Security must be part of any WAN transformation initiative. It must protect users and resources regardless of the underlying network, managed through a single-pane-of-glass. However, a bundle of non-integrated security products results in appliance sprawl that hinders your security posture instead of strengthening it. The security solution must defend against emerging threats like malware and ransomware. In addition, it must boost the ability to enforce corporate security policies on mobile users.

Finally, the security solution must also address the increasing cost of buying and managing security appliances and software.

**Security and encryption**

The complexity increases due to the disparate tools required to address the different threat vectors. For example, we have DLP that can be spread across the SWG, CASB, and DLP but with three other teams managing each. What about the impact of encrypted web traffic on the security infrastructure?

The issue is that most internet traffic is now encrypted, and attackers deliver the payloads, deliver command and control instructions, and exfiltrate data over encrypted protocols. Organizations cannot decrypt all network traffic for performance reasons and avoid looking at sensitive employee information. Also, there are issues with the scalability of encrypted traffic management solutions, which can also cause performance issues.

Example Technology: Sensitive Data Protection

Sensitive data protection

Example Technology: Security Backdoors

Backdoor access refers to a hidden method or vulnerability intentionally created within a system or software that allows unauthorized access or control. It is an alternative entry point that bypasses conventional security measures, often undetected.

Using Bash: Bash, short for “Bourne Again SHell,” is a widely used command-line interpreter in Unix-based systems. It provides powerful scripting capabilities, making it a favorite among system administrators and developers. However, this versatility also brings the potential for misuse. This section will explain what a Bash backdoor is and how it functions.

Note: In the following, I created a backdoor on a corporate machine to maintain persistence within the environment. I performed bash script and system configuration using cron jobs. You will then connect to the created backdoor. Here, we demonstrate how to use tools available on standard operating system installations to bypass an organization’s security controls.

Cron jobs, derived from the word “chronos,” meaning time in Greek, are scheduled tasks that run automatically in the background of your server. They follow a specific syntax, using fields to specify when and how often a task should be executed. You can create precise and reliable automated processes by grasping the structure and components of cron jobs.


First, the file called file is deleted with the rm command if it already exists. Next, a special pipe, a new communications channel, is called a file. Any information passed to the bash terminal, such as typed commands, is transmitted to a specific IP address and port using the pipe. The | indicates the point at which the output from one Linux command passes information to the following command. You can create a network connection to a specific machine using this single line, giving a user remote access.

First, errors when running the cron task are ignored and not printed on the screen. Then, the new cronjob is printed to the screen; in this example, the backdoor bash shell will run every minute. The output of the echoed command is then written to the cronfile with crontab. 

SASE Definition with Challenge 6: MPLS and SD-WAN

MPLS does not protect resources and users, certainly not those connected to the Internet. On the other hand, SD-WAN service offerings are not all created equal since many do not include firewall/security features for threat protection to protect all edges—mobile devices, sites, and cloud resources. This lack of integrated security complicates SD-WAN deployments and often leads to Malware getting past the perimeter unnoticed.

Challenge: The cost involved

Security solutions are expensive, and there is never a fixed price. Some security vendors may charge for usage models for which you don’t yet have the quantity. This makes the planning process extraordinarily problematic and complex. As the costs keep increasing, security professionals often trade off point-security solutions due to the associated costs. This is not an effective risk-management strategy.

The security controls are also limited to mobile VPN solutions. More often than not, they are very coarse, forcing IT to open access to all the network resources. Protecting mobile users requires additional security tools like next-generation firewalls (NGFWs), so we have another point solution. In addition, mobile VPN solutions provide no last—or middle-mile optimization.

SASE Meaning: How SASE solves this

SASE converges a complete security stack into the network, allowing it to bring granular control to sites and mobile and cloud resources by enforcing the zero-trust principles for all edges. SASE provides anti-malware protection for both WAN and Internet traffic. In addition, for malware detection and prevention, SASE can offer signature and machine-based learning protection consisting of several integrated anti-malware engines.

For malware communication, SASE can stop the outbound traffic to C&C servers based on reputation feeds and network behavioral analysis. Mobile user traffic is fully protected by SASE’s advanced security services, including NGFW, secure web gateway (SWG), threat prevention, and managed threat detection and response. Furthermore, in the case of mobile, SASE mobile users can dynamically connect to the closest SASE PoP regardless of the location. Again, as discussed previously, the SASE cloud’s relevant optimizations are available for mobile users.

Rethink the WAN: The shift to the cloud, edge computing, and mobility offers new opportunities for IT professionals. Network professionals must rethink their WAN transformation approach to support these digital initiatives. WAN transformation is not just about replacing MPLS with SD-WAN. An all-encompassing solution is needed that provides the proper network performance and security level for enhanced site-to-site connectivity, security, mobile, and cloud.

Example Product: Cisco Umbrella

### What is Cisco Umbrella?

Cisco Umbrella acts as a first line of defense against internet-based threats by leveraging the cloud. It uses DNS (Domain Name System) to block malicious domains, IPs, and URLs before a connection can be established. By analyzing and learning from internet activity patterns, it can predict and prevent potential threats, ensuring that your network remains secure.

### Key Features of Cisco Umbrella

1. **DNS Layer Security**: Cisco Umbrella provides a protective shield at the DNS layer, stopping threats before they reach your network or endpoints. This means that harmful requests are blocked at the source, reducing the risk of malware infections.

2. **Secure Web Gateway**: The solution offers a secure web gateway that inspects web traffic and enforces security policies. It ensures that only safe and compliant traffic is allowed, providing an additional layer of security.

3. **Cloud-Delivered Firewall**: Cisco Umbrella includes a built-in firewall to block unwanted traffic, adding another layer of protection. This firewall can be managed from the cloud, simplifying the process of maintaining network security.

4. **Threat Intelligence**: With real-time threat intelligence updates from Cisco Talos, one of the world’s largest commercial threat intelligence teams, Cisco Umbrella ensures that your defenses are always up to date against the latest threats.

### Benefits of Using Cisco Umbrella

1. **Simplified Security Management**: Being cloud-based, Cisco Umbrella is easy to deploy and manage. There’s no need for complex hardware or software installations, reducing the burden on IT teams.

2. **Improved Visibility**: Cisco Umbrella provides comprehensive insights into internet activity across all devices and locations. This visibility helps in identifying and responding to potential threats swiftly.

3. **Enhanced User Experience**: By blocking malicious content at the DNS layer, users experience faster internet speed and reduced latency, leading to a smoother browsing experience.

4. **Scalability**: Whether you are a small business or a large enterprise, Cisco Umbrella can scale according to your needs. Its cloud-native architecture ensures that it can handle an increasing number of users and devices without compromising on performance.

Summary: SASE Definition

With the ever-evolving landscape of technology and the increasing demand for secure and efficient networks, a new paradigm has emerged in the realm of network security – SASE, which stands for Secure Access Service Edge. In this blog post, we delved into the definition of SASE, its key components, and its transformative impact on network security.

Understanding SASE

SASE, pronounced “sassy,” is a comprehensive framework that combines network security and wide area networking (WAN) capabilities into a single cloud-based service model. It aims to provide users with secure access to applications and data, regardless of their location or the devices they use. By converging networking and security functions, SASE simplifies the network architecture and enhances overall performance.

The Key Components of SASE

To fully grasp the essence of SASE, it is essential to explore its core components. These include:

1. Secure Web Gateway (SWG): The SWG component of SASE ensures safe web browsing by inspecting and filtering web traffic, protecting users from malicious websites, and enforcing internet usage policies.

2. Cloud Access Security Broker (CASB): CASB provides visibility and control over data as it moves between the organization’s network and multiple cloud platforms. It safeguards against cloud-specific threats and helps enforce data loss prevention policies.

3. Firewall-as-a-Service (FWaaS): FWaaS offers scalable and flexible firewall protection, eliminating the need for traditional hardware-based firewalls. It enforces security policies and controls access to applications and data, regardless of their location.

4. Zero Trust Network Access (ZTNA): ZTNA ensures that users and devices are continuously authenticated and authorized before accessing resources. It replaces traditional VPNs with more granular and context-aware access policies, reducing the risk of unauthorized access.

The Benefits of SASE

SASE brings numerous advantages to organizations seeking enhanced network security and performance:

1. Simplified Architecture: By consolidating various network and security functions, SASE eliminates the need for multiple-point solutions, reducing complexity and management overhead.

2. Enhanced Security: With its comprehensive approach, SASE provides robust protection against emerging threats, ensuring data confidentiality and integrity across the network.

3. Improved User Experience: SASE enables secure access to applications and data from any location, offering a seamless user experience without compromising security.

Conclusion:

In conclusion, SASE represents a paradigm shift in network security, revolutionizing how organizations approach their network architecture. By converging security and networking functions, SASE provides a comprehensive and scalable solution that addresses the evolving challenges of today’s digital landscape. Embracing SASE empowers organizations to navigate the complexities of network security and embrace a future-ready approach.

SD WAN Overlay

SD WAN Overlay

SD WAN Overlay

In today's digital age, businesses rely on seamless and secure network connectivity to support their operations. Traditional Wide Area Network (WAN) architectures often struggle to meet the demands of modern companies due to their limited bandwidth, high costs, and lack of flexibility. A revolutionary SD-WAN (Software-Defined Wide Area Network) overlay has emerged to address these challenges, offering businesses a more efficient and agile network solution. This blog post will delve into SD-WAN overlay, exploring its benefits, implementation, and potential to transform how businesses connect.

SD-WAN employs the concepts of overlay networking. Overlay networking is a virtual network architecture that allows for the creation of multiple logical networks on top of an existing physical network infrastructure. It involves the encapsulation of network traffic within packets, enabling data to traverse across different networks regardless of their physical locations. This abstraction layer provides immense flexibility and agility, making overlay networking an attractive option for organizations of all sizes.

- Scalability: One of the key advantages of overlay networking is its ability to scale effortlessly. By decoupling the logical network from the underlying physical infrastructure, organizations can rapidly deploy and expand their networks without disruption. This scalability is particularly crucial in cloud environments or scenarios where network requirements change frequently.

- Security and Isolation: Overlay networks provide enhanced security by isolating different logical networks from each other. This isolation ensures that data traffic remains segregated and prevents unauthorized access to sensitive information. Additionally, overlay networks can implement advanced security measures such as encryption and access control, further fortifying network security.

Highlights: SD WAN Overlay

Understanding Overlay Networking

Overlay networking is a revolutionary approach to network design that enables the creation of virtual networks on top of existing physical networks. By abstracting the underlying infrastructure, overlay networks provide a flexible and scalable solution to meet the dynamic demands of modern applications and services. Whether in cloud environments, data centers, or even across geographically dispersed locations, overlay networking opens up a world of possibilities.

Overlay networks act as a virtual layer on the physical network infrastructure. They enable the creation of logical connections independent of the physical network topology. In the context of SD-WAN, overlay networks facilitate the seamless integration of multiple network connections, including MPLS, broadband, and LTE, into a unified and efficient network.

The advantages of overlay networking are manifold:

– First, it allows for seamless network segmentation, enabling different applications or user groups to operate in isolation while sharing the same physical infrastructure. This enhances security and simplifies network management.

– Secondly, overlay networks facilitate deploying advanced network services such as load balancing, firewalling, and encryption without complex changes to the underlying network infrastructure. This abstraction level empowers organizations to adapt and respond rapidly to evolving business needs.

**So, what exactly is an SD-WAN overlay?**

In simple terms, it is a virtual layer added to the existing network infrastructure. These network overlays connect different locations, such as branch offices, data centers, and the cloud, by creating a secure and reliable network.

1. Tunnel-Based Overlays:

One of the most common types of SD-WAN overlays is tunnel-based overlays. This approach encapsulates network traffic within a virtual tunnel, allowing it to traverse multiple networks securely. Tunnel-based overlays are typically implemented using IPsec or GRE (Generic Routing Encapsulation) protocols. They offer enhanced security through encryption and provide a reliable connection between the SD-WAN edge devices.

Example Technology: IPSec and GRE

Organizations can achieve enhanced network security and improved connectivity by combining GRE and IPSec. The integration allows for creating secure tunnels between networks, ensuring that data transmitted between them remains protected from potential threats. This combination also enables the establishment of Virtual Private Networks (VPNs), enabling secure remote access and seamless connectivity for geographically dispersed teams.

By encrypting GRE tunnels, IPsec provides secure VPN tunnels. This approach offers many benefits, including support for dynamic IGP routing protocols, non-IP protocols, and IP multicast. Furthermore, the headend IPsec termination points can support QoS policies and deterministic routing metrics.

There is built-in redundancy due to the pre-established primary and backup GRE over IPsec tunnels. Static IP addresses are required for the headend site, but dynamic IP addresses are permitted for the remote sites. To differentiate primary tunnels from backup tunnels, routing metrics can be modified slightly to favor one or the other.

GRE with IPsec ipsec plus GRE

2. Segment-Based Overlays:

Segment-based overlays are designed to segment the network traffic based on specific criteria such as application type, user group, or location. This allows organizations to prioritize critical applications and allocate network resources accordingly. By segmenting the traffic, SD-WAN can optimize the performance of each application and ensure a consistent user experience. Segment-based overlays are particularly beneficial for businesses with diverse network requirements.

3. Policy-Based Overlays:

Policy-based overlays enable organizations to define rules and policies that govern the behavior of the SD-WAN network. These overlays use intelligent routing algorithms to dynamically select the most optimal path for network traffic based on predefined policies. By leveraging policy-based overlays, businesses can ensure efficient utilization of network resources, minimize latency, and improve overall network performance.

4. Hybrid Overlays:

Hybrid overlays combine the benefits of both public and private networks. This overlay allows organizations to utilize multiple network connections, including MPLS, broadband, and LTE, to create a robust and resilient network infrastructure. Hybrid overlays intelligently route traffic through the most suitable connection based on application requirements, network availability, and cost. Businesses can achieve high availability, cost-effectiveness, and improved application performance by leveraging mixed overlays.

5. Cloud-Enabled Overlays:

As more businesses adopt cloud-based applications and services, seamless connectivity to cloud environments becomes crucial. Cloud-enabled overlays provide direct and secure connectivity between the SD-WAN network and cloud service providers. These overlays ensure optimized performance for cloud applications by minimizing latency and providing efficient data transfer. Cloud-enabled overlays simplify the management and deployment of SD-WAN in multi-cloud environments, making them an ideal choice for businesses embracing cloud technologies.

**Challenge: The traditional network**

The networks we depend on for business are sensitive to many factors that can result in a slow and unreliable experience. Latency, which refers to the time between a data packet being sent and received, or round-trip time, which is the time it takes for the packet to be sent and for it to get a reply, can be experienced.

We can also experience jitter, the variance in the time delay between data packets in the network, which is a “disruption” in the sending and receiving packets. We have fixed-bandwidth networks that can experience congestion. For example, with five people sharing the same Internet link, each could experience a stable and swift network. Add another 20 or 30 people onto the same link, and the experience will be markedly different.

Google Cloud & SD WAN

SD WAN Cloud Hub

Google Cloud is renowned for its robust infrastructure, scalability, and advanced services. By integrating SD-WAN with Google Cloud, businesses can tap into this powerful ecosystem. They gain access to Google’s global network, enabling optimized routing and lower latency. Additionally, the scalability and flexibility of Google Cloud allow organizations to adapt to changing network demands seamlessly.

Key Advantages:

When SD-WAN Cloud Hub is integrated with Google Cloud, it unlocks a host of advantages. Firstly, it enables organizations to seamlessly connect their branch offices, data centers, and cloud resources, providing a unified network fabric. This integration optimizes traffic flow and enhances application performance, ensuring a consistent user experience.

Key Features and Benefits:

Intelligent Traffic Steering: SD-WAN Cloud Hub with Google Cloud allows organizations to intelligently steer traffic based on application requirements, network conditions, and security policies. This dynamic traffic routing ensures optimal performance and minimizes latency.

Simplified Network Management: The centralized management platform of SD-WAN Cloud Hub simplifies network configuration, monitoring, and troubleshooting. Integration with Google Cloud provides a unified view of the entire network infrastructure, streamlining operations and reducing complexity.

Enhanced Security: SD-WAN Cloud Hub leverages Google Cloud’s security features, such as Cloud Armor and Cloud Identity-Aware Proxy, to protect network traffic and ensure secure communication between branches, data centers, and the cloud.

VPN and Overlay technologies

  • Performance-Based Routing & DMVPN 

Performance-based routing is a dynamic routing technique beyond traditional static routing protocols. It leverages real-time data and network monitoring to make intelligent routing decisions based on latency, bandwidth availability, and network congestion. By constantly analyzing network conditions, performance-based routing algorithms can adapt and reroute traffic to the most efficient paths, ensuring speedy and reliable data transmission.

  • DMVPN Phase 3

DMVPN Phase 3 is the latest iteration of the DMVPN technology, designed to address the limitations of its predecessors. It introduces significant improvements in scalability, routing protocol support, and encryption capabilities. By utilizing multipoint GRE tunnels, DMVPN Phase 3 enables efficient and dynamic communication between multiple sites securely and efficiently.

One of DMVPN Phase 3’s key advantages is its scalability. The introduction of the NHRP (Next Hop Resolution Protocol) redirect allows for the dynamic and efficient allocation of network resources, making it an ideal solution for large-scale deployments. Additionally, DMVPN Phase 3 supports a wide range of routing protocols, including OSPF, EIGRP, and BGP, providing network administrators with flexibility and ease of integration.

Multipoint GRE (mGRE) is an underlying technology that plays a crucial role in DMVPN’s functionality. It enables the establishment of multiple tunnels over a single GRE interface, maximizing network resource utilization. Encapsulating packets in GRE headers, mGRE facilitates traffic routing between remote sites, creating a secure and efficient communication path.

Configuring spokes to terminate multiple headends at one or more hub locations can achieve redundancy. Cryptographic attributes are typically mapped to the tunnel initiated by the remote peer through IPsec tunnel protection..

FlexVPN Site-to-Site Smart Defaults

FlexVPN Site-to-Site Smart Defaults is a feature designed to simplify and streamline the configuration process of site-to-site VPN connections. It provides a set of predefined default values for various parameters, eliminating the need for complex manual configurations. Network administrators can save time and effort by leveraging these smart defaults while ensuring robust security.

One key advantage of FlexVPN Site-to-Site Smart Defaults is its ease of use. Network administrators can quickly deploy secure site-to-site VPN connections without extensive knowledge of complex VPN configurations. The predefined defaults ensure that essential security features are automatically enabled, minimizing the risk of misconfigurations and vulnerabilities.

FlexVPN IKEv2 Routing

FlexVPN IKEv2 routing is a robust routing protocol that combines the flexibility of FlexVPN with the security of IKEv2. It allows for dynamic routing between different sites and simplifies the management of complex network infrastructures. By utilizing the power of cryptographic security and advanced routing techniques, FlexVPN IKEv2 routing ensures secure and efficient communication across networks.

FlexVPN IKEv2 routing offers numerous benefits, making it a preferred choice for network administrators. Firstly, it provides enhanced scalability, allowing networks to grow and adapt to changing requirements quickly. Additionally, the protocol ensures end-to-end security by encapsulating data within secure tunnels, protecting it from unauthorized access. Moreover, FlexVPN IKEv2 routing supports multipoint connectivity, enabling seamless communication between multiple sites.

**Transport Fabric Technology**

SD-WAN leverages transport-independent fabric technology to connect remote locations. This is achieved by using overlay technology. The SDWAN overlay works by tunneling traffic over any transport between destinations within the WAN environment.

This gives authentic flexibility to routing applications across any network portion regardless of the circuit or transport type. This is the definition of transport independence. Having a fabric SDWAN overlay network means that every remote site, regardless of physical or logical separation, is always a single hop away from another. DMPVN works based on transport agnostic design.

SD-WAN vs Traditional WAN

SD-WAN overlays offer several advantages over traditional WANs, including improved scalability, reduced complexity, and better control over traffic flows. They also provide better security, as each site is protected by its dedicated security protocols. Additionally, SD-WAN overlays can improve application performance and reliability and reduce latency.

Key Point: SD-WAN abstracts the underlay

With SD-WAN, the virtual WAN overlays are abstracted from the physical device’s underlay. Therefore, the virtual WAN overlays can take on topologies independent of each other without being pinned to the configuration of the underlay network. SD-WAN changes how you map application requirements to the network, allowing for the creation of independent topologies per application.

For example, mission-critical applications may use expensive leased lines, while lower-priority applications can use inexpensive best-effort Internet links. This can all change on the fly if specific performance metrics are unmet.

Previously, the application had to match and “fit” into the network with the legacy WAN, but with an SD-WAN, the application now controls the network topology. Multiple independent topologies per application are a crucial driver for SD-WAN.

Example Technology: PTP GRE

Point to Point GRE, or Generic Routing Encapsulation, is a protocol that encapsulates and transports various network layer protocols over an IP network. Providing a virtual point-to-point link enables secure and efficient communication between remote networks. Point to Point GRE offers a flexible and scalable solution for organizations seeking to establish secure connections over public or private networks.

SD-WAN combines transports, SDWAN overlay, and underlay

Look at it this way. With an SD-WAN topology, there are different levels of networking. There is an underlay network, the physical infrastructure, and an SDWAN overlay network. The physical infrastructure is the router, switches, and WAN transports; the overlay network is the virtual WAN overlays.

The SDWAN overlay presents a different network to the application. For example, the voice overlay will see only the voice overlay. The logical virtual pipe the overlay creates and the application sees differs from the underlay.

An SDWAN overlay network is a virtual or logical network created on top of an existing physical network. The internet, which connects many nodes via circuit switching, is an example of an SDWAN overlay network. An overlay network is any virtual layer on top of physical network infrastructure.

  • Consider an SDWAN overlay as a flexible tag.

This may be as simple as a virtual local area network (VLAN) but typically refers to more complex virtual layers from an SDN or an SD-WAN). Think of an SDWAN overlay as a tag so that building the overlays is not expensive or time-consuming. In addition, you don’t need to buy physical equipment for each overlay as the overlay is virtualized and in the software.

Similar to software-defined networking (SDN), the critical part is that SD-WAN works by abstraction. All the complexities are abstracted into application overlays. For example, application type A can use this SDWAN overlay, and application type B can use that SDWAN overlay. 

  • I.P. and port number, orchestrations, and end-to-end

Recent application requirements drive a new type of WAN that more accurately supports today’s environment with an additional layer of policy management. The world has moved away from looking at I.P. addresses and Port numbers used to identify applications and made the correct forwarding decision. 

Example Product: Cisco Meraki

**Section 1: Simplified Network Management**

One of the standout features of the Cisco Meraki platform is its user-friendly interface. Gone are the days of complex configurations and cumbersome setups. With Meraki, IT administrators can manage their entire network from a single, intuitive dashboard. This centralized management capability allows for real-time monitoring, troubleshooting, and updates, all from the cloud. The simplicity of the platform means that even those with limited technical expertise can effectively manage and optimize their network.

**Section 2: Robust Security Features**

In today’s digital landscape, security is paramount. Cisco Meraki understands this and has built comprehensive security features into its platform. From advanced malware protection to intrusion prevention and content filtering, Meraki offers a multi-layered approach to cybersecurity. The platform also includes built-in security analytics, providing IT teams with valuable insights to proactively address potential threats. This level of security ensures that your network remains protected against both internal and external vulnerabilities.

**Section 3: Scalability and Flexibility**

Another significant advantage of the Cisco Meraki platform is its scalability. As your business grows, so too can your network. Meraki’s cloud-based nature allows for seamless integration of new devices and locations without the need for extensive hardware upgrades. This flexibility makes it an ideal solution for businesses of all sizes, from small startups to large multinational corporations. The platform’s ability to adapt to changing needs ensures that it can grow alongside your business.

**Section 4: Comprehensive Support and Training**

Cisco Meraki doesn’t just provide a platform; it offers a complete ecosystem of support and training. From comprehensive documentation and online tutorials to live webinars and a dedicated support team, Meraki ensures that you have all the resources you need to make the most of its platform. This commitment to customer success means that you’re never alone in your network management journey.

Challenges to Existing WAN

Traditional WAN architectures consist of private MPLS links complemented with Internet links as a backup. Standard templates in most Service Provider environments are usually broken down into Bronze, Silver, and Gold SLAs. 

However, these types of SLA do not fit all geographies and often should be fine-tuned per location. Capacity, reliability, analytics, and security are all central parts of the WAN that should be available on demand. Traditional infrastructure is very static, and bandwidth upgrades and service changes require considerable time processing and locking agility for new sites.

It’s not agile enough, and nothing can be performed on the fly to meet the growing business needs. In addition, the cost per bit for the private connection is high, which is problematic for bandwidth-intensive applications, especially when the upgrades are too costly and can’t be afforded. 

  • A distributed world of dissolving perimeters

Perimeters are dissolving, and the world is becoming distributed. Applications require a WAN to support distributed environments along with flexible network points. Centralized-only designs result in suboptimal traffic engineering and increased latency. Increased latency disrupts the application performance, and only a particular type of content can be put into a Content Delivery Network (CDN). CDN cannot be used for everything.

Traditional WANs are operationally complex; people likely perform different network and security functions. For example, you may have a DMVPN, Security, and Networking specialist. Some wear all hats, but they are few and far between. Different hats have different ideas, and agreeing on a minor network change could take ages.

  • The World of SD-WAN Static Network-Based

SD-WAN replaces traditional WAN routers that are agnostic to the underlying transport technology. You can use various link types, such as MPLS, LTE, and broadband. All combined. Based on policies generated by the business, SD-WAN enables load sharing across different WAN connections that more efficiently support today’s application environment.

It pulls policy and intelligence out of the network and places them into an end-to-end solution orchestrated by a single pane of glass. SDN-WAN is not just about tunnels. It consists of components that work together to simplify network operations while meeting all bandwidth and resilience requirements.

Centralized network points are no longer adequate; we need network points positioned where they make the most sense for the application and user. Backhauling traffic to a central data center is illogical, and connecting remote sites to a SaaS or IaaS model over the public Internet is far more efficient. The majority of enterprises prefer to connect remote sites directly to cloud services. So why not let them do this in the best possible way?

A new style of WAN and SD-WAN

We require a new WAN style and a shift from a network-based approach to an application-based approach. The new WAN no longer looks solely at the network to forward packets. Instead, it looks at the business application and decides how to optimize it with the correct forwarding behavior. This new style of forwarding is problematic with traditional WAN architecture.

Making business logic decisions with IP and port number information is challenging. Standard routing is packet by packet and can only set part of the picture. They have routing tables and perform forwarding but essentially operate on their little island, losing out on a holistic view required for accurate end-to-end decision-making. An additional layer of information is needed.

A controller-based approach offers the necessary holistic view. We can now make decisions based on global information, not solely on a path-by-path basis. Getting all the routing information and compiling it into a dashboard to make a decision is much more efficient than making local decisions that only see parts of the network. 

From a customer’s viewpoint, what would the perfect WAN look like if you could roll back the clock and start again?   

Related: For additional pre-information, you may find the following helpful:

  1. Transport SDN
  2. SD WAN Diagram 
  3. Overlay Virtual Networking

SD WAN Overlay

Introducing the SD-WAN Overlay

SD-WAN decouples (separates) the WAN infrastructure, whether physical or virtual, from its control plane mechanism and allows applications or application groups to be placed into virtual WAN overlays. The separation will enable us to bring many enhancements and improvements to a WAN that has had little innovation in the past compared to the rest of the infrastructure, such as server and storage modules.

With server virtualization, several virtual machines create application isolation on a physical server. Applications placed in VMs operate in isolation, yet the VMs are installed on the same physical hosts.

Consider SD-WAN to operate with similar principles. Each application or group can operate independently when traversing the WAN to endpoints in the cloud or other remote sites. These applications are placed into a virtual SDWAN overlay.

Overlay Networking

Overlay networking is an approach to computer networking that involves building a layer of virtual networks on top of an existing physical network. This approach improves the underlying infrastructure’s scalability, performance, and security. It also allows for the creation of virtual networks that span multiple physical networks, allowing for greater flexibility in traffic routes.

**Virtualization**

At the core of overlay networking is the concept of virtualization. This involves separating the physical infrastructure from the virtual networks, allowing greater control over allocating resources. This separation also allows the creation of virtual network segments that span multiple physical networks. This provides an efficient way to route traffic and the ability to provide additional security and privacy measures.

**Underlay network**

A network underlay is a physical infrastructure that provides the foundation for a network overlay, a logical abstraction of the underlying physical network. The network underlay provides the physical transport of data between nodes, while the overlay provides logical connectivity.

The network underlay can comprise various technologies, such as Ethernet, Wi-Fi, cellular, satellite, and fiber optics. It is the foundation of a network overlay and essential for its proper functioning. It provides data transport and physical connections between nodes. It also provides the physical elements that make up the infrastructure, such as routers, switches, and firewalls.

Example: DMVPN over IPSec

Understanding DMVPN

DMVPN is a dynamic VPN technology that simplifies establishing secure connections between multiple sites. Unlike traditional VPNs, which require point-to-point tunnels, DMVPN uses a hub-and-spoke architecture, allowing any-to-any connectivity. This flexibility enables organizations to quickly scale their networks and accommodate dynamic changes in their infrastructure.

On the other hand, IPsec provides a robust framework for securing IP communications. It offers encryption, authentication, and integrity mechanisms, ensuring that data transmitted over the network remains confidential and protected against unauthorized access. IPsec is a widely adopted standard that is compatible with various network devices and software, making it an ideal choice for securing DMVPN connections.

The combination of DMVPN and IPsec brings numerous benefits to organizations. Firstly, DMVPN’s dynamic nature allows for easy scalability and improved network resiliency. New sites can be added seamlessly without the need for manual configuration changes. Additionally, DMVPN over IPsec provides strong encryption, ensuring the confidentiality of sensitive data. Moreover, DMVPN’s any-to-any connectivity optimizes network traffic flow, enhancing performance and reducing latency.

 

Overlay networking
Diagram: Overlay networking. Source Researchgate.

Key Challenges: Driving Overlay Networking & SD-WAN

**Challenge: We need more bandwidth**

Modern businesses demand more bandwidth than ever to connect their data, applications, and services. As a result, we have many things to consider with the WAN, such as regulations, security, visibility, branch, data center sites, remote workers, internet access, cloud, and traffic prioritization. They were driving the need for SD-WAN.

The concepts and design principles of creating a wide area network (WAN) to provide resilient and optimal transit between endpoints have continuously evolved. However, the driver behind building a better WAN is to support applications that demand performance and resiliency.

**Challenge: Suboptimal traffic flow**

The optimal route will be the fastest or most efficient and, therefore, preferred to transfer data. Sub-optimal routes will be slower and, hence, not the selected route. Centralized-only designs resulted in suboptimal traffic flow and increased latency, which will degrade application performance.

A key point to note is that traditional networks focus on centralized points in the network that all applications, network, and security services must adhere to. These network points are fixed and cannot be changed.

**Challenge: Network point intelligence**

However, the network should evolve to have network points positioned where it makes the most sense for the application and user, not based on a previously validated design for a different application era. For example, many branch sites do not have local Internet breakouts.

So, for this reason, we backhauled internet-bound traffic to secure, centralized internet portals at the H.Q. site. As a result, we sacrificed the performance of Internet and cloud applications. Designs that place the H.Q. site at the center of connectivity requirements inhibit the dynamic access requirements for digital business.

**Challenge: Hub and spoke drawbacks**

Simple spoke-type networks are sub-optimal because you always have to go to the center point of the hub and then out to the machine you need rather than being able to go directly to whichever node you need. As a result, the hub becomes a bottleneck in the network as all data must go through it. With a more scattered network using multiple hubs and switches, a less congested and more optimal route could be found between machines.

Cisco SD WAN Overlay
Diagram: Cisco SD-WAN overlay. Source Network Academy

The Fabric:

The word fabric comes from the fact that there are many paths to move from one server to another to ease balance and traffic distribution. SDN aims to centralize the order that enables the distribution of the flows over all the fabric paths. Then, we have an SDN controller device. The SDN controller can also control several fabrics simultaneously, managing intra and inter-datacenter flows.

SD-WAN is used to control and manage a company’s multiple WANs. There are different types of WAN: Internet, MPLS, LTE, DSL, fiber, wired network, circuit link, etc. SD-WAN uses SDN technology to control the entire environment. Like SDN, the data plane and control plane are separated. A centralized controller must be added to manage flows, routing or switch policies, packet priority, network policies, etc. SD-WAN technology is based on overlay, meaning nodes representing underlying networks.

Centralized logic:

In a traditional network, each device’s transport functions and controller layer are resident. This is why any configuration or change must be done box-by-box. Configuration was carried out manually or, at the most, an Ansible script. SD-WAN brings Software-Defined Networking (SDN) concepts to the enterprise branch WAN.

Software-defined networking (SDN) is an architecture, whereas SD-WAN is a technology that can be purchased and built on SDN’s foundational concepts. SD-WAN’s centralized logic stems from SDN. SDN separates the control from the data plane and uses a central controller to make intelligent decisions, similar to the design that most SD-WAN vendors operate.

A holistic view:

The controller and the SD-WAN overlay have a holistic view. The controller supports central policy management, enabling network-wide policy definitions and traffic visibility. The SD-WAN edge devices perform the data plane. The data plane is where simple forwarding occurs, and the control plane, which is separate from the data plane, sets up all the controls for the data plane to forward.

Like SDN, the SD-WAN overlay abstracts network hardware into a control plane with multiple data planes to make up one large WAN fabric. As the control layer is abstracted and decoupled above the physicals and running in software, services can be virtualized and delivered from a central location to any point on the network.

SD-WAN Overlay Features

SD-WAN Overlay Feature 1: Combining the transports:

At its core, SD-WAN shapes and steers application traffic across multiple WAN means of transport. Building on the concept of link bonding to combine numerous means of transport and transport types, the SD-WAN overlay improves the idea by moving the functionality up the stack—first, SD-WAN aggregates last-mile services, representing them as a single pipe to the application.SD-WAN allows you to combine all transport links into one big pipe. SD-WAN is transport agnostic. As it works by abstraction, it does not care what transport links you have. Maybe you have MPLS, private Internet, or LTE. It can combine all these or use them separately.

SD-WAN Overlay Feature 2: location:

From a central location, SD-WAN pulls all of these WAN resources together, creating one large WAN fabric that allows administrators to slice up the WAN to match the application requirements that sit on top. Different applications traverse the WAN, so we need the WAN to react differently. For example, if you’re running a call center, you want a low delay, latency, and high availability with Voice traffic. You may wish to use this traffic to use an excellent service-level agreement path.

SD WAN traffic steering
Diagram: SD-WAN traffic steering. Source Cisco.

SD-WAN Overlay Feature 3: steering:

Traffic steering may also be required: voice traffic to another path if, for example, the first Path is experiencing high latency. If it’s not possible to steer traffic automatically to a link that is better performing, run a series of path remediation techniques to try to improve performance. File transfer differs from real-time Voice: you can tolerate more delay but need more B/W. Here, you may want to use a combination of WAN transports ( such as customer broadband and LTE ) to achieve higher aggregate B/W.

This also allows you to automatically steer traffic over different WAN transports when there is a deflagration on one link. With the SD-WAN overlay, we must start thinking about paths, not links.

SD-WAN Overlay Feature 4: decisions:

At its core, SD-WAN enables real-time application traffic steering over any link, such as broadband, LTE, and MPLS, assigning pre-defined policies based on business intent. Steering policies support many application types, making intelligent decisions about how WAN links are utilized and which paths are taken.

The concept of an underlay and overlay are not new, and SD-WAN borrows these designs. First, the underlay is the physical or virtual world, such as the physical infrastructure. Then, we have the overlay, where all the intelligence can be set. The SDWAN overlay represents the virtual WANs that hold your different applications.

A virtual WAN overlay enables us to steer traffic and combine all bandwidths. Similar to how applications are mapped to V.M. in the server world, with SD-WAN, each application is mapped to its own virtual SD-WAN overlay. Each virtual SD-WAN overlay can have its own SD-WAN security policies, topologies, and performance requirements.

SD-WAN Overlay Feature 5:-Aware Routing Capabilities

Not only do we need application visibility to forward efficiently over either transport, but we also need the ability to examine deep inside the application and look at the sub-applications. For example, we can determine whether Facebook chat is over regular Facebook. This removes the application’s mystery and allows you to balance loads based on sub-applications. It’s like using a scalpel to configure the network instead of a sledgehammer.

SD-WAN Overlay Feature 6: Of Integration With Existing Infrastructure

The risk of introducing new technologies may come with a disruptive implementation strategy. Loss of service damages more than the engineer’s reputation. It hits all areas of the business. The ability to seamlessly insert new sites into existing designs is a vital criterion. With any network change, a critical evaluation is to know how to balance risk with innovation while still meeting objectives.

How aligned is marketing content to what’s happening in reality? It’s easy for marketing materials to implement their solution at Layer 2 or 3! It’s an entirely different ball game doing this. SD-WAN carries a certain amount of due diligence. One way to read between the noise is to examine who has real-life deployments with proven Proof of concept (POC) and validated designs. Proven POC will help you guide your transition in a step-by-step manner.

SD-WAN Overlay Feature 7: Specific Routing Topologies

Every company has different requirements for hub and spoke full mesh and Internet PoP topologies. For example, Voice should follow a full mesh design, while Data requires a hub and spokes connecting to a central data center. Nearest service availability is the key to improved performance, as there is little we can do about the latency Gods except by moving services closer together. 

SD-WAN Overlay Feature 8: Device Management & Policy Administration

The manual box-by-box approach to policy enforcement is not the way forward. It’s similar to stepping back into the Stone Age to request a catered flight. The ability to tie everything to a template and automate enables rapid branch deployments, security updates, and configuration changes. The optimal solutions have everything in one place and can dynamically push out upgrades.

SD-WAN Overlay Feature 9: Available With Automatic Failovers

You cannot apply a singular viewpoint to high availability. An end-to-end solution should address the high availability requirements of the device, link, and site level. WANs can fail quickly, but this requires additional telemetry information to detect failures and brownout events. 

SD-WAN Overlay Feature 10: On All Transports

Irrespective of link type, whether MPLS, LTE, or the Internet, we need the capacity to encrypt on all those paths without the excess baggage of IPsec. Encryption should happen automatically, and the complexity of IPsec should be abstracted.

**Application-Orientated WAN**

Push to the cloud:  

When geographically dispersed users connect back to central locations, their consumption triggers additional latency, degrading the application’s performance. No one can get away from latency unless we find ways to change the speed of light. One way is to shorten the link by moving to cloud-based applications.

The push to the cloud is inevitable. Most businesses are now moving away from on-premise in-house hosting to cloud-based management. Nevertheless, the benefits of moving to the cloud are manifold. It is easier for so many reasons.

The ready-made global footprint enables the usage of SaaS-based platforms that negate the drawbacks of dispersed users tromboning to a central data center. This software is pivotal to millions of businesses worldwide, which explains why companies such as Capsifi are so popular.

Logically positioned cloud platforms are closer to the mobile user. It’s increasingly far more efficient from the technical and business standpoint to host these applications in the cloud, which makes them available over the public Internet.

Bandwidth intensive applications:

Richer applications, multimedia traffic, and growth in the cloud application consumption model drive the need for additional bandwidth. Unfortunately, we can only fit so much into a single link. The congestion leads to packet drops, ultimately degrading application performance and user experience. In addition, most applications ride on TCP, yet TCP was not designed with performance.

Organic growth:

Organic business growth is a significant driver of additional bandwidth requirements. The challenge is that existing network infrastructures are static and unable to respond to this growth in a reasonable period adequately. The last mile of MPLS locks you in and kills agility. Circuit lead times impede the organization’s productivity and create an overall lag.

Costs:

A WAN virtualization solution should be simple. To serve the new era of applications, we need to increase the link capacity by buying more bandwidth. However, nothing is as easy as it may seem. The WAN is one of the network’s most expensive parts, and employing link oversubscription to reduce congestion is too costly.

Furthermore, bandwidth comes at a cost, and the existing TDM-based MPLS architectures cannot accommodate application demands. 

Traditional MPLS is feature-rich and offers many benefits. No one doubts this fact. MPLS will never die. However, it comes at a high cost for relatively low bandwidth. Unfortunately, MPLS’s price and capabilities are not a perfect couple.

Hybrid connectivity:

Since there is not one stamp for the entire world, similar applications will have different forwarding preferences. Therefore, application flows are dynamic and change depending on user consumption. Furthermore, the MPLS, LTE, and Internet links often complement each other since they support different application types.

For example, Storage and Big data replication traffic are forwarded through the MPLS links, while cheaper Internet connectivity is used for standard Web-based applications.

Limitations of protocols:

When left to its defaults, IPsec is challenged by hybrid connectivity. IPSec architecture is point-to-point, not site-to-site. As a result, it doesn’t natively support redundant uplinks. Complex configurations are required when sites have multiple uplinks to multiple providers.

By default, IPsec is not abstracted; one session cannot be used over multiple uplinks, causing additional issues with transport failover and path selection. It’s a Swiss Army knife of features, and much of IPSec’s complexities should be abstracted. Secure tunnels should be torn up and down immediately, and new sites should be incorporated into a secure overlay without much delay or manual intervention. 

Internet of Things (IoT):

Security and bandwidth consumption are key issues when introducing IP-enabled objects and IoT access technologies. IoT is all about Data and will bring a shed load of additional overheads for networks to consume. As millions of IoT devices come online, how efficiently do we segment traffic without complicating the network design further? Complex networks are hard to troubleshoot, and simplicity is the mother of all architectural success. Furthermore, much IoT processing requires communication to remote IoT platforms. How do we account for the increased signaling traffic over the WAN? The introduction of billions of IoT devices leaves many unanswered questions.

Branch NFV:

There has been strong interest in infrastructure consolidation by deploying Network Function Virtualization (NFV) at the branch. Enabling on-demand service and chaining application flows are key drivers for NFV. However, traditional service chaining is static since it is bound to a particular network topology. Moreover, it is typically built through manual configuration and is prone to human error.

 SD-WAN overlay path monitoring:

SD-WAN monitors the paths and the application performance on each link (Internet, MPLS, LTE ) and then chooses the best path based on real-time conditions and the business policy. In summary, the underlay network is the physical or virtual infrastructure above which the overlay network is built. An SDWAN overlay network is a virtual network built on top of an underlying Network infrastructure/Network layer (the underlay).

Controller-based policy:

An additional layer of information is needed to make more intelligent decisions about how and where to forward application traffic. SD-WAN offers a controller-based policy approach that incorporates a holistic view.

A central controller can now make decisions based on global information, not solely on a path-by-path basis with traditional routing protocols.  Getting all the routing information and compiling it into the controller to make a decision is much more efficient than making local decisions that only see a limited part of the network.

The SD-WAN Controller provides physical or virtual device management for all SD-WAN Edges associated with the controller. This includes, but is not limited to, configuration and activation, IP address management, and pushing down policies onto SD-WAN Edges located at the branch sites.

SD-WAN Overlay Case Study:

Personal Note: I recently consulted for a private enterprise. Like many enterprises, they have many applications, both legacy and new. No one knew about courses and applications running over the WAN; visibility was low. For the network design, the H.Q. has MPLS and Direct Internet access. Nothing is new here; this design has been in place for the last decade. All traffic is backhauled to the HQ/MPLS headend for security screening. The security stack, including firewalls, IDS/IPS, and anti-malware, was in the H.Q. The remote sites have high latency and limited connectivity options.

More importantly, they are transitioning their ERP system to the cloud. As apps move to the cloud, they want to avoid fixed WAN, a big driver for a flexible SD-WAN solution. They also have remote branches, which are hindered by high latency and poorly managed IT infrastructure. But they don’t want an I.T. representative at each site location. They have heard that SD-WAN has a centralized logic and can view the entire network from one central location. These remote sites must receive large files from the H.Q.; the branch sites’ transport links are only single-customer broadband links.

Some remote sites have LTE, and the bills are getting more significant. The company wants to reduce costs with dedicated Internet access or customer/business broadband. They have heard that you can combine different transports with SD-WAN and have several path remediations on degraded transports for better performance. So, they decided to roll out SD-WAN. From this new architecture, they gained several benefits.

SD-WAN Visibility

When your business-critical applications operate over different provider networks, troubleshooting and finding the root cause of problems becomes more challenging. So, visibility is critical to business. SD-WAN allows you to see network performance data in real-time and is essential for determining where packet loss, latency, and jitter are occurring so you can resolve the problem quickly.

You also need to be able to see who or what is consuming bandwidth so you can spot intermittent problems. For all these reasons, SD-WAN visibility needs to go beyond network performance metrics and provide greater insight into the delivery chains that run from applications to users.

  • Understand your baselines:

Visibility is needed to complete the network baseline before the SD-WAN is deployed. This enables the organization to understand existing capabilities, the norm, what applications are running, the number of sites connected, what service providers used, and whether they’re meeting their SLAs. Visibility is critical to obtaining a complete picture, so teams understand how to optimize the business infrastructure. SD-WAN gives you an intelligent edge, so you can see all the traffic and act on it immediately.

First, look at the visibility of the various flows, the links used, and any issues on those links. Then, if necessary, you can tweak the bonding policy to optimize the traffic flow. Before the rollout of SD-WAN, there was no visibility into the types of traffic, and different apps used what B.W. They had limited knowledge of WAN performance.

  • SD-WAN offers higher visibility:

With SD-WAN, they have the visibility to control and class traffic on layer seven values, such as what URL you are using and what Domain you are trying to hit, along with the standard port and protocol. All applications are not equal; some run better on different links. If an application is not performing correctly, you can route it to a different circuit. With the SD-WAN orchestrator, you have complete visibility across all locations, all links, and into the other traffic across all circuits. 

  • SD-WAN High Availability:

Any high-availability solution aims to ensure that all network services are resilient to failure. It aims to provide continuous access to network resources by addressing the potential causes of downtime through functionality, design, and best practices. The previous high-availability design was active and passive with manual failover. It was hard to maintain, and there was a lot of unused bandwidth. Now, they use resources more efficiently and are no longer tied to the bandwidth of the first circuit.

There is a better granular application failover mechanism. You can also select which apps are prioritized if a link fails or when a certain congestion ratio is hit. For example, you have LTE as a backup, which can be expensive. So applications marked high priority are steered over the backup link, but guest WIFI traffic isn’t.  

  • Flexible topology:

Before, they had a hub-and-spoke MPLS design for all applications. They wanted a complete mesh architecture for some applications, kept the existing hub, and spoke for others. However, the service provider couldn’t accommodate the level of granularity that they wanted.

With SD-WAN, they can choose topologies that are better suited to the application type. As a result, the network design is now more flexible and matches the application than the application matching a network design it doesn’t want.

Types of SD-WAN

The market for branch office wide-area network functionality is shifting from dedicated routing, security, and WAN optimization appliances to feature-rich SD-WAN. As a result, WAN edge infrastructure now incorporates a widening set of network functions, including secure routers, firewalls, SD-WAN, WAN path control, and WAN optimization, along with traditional routing functionality. Therefore, consider the following approach to deploying SD-WAN.

1. Application-based approach

With SD-WAN, we are shifting from a network-based approach to an application-based approach. The new WAN no longer looks solely at the network to forward packets. Instead, it looks at the business requirements and decides how to optimize the application with the correct forwarding behavior. This new way of forwarding would be problematic when using traditional WAN architectures.

Making business logic decisions with I.P. and port number information is challenging. Standard routing is the most common way to forward application traffic today, but it only assesses part of the picture when making its forwarding decision. 

These devices have routing tables to perform forwarding. Still, with this model, they operate and decide on their little island, losing the holistic view required for accurate end-to-end decision-making.  

2. SD-WAN: Holistic decision

The WAN must start making decisions holistically. It should not be viewed as a single module in the network design. Instead, it must incorporate several elements it has not integrated to capture the correct per-application forwarding behavior. The ideal WAN should be automatable to form a comprehensive end-to-end solution centrally orchestrated from a single pane of glass.

Managed and orchestrated centrally, this new WAN fabric is transport agnostic. It offers application-aware routing, regional-specific routing topologies, encryption on all transports regardless of link type, and high availability with automatic failover. All of these will be discussed shortly and are the essence of SD-WAN.  

3. SD-WAN and central logic        

Besides the virtual SD-WAN overlay, another key SD-WAN concept is centralized logic. Upon examining a standard router, local routing tables are computed from an algorithm to forward a packet to a given destination.

It receives routes from its peers or neighbors but computes paths locally and makes local routing decisions. The critical point to note is that everything is calculated locally. SD-WAN functions on a different paradigm.

Rather than using distributed logic, it utilizes centralized logic. This allows you to view the entire network holistically and with a distributed forwarding plane that makes real-time decisions based on better metrics than before.

This paradigm enables SD-WAN to see how the flows behave along the path. They are taking the fragmented control approach and centralizing it while benefiting from a distributed system. 

The SD-WAN controller, which acts as the brain, can set different applications to run over different paths based on business requirements and performance SLAs, not on a fixed topology. So, for example, if one path does not have acceptable packet loss and latency is high, we can move to another path dynamically.

4. Independent topologies

SD-WAN has different levels of networking and brings the concepts of SDN into the Wide Area Network. Similar to SDN, we have an underlay and an overlay network with SD-WAN. The WAN infrastructure, either physical or virtual, is the underlay, and the SDWAN overlay is in software on top of the underlay where the applications are mapped.

This decoupling or separation of functions allows different application or group overlays. Previously, the application had to work with a fixed and pre-built network infrastructure. With SD-WAN, the application can choose the type of topology it wants, such as a full mesh or hub and spoke. The topologies with SD-WAN are much more flexible.

5. The SD-WAN overlay

SD-WAN optimizes traffic over multiple available connections. It dynamically steers traffic to the best available link. Suppose the available links show any transmission issues. In that case, it will immediately transfer to a better path or apply remediation to a link if, for example, you only have a single link. SD-WAN delivers application flows from a source to a destination based on the configured policy and best available network path. A core concept of SD-WAN is overlaid.

SD-WAN solutions provide the software abstraction to create the SD-WAN overlay and decouple network software services from the underlying physical infrastructure. Multiple virtual overlays may be defined to abstract the underlying physical transport services, each supporting a different quality of service, preferred transport, and high availability characteristics.

6. Application mapping

Application mapping also allows you to steer traffic over different WAN transports. This steering is automatic and can be implemented when specific performance metrics are unmet. For example, if Internet transport has a 15% packet loss, the policy can be set to steer all or some of the application traffic over to better-performing MPLS transport.

Applications are mapped to different overlays based on business intent, not infrastructure details like IP addresses. When you think about overlays, it’s common to have, on average, four overlays. For example, you may have a gold, platinum, and bronze SDWAN overlay, and then you can map the applications to these overlays.

The applications will have different networking requirements, and overlays allow you to slice and dice your network if you have multiple application types. 

SD-WAN & WAN metrics

SD-WAN captures metrics that go far beyond the standard WAN measurements. For example, the traditional method measures packet loss, latency, and jitter metrics to determine path quality. These measurements are insufficient for routing protocols that only make the packet flow decision at layer 3 of the OSI model.

As we know, layer 3 of the OSI model lacks intelligence and misses the overall user experience. We must start looking at application transactions rather than relying on bits, bytes jitter, and latency.

SD-WAN incorporates better metrics beyond those a standard WAN edge router considers. These metrics may include application response time, network transfer, and service response time. Some SD-WAN solutions monitor each flow’s RTT, sliding windows, and ACK delays, not just the I.P. or TCP. This creates a more accurate view of the application’s performance.

Overlay Use Case: DMVPN Dual Cloud

Exploring Single Hub Dual Cloud Configuration

The Single Hub, Dual Cloud configuration, is a DMVPN setup in which a central hub site connects to two or more cloud service providers simultaneously. This configuration offers several advantages, such as increased redundancy, improved performance, and enhanced security.

By connecting to multiple cloud service providers, the Single Hub Dual Cloud configuration ensures redundancy if one provider experiences an outage. This redundancy enhances network availability and minimizes the risk of downtime, providing a robust and reliable network infrastructure.

With the Single Hub Dual Cloud configuration, network traffic can be load-balanced across multiple cloud service providers. This load balancing distributes the workload evenly, optimizing network performance and preventing bottlenecks. It allows for efficient utilization of network resources, resulting in enhanced user experience and improved application performance.

Summary: SD WAN Overlay

In today’s digital landscape, businesses increasingly rely on cloud-based applications, remote workforces, and data-driven operations. As a result, the demand for a more flexible, scalable, and secure network infrastructure has never been greater. This is where SD-WAN overlay comes into play, revolutionizing how organizations connect and operate.

SD-WAN overlay is a network architecture that allows organizations to abstract and virtualize their wide area networks, decoupling them from the underlying physical infrastructure. It utilizes software-defined networking (SDN) principles to create an overlay network that runs on top of the existing WAN infrastructure, enabling centralized management, control, and optimization of network traffic.

Key benefits of SD-WAN overlay 

1. Enhanced Performance and Reliability:

SD-WAN overlay leverages multiple network paths to distribute traffic intelligently, ensuring optimal performance and reliability. By dynamically routing traffic based on real-time conditions, businesses can overcome network congestion, reduce latency, and maximize application performance. This capability is particularly crucial for organizations with distributed branch offices or remote workers, as it enables seamless connectivity and productivity.

2. Cost Efficiency and Scalability:

Traditional WAN architectures can be expensive to implement and maintain, especially when organizations need to expand their network footprint. SD-WAN overlay offers a cost-effective alternative by utilizing existing infrastructure and incorporating affordable broadband connections. With centralized management and simplified configuration, scaling the network becomes a breeze, allowing businesses to adapt quickly to changing demands without breaking the bank.

3. Improved Security and Compliance:

In an era of increasing cybersecurity threats, protecting sensitive data and ensuring regulatory compliance are paramount. SD-WAN overlay incorporates advanced security features to safeguard network traffic, including encryption, authentication, and threat detection. Businesses can effectively mitigate risks, maintain data integrity, and comply with industry regulations by segmenting network traffic and applying granular security policies.

4. Streamlined Network Management:

Managing a complex network infrastructure can be a daunting task. SD-WAN overlay simplifies network management with centralized control and visibility, enabling administrators to monitor and manage the entire network from a single pane of glass. This level of control allows for faster troubleshooting, policy enforcement, and network optimization, resulting in improved operational efficiency and reduced downtime.

5. Agility and Flexibility:

In today’s fast-paced business environment, agility is critical to staying competitive. SD-WAN overlay empowers organizations to adapt rapidly to changing business needs by providing the flexibility to integrate new technologies and services seamlessly. Whether adding new branch locations, integrating cloud applications, or adopting emerging technologies like IoT, SD-WAN overlay offers businesses the agility to stay ahead of the curve.

Implementation of SD-WAN Overlay:

Implementing SD-WAN overlay requires careful planning and consideration. The following steps outline a typical implementation process:

1. Assess Network Requirements: Evaluate existing network infrastructure, bandwidth requirements, and application performance needs to determine the most suitable SD-WAN overlay solution.

2. Design and Architecture: Create a network design incorporating SD-WAN overlay while considering factors such as branch office connectivity, data center integration, and security requirements.

3. Vendor Selection: Choose a reliable and reputable SD-WAN overlay vendor based on their technology, features, support, and scalability.

4. Deployment and Configuration: Install the required hardware or virtual appliances and configure the SD-WAN overlay solution according to the network design. This includes setting up policies, traffic routing, and security parameters.

5. Testing and Optimization: Thoroughly test the SD-WAN overlay solution, ensuring its compatibility with existing applications and network infrastructure. Optimize the solution based on performance metrics and user feedback.

Conclusion: SD-WAN overlay is a game-changer for businesses seeking to optimize their network infrastructure. By enhancing performance, reducing costs, improving security, streamlining management, and enabling agility, SD-WAN overlay unlocks the true potential of connectivity. Embracing this technology allows organizations to embrace digital transformation, drive innovation, and gain a competitive edge in the digital era. In an ever-evolving business landscape, SD-WAN overlay is the key to unlocking new growth opportunities and future-proofing your network infrastructure.

SD-WAN topology

SD WAN | SD WAN Tutorial

SD WAN Tutorial

In the ever-evolving landscape of networking technology, SD-WAN has emerged as a powerful solution that revolutionizes the way businesses connect and operate. This blog post delves into the world of SD-WAN, exploring its key features, benefits, and the impact it has on modern networks.

SD-WAN, which stands for Software-Defined Wide Area Networking, is a technology that simplifies the management and operation of a wide area network. By separating the network hardware from its control mechanism, SD-WAN enables businesses to have more flexibility and control over their network infrastructure. Unlike traditional WAN setups, SD-WAN utilizes software to intelligently route traffic across multiple connection types, optimizing performance and enhancing security.

One of the fundamental features of SD-WAN is its ability to provide centralized network management. This means that network administrators can easily configure and monitor the entire network from a single interface, streamlining operations and reducing complexity. Additionally, SD-WAN offers dynamic path selection, allowing traffic to be routed based on real-time conditions, such as latency, congestion, and link availability. This dynamic routing capability ensures optimal performance and resilience.

Another significant benefit of SD-WAN is its ability to support multiple connection types, including MPLS, broadband, and cellular networks. This enhances network reliability and scalability, as businesses can leverage multiple connections to avoid single points of failure and accommodate growing bandwidth demands. Moreover, SD-WAN solutions often incorporate advanced security features, such as encryption and segmentation, ensuring data integrity and protecting against cyber threats.?

SD-WAN has had a profound impact on modern networks, empowering businesses with greater agility and cost-efficiency. With the rise of cloud computing and the increasing adoption of SaaS applications, traditional network architectures were often unable to provide the necessary performance and reliability. SD-WAN addresses these challenges by enabling direct and secure access to cloud resources, eliminating the need for backhauling traffic to a central data center.

Furthermore, SD-WAN enhances network visibility and control, allowing businesses to prioritize critical applications, apply Quality of Service (QoS) policies, and optimize bandwidth utilization. This level of granular control ensures that essential business applications receive the required resources, enhancing user experience and productivity. Additionally, SD-WAN simplifies network deployments, making it easier for organizations to expand their networks, open new branches, and integrate acquisitions seamlessly.

SD-WAN represents a significant evolution in networking technology, offering businesses a comprehensive solution to modern connectivity challenges. With its centralized management, dynamic path selection, and support for multiple connection types, SD-WAN empowers organizations to build robust, secure, and agile networks. As businesses continue to embrace digital transformation, SD-WAN is poised to play a pivotal role in shaping the future of networking.

Highlights: SD WAN Tutorial

Network Abstraction

– In an era where digital transformation is no longer a luxury but a necessity, businesses are constantly looking for ways to optimize their network infrastructure.

Enter Software-Defined Wide Area Networking (SD-WAN), a revolutionary technology that is transforming how organizations approach WAN architecture.

– SD-WAN is not just a buzzword; it is a robust solution designed to simplify the management and operation of a WAN by decoupling the networking hardware from its control mechanism. 

### The Basics of SD-WAN Technology

At its core, SD-WAN is a virtualized WAN architecture that allows enterprises to leverage any combination of transport services, including MPLS, LTE, and broadband internet services, to securely connect users to applications.

Unlike traditional WANs, which require proprietary hardware and complex configurations, SD-WAN uses a centralized control function to direct traffic across the WAN, increasing application performance and delivering a high-quality user experience. This separation of the data plane from the control plane is what makes SD-WAN a game-changer in the world of networking.

### WAN Virtualization: The Heart of SD-WAN

WAN virtualization is a critical component of SD-WAN. It abstracts the underlying network infrastructure, creating a virtual overlay that provides seamless connectivity across different network types. This virtualization enables businesses to manage network traffic more effectively, prioritize critical applications, and ensure reliable performance irrespective of the physical network.

With WAN virtualization, businesses can rapidly deploy new applications and services, respond to changes in network conditions, and optimize bandwidth usage without costly hardware upgrades.

### Benefits of Adopting SD-WAN

The benefits of adopting SD-WAN are manifold. Firstly, it offers cost savings by reducing the need for expensive MPLS circuits and allowing the use of more cost-effective broadband connections. Secondly, it provides enhanced security through integrated encryption and advanced threat protection. Additionally, SD-WAN simplifies network management by providing a centralized dashboard that offers visibility into network traffic and performance. This simplification leads to improved agility, allowing IT teams to deploy and manage applications with greater speed and efficiency.

### The Future of Networking with SD-WAN

As businesses continue to embrace cloud services and remote work becomes increasingly prevalent, the demand for flexible, scalable, and secure networking solutions will only grow. SD-WAN is well-positioned to meet these demands, offering a future-proof solution that can adapt to the ever-changing landscape of enterprise networking. With its ability to integrate with cloud platforms and support IoT deployments, SD-WAN is paving the way for the next generation of network connectivity.

Picture This: Personal Note – 

Now imagine these virtual WANs individually holding a single application running over the WAN, but consider them end-to-end instead of being in one location, i.e., on a server. The individual WAN runs to the cloud or enterprise location, having secure, isolated paths with different policies and topologies. Wide Area Network (WAN) virtualization is an emerging technology revolutionizing how networks are designed and managed.

Note: WAN virtualization allows for decoupling the physical network infrastructure from the logical network, enabling the same physical infrastructure for multiple logical networks. It allows organizations to utilize a single physical infrastructure to create multiple virtual networks, each with unique characteristics. WAN virtualization is a core requirement enabling SD-WAN.

SD WAN Overlay & Underlay Design 

This SD-WAN tutorial will address the SD-WAN vendor’s approach to an underlay and an overlay, including the SD-WAN requirements. The underlay consists of the physical or virtual infrastructure and the overlay network, the SD WAN overlay to which the applications are mapped.

SD-WAN solutions are designed to provide secure, reliable, and high-performance connectivity across multiple locations and networks. Organizations can manage their network configurations, policies, and security infrastructure with SD-WAN.

In addition, SD-WAN solutions can be deployed over any type of existing WAN infrastructure, such as MPLS, Frame Relay, and more. SD-WAN offers enhanced security features like encryption, authentication, and access control. This ensures that data is secure and confidential and that only authorized users can access the network.

Example Technology: GRE Overlay with IPsec

GRE with IPsec ipsec plus GRE

Google SD WAN Cloud Hub

The integration of SD-WAN with Google Cloud takes connectivity to new heights. By deploying an SD-WAN Cloud Hub, businesses can seamlessly connect their branch networks to the cloud, leveraging the power of Google Cloud’s infrastructure.

This enables organizations to optimize network performance, reduce latency, and enhance overall user experience. The centralized management capabilities of SD-WAN further simplify network operations, allowing businesses to efficiently control traffic routing, prioritize critical applications, and ensure maximum uptime.

Seamless Integration

One of the standout aspects of SD-WAN Cloud Hub is its seamless integration with Google Cloud. Organizations can extend their on-premises network to Google Cloud, enabling them to leverage Google Cloud’s extensive services and resources. This integration empowers businesses to adopt a hybrid cloud strategy, seamlessly connecting their on-premises infrastructure with the scalability and flexibility of Google Cloud.

SD-WAN Enables

A) Performance-Based Routing

Performance-based routing is a dynamic routing technique that selects the best path for data transmission based on real-time performance metrics. Unlike traditional routing protocols that solely consider static factors like hop count, performance-based routing considers parameters such as latency, packet loss, and bandwidth availability.

-Enhanced Network Performance: Performance-based routing minimizes latency and packet loss by dynamically selecting the optimal path, improving overall network performance. This leads to faster data transfer speeds and better user experiences for applications and services.

-Efficient Bandwidth Utilization: Performance-based routing intelligently allocates network resources by diverting traffic to less congested routes. This ensures that available bandwidth is utilized optimally, preventing bottlenecks and congestion in the network.

-Redundancy and Failover: Another advantage of performance-based routing is its ability to provide built-in redundancy and failover mechanisms. By constantly monitoring performance metrics, it can automatically reroute traffic when a network link or node fails, ensuring uninterrupted connectivity.

B) Understanding DMVPN Phase 3

DMVPN Phase 3 is an advanced networking solution that provides scalable and efficient connectivity for organizations with distributed networks. Unlike its predecessors, DMVPN Phase 3 introduces the concept of Spokes connecting directly with each other, eliminating the need for traffic to traverse through the Hub. This dynamic spoke-to-spoke tunneling architecture enhances network performance and reduces latency, making it an ideal choice for modern network infrastructures.

DMVPN Phase 3 offers many advantages for organizations seeking streamlined network connectivity. First, it provides enhanced scalability, allowing for easy addition or removal of spokes without impacting the overall network. Additionally, direct spoke-to-spoke communication reduces the dependency on the Hub, resulting in improved network resiliency and reduced bandwidth consumption. Moreover, DMVPN Phase 3 supports dynamic routing protocols, enabling efficient and automated network management.

C) Securing DMVPN with IPSec

DMVPN is a Cisco proprietary solution that simplifies the deployment of VPN networks, offering scalability, flexibility, and ease of management. It utilizes multipoint GRE tunnels to establish secure connections between multiple sites, creating a virtual mesh network. This architecture eliminates the need for point-to-point tunnels between every site, reducing overhead and enhancing scalability.

IPsec, short for Internet Protocol Security, is a widely used protocol suite that provides secure communication over IP networks. With features like authentication, encryption, and data integrity, IPsec ensures confidentiality and integrity of data transmitted over the network. When combined with DMVPN, IPsec adds an additional layer of security to the virtual network, safeguarding sensitive information from unauthorized access.

DMVPN over IPsec offers numerous advantages for organizations. Firstly, it enables dynamic and on-demand connectivity, adding new sites seamlessly without manual configuration. This scalability reduces administrative overhead and streamlines network expansion. Secondly, DMVPN over IPsec provides enhanced security, ensuring that data remains confidential and protected from potential threats. Lastly, it improves network performance by leveraging multipoint connectivity, optimizing bandwidth usage, and reducing latency.

Example Product: Cisco Meraki

### Centralized Management

One of the standout features of the Cisco Meraki platform is its centralized management system. Gone are the days of configuring each device individually. With Meraki, all your network devices can be managed from a single, intuitive dashboard. This not only simplifies the administration process but also ensures that your network remains consistent and secure. The centralized dashboard provides real-time monitoring, configuration, and troubleshooting capabilities, allowing IT administrators to manage their entire network with ease and efficiency.

### Robust Security Features

Security is a top priority for any network, and Cisco Meraki excels in this area. The platform offers a comprehensive suite of security features designed to protect your network from a wide range of threats. Built-in firewall, intrusion detection, and prevention systems work seamlessly to safeguard your data. Additionally, Meraki’s advanced malware protection and content filtering ensure that harmful content is kept at bay. With automatic firmware updates, your network is always protected against the latest vulnerabilities, giving you peace of mind.

### Unparalleled Scalability

As your business grows, so does your network. Cisco Meraki is designed to scale effortlessly with your organization. Whether you are managing a small business or a large enterprise, Meraki’s cloud-based architecture allows you to add new devices and locations without the need for complex configurations or costly hardware investments. The platform supports a wide range of devices, including switches, routers, access points, and security cameras, all of which can be easily integrated into your existing network.

### Seamless Integration and Automation

Integration and automation are key to streamlining network management, and Cisco Meraki shines in these areas. The platform supports API integrations, allowing you to connect with third-party applications and services. This opens up a world of possibilities for automating routine tasks, such as device provisioning, network monitoring, and reporting. By leveraging these capabilities, businesses can reduce manual workload, minimize errors, and improve overall operational efficiency.

### Enhanced User Experience

User experience is at the heart of the Cisco Meraki platform. The user-friendly dashboard is designed with simplicity in mind, making it accessible even to those with limited technical expertise. Detailed analytics and reporting tools provide valuable insights into network performance, helping administrators make informed decisions. Additionally, the platform’s mobile app allows for on-the-go management, ensuring that your network is always within reach.

Related: Before you proceed, you may find the following posts helpful for pre-information:

  1. SD WAN Security 
  2. WAN Monitoring
  3. Zero Trust SASE
  4. Forwarding Routing Protocols

SD WAN Tutorial

Transition: The era of client-server

The design for the WAN and branch sites was conceived in the client-server era. At that time, the WAN design satisfies the applications’ needs. Then, applications and data resided behind the central firewall in the on-premises data center. Today, we are in a different space with hybrid IT and multi-cloud designs, making applications and data distribution. Data is now omnipresent. The type of WAN and branch originating in the client-server era was not designed with cloud applications. 

Hub and spoke designs:

The “hub and spoke” model was designed for client/server environments where almost all of an organization’s data and applications resided in the data center (i.e., the hub location) and were accessed by workers in branch locations (i.e., the spokes).  Internet traffic would enter the enterprise through a single ingress/egress point, typically into the data center, which would then pass through the hub and to the users in branch offices.

Push to the Cloud:

The birth of the cloud resulted in a significant shift in how we consume applications, traffic types, and network topology. There was a big push to the cloud, and almost everything was offered as a SaaS. In addition, the cloud era changed traffic patterns, as traffic goes directly to the cloud from the branch site and doesn’t need to be backhauled to the on-premise data center.

**Challenge: Hub and spoke design**

The hub and spoke model needs to be updated. Because the model is centralized, day-to-day operations may be relatively inflexible, and changes at the hub, even in a single route, may have unexpected consequences throughout the network. It may be difficult or even impossible to handle occasional periods of high demand between two spokes.

The result of the cloud acceleration means that the best access point is only sometimes in the central location. Why would branch sites direct all internet-bound traffic to the central HQ, causing traffic tromboning and adding to latency when it can go directly to the cloud? The hub-and-spoke design could be an efficient topology for cloud-based applications. 

**Active/Active and Active/Passive**

Historically, WANs are built on “active-passive,” where a branch can be connected using two or more links, but only the primary link is active and passing traffic. In this scenario, the backup connection only becomes active if the primary connection fails. While this might seem sensible, it could be more efficient.

Interest in active-active routing protocols has always existed, but it was challenging to configure and expensive to implement. In addition, active/active designs with traditional routing protocols are complex to design, inflexible, and a nightmare to troubleshoot.

Convergence & Application Performance:

Convergence and application performance problems can arise from active-active WAN edge designs. For example, active-active packets that reach the other end could be out-of-order packets due to each link propagating at different speeds. Also, the remote end has to reassemble, resulting in additional jitter and delay. Both high jitter and delay are bad for network performance.

Packet reordering:

The issues arising from active-active are often known as spray and pray. It increases bandwidth but decreases good output. Spraying packets down both links can result in 20% drops or packet reordering. There will also be firewall issues as they may see asymmetric routes.

TCP out of order packets
Diagram: TCP out-of-order packets. Source F5.

Key SD-WAN Requirements 

1: SD-WAN requirement and active-active paths

For an active-active design, one must know the application session and design that eliminates asymmetric routing. In addition, it would help if you slice up the WAN so application flows can work efficiently over either link. SD-WAN does this. Also, WAN designs can be active–standby, which requires routing protocol convergence in the event of primary link failure.

Routing Protocol Convergence

Unfortunately, routing protocols are known to converge slowly. The emergence of SD-WAN technologies with multi-path capabilities combined with the ubiquity of broadband has made active-active highly attractive and something any business can deploy and manage quickly and easily.

SD-WAN solution enables the creation of virtual overlays that bond multiple underlay links. Virtual overlays would allow enterprises to classify and categorize applications based on their unique service level requirements and provide fast failover should an underlay link experience congestion, a brownout, or an outage.

Example: BFD improving convergence

There is traditional routing regardless of the mechanism used to speed up convergence and failure detection. These several convergence steps need to be carried out:

a ) Detecting the topology change,

b ) Notifying the rest of the network about the change,

c ) Calculating the new best path, and

d) switching to the new best path.

Traditional WAN protocols route down one path and, by default, have no awareness of what’s happening at the application level. For this reason, there have been many attempts to enhance Wan’s behavior. 

Example Convergence Time with OSPF
Diagram:Example Convergence Time with OSPF. Source INE.
Example Convergence Time with OSPF
Diagram:Example Convergence Time with OSPF. Source INE.

2: SD-WAN requirements: Flexible topologies

For example, using DPI, we can have Voice over IP traffic go over MPLS. Here, the SD-WAN will look at real-time protocol and session initiation protocol. We can also have less critical applications that can go to the Internet. MPLS can be used only for a specific app.

As a result, the best-effort traffic is pinned to the Internet, and only critical apps get an SLA and go on the MPLS path. Now, we better utilize the transports, and circuits never need to be dormant. With SD-WAN, we are using the B/W that you have available to ensure an optimized experience.

The SD-WAN’s value is that the solution tracks network and path conditions in real time, revealing performance issues as they occur. Then, it dynamically redirects data traffic to the following available path.

Then, when the network recovers to its normal state, the SD-WAN solution can redirect the data’s traffic path to its original location. Therefore, the effects of network degradation, such as brownouts and soft failure, can be minimized.

VPN Segmentation
Diagram: VPN Segmentation. Source Cisco.

3: SD-WAN requirements: Encryption key rotation

Data security has never been a more important consideration than it is today. Therefore, businesses and other organizations must take robust measures to keep data and information safely under lock and key. Encryption keys must be rotated regularly (the standard interval is every 90 days) to reduce the risk of compromised data security.

However, regular VPN-based encryption key rotation can be complicated and disruptive, often requiring downtime. SD-WAN can offer automatic key rotation, allowing network administrators to pre-program rotations without manual intervention or system downtime.

4: SD-WAN requirements: Push to the cloud 

Another critical feature of SD-WAN technology is cloud breakout. This lets you connect branch office users to cloud-hosted applications directly and securely, eliminating the inefficiencies of backhauling cloud-destined traffic through the data center. Given the ever-growing importance of SaaS and IaaS services, efficient and reliable access to the cloud is crucial for many businesses and other organizations. By simplifying how branch traffic is routed, SD-WAN makes setting up breakouts quicker and easier.

**The changing perimeter location**

Users are no longer positioned in one location with corporate-owned static devices. Instead, they are dispersed; additional latency degrades application performance when connecting to central areas. Applications and network devices can be optimized, but the only solution is to shorten the link by moving to cloud-based applications. There is a huge push and a rapid flux for cloud-based applications. Most are now moving away from on-premise in-house hosting to cloud-based management.

**SaaS-based Applications**

The ready-made global footprint enables the usage of SaaS-based platforms that negate the drawbacks of dispersed users tromboning to a central data center to access applications. Logically positioned cloud platforms are closer to the mobile user. In addition, hosting these applications on the cloud is far more efficient than making them available over the public Internet.

5: SD-WAN requirements: Decentralization of traffic

A lot of traffic is now decentralized from the central data center to remote branch sites. Many branches do not run high bandwidth-intensive applications. These types of branch sites are known as light edges. Despite the traffic change, the traditional branch sites rely on hub sites for most security and network services.

The branch sites should connect to the cloud applications directly over the Internet without tromboning traffic to data centers for Internet access or security services. An option should exist to extend the security perimeter into the branch sites without requiring expensive onsite firewalls and IPS/IDS. SD-WAN builds a dynamic security fabric without the appliance sprawl of multiple security devices and vendors.

**The ability to service chain traffic** 

Also, service chaining. Service chaining through SD-WAN allows organizations to reroute their data traffic through one service or multiple services, including intrusion detection and prevention devices or cloud-based security services. It thereby enables firms to declutter their branch office networks.

After all, they can automate handling particular types of traffic flows and assemble connected network services into a single chain.

6: SD-WAN requirements: Bandwidth-intensive applications 

Exponential growth in demand for high-bandwidth applications such as multimedia in cellular networks has triggered the need to develop new technologies capable of providing the required high-bandwidth, reliable links in wireless environments. Video streaming is the biggest user of internet bandwidth—more than half of total global traffic. The Cartesian study confirms historical trends reflecting consumer usage that remains highly asymmetric, as video streaming remains the most popular.

**Richer and hungry applications**

Richer applications, multimedia traffic, and growth in the cloud application consumption model drive the need for additional bandwidth. Unfortunately, the congestion leads to packet drops, ultimately degrading application performance and user experience.

SD-WAN offers flexible bandwidth allocation, so you don’t manually allocate bandwidth for specific applications. Instead, SD-WAN allows you to classify applications and specify a particular service level requirement. This way, you can ensure your set-up is better equipped to run smoothly, minimizing the risk of glitchy and delayed performance on an audio conference call.

7: SD-WAN requirements: Organic growth 

We also have organic business growth, a big driver for additional bandwidth requirements. The challenge is that existing network infrastructures are static and need help to respond adequately to this growth in a reasonable period. The last mile of MPLS locks you in, destroying agility. Circuit lead times impede the organization’s productivity and create an overall lag.

A WAN solution should be simple. To serve the new era of applications, we need to increase the link capacity by buying more bandwidth. However, life is more complex. The WAN is an expensive part of the network, and employing link oversubscription to reduce the congestion is too costly.

Bandwidth is expensive to cater to new application requirements not met by the existing TDM-based MPLS architectures. At the same time, feature-rich MPLS is expensive for relatively low bandwidth. You will need more bandwidth to beat latency.

On the more traditional side, MPLS and private ethernet lines (EPLs) can range in cost from $700 to $10,000 per month, depending on bandwidth size and distance of the link itself. Some enterprises must also account for redundancies at each site as uptime for higher-priority sites comes into play. Cost becomes exponential when you have a large number of sites to deploy.

8: SD-WAN requirements: Limitations of protocols 

We already mentioned some problems with routing protocols, but leaving IPsec to default raises challenges. IPSec architecture is point-to-point, not site-to-site. Therefore, it does not natively support redundant uplinks. Complex configurations and potentially additional protocols are required when sites have multiple uplinks to multiple providers. 

Left to its defaults, IPsec is not abstracted, and one session cannot be sent over various uplinks. This will cause challenges with transport failover and path selection. Secure tunnels should be torn up and down immediately, and new sites should be incorporated into a secure overlay without much delay or manual intervention.

9: SD-WAN requirements: Internet of Things (IoT) 

As millions of IoT devices come online, how do we further segment and secure this traffic without complicating the network design? Many dumb IoT devices will require communication with the IoT platform in a remote location. Therefore, will there be increased signaling traffic over the WAN? 

Security and bandwidth consumption are vital issues concerning the introduction of IP-enabled objects. Although encryption is a great way to prevent hackers from accessing data, it is also one of the leading IoT security challenges.

These drives like the storage and processing capabilities found on a traditional computer. The result is increased attacks where hackers can easily manipulate the algorithms designed for protection. Also, Weak credentials and login details leave nearly all IoT devices vulnerable to password hacking and brute force. Any company that uses factory default credentials on its devices places its business, assets, customers, and valuable information at risk of being susceptible to a brute-force attack.

10: SD-WAN requirements: Visibility

Many service provider challenges include the need for more visibility into customer traffic. The lack of granular details of traffic profiles leads to expensive over-provision of bandwidth and link resilience. In addition, upgrades at both the packet and optical layers often need complete traffic visibility and justification.

In case of an unexpected traffic spike, many networks are left at half capacity. As a result, much money is spent on link underutilization, which should be spent on innovation. This link between underutilization and oversubscription is due to a need for more visibility.

**SD-WAN Use Case**

DMVPN: Exploring Single Hub Dual Cloud Architecture

Single Hub Dual Cloud architecture takes the traditional DMVPN setup to the next level. Instead of relying on a single cloud (service provider) for connectivity, this configuration utilizes two separate clouds, providing redundancy and improved performance. The hub device is the central point of contact for all remote sites, ensuring seamless communication between them.

1. Enhanced Redundancy: The Single Hub Dual Cloud configuration offers built-in redundancy with two separate clouds. If one cloud experiences downtime or connectivity issues, the network seamlessly switches to the alternate cloud, ensuring uninterrupted communication.

2. Improved Performance: By distributing the load across two clouds, Single Hub Dual Cloud architecture can handle higher traffic volumes efficiently. This leads to improved network performance and reduced latency for end-users.

3. Scalability: This architecture allows for easy scalability as new sites can be seamlessly added to the network without disrupting the existing infrastructure. The hub device manages the routing and connectivity, simplifying network management and reducing administrative overhead.

Summary: SD WAN Tutorial

SD-WAN, or Software-Defined Wide Area Networks, has emerged as a game-changing technology in the realm of networking. This tutorial delved into SD-WAN fundamentals, its benefits, and how it revolutionizes traditional WAN infrastructures.

Understanding SD-WAN

SD-WAN is an innovative approach to networking that simplifies the management and operation of a wide area network. It utilizes software-defined principles to abstract the underlying network infrastructure and provide centralized control, visibility, and policy-based management.

Key Features and Benefits

One of the critical features of SD-WAN is its ability to optimize network performance by intelligently routing traffic over multiple paths, including MPLS, broadband, and LTE. This enables organizations to leverage cost-effective internet connections without compromising performance or reliability. Additionally, SD-WAN offers enhanced security measures, such as encrypted tunneling and integrated firewall capabilities.

Deployment and Implementation

Implementing SD-WAN requires careful planning and consideration. This section will explore the different deployment models, including on-premises, cloud-based, and hybrid approaches. We will discuss the necessary steps in deploying SD-WAN, from initial assessment and design to configuration and ongoing management.

Use Cases and Real-World Examples

SD-WAN has gained traction across various industries due to its versatility and cost-saving potential. This section will showcase notable use cases, such as retail, healthcare, and remote office connectivity, highlighting the benefits and outcomes of SD-WAN implementation. Real-world examples will provide practical insights into the transformative capabilities of SD-WAN.

Future Trends and Considerations

As technology continues to evolve, staying updated on the latest trends and considerations in the SD-WAN landscape is crucial. This section will explore emerging concepts, such as AI-driven SD-WAN and integrating SD-WAN with edge computing and IoT technologies. Understanding these trends will help organizations stay ahead in the ever-evolving networking realm.

Conclusion:

In conclusion, SD-WAN represents a paradigm shift in how wide area networks are designed and managed. Its ability to optimize performance, ensure security, and reduce costs has made it an attractive solution for organizations of all sizes. By understanding the fundamentals, exploring deployment options, and staying informed about the latest trends, businesses can leverage SD-WAN to unlock new possibilities and drive digital transformation.

zero trust network design

Zero Trust Network Design

Zero Trust Network Design

In today's interconnected world, where data breaches and cyber threats have become commonplace, traditional perimeter defenses are no longer enough to protect sensitive information. Enter Zero Trust Network Design is a security approach that prioritizes data protection by assuming that every user and device, inside or outside the network, is a potential threat. In this blog post, we will explore the Zero Trust Network Design concept, its principles, and its benefits in securing the modern digital landscape.

Zero trust network design is a security concept that focuses on reducing the attack surface of an organization’s network. It is based on the assumption that users and systems inside a network are untrusted, and therefore, all traffic is considered untrusted and must be verified before access is granted. This contrasts traditional networks, which often rely on perimeter-based security to protect against external threats.

Key Points:

-Identity and Access Management (IAM): IAM plays a vital role in Zero Trust by ensuring that only authenticated and authorized users gain access to specific resources. Multi-factor authentication (MFA) and strong password policies are integral to this component.

-Network Segmentation: Zero Trust advocates for segmenting the network into smaller, more manageable zones. This helps contain potential breaches and restricts lateral movement within the network.

-Continuous Monitoring and Analytics: Real-time monitoring and analysis of network traffic, user behavior, and system logs are essential for detecting any anomalies or potential security breaches.

-Enhanced Security: By adopting a Zero Trust approach, organizations significantly reduce the risk of unauthorized access and lateral movement within their networks, making it harder for cyber attackers to exploit vulnerabilities.

-Improved Compliance: Zero Trust aligns with various regulatory and compliance requirements, providing organizations with a structured framework to ensure data protection and privacy.

-Greater Flexibility: Zero Trust allows organizations to embrace modern workplace practices, such as remote work and BYOD (Bring Your Own Device), without compromising security. Users can securely access resources from anywhere, anytime.

Implementing Zero Trust requires a well-defined strategy and careful planning. Here are some key steps to consider:

1. Assess Current Security Infrastructure: Conduct a thorough assessment of existing security measures, identify vulnerabilities, and evaluate the readiness for Zero Trust implementation.

2. Define Trust Boundaries: Determine the trust boundaries within the network and establish access policies accordingly. Consider factors like user roles, device types, and resource sensitivity.

3. Choose the Right Technologies: Select security solutions and tools that align with your organization's needs and objectives. These may include next-generation firewalls, secure web gateways, and identity management systems.

Highlights: Zero Trust Network Design

**Understanding Zero Trust**

Zero trust is a security concept that challenges the traditional perimeter-based network security model. It operates on the principle of never trusting any user or device, regardless of their location or network connection. Instead, it continuously verifies and authenticates every user and device attempting to access network resources.

Key Points:

A – Certain principles must be followed to implement a zero-trust network design successfully. One crucial principle is the principle of least privilege, where users and devices are granted only the necessary access to perform their tasks. Another principle is continuously monitoring and assessing all network traffic, ensuring that any anomalies or suspicious activities are detected and responded to promptly.

B – Implementing a zero-trust network design requires careful planning and consideration. It involves a combination of technological solutions, such as multi-factor authentication, network segmentation, encryption, and granular access controls. Additionally, organizations must establish comprehensive policies and procedures to govern user access, device management, and incident response.

C – Zero trust network design offers several benefits to organizations. Firstly, it enhances overall security posture by minimizing the attack surface and preventing lateral movement within the network. Secondly, it provides granular control over network resources, ensuring that only authorized users and devices can access sensitive data. Lastly, it simplifies compliance efforts by enforcing strict access controls and maintaining detailed audit logs.

“Never Trust, Always Verify”

D – The core concept of zero-trust network design and segmentation is never to trust, always verify. This means that all traffic, regardless of its origin, must be verified before access is granted. This is achieved through layered security controls, including authentication, authorization, encryption, and monitoring.

E – Authentication verifies users’ and devices’ identities before allowing access to resources. Authorization determines what resources a user or device is allowed to access. Encryption protects data in transit and at rest. Monitoring detects threats and suspicious activity.

**Zero Trust Network Segmentation**

Zero-trust network design, including segmentation, is becoming increasingly popular as organizations move away from perimeter-based security. By verifying all traffic rather than relying on perimeter-based security, organizations can reduce their attack surface and improve their overall security posture. Segmentation can work at different layers of the OSI Model.

**Scanning Networks: Securing Networks**

Endpoint security refers to the protection of devices (endpoints) that have access to a network. These devices, which include laptops, smartphones, and servers, are often targeted by cybercriminals seeking unauthorized access, data breaches, or system disruptions. Businesses and individuals can fortify their digital realms against threats by implementing robust endpoint security measures.

Address Resolution Protocol (APR):

ARP (Address Resolution Protocol) plays a vital role in establishing communication between devices within a network. It maps an IP address to a physical (MAC) address, allowing data transmission between devices. However, cyber attackers can exploit ARP to launch attacks, such as ARP spoofing, compromising network security. Understanding ARP and implementing countermeasures is crucial for adequate endpoint security.

The Role of Routing:

Routing is the process of forwarding network traffic between different networks. Secure routing protocols and practices are essential to prevent unauthorized access and ensure data integrity. By implementing secure routing mechanisms, organizations can establish trusted paths for data transmission, reducing the risk of data breaches and unauthorized network access.

Note: Netstat: Netstat, a command-line tool, provides valuable insights into network connections, active ports, and listening services. By utilizing Netstat, network administrators can identify suspicious connections, potential malware infections, or unauthorized access attempts. Regularly monitoring and analyzing Netstat outputs can aid in maintaining a secure network environment.

Zero Trust Connectivity: NCC

### What is Google’s Network Connectivity Center?

Google’s Network Connectivity Center is a centralized platform that simplifies the management of hybrid and multi-cloud networks. It provides organizations with a unified view of their network, enabling them to connect, secure, and manage their infrastructure with ease. By leveraging Google’s global network, NCC ensures high availability, low latency, and optimized performance.

#### Unified Network Management

NCC offers a single pane of glass for managing all network connections, whether they are on-premises, in the cloud, or across different cloud providers. This unified approach reduces complexity and streamlines operations, making it easier for IT teams to maintain a cohesive network architecture.

#### Advanced Security Measures

Security is a core component of NCC. It integrates seamlessly with Google’s security services, providing advanced threat protection, encryption, and compliance monitoring. This ensures that data remains secure as it traverses the network, adhering to the principles of Zero Trust.

#### Scalability and Flexibility

One of the standout features of NCC is its scalability. Organizations can easily scale their network infrastructure to accommodate growth and changing business needs. Whether expanding to new regions or integrating additional cloud services, NCC offers the flexibility to adapt without compromising performance or security.

Zero Trust Connectivity: Private Service Connect

### What is Private Service Connect?

Private Service Connect is a feature offered by Google Cloud that allows users to securely connect services across different VPC networks. It leverages private IPs to ensure that data does not traverse the public internet, reducing the risk of exposure to potential threats. This service is particularly useful for organizations looking to maintain a high level of security while ensuring seamless connectivity between their cloud-based services.

### The Role of Zero Trust in Private Service Connect

Zero trust is a security framework that operates on the principle of “never trust, always verify.” It assumes that threats can come from both inside and outside the network. Private Service Connect embodies this principle by ensuring that services are only accessible to authorized users and devices. By integrating zero trust into its framework, Private Service Connect provides an additional layer of security, ensuring that data and services remain protected.

private service connect

Network Policies: GKE 

**Understanding the Basics of Network Policy**

Network policies in GKE are akin to firewall rules that control the traffic flow between pods, effectively determining which pods can communicate with each other. These policies are essential for isolating applications, segmenting traffic, and protecting sensitive data. In essence, network policies provide a framework for defining how groups of pods can interact, allowing for fine-grained control over network communication.

**Implementing Zero Trust Network Design with GKE**

Zero trust network design is a security model that operates on the principle of “never trust, always verify.” In the context of GKE, this means that no pod should be able to communicate with another pod without explicit permission. Implementing zero trust in GKE involves carefully crafting network policies to ensure that only the necessary communication paths are open. This approach minimizes the risk of unauthorized access and lateral movement within the cluster, enhancing the overall security posture.

**Best Practices for Configuring Network Policies**

When configuring network policies in GKE, there are several best practices to consider. First, start by defining default deny policies to block all traffic by default, then incrementally add specific allow policies as required. It’s also important to regularly review and update these policies to reflect changes in the application architecture. Additionally, leveraging tools like Kubernetes Network Policy API can simplify the management and enforcement of these policies.

Kubernetes network policy

Zero Trust Google Cloud IAM

## Understanding the Basics

At its core, Google Cloud IAM allows you to define roles and permissions that determine what actions users can take with your resources. It’s a comprehensive tool that helps you manage access to Google Cloud services with precision. By assigning roles based on the principle of least privilege, you ensure that users have only the permissions they need to perform their jobs, minimizing potential security risks.

## Zero Trust Network Design

Incorporating a zero trust network design with Google Cloud IAM is an effective way to bolster security. Unlike traditional security models that rely heavily on perimeter defenses, zero trust assumes that threats could be both outside and inside the network. This approach requires strict identity verification for every person and device trying to access resources. By integrating zero trust principles, organizations can enhance their security posture and reduce the risk of unauthorized access.

## Advanced Features for Enhanced Security

Google Cloud IAM offers several advanced features that complement a zero trust strategy. These include conditional access based on attributes such as device security status and location, as well as support for multi-factor authentication. Additionally, IAM’s audit logs provide comprehensive visibility into who accessed what, when, and how, allowing for thorough monitoring and quick incident response.

Google Cloud IAM

Detecting Authentication Failures in Logs

Understanding Log Analysis

Log analysis is the process of examining log data to extract meaningful insights and identify potential security events. Logs act as a digital trail, capturing valuable information about system activities, user actions, and network traffic. By carefully analyzing logs, security teams can detect anomalies, track user behavior, and uncover potential threats lurking in the shadows.

Syslog is a standard protocol for message logging. It allows various devices and applications to send log messages to a central logging server. Syslog provides a standardized format, making aggregating and analyzing logs from different sources easier. Syslog messages contain essential details such as timestamps, log levels, and source IP addresses, which are crucial for detecting security events.

Auth.log, or the authentication log, is a specific log file that records authentication-related events on Unix-based systems. It includes valuable information about user logins, failed login attempts, and other authentication activities. Analyzing auth.log can help identify brute-force attacks, unauthorized access attempts, and potential security breaches targeting user accounts.

Understanding SELinux

SELinux is a security framework built into the Linux kernel that provides Mandatory Access Control (MAC) policies. Unlike traditional discretionary access control (DAC), which relies on user permissions, SELinux focuses on controlling access based on the security context of processes and resources. This means that even if an attacker gains unauthorized access to a system, SELinux can prevent them from compromising the entire system.

Implementing SELinux

To implement zero trust endpoint security with SELinux, organizations should start by defining security policies that align with their specific needs. These policies should enforce strict access controls, limit privileges, and define fine-grained permissions for processes and resources. By doing so, organizations can ensure that even if an endpoint is compromised, the attacker’s ability to move laterally within the network is significantly restricted.

Zero Trust Networking with Cloud Service Mesh

## What is a Cloud Service Mesh?

At its core, a Cloud Service Mesh is a configurable infrastructure layer for microservices application, which makes communication between service instances flexible, reliable, and observable. It decouples network and security policies from the application code, allowing developers to focus on their core functionality without worrying about the intricacies of service-to-service communication. Essentially, it acts as a dedicated layer for managing service-to-service communications, offering features like load balancing, service discovery, retries, and circuit breaking.

## The Benefits of Implementing a Cloud Service Mesh

Implementing a Cloud Service Mesh offers numerous benefits that streamline operations and enhance security:

1. **Enhanced Observability**: It provides deep insights into service behavior with monitoring and tracing capabilities, helping to quickly identify and resolve issues.

2. **Improved Security**: By enforcing security policies like mutual TLS and fine-grained access control, it ensures secure service-to-service communication.

3. **Resilience and Reliability**: Features like automatic retries, circuit breaking, and load balancing ensure that services remain resilient and available, even in the face of failures.

4. **Operational Simplicity**: By offloading the complexities of service management to the mesh, developers can focus on business logic, speeding up development cycles.

### Cloud Service Mesh and Zero Trust Networks

The concept of Zero Trust Networks (ZTN) revolves around the principle of “never trust, always verify.” In a ZTN, every request, whether it originates inside or outside the network, must be authenticated and authorized. Cloud Service Meshes align perfectly with ZTN principles by providing robust security features:

– **Mutual TLS**: Ensures that all communication between services is encrypted and authenticated.

– **Fine-Grained Policy Control**: Allows administrators to define and enforce policies at a granular level, ensuring that only authorized services can communicate.

Google has been at the forefront of integrating Cloud Service Mesh technology with Zero Trust principles. Their Istio service mesh, for example, offers robust security features that align with Zero Trust guidelines, making it a preferred choice for organizations looking to enhance their security posture.

### Google’s Contribution to Cloud Service Mesh

Google has played a significant role in advancing Cloud Service Mesh technology. Their open-source service mesh, Istio, has become a cornerstone in the industry. Istio simplifies service management by providing a uniform way to secure, connect, and monitor microservices. It integrates seamlessly with Kubernetes, making it an ideal choice for cloud-native applications. Google’s emphasis on security, observability, and operational efficiency in Istio reflects their commitment to fostering innovation in cloud technologies.

Example Product: Cisco Secure Workload

### What is Cisco Secure Workload?

Cisco Secure Workload is a comprehensive security solution that provides visibility, micro-segmentation, and workload protection for applications across multi-cloud environments. It leverages advanced analytics and machine learning to identify and mitigate threats, ensuring that your workloads remain secure, whether they are on-premises, in the cloud, or in hybrid environments.

#### 1. Enhanced Visibility

One of the standout features of Cisco Secure Workload is its ability to provide unparalleled visibility into your network. It offers real-time insights into application dependencies, communications, and behaviors, allowing you to detect anomalies and potential threats swiftly.

#### 2. Micro-Segmentation

Micro-segmentation is a critical component of modern security strategies. Cisco Secure Workload enables fine-grained segmentation of workloads, reducing the attack surface and preventing lateral movement of threats within your network. This granular approach to segmentation ensures that even if a threat breaches one segment, it cannot easily spread to others.

#### 3. Automated Policy Enforcement

Maintaining consistent security policies across diverse environments can be challenging. Cisco Secure Workload simplifies this process through automated policy enforcement. By defining security policies centrally, you can ensure they are uniformly applied across all workloads, reducing the risk of misconfigurations and human errors.

### How Cisco Secure Workload Works

#### 1. Data Collection

Cisco Secure Workload starts by collecting data from various sources within your network. This includes telemetry data from workloads, network traffic, and existing security tools. This data is then analyzed to create a comprehensive map of your application environment.

#### 2. Behavior Analysis

Using machine learning and advanced analytics, Cisco Secure Workload analyzes the collected data to identify normal and abnormal behaviors. This analysis helps in detecting potential threats and vulnerabilities that traditional security tools might miss.

#### 3. Threat Detection and Response

Once potential threats are identified, Cisco Secure Workload provides actionable insights and automated responses to mitigate these threats. This proactive approach ensures that your workloads remain protected even as new threats emerge.

### Real-World Applications

#### 1. Financial Services

Financial institutions handle sensitive data and are prime targets for cyberattacks. Cisco Secure Workload helps these organizations secure their workloads, ensuring compliance with regulatory requirements and protecting customer data from breaches.

#### 2. Healthcare

In the healthcare sector, patient data security is of utmost importance. Cisco Secure Workload provides healthcare organizations with the tools they need to protect electronic health records (EHRs) and ensure HIPAA compliance.

#### 3. Retail

Retailers face unique challenges with high transaction volumes and diverse IT environments. Cisco Secure Workload helps retailers secure their transactional data, protect customer information, and prevent fraud.

Example Product: Cisco Secure Network Analytics

Cisco Secure Network Analytics offers a plethora of features that make it stand out in the crowded cybersecurity market. Here are some of the core functionalities:

– **Comprehensive Network Visibility**: Cisco SNA provides a complete view of all network traffic, allowing you to see what’s happening across your entire infrastructure. This visibility is crucial for identifying potential threats and understanding normal network behavior.

– **Advanced Threat Detection**: Utilizing machine learning and behavioral analytics, Cisco SNA can detect anomalies that may indicate a security breach. This proactive approach helps in identifying threats before they can cause significant damage.

– **Automated Response and Mitigation**: When a threat is detected, Cisco SNA can automatically respond by triggering predefined actions, such as isolating affected devices or blocking malicious traffic. This automation ensures a swift and efficient response to security incidents.

### Benefits of Implementing Cisco Secure Network Analytics

Implementing Cisco Secure Network Analytics offers numerous benefits to organizations of all sizes. Some of the key advantages include:

– **Reduced Mean Time to Detect (MTTD) and Respond (MTTR)**: With its advanced detection and automated response capabilities, Cisco SNA significantly reduces the time it takes to identify and mitigate threats. This rapid response is crucial for minimizing the impact of security incidents.

– **Enhanced Network Performance**: By providing detailed insights into network traffic, Cisco SNA helps organizations optimize their network performance. This optimization leads to improved efficiency and reduced downtime.

– **Regulatory Compliance**: Many industries are subject to strict regulatory requirements regarding data protection and network security. Cisco SNA helps organizations meet these compliance standards by providing detailed audit trails and reporting capabilities.

### Real-World Applications of Cisco Secure Network Analytics

Cisco Secure Network Analytics is versatile and can be applied across various industries and use cases. Here are a few examples:

– **Financial Services**: Banks and financial institutions can use Cisco SNA to protect sensitive customer information and prevent fraud. The tool’s advanced threat detection capabilities are particularly valuable in identifying and stopping sophisticated cyber-attacks.

– **Healthcare**: In the healthcare sector, protecting patient data is paramount. Cisco SNA helps healthcare providers secure their networks against breaches and ensure compliance with regulations such as HIPAA.

– **Education**: Educational institutions can benefit from Cisco SNA by safeguarding student and faculty data. The tool also helps in maintaining the integrity of online learning platforms and preventing disruptions.

Related: For pre-information, you may find the following helpful:

  1. DNS Security Designs
  2. Zero Trust Access
  3. SD WAN Segmentation

 

Zero Trust Network Design

**Issue 1 – We Connect First and Then Authenticate**

  • Connect first, authenticate second.

TCP/IP is a fundamentally open network protocol facilitating easy connectivity and reliable communications between distributed computing nodes. It has served us well in enabling our hyper-connected world but—for various reasons—doesn’t include security as part of its core capabilities.

  • TCP has a weak security foundation

Transmission Control Protocol (TCP) has been around for decades and has a weak security foundation. When it was created, security was out of scope. TCP can detect and retransmit error packets but leave them to their default; communication packets are not encrypted, which poses security risks.

In addition, TCP operates with a Connect First, Authenticate, Second operation model, which is inherently insecure. It leaves the two connecting parties wide open for an attack. When clients want to communicate and access an application, they first set up a connection.

The authentication stage occurs only once the connect stage has been completed. Once the authentication stage has been completed, we can pass the data. 

zero trust network design
Diagram: Zero Trust security. The TCP model of connectivity.

From a security perspective, the most important thing to understand is that this connection occurs purely at a network layer with no identity, authentication, or authorization. The beauty of this model is that it enables anyone with a browser to easily connect to any public web server without requiring any upfront registration or permission. This is a perfect approach for a public web server but a lousy approach for a private application.

Zero Trust Connectivity: Service Networking APIs

**Understanding Zero Trust: A Paradigm Shift in Security**

In the context of service networking APIs, zero trust ensures that only authorized users and devices can interact with the APIs, reducing the risk of unauthorized access and data breaches. Implementing zero trust can significantly enhance the security posture of an organization, safeguarding sensitive data and maintaining user trust.

**Integrating Google Cloud and Zero Trust for Enhanced API Security**

Combining Google Cloud’s robust platform with zero trust principles creates a powerful synergy for securing service networking APIs. Google Cloud’s identity and access management tools, such as Cloud Identity and Access Management (IAM), work seamlessly within a zero trust framework to enforce strict authentication and authorization policies. By leveraging these tools, organizations can create a secure environment where APIs are protected from potential threats, and data is kept confidential and integral.

Service Networking API

**The potential for malicious activity**

With this process of Connect First and Authenticate Second, we are essentially opening up the door of the network and the application without knowing who is on the other side. Unfortunately, with this model, we have no idea who the client is until they have carried out the connect phase, and once they have connected, they are already in the network. Maybe the requesting client is not trustworthy and has bad intentions. If so, once they connect, they can carry out malicious activity and potentially perform data exfiltration. 

What is Network Monitoring?

Network monitoring is observing and analyzing network components and traffic to identify anomalies or performance issues. It uses specialized software and tools that provide real-time insights into network health, bandwidth utilization, device status, etc. By actively monitoring the network infrastructure, businesses can proactively detect and resolve issues before they escalate.

Network monitoring plays a pivotal role in safeguarding sensitive data from external threats. By monitoring network traffic for any suspicious activities or unauthorized access attempts, IT teams can quickly detect and respond to potential security breaches. Additionally, monitoring network devices for vulnerabilities and applying necessary patches and updates ensures a robust defense against cyber threats.

**Understanding Network Scanning**

Network scanning, at its core, involves systematically examining a network to identify its assets, configurations, and potential vulnerabilities. By employing various scanning techniques, security professionals can understand the network’s structure and potential risks.

Different methodologies for conducting network scanning exist, each catering to specific objectives. Passive scanning, for instance, focuses on observing network traffic without actively engaging with devices. On the other hand, active scanning involves sending requests to network devices to gather information about their configurations and potential vulnerabilities.

Numerous powerful tools are available to aid in network scanning endeavors. From widely used tools like Nmap and Wireshark to more specialized ones like Nessus and OpenVAS, the selection of tools depends on the desired scanning approach and the level of detail required. These tools provide many features, including port scanning, vulnerability assessment, and network mapping capabilities.

Additional Information on Network Mapping

Example: Identifying and Mapping Networks

To troubleshoot the network effectively, you can use a range of tools. Some are built into the operating system, while others must be downloaded and run. Depending on your experience, you may choose a top-down or a bottom-up approach.

**Developing a Zero Trust Architecture**

A zero-trust architecture requires endpoints to authenticate and be authorized before obtaining network access to protected servers. Then, real-time encrypted connections are created between requesting systems and application infrastructure. With a zero-trust architecture, we must establish trust between the client and the application before the client can set up the connection. Zero Trust is all about trust – never trust, always verify.

Trust is bidirectional between the client and the Zero Trust architecture (which can take forms ) and the application to the Zero Trust architecture. It’s not a one-time check; it’s a continuous mode of operation. Once sufficient trust has been established, we move into the next stage, authentication. Once authentication has been set, we can connect the user to the application. Zero Trust access events flip the entire security model and make it more robust. 

  • We have gone from connecting first and authenticating second to authenticating first and connecting second.
zero trust model
Diagram: The Zero Trust model of connectivity.

Example of a zero-trust network access

A. Single Pack Authorization ( SPA)

The user cannot see or know where the applications are located. SDP hides the application and creates a “dark” network by using Single Packet Authorization (SPA) for the authorization.

SPAs, also known as Single Packet Authentication, aim to overcome the open and insecure nature of TCP/IP, which follows a “connect then authenticate” model. SPA is a lightweight security protocol that validates a device or user’s identity before permitting network access to the SDP. The purpose of SPA is to allow a service to be darkened via a default-deny firewall.

The systems use a One-Time-Password (OTP) generated by algorithm 14 and embed the current password in the initial network packet sent from the client to the Server. The SDP specification mentions using the SPA packet after establishing a TCP connection. In contrast, the open-source implementation from the creators of SPA15 uses a UDP packet before the TCP connection.

B. Understanding Port Knocking

At its core, port knocking is an access control method that conceals open ports on a server. Instead of leaving ports visibly open and vulnerable to attackers, port knocking requires a sequence of connection attempts to predefined closed ports. Once the correct sequence is detected, the server dynamically opens the desired port and allows access. This covert approach adds an extra layer of protection, making it an intriguing choice for those seeking to fortify their network security.

Implementing port knocking within a zero-trust framework can significantly enhance your network security. By obscuring open ports and allowing access only to authorized users who possess the correct port-knocking sequence, potential attackers face an additional barrier to overcome. This technique effectively reduces the attack surface and minimizes the risk of unauthorized access, making it an invaluable tool for security-conscious individuals and organizations.

**Issue 2 – Fixed perimeter approach to networking and security**

Traditionally, security boundaries were placed at the edge of the enterprise network in a classic “castle wall and moat” approach. However, as technology evolved, remote workers and workloads became more common. As a result, security boundaries necessarily followed and expanded from just the corporate perimeter.

**The traditional world of static domains**

The traditional world of networking started with static domains. Networks were initially designed to create internal segments separated from the external world by a fixed perimeter. The classical network model divided clients and users into trusted and untrusted groups. The internal network was deemed trustworthy, whereas the external was considered hostile.

The perimeter approach to network and security has several zones. We have, for example, the Internet, DMZ, Trusted, and then Privileged. In addition, we have public and private address spaces that separate network access from here. Private addresses were deemed more secure than public ones as they were unreachable online. However, this trust assumption that all private addresses are safe is where our problems started. 

**The fixed perimeter** 

The digital threat landscape is concerning. We are getting hit by external threats to your applications and networks from all over the world. They also come internally within your network, and we have insider threats within a user group and internally as insider threats across user group boundaries. These types of threats need to be addressed one by one.

One issue with the fixed perimeter approach is that it assumes trusted internal and hostile external networks. However, we must assume that the internal network is as hostile as the external one.

Over 80% of threats are from internal malware or malicious employees. The fixed perimeter approach to networking and security is still the foundation for most network and security professionals, even though a lot has changed since the design’s inception. 

Zero Trust & VPC Service Controls

### Role of VPC Service Controls in Zero Trust Network Design

Zero Trust Network Design is rapidly gaining traction as an essential cybersecurity framework. Unlike traditional security models that assume trust within the network, Zero Trust operates on the principle of ‘never trust, always verify.’ This paradigm shift emphasizes the need for more granular controls and continuous verification of user and device identities. VPC Service Controls align perfectly with this approach by restricting access to critical resources and ensuring that only authenticated and authorized entities can interact with the data. This integration fortifies the network’s defenses, minimizes potential attack vectors, and ensures data integrity.

### Implementing VPC Service Controls in Google Cloud

Implementing VPC Service Controls within Google Cloud is a strategic move for organizations aiming to enhance their security posture. The process involves setting up security perimeters around sensitive resources, such as Cloud Storage buckets, BigQuery datasets, and Cloud Bigtable instances. By defining these perimeters, organizations can enforce policies that restrict access based on specific criteria, such as IP addresses, service accounts, or even user-defined attributes. This granular control not only prevents unauthorized access but also ensures compliance with industry regulations and standards.

VPC Security Controls

We get hacked daily!

We are now at a stage where 45% of US companies have experienced a data breach. The 2022 Thales Data Threat Report found that almost half (45%) of US companies suffered a data breach in the past year. However, this could be higher due to the potential for undetected breaches.

We are getting hacked daily, and major networks with skilled staff are crashing. Unfortunately, the perimeter approach to networking has failed to provide adequate security in today’s digital world. It works to an extent by delaying an attack. However, a bad actor will eventually penetrate your guarded walls with enough patience and skill.

If a large gate and walls guard your house, you would feel safe and fully protected inside. However, the perimeter protecting your home may be as large and thick as possible. There is still a chance that someone can climb the walls, access your front door, and enter your property. If a bad actor cannot even see your house, they cannot take the next step and try to breach your security.

Example: Security Scan Lynis

Lynis is an open-source security auditing tool designed to assess the security of Linux and Unix-based systems. Developed by CISOfy, Lynis performs comprehensive security scans by analyzing system configurations, checking for vulnerabilities, and recommending steps to improve overall security posture.

**Issue 3 – Dissolved perimeter caused by the changing environment**

The environment has changed with the introduction of the cloud, advanced BYOD, machine-to-machine connections, the rise in remote access, and phishing attacks. We have many internal devices and a variety of users, such as on-site contractors, that need to access network resources.

Corporate devices are also trending to move to the cloud, collocated facilities, and off-site to customer and partner locations. In addition, they are becoming more diversified with hybrid architectures.

These changes are causing major security problems with the fixed perimeter approach to networking and security. For example, with the cloud, the internal perimeter is stretched to the cloud, but traditional security mechanisms are still being used. But it is an entirely new paradigm. Also, some abundant remote workers work from various devices and places.

Again, traditional security mechanisms are still being used. As our environment evolves, security tools and architectures must evolve. Let’s face it: the network perimeter has dissolved as your remote users, things, services, applications, and data are everywhere. In addition, as the world moves to the cloud, mobile, and IoT, the ability to control and secure everything in the network is no longer available.

Phishing attacks are on the rise.

We have witnessed increased phishing attacks that can result in a bad actor landing on your local area network (LAN). Phishing is a type of social engineering where an attacker sends a fraudulent message designed to trick a person into revealing sensitive information to the attacker or to deploy malicious software on the victim’s infrastructure, like ransomware. The term “phishing” was first used in 1994 when a group of teens worked to obtain credit card numbers from unsuspecting users on AOL manually.

Phishing attacks
Diagram: Phishing attacks. Source is helpnetsecurity

Hackers are inventing new ways.

By 1995, they had created a program called AOHell to automate their work. Since then, hackers have continued to invent new ways to gather details from anyone connected to the internet. These actors have created several programs and types of malicious software that are still used today.

Recently, I was a victim of a phishing email. Clicking and downloading the file is very easy if you are not educated about phishing attacks. In my case, the particular file was a .wav file. It looked safe, but it was not.

**Issue 4 – Broad-level access**

So, you may have heard of broad-level access and lateral movements. Remember, with traditional network and security mechanisms, when a bad actor lands on a particular segment, i.e., a VLAN, known as zone-based networking, they can see everything on that segment. So, this gives them broad-level access. But, generally speaking, when you are on a VLAN, you can see everything in that VLAN and VLAN-to-VLAN communication is not the hardest thing to do, resulting in lateral movements.

The issue of lateral movements

Lateral movement is the technique attackers use to progress through the organizational network after gaining initial access. Adversaries use lateral movement to identify target assets and sensitive data for their attack. Lateral movement is the tenth step in the MITRE Att&ck framework. It is the set of techniques attackers use to move in the network while gaining access to credentials without being detected.

No intra-VLAN filtering

This is made possible as, traditionally, a security device does not filter this low down on the network, i.e., inside of the VLAN, known as intra-VLAN filtering. A phishing email can easily lead the bad actor to the LAN with broad-level access and the capability to move laterally throughout the network. 

For example, a bad actor can initially access an unpatched central file-sharing server; they move laterally between segments to the web developers’ machines and use a keylogger to get the credentials to access critical information on the all-important database servers.

They can then carry out data exfiltration with DNS or even a social media account like Twitter. However, firewalls generally do not check DNS as a file transfer mechanism, so data exfiltration using DNS will often go unnoticed. 

With a zero-trust network segmentation approach, networks are segmented into smaller islands with specific workloads. In addition, each segment has its own ingress and egress controls to minimize the “blast radius” of unauthorized access to data.

Example: Segmentation with Network Endpoint Groups (NEGs)

network endpoint groups

**Issue 5 – The challenges with traditional firewalls**

The limited world of 5-tuple

Traditional firewalls typically control access to network resources based on source IP addresses. This creates the fundamental challenge of securing admission. Namely, we need to solve the user access problem, but we only have the tools to control access based on IP addresses.

As a result, you have to group users, some of whom may work in different departments and roles, to access the same service and with the same IP addresses. The firewall rules are also static and don’t change dynamically based on levels of trust on a given device. They provide only network information.

Maybe the user moves to a more risky location, such as an Internet cafe, its local Firewall, or antivirus software that has been turned off by malware or even by accident. Unfortunately, a traditional firewall cannot detect this and live in the little world of the 5-tuple.  Traditional firewalls can only express static rule sets and not communicate or enforce rules based on identity information.

TCP 5 Tuple
Diagram: TCP 5 Tuple. Source is packet-foo.

**Issue 6 – A Cloud-focused environment**

Upon examining the cloud, let’s compare a public parking space. A public cloud is where you can put your car compared to your vehicle in your parking garage. We have multiple tenants who can take your area in a public parking space, but we don’t know what they can do to your car.

Today, we are very cloud-focused, but when moving applications to the cloud, we need to be very security-focused. However, the cloud environment is less mature in providing the traditional security control we use in our legacy environment. 

So, when putting applications in the cloud, you shouldn’t leave security to its default. Why? Firstly, we operate in a shared model where the tenant after you can steal your encryption keys or data. There have been many cloud breaches. We have firewalls with static rulesets, authentication, and key management issues in cloud protection.

**Control point change**

One of the biggest problems is that the perimeter has moved when you move to a cloud-based application. Servers are no longer under your control. Mobile and tablets exacerbate the problem as they can be located everywhere. So, trying to control the perimeter is very difficult. More importantly, firewalls only have access to and control network information and should have more content.

This perimeter is defined by ZTNA architecture and software-defined perimeter. Cloud users now manage firewalls by moving their applications to the cloud, not the I.T. teams within the cloud providers.

So when moving applications to the cloud, even though cloud providers provide security tools, the cloud consumer has to integrate security to have more visibility than they have today.

Before, we had clear network demarcation points set by a central physical firewall creating inside and outside trust zones. Anything outside was considered hostile, and anything on the inside was deemed trusted.

1. Connection-centric model

The Zero Trust model flips this around and considers everything untrusted. To do this, there are no longer pre-defined fixed network demarcation points. Instead, the network perimeter initially set in stone is now fluid and software-based.

Zero Trust is connection-centric, not network-centric. Each user on a specific device connected to the network gets an individualized connection to a particular service hidden by the perimeter.

Instead of having one perimeter every user uses, SDP creates many small perimeters purposely built for users and applications. These are known as micro perimeters. Clients are cryptographically signed into these microperimeters.

2. Micro perimeters: Zero trust network segmentation

The micro perimeter is based on user and device context and can dynamically adjust to environmental changes. So, as a user moves to different locations or devices, the Zero Trust architecture can detect this and set the appropriate security controls based on the new context.

The data center is no longer the center of the universe. Instead, the user on specific devices, along with their service requests, is the new center of the universe.

Zero Trust does this by decoupling the user and device from the network. The data plane is separated from the network to remove the user from the control plane, where the authentication happens first.

Then, the data plane, the client-to-application connection, transfers the data. Therefore, the users don’t need to be on the network to gain application access. As a result, they have the least privilege and no broad-level access.

3. Zero trust network segmentation

Zero-trust network segmentation is gaining traction in cybersecurity because it increases an organization’s network protection. This method of securing networks is based on the concept of “never trust, always verify,” meaning that all traffic must be authenticated and authorized before it can access the network.

This is accomplished by segmenting the network into multiple isolated zones accessible only through specific access points, which are carefully monitored and controlled.

Network segmentation is a critical component of a zero-trust network design. By dividing the network into smaller, isolated units, it is easier to monitor and control access to the network. Additionally, segmentation makes it harder for attackers to move laterally across the network, reducing the chance of a successful attack.

Zero-trust network design segmentation is essential to any organization’s cybersecurity strategy. By utilizing segmentation, authentication, and monitoring systems, organizations can ensure their networks are secure and their data is protected.

4. The I.P. address conundrum

Everything today relies on IP addresses for trust, but there is a problem: IP addresses lack user knowledge to assign and validate the device’s trust. There is no way for an IP address to do this. IP addresses provide connectivity but do not involve validating the trust of the endpoint or the user.

Also, I.P. addresses should not be used as an anchor for network locations as they are today because when a user moves from one place to another, the I.P. address changes. 

Can’t have security related to an I.P. address.

But what about the security policy assigned to the old IP addresses? What happens with your changed IPs? Anything tied to IP is ridiculous, as we don’t have a good hook to hang things on for security policy enforcement. There are several facets to policy. For example, the user access policy touches on authorization, the network access policy touches on what to connect to, and the user account policies touch on authentication.

With either one, there is no policy visibility with I.P. addresses. This is also a significant problem for traditional firewalling, which displays static configurations; for example, a stationary design may state that this particular source can reach this destination using this port number. 

**Security-related issues to I.P.**

  1. This has no meaning. There is no indication of why that rule exists or under what conditions a packet should be allowed to travel from one source to another.
  2. No contextual information is taken into consideration. When creating a robust security posture, we must consider more than ports and IP addresses.

For a robust security posture, you need complete visibility into the network to see who, what, when, and how they connect with the device. Unfortunately, today’s Firewall is static and only contains information about the network.

On the other hand, Zero Trust enables a dynamic firewall with the user and device context to open a firewall for a single secure connection. The Firewall remains closed at all other times, creating a ‘black cloud’ stance regardless of whether the connections are made to the cloud or on-premise. 

The rise of the next-generation firewall?

Next-generation firewalls are more advanced than traditional firewalls. They use the information in layers 5 through 7 (session, presentation, and application layers) to perform additional functions. They can provide advanced features such as intrusion detection, prevention, and virtual private networks.

Today, most enterprise firewalls are “next generation” and typically include IDS/IPS, traffic analysis and malware detection for threat detection, URL filtering, and some degree of application awareness/control.

Like the NAC market segment, vendors in this area began a journey to identity-centric security around the same time Zero Trust ideas began percolating through the industry. Today, many NGFW vendors offer Zero Trust capabilities, but many operate with the perimeter security model.

Still, IP-based security systems

NGFWs are still IP-based systems offering limited identity and application-centric capabilities. In addition, they are static firewalls. Most do not employ zero-trust segmentation, and they often mandate traditional perimeter-centric network architectures with site-to-site connections and don’t offer flexible network segmentation capabilities. Similar to conventional firewalls, their access policy models are typically coarse-grained, providing users with broader network access than what is strictly necessary.

Example: Tags and Controls with firewalling

Firewall tags

Summary: Zero Trust Network Design

Traditional network security measures are no longer sufficient in today’s digital landscape, where cyber threats are becoming increasingly sophisticated. Enter zero trust network design, a revolutionary approach that challenges the traditional perimeter-based security model. In this blog post, we will delve into the concept of zero-trust network design, its key principles, benefits, and implementation strategies.

Understanding Zero Trust Network Design

Zero-trust network design is a security framework that operates on the principle of “never trust, always verify.” Unlike traditional perimeter-based security, which assumes trust within the network, zero-trust treats every user, device, or application as potentially malicious. This approach is based on the belief that trust should not be automatically granted but continuously verified, regardless of location or network access method.

Key Principles of Zero Trust

Certain key principles must be followed to implement zero trust network design effectively. These principles include:

1. Least Privilege: Users and devices are granted the minimum level of access required to perform their tasks, reducing the risk of unauthorized access or lateral movement within the network.

2. Microsegmentation: The network is divided into smaller segments or zones, allowing granular control over network traffic and limiting the impact of potential breaches or lateral movement.

3. Continuous Authentication: Authentication and authorization are not just one-time events but are verified throughout a user’s session, preventing unauthorized access even after initial login.

Benefits of Zero Trust Network Design

Implementing a zero-trust network design offers several significant benefits for organizations:

1. Enhanced Security: By adopting a zero-trust approach, organizations can significantly reduce the attack surface and mitigate the risk of data breaches or unauthorized access.

2. Improved Compliance: Zero trust network design aligns with many regulatory requirements, helping organizations meet compliance standards more effectively.

3. Greater Flexibility: Zero trust allows organizations to embrace modern workplace trends, such as remote work and cloud-based applications, without compromising security.

Implementing Zero Trust

Implementing a trust network design requires careful planning and a structured approach. Some key steps to consider are:

1. Network Assessment: Conduct a thorough assessment of the existing network infrastructure, identifying potential vulnerabilities or areas that require improvement.

2. Policy Development: Define comprehensive security policies that align with zero trust principles, including access control, authentication mechanisms, and user/device monitoring.

3. Technology Adoption: Implement appropriate technologies and tools that support zero-trust network design, such as network segmentation solutions, multifactor authentication, and continuous monitoring systems.

Conclusion:

Zero trust network design represents a paradigm shift in network security, challenging traditional notions of trust and adopting a more proactive and layered approach. By implementing the fundamental principles of zero trust, organizations can significantly enhance their security posture, reduce the risk of data breaches, and adapt to evolving threat landscapes. Embracing the principles of least privilege, microsegmentation, and continuous authentication, organizations can revolutionize their network security and stay one step ahead of cyber threats.

Ansible Variables

Ansible Variables | Ansible Automation

Ansible Variable

Ansible, the powerful automation tool, offers a wide range of features to simplify IT infrastructure management. One such feature is the use of variables, which allow for dynamic and flexible configurations. In this blog post, we will dive deep into the world of Ansible variables and explore how they can enhance your automation workflows.

Variables in Ansible serve as placeholders for dynamic values that can be used across playbooks, roles, and templates. They provide a way to customize and parameterize your automation tasks. Whether it's defining host-specific properties or storing sensitive data securely, Ansible variables offer great versatility.

Ansible supports various types of variables, including global, playbook-level, and role-specific variables. Understanding the scope of variables is crucial for managing their values effectively. We will explore how to define and access variables within different contexts and discuss best practices for maintaining variable consistency.

In complex Ansible projects, it's common to have multiple variables defined at different levels. This section will shed light on the order of precedence for variables and how overrides work. We will learn how to prioritize variable values and handle conflicts gracefully, ensuring that the desired configuration is achieved.

Dynamic variables take Ansible's flexibility to the next level. These variables are derived from system facts, registered task results, or the output of other tasks. We will discover how to leverage dynamic variables to create more intelligent and adaptive automation playbooks.

Securing sensitive information is paramount in any automation solution. Ansible provides a secure way to encrypt variables using the Ansible Vault feature. In this section, we will explore how to encrypt sensitive data, such as passwords or API keys, and seamlessly integrate them into your automation workflows.

Ansible variables are a fundamental aspect of building robust and adaptable automation solutions. From customizing configurations to handling sensitive data, variables empower you to achieve greater control and flexibility. By mastering the art of using variables effectively, you can unlock the full potential of Ansible and streamline your IT operations.

Highlights: Ansible Variable

Managing & Configuring Systems

In the world of automation, Ansible has emerged as a popular choice for managing and configuring systems. One of the key features that sets Ansible apart is its ability to work with variables. Variables in Ansible enable users to define and store values that can be used throughout the playbook, making it a powerful tool for automation.

Variables in Ansible can be defined in various ways. They can be set globally, at the playbook level, or even at the task level. This flexibility allows users to customize their automation process based on their needs.

Variables: Values Specific to Different Environments

One everyday use case for variables is storing configuration values specific to different environments. For example, you have a playbook that deploys a web application. Using variables, you can define the database connection string, the server IP address, and other environment-specific values separately for development, staging, and production environments. This makes reusing the same playbook easy across different environments without modifying the code.

Variables: Dynamically Created

Another powerful feature of variables in Ansible is their ability to be dynamically generated. You can use calculated or fetched variables instead of hardcoding values at runtime. For example, the “lookup” plugin can read values from external sources like files or databases and assign them to variables. This makes your automation process more dynamic and adaptable.

**The Process of Decoupling**

For your automation journey, you want to be as flexible as possible. For this reason, within the Ansible architecture, we have a process known as decoupling. Here, we are separating site-specific code from static code. Anything specific to a server or managed device, such as an IP address, can be replaced with Ansible variables. As a best practice, always aim to have flexible playbooks, and if you want to share with someone else, all you need to change is the variables.

A) Variable Locations

As you know, variables are defined in several places. Each place you define the variables, such as the inventory with the ansible inventory variable or play header, will have an order of precedence. The most common place is for the Ansible set variables task. When you have Ansible set variables in the task, Ansible also allows setting variables directly in a task using the set_fact module. We will look at Ansible set variables in a task in a moment.

So, for an Ansible architecture with more extensive playbooks, remember the best place to hold your variables and not keep your playbooks site-specific. With Ansible, you can execute tasks and playbooks on multiple systems with a single command.

B) Ansible Tower

With Ansible Tower, you can have very complex automation requirements with the push of a button. Every site will have variations, and Ansible uses variables to manage system differences. To represent the variations among those systems, you can create variables with standard YAML syntax, including lists and dictionaries. 

Before you proceed, you may find the following posts helpful:

  1. Ansible Architecture
  2. Security Automation
  3. Network Configuration Automation
  4. Software Defined Perimeter Solutions
  5. Security Automation

Ansible Variable

Defining Ansible Variables 

Ansible is open-source automation and orchestration software that can automate most of your operations with IT infrastructure components, including servers, storage, networks, and application platforms. It is one of the most popular automation tools in the IT world and has strong community support, with more than 5,000 contributors worldwide. With Ansible, we use variables. 

Ansible uses variables to manage differences between systems. With Ansible, you can execute tasks and playbooks on multiple systems with a single command. To represent the variations among those systems, you can create variables with standard YAML syntax, including lists and dictionaries.

Ansible set variables in the task

Ansible is not a full-fledged programming language. However, it does have several programming language components. One of the most significant of these is variable substitution. The most straightforward way to define variables is to put a vars section in your playbook with the names and values of variables.

Here, we can have Ansible set variables in the task. The following will list the various ways you can set variables in Ansible. Keep in mind that we have an order of preference when doing so.

You can define your variables in several places: your playbooks, inventory, reusable files or roles, or at the command line. During a playbook run, you can create variables by registering a task’s return value or value as a new variable.

When defining variables in multiple places, they have variable precedence. After creating variables, you can use them in module arguments, such as conditional “when” clauses, templates, and loops. All of these are potent constructs to have in your automation toolbox.

Define variables: Vars: Section

If you are starting your automation journey, the simplest way to define variables is to put a vars section in your playbook with the names and values of variables. This allows you to define several configuration-related variables. So, to define variables in plays, include a vars: section in the header of the play where the variables are needed.

Variables defined in plays are only valid within that specific play and don’t have playbook scope. So, if you need a variable in a different play, you must define it again. This may be inconvenient and difficult to manage across extensive playbooks with multiple teams working on playbook development.

Define variables: Set_fact Module.

We also have the set_fact module. This module is used anywhere in a play to set variables. Any variable set this way applies as a fact to the host in which it is set. The set_fact relates the variable to the host used in the play.

Here, you can dynamically set variables based on the result of any task in the playbook. So, set_fact dynamically defines variables. Keep in mind that setting variables this way will have a playbook scope.

Define variables: Vars_files

You can also put variables into one or more files using the section called vars_files. This lets you put variables in a file instead of directly in the playbook. What I like most about setting variables this way is that it allows you to separate variables that contain sensitive information. When you define variables in reusable variable files, the sensitive variables are separated from playbooks.

This separation enables you to store your playbooks in, for example, source control software and even share them without the risk of exposing passwords or other sensitive personal data. So, when you put variables in files, they are referenced in the playbook using vars_files.

Use var_files to specify a list of files that include variables you want to include. This is convenient when you want to manage the variables independently of the place using them and is helpful for security purposes.

Define variables: Vars_Promt

So, here, we can use vars_promt in the play header to prompt users for a variable value. This has a playbook scope. By default, the variable is flagged as private, so the user does not see anything while entering it. We can if we have set private to no here. 

Define variables: Defining variables at runtime  

When you run your playbook, you can define variables by passing them at the command line using the—- extra-vars (or—e) argument. 

Define variables: Task Variables

Task variables are made from data discovered while executing tasks or in the fact-gathering phase of a play. These variables are host-specific and are added to the host’s host vars. Variables of this type can be discovered via gather_facts and fact modules, populated from task return data via the register task key, or defined directly by a task using the set_fact or add_host modules. 

**Facts / Variables**

Ansible Fact: System Variables

Ansible facts are a type of variable. You don’t define Ansible facts; they are discovered. Facts are system variables that contain information about the managed system. Each playbook starts with an implicit task to gather facts. This can also be disabled in the playhead. Gathering facts takes time, so if you are not going to use it, we can disable it.

We can also gather facts manually by using the setup modules. All of the facts are stored in one big variable called ansible_facts. However, within the variable, we have the second-tier variables. All of the facts are categorized into different variables. These facts are variables, and you can use the facts in conditionals and when statements. 

Speeding up Fact Gathering

Fact collection can be slow as you may work against many hosts you can use and set up a fact cache. If fact-caching is enabled, Ansible will store facts in a cache for the first time; it connects to the host. This is added to your Ansible.cfg with the “fact_caching_timeout” value. So if your playbook does not reference any ansible facts, you can turn off fact-gathering for that play. Here we use the “gather_facts” clause in the play and the Ansible.cfg.

**Ansible Inventory Variable**

Ansible Variables are a key component of Automation. They allow for dynamic play content and reusable plays across different sets of inventory. Variable data, such as specific details on how to connect to a particular host in your inventory, can be included along with an inventory in various ways.

While Ansible can discover data about a system during the setup phase, not all data can be found. We can define data with the inventory that will expand what Ansible has been to discover.

Ansible variables: The Ansible inventory variables

    • Host and group variables

In inventory, you can store variable values related to a specific host or group. This allows you to add variables directly to the hosts and groups in your main inventory file. Adding more managed nodes to your Ansible inventory will likely allow you to store variables in separate host and group variable files.

    • [host_var and group_var]

Ansible looks for host variable files in a directory called host_vars and group variable files in a directory called group_vars. Remember that Ansible expects these directories to be in either the directory that contains your playbooks or the directory adjacent to your inventory file. You can break things out even further if you want to go further. Ansible lets us define, for example, group_vars/production as a directory instead of a file. 

Behavior inventory variables

Behavioral inventory parameters allow you to describe your machines with additional parameters in your inventory file. Such as the ansible_connection parameter may be helpful. By default, Ansible supports multiple means of transport, which Ansible uses to connect to the managed host. Here is a list of some expected behavior inventory variables and the behaviors they intend to modify:

    1. ansible_host: This is the DNS name or the Docker container name that Ansible will initiate a connection to.
    2. ansible_port: This specifies the port number that Ansible will use to connect to the inventory host if it is not the default value 22.
    3. ansible_user: This specifies the username Ansible will use to communicate with the inventory host, regardless of the connection type.

**A key point: Ansible inventory scalability**

If you don’t have many hosts, you can use the inventory to store host and group variables. However, as your environment grows, managing variables in the inventory will become more challenging. In this case, you need to find a more scalable approach to keeping track of your host and group variables.

Even though Ansible variables can hold Booleans, strings, lists, and dictionaries, you can specify only Booleans and strings in an inventory file. Therefore, we have a more scalable approach to keeping track of host and group variables: you can create a separate variable file for each host and group. Ansible expects these variable files to be in YAML format, which allows you to break the inventory into multiple files.

Highlighting Ansible conditionals

– In a playbook, you may want to execute different tasks depending on the value of a fact, a variable, or the result of a previous task. You may wish to the value of some variables to depend on the value of other variables. You can do all of these things with conditionals. Ansible uses Jinja2 tests and filters in conditionals. Basic conditionals are used with the when clause.

– The most straightforward conditional statement applies to a single task. Create the Task, then add a when statement that applies a test. When running the task or playbook, Ansible evaluates the test for all hosts. For example, if you are installing MySQL on multiple machines, some of which have SELinux enabled, you might have a task to configure SELinux to allow MySQL to run.

– You would only want that Task to run on machines with SELinux enabled. Sometimes, you want to execute or skip a task based on facts. With conditionals based on facts, you can install a specific package only when the operating system is a particular version. You can skip configuring a firewall on hosts with internal IP addresses. You can perform cleanup tasks only when a filesystem is getting full.

**Ansible conditionals and When Clause**

You can also create conditionals based on variables defined in the playbooks or inventory. So, we have playbook variables, registered variables, and facts that can all be used in conditions and ensure that tasks only run if specific conditions are true. 

**Handlers and When Statement**

There are several ways Ansible can be configured for conditional task execution. We have, for example, Handlers for conditional task execution. It is used when a task has changed something. Then we have a very powerful When statement. The When statement allows you to run tasks when specific conditions are true. You can also use the Register in combination with When statements.

**Using Handlers for conditional task execution**

A handler is a task only executed when triggered by a task that has changed something. Handler is executed after all tasks in a play, so you need to organize the contents of your playbook accordingly. If any task fails after the task is called the handler, the handlers are not executed. We can use the “force-handler: to change this.

Keep in mind that handlers are operational when something has changed. We have a force handler that allows you to force the handler to be started even if subsequent tasks are failing. Simply put, a handler is a particular type of Task that is called only if something changes. 

**Handlers in Pre and Post tasks**

Each task section in a playbook is handled separately; any handler notified in pre_tasks, tasks, or post_tasks is executed at the end of each section. As a result, it is possible to execute one handler several times in one play.

  • Using Blocks

Blocks create logical groups of tasks. Blocks also offer ways to handle task errors, similar to exception handling in many programming languages. Blocks can also be used in error conditional handling. So you have to use a block to define the main Task to run and then rescue to define tasks that run if tasks defined in the block fail. You can use “Always to Define” tasks that will run regardless of the success or failure of the Block and Rescue.

  • Ansible loops

What happens, however, if you have a single task but need to run it against a list of data, for example, creating several user accounts, directories, or something more complex? Like any programming language, Ansible loops provide an easier way of executing repetitive tasks using fewer lines of code in a playbook.

Examples of commonly used loops include changing ownership of several files and directories with the file module, creating multiple users with the user module, and repeating a polling step until a particular result is reached.

**A final point: Managing failures with the Fail Module**

Ansible looks at a task’s exit status to determine whether it has failed. When any task fails, Ansible aborts the rest of the playbook on that host and continues with the next host. We can change this with a few solutions. For example, we can use ignore_errors in a task to ignore failure—force_handlers to force a handler that has been triggered to run even if another task fails.

But remember, there needs to be a change. We can also use these Failed_When, which allows you to specify what to look for in command output to recognize a failure. You may have a playbook used to clean up resources, and you want the playbook to ignore every error, keep going on till the end, and then fail if there are errors. In this case, when using the fail module, the failing Task must have Ignore Errors set to yes.

Summary: Highlights: Ansible Variable

Ansible, a popular open-source automation tool, provides many features to streamline IT operations. One such powerful feature is the use of variables. In this blog post, we delved into the world of Ansible variables, exploring their importance, different types, best practices, and practical examples.

Understanding Ansible Variables

Variables in Ansible serve as placeholders for dynamic values that can be used across playbooks and roles. They enable flexibility and reusability, making your automation workflows more efficient. This section covered the basics of Ansible variables, including variable declaration, scoping, and precedence.

Types of Ansible Variables

Ansible offers various types of variables to cater to different needs. Each type serves a specific purpose, From global to environment variables, extra variables, and facts. We explored these different types in detail, highlighting their use cases and how they can enhance your Ansible automation.

Best Practices for Working with Variables

To ensure smooth and maintainable Ansible projects, following best practices when working with variables is essential. This section provided valuable insights into structuring variable files, naming conventions, documentation, and organizing your variables for better maintainability and collaboration within your team.

Practical Examples

Putting theory into practice is the best way to solidify your understanding. This section will walk through real-world examples where Ansible variables play a crucial role. From dynamic inventory management to conditional statements and template rendering, you will gain hands-on experience leveraging variables to achieve desired automation outcomes.

Advanced Techniques and Tips

Once you fully grasp the fundamentals, it’s time to level up your Ansible variable skills. This section will dive into advanced techniques and tips, including variable manipulation, using Jinja2 filters, encrypted variables, and integrating Ansible with external data sources. Unlock the full potential of Ansible variables and take your automation to the next level.

Conclusion:

In conclusion, Ansible variables are fundamental to creating robust and flexible automation workflows. By understanding their significance, exploring different types, and adopting best practices, you can harness their power to simplify complex deployments, increase reusability, and maintain cleaner code. Embrace the world of Ansible variables and unlock unparalleled automation capabilities.

Ansible Architecture Diagram

Ansible Architecture | Ansible Automation

Ansible Architecture

When it comes to managing and automating IT infrastructure, Ansible has gained immense popularity. Its simplicity and flexibility make it an ideal choice for sysadmins, developers, and IT professionals alike. In this blog post, we will delve into the intricate architecture of Ansible, understanding its key components and how they work together seamlessly.

Ansible Controller: The Ansible Controller acts as the brain behind the entire automation process. It is responsible for orchestrating and managing the entire infrastructure. The controller is where playbooks and inventories are stored and executed. Additionally, it communicates with the managed hosts using SSH or other remote protocols.

Managed Hosts: Managed Hosts, also known as Ansible Nodes, are the machines that are being managed by Ansible. These can be physical servers, virtual machines, network devices, or even containers. Ansible uses SSH or other remote protocols to establish a connection with managed hosts and execute tasks or playbooks on them.

Playbooks: Playbooks are the heart and soul of Ansible automation. They are written in YAML and define a set of tasks to be executed on managed hosts. Playbooks provide a structured and reusable way to automate complex tasks. They can include variables, conditionals, loops, and handlers, allowing for powerful automation workflows.

Modules: Modules are reusable units of code that perform specific tasks on managed hosts. Ansible provides a wide range of modules for various purposes, such as managing files, installing packages, configuring network devices, and more. Modules can be executed directly from the command line or invoked from within playbooks.

Inventories: Inventories serve as a source of truth for Ansible, defining the hosts and groups that Ansible manages. They can be static files or dynamic sources, such as cloud providers or external databases. Inventories enable you to organize and group hosts, making it easier to apply configurations to specific subsets of machines.

Ansible's architecture is built on simplicity, yet it offers immense power and flexibility. Understanding the key components of Ansible, such as the controller, managed hosts, playbooks, modules, and inventories, is crucial for harnessing the full potential of this automation tool. Whether you are a seasoned sysadmin or a beginner in the world of IT automation, Ansible's architecture provides a solid foundation for managing and orchestrating your infrastructure efficiently.

Highlights: Ansible Architecture

Architecture: The Components

Ansible has emerged as one of the most popular automation tools, revolutionizing how organizations manage and deploy their IT infrastructure. With its simple yet robust architecture, Ansible has gained widespread adoption across diverse industries. In this blog post, we will delve into the intricacies of Ansible architecture, exploring how it works and its key components.

At its core, Ansible follows a client-server architecture model. It has three main components: control nodes, managed nodes, and communication channels. Let’s examine each of these components.

1. Control Node:

The control node acts as the central management point in Ansible architecture. It is the machine from which Ansible is installed and executed. The control node stores the inventory, playbooks, and modules to manage the managed nodes. Ansible uses a declarative language called YAML to define playbook tasks and configurations.

2. Managed Nodes:

Managed nodes are the machines that Ansible drives. They can be physical servers, virtual machines, or even network devices. Ansible connects to managed nodes over SSH or WinRM protocols, enabling seamless management across different operating systems.

3. Communication Channels:

Ansible utilizes SSH or WinRM protocols to establish secure communication channels between the control node and managed nodes. SSH is used for Linux-based systems, whereas WinRM is for Windows-based systems. This allows Ansible to execute commands, transfer files, and collect information from managed nodes.

**Ansible Workflow**

Ansible operates on a push-based model, where the control node pushes configurations and commands to the managed nodes. The workflow involves the following steps:

1. Inventory: The inventory is a file that contains a list of managed nodes. It provides Ansible with information such as IP addresses, hostnames, and connection details to establish communication.

2. Playbooks: Playbooks are YAML files that define the desired state of the managed nodes. They consist of a series of tasks or plays, each representing a specific action to be executed on the managed nodes. Playbooks can be as simple as a single task or as complex as a multi-step deployment process.

3. Execution: Ansible executes playbooks on the control node and communicates with managed nodes to perform the defined tasks. It uses modules, which are small programs written in Python or other scripting languages, to interact with the managed nodes and carry out the required operations.

4. Reporting: Ansible provides detailed reports on task execution status, allowing administrators to monitor and troubleshoot any issues. This helps maintain visibility and consistently apply the desired configurations across the infrastructure.

Advantages of Ansible Architecture:

The Ansible architecture offers several advantages, making it a preferred choice for automation:

1. Simplicity: Ansible’s architecture is designed to be simple and easy to understand. Using YAML playbooks and declarative language allows administrators to define configurations and tasks in a human-readable format.

2. Agentless: Unlike traditional configuration management tools, Ansible requires no agent software installed on managed nodes. This reduces complexity and eliminates the need for additional overhead.

3. Scalability: Ansible’s highly scalable architecture enables administrators to manage thousands of nodes simultaneously. SSH and WinRM protocols allow for efficient communication and coordination across large infrastructures.

**Playbooks and Inventory**

 As a best practice, you don’t want your Ansible architecture, which consists of playbooks and Inventory, to be too specific. However, you need to have a certain level of abstraction and keep out precise information. Therefore, to develop flexible code, you must separate site-specific information from the code, which is done with variables in Ansible. 

Remember that when you develop dynamic code along with your static information, you can use this on any site with minor modifications to the variables themselves. However, you can have variables in different places, and where you place variables, such as a play header or Inventory, will take different precedence. So, variables can be used in your Ansible deployment architecture to provide site-specific code. 

Before you proceed, you may find the following posts helpful for pre-information:

  1. Network Configuration Automation
  2. Ansible Variables
  3. Network Traffic Engineering

Ansible Architecture

Rise of Distributed Computing

Several megatrends have driven the move to an Ansible architecture. Firstly, the rise of distributed computing made the manual approach to almost anything in the IT environment obsolete. This was not only because it caused many errors and mistakes but also because the configuration drift from the desired to the actual state was considerable.

This is not only an operational burden but also a considerable security risk. Today, deploying applications by combining multiple services that run on a distributed set of resources is expected. As a result, configuration and maintenance are more complex than in the past.

Two Options:

– You have two options to implement all of this. First, you can connect these services by manually spinning up the servers, installing the necessary packages, and SSHing to each one, or you can go down the path of automation, in particular, automation with Ansible.

– So, with Ansible deployment architecture, we have the Automation Engine, the CLI, and Ansible Tower, which is more of an automation platform for enterprise-grade automation. This post focuses on Ansible Engine. 

– As a quick note, if you have environments with more than a few teams automating, I recommend Ansible Tower or the open-source version of AWX. Ansible Tower has a 60-day trial license, while AWX is fully open-sourced and does not require a license. The open-source version of AWX could be a valuable tool for your open networking journey.

A Key Point: Risky: The Manual Way.

Let me put it this way: If you configure manually, you will likely maintain all the settings. What about mitigating vulnerabilities and determining what patches or packages are installed in a large environment?

How can you ensure all your servers are patched and secured manually? Manipulating configuration files by hand is tedious, error-prone, and time-consuming. Equally, performing pattern matching to make changes to existing files is risky. 

A Key Point: The issue of Configuration Drift

The manual approach will result in configuration drift, where some servers will drift from the desired state. Configuration drift is caused by inconsistent configuration items across devices, usually due to manual changes and updates and not following the automation path. Ansible is all about maintaining the desired state and eliminating configuration drift.

Components: Ansible Deployment Architecture

Configuration management

The Ansible architecture is based on a configuration management tool that can help alleviate these challenges. Ansible replaces the need for an operator to tune configuration files manually and does an excellent job in application deployment and orchestrating multi-deployment scenarios. It can also be integrated into CI/CD pipelines.

In reality, Ansible is relatively easy to install and operate. However, it is not a single entity. Instead, it comprises tools, modules, and software-defined infrastructure that form the ansible toolset configured from a single host that can manage multiple hosts.

We will discuss the value of idempotency with Ansible modules later. Even with modules’ idempotency, you can still have users of Ansible automate over each other. Ansible Tower or AWS is the recommended solution for multi-team automation efforts.

Ansible vs Tower
Diagram: Ansible vs Tower. Source Red Hat.

Pre-deployed infrastructure: Terraform

Ansible does not deploy the infrastructure; you could use other solutions like Terraform that are best suited for this. Terraform is infrastructure as a code tool. Ansible Engine is more of a configuration as code. The physical or virtual infrastructure needs to be there for Ansible to automate, compared to Terraform, which does all of this for you.

Ansible is an easy-to-use DevOps tool that manages configuration as code has in the same design through any sized environment. Therefore, the size of the domain is irrelevant to Ansible.

As Ansible Connectivity uses SSH that runs over TCP, there are multiple optimizations you can use to increase performance and optimize connectivity, which we will discuss shortly. Ansible is often described as a configuration management tool and is typically mentioned along the same lines as Puppet, Chef, and Salt. However, there is a considerable difference in how they operate. Most notably, the installation of agents.

Ansible architecture: Agentless

The Ansible architecture is agentless and requires nothing to be installed on the managed systems. It is also serverless and agentless, so it has a minimal footprint. Some configuration management systems, such as Chef and Puppet, are “pull-based” by default.

Where agents are installed, periodically check in with the central service and pull-down configuration. Ansible is agentless and does not require the installation of an agent on the target to communicate with the target host.

However, it requires connectivity from the control host to the target inventory ( which contains a list of hosts that Ansible manages) with a trusted relationship. For convenience, we can have passwordless sudo connectivity between our hosts. This allows you to log in without a password and can be a security risk if someone gets to your machines; they could have escalated privileges on all the Ansible-managed hosts.

Agentless Automation
Diagram: Agentless Automation. Source Docs at Ansible

Key Ansible features:

Easy-to-Read Syntax: Ansible uses the YAML file format and Jinja2 templating. Jinja2 is the template engine for the Python programming language. Ansible uses Jinja2 templating to access variables and facts and extends the defaults of Ansible for more advanced use cases.

Not a full programming language: Although Ansible is not a full-fledged programming language, it has several good features. One of the most important is a variable substitution, or using the values of variables in strings or other variables.

In addition, the variables in Ansible make the Ansible playbooks, which are like executable documents, very flexible. Variables are a powerful construct within Ansible and can be used in various ways. Nearly every single thing done in Ansible can include a variable reference. We also have dynamic variables known as facts.

Jinja2 templating language: Ansible’s defaults are extended using the Jinja2 templating language. In addition, Ansible’s use of Jinja2 templating adds more advanced use cases. One great benefit is that it is self-documenting, so when someone looks at your playbook, it’s easy to understand, unlike Python code or a Bash script.

So, not only is Ansible easy to understand, but with just a few lines of YAML, the language used for Ansible, you can install, let’s say, web servers on as many hosts as you like.

Ansible Architecture: Scalability: Ansible can scale. For example, Ansible uses advanced features like SSH multiplexing to optimize SSH performance. Some use cases manage thousands of nodes with Ansible from a single machine.

SSH Connection: Parallel connections: We have three managed hosts: web1, web2, and web3. Ansible will make SSH connections parallel to web1, web2, and web3. It will then execute the first task on the list on all three hosts simultaneously. In this example, the first task is installing the Nginx package, so the task in the playbook would look like this. Ansible uses the SSH protocol to communicate with hosts except Windows hosts.

This SSH service is usually integrated with the operating system authentication stack, enabling you to use Kerberos to improve authentication security. Ansible uses the same authentication methods that you are already familiar with. SSH keys are typically the easiest way to proceed as they remove the need for users to input the authentication password every time a playbook is run.

Optimizing SSH 

Ansible uses SSH to manage hosts, and establishing an SSH connection takes time. However, you can optimize SSH with several features. Because the SSH protocol runs on top of the TCP protocol, you need to create a new TCP connection when you connect to a remote host with SSH.

You don’t want to open a new SSH connection for every activity. Here, you can use Control Master, which allows multiple simultaneous SSH connections with a remote host using one network connection. Control Persists, or multiplexing keeps a connection option for xx seconds. The pipeline allows more commands to use simultaneous SSH connections. If you recall how Ansible executes a task?

    1. It generates a Python script based on the module being invoked
    2. It then copies the Python script to the host
    3. Finally, it executes the Python script

The pipelining optimization will execute the Python scripts by piping it to the SSH sessions instead of copying git. Here, we are using one SSH session instead of two. These options can be configured in Ansible.cfg, and we use the SSH_connecton section. Then, you can specify how these connections are used.  

A note on scalability: Ansible and modularity

Ansible scales down well because simple tasks are easy to implement and understand in playbooks. Ansible scales well because it allows decomposing of complex jobs into smaller pieces. So, we can bring the concept of modularity into playbooks as the playbook becomes more difficult. I like using Tags for playbook developments that can save time and effect to test different parts of the playbook when you know certain parts are 100% working.

Security wins! No daemons and no listening agents

Once Ansible is installed, it will not add a database, and there will be no daemons to start or keep running. You only need to install it on one machine (which could easily be a laptop), and it can manage an entire fleet of remote machines from that central point. No Ansible agent is listening on a port. Therefore, when you use Ansible, there is no extra attack surface for a bad actor to play with.

This is a big win for security following one of the leading security principles of reducing the attack surface. When you run the Ansible-playbook command, Ansible connects to the remote servers and does what you want. By its defaults, Ansible is pretty streamlined out of the box, but you can enhance it by configuring Ansible.cfg file, to be even more automated.

Ansible Architecture: Ansible Architecture diagram

Ansible Inventory: Telling Ansible About Your Servers

The Ansible architecture diagram has several critical components. First, the Ansible inventory is all about telling Ansible about your servers. Ansible can manage only the servers it explicitly knows about. Ansible comes with one default server of the local host, the control host.

You provide Ansible with information about servers by specifying them in an inventory. We usually create a directory called “inventory” to hold this information.

Example: For example, a straightforward inventory file might contain a list of hostnames. The Ansible Inventory is the system against which a playbook runs. It is a list of systems in your infrastructure against which the automaton is executed. The following Ansible architecture diagram shows all the Ansible components, including modules, playbooks, and plugins.

Ansible Architecture Diagram
Diagram: Ansible Architecture Diagram.

Ansible architecture diagram: Inventory highlights

These hosts commonly have hosts but can comprise other components such as network devices, storage arrays, and other physical and virtual appliances. It also has valuable information that can be used alongside our target using the execution.

The Inventory can be as simple as a text file or more dynamic, where it is an executable where the data is sourced dynamically. This way, we can store data externally and use it during runtime. So, we can have a dynamic inventory via Amazon Web Service or create our own dynamic.

Example: AWS EC2 External inventory script  

The above Ansible architecture diagram shows a connection to the cloud. If you use Amazon Web Services EC2, maintaining an inventory file might not be the best approach because hosts may come and go over time, be managed by external applications, or be affected by AWS autoscaling.

For this reason, you can use the EC2 external inventory script. In addition, if your hosts run on Amazon EC2, then EC2 tracks information about your hosts for you. Ansible inventory is flexible; you can use multiple inventory sources simultaneously. Mix dynamic and statically managed inventory sources in the same ansible run is possible. Many are referring to this as an instant hybrid cloud.

Ansible and NMAP
Diagram: Ansible and NMAP. Source Red Hat.

Ansible deployment architecture and Ansible modules

Next, within the Ansible deployment architecture, we have the Ansible modules, which are considered Ansible’s main workhorse. You use modules to perform various tasks, such as installing a package, restarting a service, or copying a configuration file. Ansible modules cater to a wide range of system administration tasks.

This list has categories of the kinds of modules that you can use. There are over 3000 modules. So you may be thinking, who is looking after these modules? That’s where collections in the more recent versions of Ansible are doing it.

**Extending Ansible Modules**

The modules are the scripts (written in Python) that come with packages with Ansible and Perform some action on the managed host. Ansible has extensive modules covering many areas, including networking, cloud computing, server configuration, containerization, and virtualization.

In addition, many modules support your automation requirements. If no modules exist, you can create a custom module with the extensive framework of Ansible. Each task is correlated with the module one-to-one. For example, a template task will use the template module.

**A key point: Idempotency**

Modules strive to be idempotent, allowing the module to run repeatedly without a negative impact. In Ansible, the input is in the form of command-line arguments to the module, and the output is delivered as JSON to STDOUT. Input is generally provided in the space-separated key=value syntax, and it’s up to the module to deconstruct these into usable data.

Most Ansible modules are also idempotent, which means running an Ansible playbook multiple times against a server is safe. For example, if the deploy user does not exist, Ansible will create it. If it does exist, Ansible will not do anything. This is a significant improvement over the shell script approach, where running the script a second time might have different and potentially unintended effects.

**The Use of Ansible Ad Hoc Commands**

For instance, if you wanted to power off all of your labs for the weekend, you could execute a quick one-liner in Ansible without writing a playbook. With an Ad Hoc command, we can be suitable for running one command on a host, and it uses the modules. They are used to check configuring on the host and are also good for learning Ansible.

Note that Ansible is the executable for ad hoc one-task executions, and ansible-playbook is the executable for processing playbooks to orchestrate multiple tasks.

The opposable side of the puzzle is the playbook commands are used for more complex tasks and are better for use cases where the dependencies have to be managed. The playbook can take care of all application deployments and dependencies.

Ansible Ad Hoc commands
Diagram: Ansible Commands. Source Docs at Ansible.

Ansible Plays

1.Ansible playbooks

An Ansible Playbook can have multiple plays that can exist within an Ansible playbook that can execute on different managed assets. So, an Ansible Play is all about “what am I automating.” Then, it connects to the hosts to perform the actions. Each playbook comprises one or more ‘plays’ in a list. The goal of a play is to map a group of hosts to some well-defined roles, represented by things Ansible calls tasks.

At a basic level, a task is just a call to an Ansible module. By composing a playbook of multiple ‘plays’, it is possible to orchestrate multi-machine deployments, running specific steps on all machines in the web servers group. For example, particular actions are on the database server group; then more commands are returned on the web servers group, etc.

2.Ansible tasks

Ansible is ready to execute a task once a playbook is parsed and the hosts are determined. Tasks include a name, a module reference, module arguments, and task control directives. Task execution By default, Ansible executes each task in order, one at a time, against all machines matched by the host pattern. Each task executes a module with specific arguments. You can also use the “–start-at-task <task name> flag to tell the Ansible playbook to start a playbook in the middle of a task.

3.Task execution

Each play contains a list of tasks. Tasks are executed in order, one at a time, against all machines matched by the host pattern before moving on to the next task. It is essential to understand that, within a play, all hosts will get the same task directives. This is because the purpose of a play is to map a selection of hosts to tasks. 

Summary: Ansible Architecture

Ansible, an open-source automation tool, has gained immense popularity in IT infrastructure management. Its architecture plays a crucial role in enabling efficient and streamlined automation processes. In this blog post, we explored the intricacies of Ansible architecture, understanding its key components and how they work together to deliver powerful automation capabilities.

Ansible Overview

Ansible is designed to simplify the automation of complex tasks, allowing system administrators to manage infrastructure effectively. Its architecture revolves around a simple yet powerful concept: declarative configuration management. By defining a system’s desired state, Ansible takes care of the necessary actions to achieve that state. This approach eliminates the need for manual intervention and reduces the risk of error.

Control Node

At the heart of Ansible architecture lies the control node. This is where the Ansible engine is installed, and automation tasks are executed. The control node is the central control point, orchestrating the entire automation process. It includes the Ansible command-line interface (CLI) and the inventory file, which contains information about the target systems.

Managed Nodes

Managed nodes are the systems that Ansible manages and automates. These can be servers, network devices, or other devices accessed remotely. Ansible connects to managed nodes using SSH or PowerShell, depending on the underlying operating system. It executes modules on the managed nodes to perform tasks and gather information.

Playbooks

Playbooks are the heart and soul of Ansible. They are written in YAML format and define the automation tasks. Playbooks consist of one or more plays and tasks targeted at specific groups of managed nodes. Playbooks provide a high-level abstraction layer, allowing system administrators to express complex automation workflows in a human-readable format.

Modules

Modules are the building blocks of Ansible automation. They are small units of code that perform specific tasks on managed nodes. Ansible ships with various modules covering various aspects of system administration, network configuration, cloud management, and more. Modules can be executed individually or as part of a playbook, enabling granular control over the automation process.

Conclusion: Ansible architecture provides a robust framework for automating IT infrastructure management tasks. Its simplicity, scalability, and flexibility make it a popular choice among system administrators and DevOps teams. By understanding the key components of Ansible architecture, you can harness its power to streamline your automation workflows and drive operational efficiency.

OpenShift Networking Deep Dive

OpenShift SDN

OpenShift SDN

In the world of networking, Openshift SDN stands out as a powerful and innovative solution. With its cutting-edge features and flexible architecture, it has revolutionized the way we approach software-defined networking. In this blog post, we will dive deep into Openshift SDN, exploring its key components, benefits, and practical use cases.

Openshift SDN, short for Openshift Software-Defined Networking, is a robust platform that enables efficient and scalable networking in containerized environments. It leverages various technologies such as Open vSwitch and OpenFlow to provide a comprehensive networking solution for modern applications. By decoupling the network control plane from the data plane, Openshift SDN offers enhanced flexibility and agility.

a) Open vSwitch: At the heart of Openshift SDN lies Open vSwitch, a virtual switch that enables network virtualization and bridges the gap between physical and virtual networks. It allows for seamless communication between containers and facilitates the creation of logical networks with advanced features like VLAN tagging and QoS.

b) OpenFlow: Openshift SDN relies on the OpenFlow protocol to manage and control network flows. OpenFlow enables centralized management and programmability, allowing administrators to define and enforce network policies with ease. This dynamic control over network traffic ensures efficient resource utilization and improved network performance.

a) Enhanced Scalability: Openshift SDN provides a scalable networking solution that can handle the demands of modern applications. Its ability to dynamically provision and manage network resources ensures smooth scalability as your application needs grow.

b) Improved Security: With Openshift SDN, security becomes a top priority. The platform offers robust isolation between containers, preventing unauthorized access and potential breaches. Network policies can be defined and enforced at a granular level, ensuring strict control over traffic flow.

c) Simplified Operations: Openshift SDN simplifies network management by providing a centralized control plane. Administrators can easily configure, monitor, and troubleshoot the network, reducing operational complexities and minimizing downtime.

a) Microservices Architecture: Openshift SDN is particularly well-suited for microservices-based applications. Its flexible networking capabilities allow for seamless communication between microservices, enabling efficient scaling and load balancing.

b) Hybrid Cloud Environments: In hybrid cloud deployments, Openshift SDN acts as a bridge, connecting on-premises infrastructure with cloud resources. It ensures secure and reliable communication between different environments, providing a consistent networking experience.

Highlights: OpenShift SDN

The Role of SDN

OpenShift SDN (Software Defined Network) is a software-defined networking solution designed to make it easier for organizations to manage their network traffic in the cloud. It is a network overlay technology that enables distributed applications to communicate over public and private networks.

OpenShift SDN is based on the Open vSwitch (OVS) platform and provides a secure, reliable, and highly available layer 3 network overlay. With OpenShift SDN, users can define their network topologies, create virtual networks, and control traffic flows between virtual machines and containers.

One of the standout features of OpenShift SDN is its ability to support multiple network isolation policies. This is carried out with the use of a network plugin. This ensures that applications can run securely without interfering with one another. Additionally, OpenShift SDN integrates seamlessly with Kubernetes, leveraging its capabilities to manage network policies, services, and routes efficientl

OpenShift Network Plugins:

OpenShift SDN uses a network plugin that provides a flexible and efficient way to connect containers across nodes in a cluster. It abstracts the underlying network complexities via the network overlay, enabling seamless communication between pods. OpenShift SDN is designed to meet the high demands of modern applications, providing secure and scalable networking solutions.

**OpenShift Networking SDN**

1. To start OpenShift networking SDN, we have the Route constructs to provide access to specific services from the outside world. So, there is a connection point between the Route and the service construct. First, the Route connects to the service; then, the service acts as a software load balancer to the correct pod or pods with your application.

2. With the default of cluster-IP, several different service types can exist. So, you may consider the service the first level of exposing applications, but they are unrelated to DNS name resolution. To make servers accepted by FQDN, we use the OpenShift route resource, which provides the DNS.

3. The default service cluster IP addresses are from the OpenShift Dedicated internal network, which permits pods to access each other. Services are assigned an IP address and port pair that, when accessed, proxy to an appropriate backing pod.

4. By default, unsecured routes are configured and are, therefore, the easiest to configure. A secured route, however, offers security that keeps your connection private. Create secure HTTPS routes using the create route command and optionally supplying certificates and keys (PEM-format files that must be generated and signed separately).

Personal Note: Application Exposure

– When considering OpenShift and how OpenShift networking SDN works, you need to fully understand how application exposure works and how to expose yourself to the external world so that external clients can access your internal application.

– For most use cases, the applications running in Kubernetes ( see Kubernetes networking 101 ) pods’ containers need to be exposed, and this is not done with the pod IP address. Due to the fact that pods IP can be ephmemeral.

– Instead, the pods are given IP addresses for different use cases. Application exposure is done with OpenShift routes and OpenShift services. The construct used depends on the level of exposure needed. 

Related: For pre-information, kindly visit the following:

  1. OpenShift Security Best Practices
  2. ACI Cisco
  3. DNS Security Solutions
  4. Container Networking
  5. OpenStack Architecture
  6. Kubernetes Security Best Practice

OpenShift SDN

Kuberenetes Concepts

Kubernetes’ concept of a POD

As the smallest compute unit that can be defined, deployed, and managed, OpenShift leverages the Kubernetes concept of a pod. There is one or more containers deployed on one host. A pod is the equivalent of a physical or virtual machine instance to a container. Containers within pods can share their local storage and networking, as each pod has its IP address.

An individual pod has a lifecycle; it is defined, assigned to a node, and then runs until the container(s) exit or are removed for some other reason. Pods can be removed after exiting or retained to allow access to container logs, depending on policy and exit code.

In OpenShift, pod definitions are largely immutable; they cannot be modified while running. Changes are implemented by terminating existing pods and recreating them with modified configurations, base images, or both. Additionally, pods are expendable and do not maintain their state when recreated. In general, pods should not be managed directly by users but by higher-level controllers.

Kubernetes PODKubernetes’ Concept of Services

Kubernetes services act as internal load balancers. They identify a set of replicated pods to proxy connections to them. While the service remains consistently available, backing pods can be added or removed arbitrarily, enabling everything that depends on them to refer to them at a consistent address. This is a key abstraction layer a a Kubernetes design. Additionally, the OpenShift Container Platform uses cluster IP addresses to allow pods to communicate with each other and access the internal network.

Note: The service can be assigned additional external IP and ingress IP addresses outside the cluster to allow external access. An external IP address can also be a virtual IP address that provides highly available access to the service.

Services are assigned IP addressing and port mappings, which proxy to an appropriate backing pod when accessed. Using a label selector, a service can find all containers running on a specific port that provides a particular network service. Like pods, services are REST objects. 

Kubernetes service

There are a couple of options for getting hands-on with OpenShift. You can download the CodeReady Containers for Linux, Microsoft, and MacOS or RedHat’s pre-built Sandbox Lab environment.

Configuring OpenShift Cluster

### Getting Started: Setting Up Your OpenShift Environment

Before diving into configuration, it’s essential to set up your OpenShift environment correctly. Start by installing the OpenShift CLI (oc) and setting up access to your cluster. For beginners, Red Hat offers OpenShift Online, which provides a managed cloud service to ease you into the ecosystem. Ensure your local development environment is compatible and includes the necessary tools such as Docker and Kubernetes.

### Configuring Your OpenShift Cluster: Step-by-Step

1. **Networking:** Begin by configuring the cluster networking. OpenShift uses a Software-Defined Network (SDN) to manage communication between pods. Choose a suitable network plugin, like OVN-Kubernetes or Calico, and configure the network policies to secure your application traffic.

2. **Storage:** Proper storage configuration is vital for stateful applications. OpenShift supports various storage solutions, including Persistent Volumes (PVs) and Persistent Volume Claims (PVCs). Evaluate your storage needs and select appropriate storage classes to ensure data persistence and redundancy.

3. **Security:** OpenShift offers robust security features. Configure role-based access control (RBAC) to restrict user permissions and define security contexts for your pods. Implement network security policies and enable cluster auditing to monitor and respond to potential threats.

4. **Resource Management:** Efficient resource allocation ensures optimal performance. Configure resource quotas and limits to prevent overconsumption. Utilize Horizontal Pod Autoscalers (HPAs) to automatically adjust the number of pods based on current demand.

### Advanced Configuration: Enhancing Your Cluster

For those looking to further optimize their OpenShift cluster, consider implementing advanced configurations:

– **Monitoring and Logging:** Deploy tools like Prometheus and Grafana for monitoring, and Elasticsearch, Fluentd, and Kibana (EFK) stack for logging. These tools offer insight into cluster performance and help troubleshoot issues quickly.

– **CI/CD Integration:** Integrate OpenShift with your existing CI/CD pipeline using tools like Jenkins or GitLab. This integration allows for automated builds, tests, and deployments, enhancing your development workflow.

– **Custom Operators:** Leverage OpenShift’s Operator Framework to automate the management of complex applications. Operators can significantly reduce manual intervention and streamline application lifecycle management.

OpenShift Networking Deep Dive

### The Basics of OpenShift Networking

OpenShift networking is built on the foundation of Kubernetes, but with enhancements that cater to enterprise needs. At its core, OpenShift uses an overlay network to manage communication between pods across nodes. This network ensures seamless connectivity and scalability, allowing applications to function efficiently in a distributed environment. By default, OpenShift uses Open vSwitch (OVS) as its software-defined networking (SDN) solution, which provides flexibility and control over network traffic.

### Key Components of OpenShift Networking

Understanding the components that make up OpenShift networking is essential for effective network management. The primary components include:

1. **Pods and Services:** Pods are the smallest deployable units in OpenShift, and services provide a stable endpoint for accessing these pods.

2. **Ingress and Egress Traffic:** OpenShift manages ingress traffic via routes, which define how external users access the services. Egress traffic, on the other hand, involves controlling the outbound connections from the pods to external services.

3. **Network Policies:** These policies define the rules for communication between pods. By implementing network policies, you can enhance security by restricting the flow of traffic between different parts of your application.

### Advanced Networking Features

OpenShift offers advanced networking features that cater to specific enterprise needs. Some of these features include:

– **Multi-Cluster Networking:** OpenShift supports the deployment of applications across multiple clusters, enabling organizations to achieve high availability and disaster recovery.

– **Service Mesh Integration:** With the rise of microservices, service mesh solutions like Istio are integrated into OpenShift to manage service-to-service communication, providing observability, traffic management, and security.

– **Custom DNS and Load Balancing:** OpenShift allows for custom DNS configurations and integrates with various load balancers to distribute traffic efficiently across nodes.

Service Discovery and DNS

Applications depend on each other to deliver information to users. These relationships are complex to manage in an application spanning multiple independently scalable pods. So, we don’t access applications by pod IP. These IP addresses will change for one reason, and it’s not a scalable solution.

To make this easier, OpenShift deploys DNS when the cluster is deployed and makes it available on the pod network. DNS in OpenShift allows pods to discover the resources in the OpenShift SDN.

Layer Approach to DNS.

DNS in the Openshift is a layered approach. Originally, DNS in Kubernetes was used for service discovery. The problem was solved a long time ago. DNS was the answer for service discovery back then, and it still is. Service Discovery means an application or service inside; it can reference a service by name, not an IP address.

The deployed pods represent microservices. A Kubernetes service points to these pods and discovers them by DNS name, so the service is transparent. The internal DNS manages this in Kubernetes; originally, it was SKYDNS KubeDNS, and now it’s Core DNS.

The DNS Operator

The DNS operator runs DNS services and uses Core DNS. The pods use the internal Core DNS server for DNS resolution. The pod’s DNS name server is automatically set to the Core DNS. OpenShift provides its internal DNS, implemented via Core DNS and dnsmasq for service discovery. The dnsmasq is a lightweight DNS forwarder that provides DNS. 

The DNS Operator has several roles:

    1. It creates the default cluster DNS name cluster. local
    2. Assigns DNS names to namespaces. The namespace is part of the FQDN.
    3. Assign DNS names to services. So, both the service and namespace are part of the FQDN name.
OpenShift DNS Operator
Diagram: OpenShift DNS Operator. Source OpenShift Docs.

OpenShift SDN and the DNS processes

**The Controller nodes

The OpenShift cluster network has several components. First, we have a controller node. There are multiple controller nodes in a cluster. The controller nodes redirect traffic to the PODs. We run a route on each controller node and use Core DNS. So, in front of the Kubernetes cluster or layer, this is a hardware load balancer. Then, we have external DNS, which is outside of the cluster. 

This external DNS has a wildcard domain; thus, external DNS is resolved to the frontend hardware load balancer through the wildcard. So, users who want to access a service issue the request and contact external DNS for name resolution.

Then, external resolves the wildcard domain to the load balancer, which will load balance to the different control nodes. For these control nodes, we can address the route and service.

**OpenShift and DNS: Wildcard DNS.

OpenShift has an internal DNS server that is reachable only by Pods. We need an external DNS server configured with wildcard DNS to make the service available by name to the outside. The wildcard DNS is resolved to all resources created in the cluster domain by fixing the OpenShift load balancer. 

This OpenShift load balancer provides a frontend to the control nodes run as ingress controllers and part of the cluster. They have access to internal resources and are part of the internal cluster.

**OpenShift ingress operators

For this to work, we need to use the OpenShift Operators. The Ingress Operator implements the IngressController API and enables external access to OpenShift Container Platform cluster services. It does this by deploying one or more HAProxy ingress controllers to handle the routing side.

You can use the Ingress Operator to route traffic by specifying the OpenShift Container Platform route construct. You may also have heard of the Kubernetes Ingress resources. Both are similar, but the OpenShift route can have additional security features and the use case of split green deployments.

The OpenShift Route Construct and Encryption

The OpenShift Container Platform route provides traffic to services in the cluster. Routes also offer advanced features that might not be supported by standard Kubernetes Ingress Controllers, such as TLS re-encryption, TLS passthrough, and split traffic for blue-green deployments.

In Kubernetes’s words, we use Ingress, which exposes services to the external world. However, in Openshift, it is best practice to use a routing set, which is an alternative to Ingress.

We have three Pods, each with a different IP address. To access these Pods, we need a service. Essentially, this service provides a load balancing service and distributes load to the pods using a load balancing algorithm, a round robin by default.

The service is an internal component, and in Openshift, we have routes that provide a URL for the services so they can be accessible from the outside world. So, the URL created by the Route points to the service, and the service points to the pods. In the Kubernetes world, Ingress pointed out the benefits, not routes.

Different types of services

Type: 

  • ClusterIP: The Service is exposed as an IP address internal to the cluster. This is useful for a microservices design where the front end connects to the backend without exposing the service externally. These are the Default Types. The service type is ClusterIP, meaning you have a cluster-wide IP address.
  • Node Port: This is a service type that exposes a port on the node’s IP address. This is like port forwarding on the physical node. However, the node port does not connect the internal cluster pods to a port dynamically exposed on the physical node. So external users can connect to the port on the node, and we get the port forwarded to the node port. This then goes to the pods and is load-balanced to the pods.
  • Load Balancer: You can find a service type in public cloud environments. 

Forming the network topology

OpenShift SDN networking

New pod creation: OpenShift networking SDN

As new pods are created on a host, the local OpenShift software-defined network (SDN) allocates and assigns an IP address from the cluster network subnet assigned to the node and connects the veth interface to a port in the br0 switch. It does this with the OpenShift OVS, which programs OVS rules via the OVS bridge. At the same time, the OpenShift SDN injects new OpenFlow entries into the OVSDB of br0 to Route traffic addressed to the newly allocated IP Address to the correct OVS port connecting the pod.

Pod network: 10.128.0.0/14

The pod network defaults to use the 10.128.0.0/14 IP address block. Each node in the cluster is assigned a /23 CIDR IP address range from the pod network block. That means, by default, each application node in OpenShift can accommodate a maximum of 512 pods. 

OpenFlow manages how IP addresses are allocated to each application node. The primary CNI plugin, the essence of SDN for Openshift, establishes the cluster-wide network and configures the overlay network using the OVS.

OpenShift CNI SDN Plugin

OVS is used in your OpenShift cluster as the communications backbone for your deployed pods. OVS in and out of every pod affects traffic in and out of the OpenShift cluster. OVS runs as a service on each node in the cluster. The Primary CNI SDN Plugin uses network policies using Openvswitch flow ruleswhich dictate which packets are allowed or denied. 

Configuring OpenShift Networking SDN

When you deploy OpenShift, the default configuration for the pod network’s topology is a single flat network. Every pod in every project can communicate without restrictions. OpenShift SDN uses a plugin architecture that provides different network topologies. Depending on your network and security requirements, you can choose a plugin that matches your desired topology. Currently, three OpenShift SDN plugins can be enabled in the OpenShift configuration without significantly changing your cluster.

OpenShift SDN default CNI network provider

OpenShift Container Platform uses a software-defined networking (SDN) approach to provide a unified cluster network that enables communication between pods across the OpenShift Container Platform cluster. The OpenShift SDN establishes and maintains this pod network, configuring an overlay network using Open vSwitch (OVS).

OpenShift SDN modes:

OpenShift SDN provides three SDN modes for configuring the pod network.

  1. ovs-subnet— Enabled by default. Creates a flat pod network, allowing all project pods to communicate.
  2. ovs-multitenant—Separates the pods by project. Applications deployed in a project can only communicate with pods deployed in the same project. 
  3. ovs-network policy: This plugin provides fine-grained Ingress and egress rules for applications. It can be more complex than the other two.
    • OpenShift ovs-subnet

The OpenShift ovs-subnet is the original OpenShift SDN plugin. This plugin provides basic connectivity for the Pods. This network connectivity is sometimes called a “flat” pod network. It is described as a “flat” Pod network because there are no filters or restrictions, and every pod can communicate with every other Pod and Service in the cluster. Flat network topology for all pods in all projects lets all deployed applications communicate. 

    • OpenShift ovs-multitenant

With OpenShift ovs-multitenant plugin, each project receives a unique VXLAN ID known as a Virtual Network ID (VNID). All the pods and services of an OpenShift Project are assigned to the corresponding VNID. So now we have segmentation based on the VNID. Doing this maintains project-level traffic isolation, meaning that Pods and Services of one Project can only communicate with Pods and Services in the same project. There is no way for Pods or Services from one project to send traffic to another. The ovs-multitenant plugin is perfect if just having projects separated is enough.

Unique across projects

Unlike the ovs-subnet plugin, which passes all traffic across all pods, this one assigns the same VNID to all pods for each project, keeping them unique across projects. It also sets up flow rules on the br0 bridge to ensure that traffic is only allowed between pods with the same VNID.

VNID for each Project

When the ovs-multitenant plugin is enabled, each project is assigned a VNID. The VNID for each Project is maintained in the etcd database on the OpenShift master node. When a pod is created, its linked veth interface is associated with its Project’s VNID, and OpenFlow rules are made to ensure it can communicate only with pods in the same project.

    • The ovs-network policy plugin 

The ovs-multitenant plugin cannot control access at a more granular level. This is where the Ovs-network policy plugin steps in, add more configuration power and lets you create custom NetworkPolicy objects. As a result, the ovs-network policy plugin provides fine-grained access control for individual applications, regardless of their project. By isolating policy using network policy objects, you can tailor your topology requirement.

This is the Kubernetes Network Policy, so you map, Label, or tag your application, then define a network policy definition to allow or deny connectivity across your application. Network policy mode will enable you to configure their isolation policies using NetworkPolicy objects. Network policy is the default mode in OpenShift Container Platform 4.8.

  • OpenShift OVN Kubernetes CNI network provider

OpenShift Container Platform uses a virtualized network for pod and service networks. The OVN-Kubernetes Container Network Interface (CNI) plugin is a network provider for the default cluster network. OVN-Kubernetes is based on the Open Virtual Network (OVN) and provides an overlay-based networking implementation. A cluster that uses the OVN-Kubernetes network provider also runs Open vSwitch (OVS) on each node. OVN configures OVS on each node to implement the declared network configuration.

OVN-Kubernetes features

The OVN-Kubernetes Container Network Interface (CNI) cluster network provider implements the following features:

  • Uses OVN (Open Virtual Network) to manage network traffic flows. OVN is a community-developed, vendor-agnostic network virtualization solution.
  • Implements Kubernetes network policy support, including ingress and egress rules.
  • It uses the Geneve (Generic Network Virtualization Encapsulation) protocol rather than VXLAN to create an overlay network between nodes.

Summary: OpenShift SDN

OpenShift SDN, short for Software-Defined Networking, is a revolutionary technology that has transformed the way we think about network management in the world of containerization. In this blog post, we delved deep into the intricacies of OpenShift SDN and explored its various components, benefits, and use cases. So, fasten your seatbelts as we embark on this exciting journey!

Understanding OpenShift SDN

OpenShift SDN is a networking plugin for the OpenShift Container Platform that provides a robust and scalable network infrastructure for containerized applications. It leverages the power of Kubernetes and overlays network connectivity on top of existing physical infrastructure. OpenShift SDN offers unparalleled flexibility, agility, and automation by decoupling the network from the underlying infrastructure.

Key Components of OpenShift SDN

To comprehend the inner workings of OpenShift SDN, let’s explore its key components:

1. Open vSwitch: Open vSwitch is a virtual switch that forms the backbone of OpenShift SDN. It enables the creation of logical networks and provides advanced features like load balancing, firewalling, and traffic shaping.

2. SDN Controller: The SDN controller is responsible for managing and orchestrating the network infrastructure. It acts as the brain of OpenShift SDN, making intelligent decisions regarding network policies, routing, and traffic management.

3. Network Overlays: OpenShift SDN utilizes network overlays to create virtual networks on top of the physical infrastructure. These overlays enable seamless communication between containers running on different hosts and ensure isolation and security.

Benefits of OpenShift SDN

OpenShift SDN brings a plethora of benefits to containerized environments. Some of the notable advantages include:

1. Simplified Network Management: With OpenShift SDN, network management becomes a breeze. It abstracts the complexities of the underlying infrastructure, allowing administrators to focus on higher-level tasks and reducing operational overhead.

2. Scalability and Elasticity: OpenShift SDN is highly scalable and elastic, making it suitable for dynamic containerized environments. It can easily accommodate the addition or removal of containers and adapt to changing network demands.

3. Enhanced Security: OpenShift SDN provides enhanced security for containerized applications by leveraging network overlays and advanced security policies. It ensures isolation between different containers and enforces fine-grained access controls.

Use Cases for OpenShift SDN

OpenShift SDN finds numerous use cases across various industries. Some prominent examples include:

1. Microservices Architecture: OpenShift SDN seamlessly integrates with microservices architectures, enabling efficient communication between different services and ensuring optimal performance.

2. Multi-Cluster Deployments: OpenShift SDN is well-suited for multi-cluster deployments, where containers are distributed across multiple clusters. It simplifies network management and enables seamless inter-cluster communication.

In conclusion, OpenShift SDN is a game-changer in the world of container networking. Its software-defined approach, coupled with advanced features and benefits, empowers organizations to build scalable, secure, and resilient containerized environments. Whether you are deploying microservices or managing multi-cluster setups, OpenShift SDN has got you covered. So, embrace the power of OpenShift SDN and unlock new possibilities for your containerized applications!

Chaos Engineering

Chaos Engineering Kubernetes

Chaos Engineering Kuberentes

In the ever-evolving world of Kubernetes, maintaining system reliability is of paramount importance. Chaos Engineering has emerged as a powerful tool to proactively identify and address weaknesses in distributed systems. In this blog post, we will explore the concept of Chaos Engineering and its application in the Kubernetes ecosystem, focusing on the various metric types that play a crucial role in this process.

Chaos Engineering is a discipline that involves intentionally injecting failures and disturbances into a system to uncover vulnerabilities and enhance its resilience. By simulating real-world scenarios, Chaos Engineering allows organizations to identify potential weaknesses and develop robust strategies to mitigate failures. In the context of Kubernetes, Chaos Engineering provides a proactive approach to detect and address potential issues before they impact critical services.

Metrics are indispensable in Chaos Engineering as they provide crucial insights into the behavior and performance of a system. In Kubernetes, various metric types are utilized to measure different aspects of system reliability. These include latency metrics, error rate metrics, resource utilization metrics, and many more. Each metric type offers unique perspectives on system performance and aids in identifying areas that require improvement.

Latency metrics play a vital role in Chaos Engineering as they help measure the response time of services within a Kubernetes cluster. By intentionally increasing latency, organizations can gauge the impact on overall system performance and identify potential bottlenecks. This enables them to optimize resource allocation, enhance scalability, and improve the overall user experience.

Error rate metrics provide valuable insights into the stability and resiliency of a Kubernetes system. By intentionally introducing errors, Chaos Engineering practitioners can analyze the system's response and measure the impact on end-users. This allows organizations to proactively identify and fix vulnerabilities, ensuring uninterrupted service delivery and customer satisfaction.

Resource utilization metrics enable organizations to manage and optimize the allocation of resources within a Kubernetes cluster. By simulating resource-intensive scenarios, Chaos Engineering practitioners can evaluate the impact on system performance and resource allocation. This helps in identifying potential inefficiencies, optimizing resource utilization, and ensuring the scalability and stability of the system.

Chaos Engineering has emerged as a crucial practice in the Kubernetes ecosystem, empowering organizations to enhance system reliability and resilience. By leveraging various metric types such as latency metrics, error rate metrics, and resource utilization metrics, organizations can proactively identify weaknesses and implement effective strategies to mitigate failures. Embracing Chaos Engineering in Kubernetes enables organizations to build robust, scalable, and reliable systems that can withstand unexpected challenges.

Highlights: Chaos Engineering Kuberentes

The Principles of Chaos Engineering

Chaos Engineering is grounded in several core principles. First, it is about embracing failure as a learning opportunity. By simulating failures, teams can understand how their systems react under stress. Another principle is to conduct experiments in a controlled environment, ensuring any disruptions do not affect end users. Lastly, Chaos Engineering emphasizes continuous learning and improvement, turning insights from experiments into actionable changes.

**Tools and Techniques**

Implementing chaos engineering requires a robust set of tools and techniques to simulate real-world disruptions effectively. Some popular tools in the chaos engineering toolkit include Chaos Monkey, Gremlin, and Litmus, each offering unique features to test system resilience. These tools enable engineers to automate experiments, analyze results, and derive actionable insights. By using these advanced tools, organizations can run experiments at scale and gain confidence in their systems’ ability to handle unpredictable events.

**Why Chaos Engineering Matters**

Modern systems are complex and often distributed across multiple platforms and services. This complexity makes it difficult to predict how systems will behave under unexpected conditions. Chaos Engineering allows teams to identify hidden vulnerabilities, improve system reliability, and enhance user experience. Companies like Netflix and Amazon have successfully used Chaos Engineering to maintain their competitive edge by ensuring their services remain robust and available.

**The Traditional Application**

When considering Chaos Engineering kubernetes, we must start from the beginning. Not too long ago, applications ran in single private data centers, potentially two data centers for high availability. These data centers were on-premises, and all components were housed internally. Life was easy, and troubleshooting and monitoring any issues could be done by a single team, if not a single person, with predefined dashboards. Failures were known, and there was a capacity planning strategy that did not change too much, and you could carry out standard dropped packet test

**A Static Infrastructure**

The network and infrastructure had fixed perimeters and were pretty static. There weren’t many changes to the stack, for example, daily. Agility was at an all-time low, but that did not matter for the environments in which the application and infrastructure were housed. However, nowadays, we are in a completely different environment with many more moving parts with an increasing need to support a reliable distributed system.

Complexity is at an all-time high, and business agility is critical. Now, we have distributed applications with components/services located in many different places and types of places, on-premises and in the cloud, with dependencies on both local and remote services. So, in this land of complexity, we must find system reliability. A reliable system is one that you can trust will be reliable.

Implementing Chaos Engineering

Implementing Chaos Engineering involves several steps. First, start by defining a “steady state” of your system’s normal operations. Next, develop hypotheses about potential failure points and design experiments to test these. Use tools like Chaos Monkey or Gremlin to introduce controlled disruptions and observe the outcomes. Finally, analyze the results and apply any necessary changes to improve system resilience.

Challenges and Considerations

While Chaos Engineering offers many benefits, it also presents challenges. Resistance from stakeholders who are wary of intentional disruptions can be a hurdle. Additionally, ensuring that experiments are conducted safely and do not impact customers is crucial. To overcome these challenges, it’s important to foster a culture of transparency, continuous learning, and collaboration among teams.

Benefits of Chaos Engineering in Kubernetes:

1. Enhanced Reliability: By subjecting Kubernetes deployments to controlled failures, Chaos Engineering helps organizations identify weak points and vulnerabilities, enabling them to build more resilient systems that can withstand unforeseen events.

2. Improved Incident Response: Chaos Engineering allows organizations to test and refine their incident response processes by simulating real-world failures. This helps teams understand how to quickly detect and mitigate potential issues, reducing downtime and improving the overall incident response capabilities.

3. Cost Optimization: Chaos engineering can help optimize resource utilization within a Kubernetes cluster by identifying and addressing performance bottlenecks and inefficient resource allocation. This, in turn, leads to cost savings and improved efficiency.

Personal Note: Today’s standard explanation for Chaos Engineering is “The facilitation of experiments to uncover systemic weaknesses.” The following is true for Chaos Engineering.

  1. Begin by defining “steady state” as some measurable output of a system that indicates normal behavior.
  2. Hypothesize that this steady state will persist in both the control and experimental groups.
  3. Submit variables that mirror real-world events like servers that crash, hard drives that malfunction, severed network connections, etc.
  4. Then, as a final step, try to disprove the hypothesis by looking for a steady-state difference between the control and experimental groups.

Chaos Engineering Scenarios in Kubernetes:

Chaos Engineering in Kubernetes involves deliberately introducing failures into the system to observe how it behaves and recovers. This proactive approach enables teams to understand the system’s response to unexpected disruptions, whether it’s a pod failure, network latency, or a node shutdown.

**Implementing Chaos Experiments in Kubernetes**

To harness the full potential of Chaos Engineering, it’s essential to implement chaos experiments thoughtfully. Start by identifying critical components of your Kubernetes cluster that need testing. Tools like LitmusChaos and Chaos Mesh provide a framework for defining and executing chaos experiments. These experiments simulate disruptions such as pod deletions, CPU stress, and network partitioning, allowing teams to evaluate system behavior and improve fault tolerance.

**Chaos Engineering Kubernetes**

1. Pod Failures: Simulating failures of individual pods within a Kubernetes cluster allows organizations to evaluate how the system responds to such events. By randomly terminating pods, Chaos Engineering can help ensure that the system can handle pod failures gracefully, redistributing workload and maintaining high availability.

2. Network Partitioning: Introducing network partitioning scenarios can help assess the resilience of a Kubernetes cluster. By isolating specific nodes or network segments, Chaos Engineering enables organizations to test how the group reacts to network disruptions and evaluate the effectiveness of load balancing and failover mechanisms.

3. Resource Starvation: Chaos Engineering can simulate resource scarcity scenarios by intentionally consuming excessive resources, such as CPU or memory, within a Kubernetes cluster. This allows organizations to identify potential performance bottlenecks and optimize resource allocation strategies.

Understanding GKE-Native Monitoring

GKE-Native Monitoring is a comprehensive solution designed for Google Kubernetes Engine (GKE). It empowers developers and operators with real-time insights into the health and performance of their Kubernetes clusters, pods, and containers. Leveraging the power of Prometheus, GKE-Native Monitoring offers a wide range of monitoring metrics, such as CPU and memory utilization, network traffic, and custom application metrics.

GKE-Native Monitoring: Features

GKE-Native Monitoring boasts several key features that enhance observability and troubleshooting capabilities. One notable feature is the ability to create custom dashboards, enabling users to visualize and analyze critical metrics tailored to their needs. GKE-Native Monitoring integrates seamlessly with Stackdriver, Google Cloud’s unified monitoring and logging platform, allowing for centralized management and alerting across multiple GKE clusters.

The Rise of Chaos Engineering

Infrastructure is becoming increasingly complex, and let’s face it, a lot can go wrong. It’s imperative to have a global view of all the infrastructure components and a good understanding of the application’s performance and health. In a large-scale container-based application design, there are many moving pieces and parts, and trying to validate the health of each piece manually is hard

With these new environments, especially cloud-native at scale. Complexity is at its highest, and many more things can go wrong. For this reason, you must prepare as much as possible so the impact on users is minimal.

So, the dynamic deployment patterns you get with frameworks with Kubernetes allow you to build better applications. But you need to be able to examine the environment and see if it is working as expected. Most importantly, this course’s focus is that to prepare effectively, you need to implement a solid strategy for monitoring in production environments.

Chaos Engineering Kubernetes

For this, you need to understand practices and tools like Chaos Engineering and how they can improve the reliability of the overall system. Chaos Engineering is the ability to perform tests in a controlled way. Essentially, we intentionally break things to learn how to build more resilient systems.

So, we are injecting faults in a controlled way to make the overall application more resilient by injecting various issues and faults. It comes down to a trade-off and your willingness to accept it. There is a considerable trade-off with distributed computing. You have to monitor efficiently, have performance management, and, more importantly, accurately test the distributed system in a controlled manner. 

Service Mesh Chaos Engineering

 Service Mesh is an option to use to implement Chaos Engineering. You can also implement Chaos Engineering with Chaos Mesh, a cloud-native Chaos Engineering platform that orchestrates tests in the Kubernetes environment. The Chaos Mesh project offers a rich selection of experiment types. Here are the choices, such as the POD lifecycle test, network test, Linux kernel, I/O test, and many other stress tests.

Implementing practices like Chaos Engineering will help you understand and manage unexpected failures and performance degradation. The purpose of Chaos Engineering is to build more robust and resilient systems. 

Cloud Service Mesh:

### The Role of Cloud Service Mesh in Modern Architecture

Traditional monolithic applications are being replaced by microservices architectures, which break down applications into smaller, independent services. While this shift offers numerous benefits such as scalability and flexibility, it also introduces complexities in managing the interactions between these services. This is where a Cloud Service Mesh comes into play. It provides a structured method to handle service discovery, load balancing, failure recovery, metrics, and monitoring without adding heavy burdens to the developers.

### Google’s Pioneering Efforts in Cloud Service Mesh

Google has been at the forefront of cloud innovation, and its contributions to the development of Cloud Service Mesh technologies are no exception. Through its open-source project Istio, Google has provided a robust framework for managing microservices.

Istio simplifies the observability, security, and management of microservices, making it easier for enterprises to adopt and scale their cloud-native applications. By integrating Istio with their Google Kubernetes Engine (GKE), Google has made it seamless for organizations to deploy and manage their service meshes.

### Enhancing Reliability with Chaos Engineering

One of the most intriguing aspects of utilizing a Cloud Service Mesh is its ability to enhance system reliability through Chaos Engineering. This practice involves intentionally injecting faults into the system to test its resilience and identify potential weaknesses.

By simulating real-world failures, organizations can better prepare for unexpected issues and ensure their services remain robust under pressure. The granular control provided by a Cloud Service Mesh makes it an ideal platform for implementing Chaos Engineering practices, helping to create more resilient and dependable cloud environments.

Related: Before you proceed to the details of Chaos Engineering, you may find the following useful:

  1. Service Level Objectives (slos)
  2. Kubernetes Networking 101
  3. Kubernetes Security Best Practice
  4. Network Traffic Engineering
  5. Reliability In Distributed System
  6. Distributed Systems Observability

Chaos Engineering Kuberentes

Monitoring & Troubleshooting 

Beyond the Complexity Horizon

Therefore, monitoring and troubleshooting are much more demanding, as everything is interconnected, making it difficult for a single person in one team to understand what is happening entirely. The edge of the network and application boundary surpasses one location and team. Enterprise systems have gone beyond the complexity horizon, and you can’t understand every bit of every single system.

Even if you are a developer closely related to the system and truly understand the nuts and bolts of the application and its code, no one can understand every bit of every single system.  So, finding the correct information is essential, but once you see it, you have to give it to those who can fix it. So monitoring is not just about finding out what is wrong; it needs to alert, and these alerts need to be actionable.

Troubleshooting: Chaos engineering kubernetes

– Chaos Engineering aims to improve a system’s reliability by ensuring it can withstand turbulent conditions. Chaos Engineering makes Kubernetes more secure. So, if you are adopting Kubernetes, you should adopt Chaos Engineering as an integral part of your monitoring and troubleshooting strategy.

– Firstly, we can pinpoint the application errors and understand, at best, how these errors arose. This could be anything from badly ordered scripts on a web page to a database query that has bad sequel calls or even unoptimized code-level issues.

– Or there could be something more fundamental going on. It is expected to have issues with how something is packaged into a container. You can pull in the incorrect libraries or even use a debug version of the container.

– Or there could be nothing wrong with the packaging and containerization of the container; it is all about where the container is being deployed. There could be something wrong with the infrastructure, either a physical or logical problem—incorrect configuration or a hardware fault somewhere in the application path.

**Non-ephemeral and ephemeral services**

With the introduction of containers and microservices observability, monitoring solutions need to manage non-ephemeral and ephemeral services. We are collecting data for applications that consist of many different benefits.

So when it comes to container monitoring and performing chaos engineering kubernetes tests, we need to understand the nature and the full application that lies upon it. Everything is dynamic by nature. It would be best if you had monitoring and troubleshooting in place that can handle the dynamic and transient nature. When monitoring a containerized infrastructure, you should consider the following.

A: Container Lifespan: Containers have a short lifespan; they are provisioned and commissioned based on demand. This is compared to VMs or bare-metal workloads, which generally have a longer lifespan. As a generic guideline, containers have an average lifespan of 2.5 days, while traditional and cloud-based VMs have an average lifespan of 23 days. Containers can move, and they do move frequently.

One day, we could have workload A on cluster host A, and the next day or even on the same day, the same cluster host could be hosting Application workload B. Therefore, different types of impacts could depend on the time of day.

B: Containers are Temporary: Containers are dynamically provisioned for specific use cases temporarily. For example, we could have a new container based on a specific image. New network connections, storage, and any integrations to other services that make the application work will be set up for that container. All of this is done dynamically and can be done temporarily.

Example: Orchestration with Docker Swarm

C: Different monitoring levels: In a Kubernetes environment, there are many monitoring levels. The components that make up the Kubernetes deployment will affect application performance. We have nodes, pods, and application containers. We also monitor at different levels, such as the VM, storage, and microservice levels.

D: Microservices change fast and often: Microservices consist of constantly evolving apps. New microservices are added, and existing ones are decommissioned quickly. So, what does this mean to usage patterns? This will result in different usage patterns on the infrastructure. If everything is often changing, it can be hard to derive the baseline and build a topology map unless you have something automatic in place. 

E: Metric overload: We now have loads of metrics, including additional metrics for the different containers and infrastructure levels. We must consider metrics for the nodes, cluster components, cluster add-on, application runtime, and custom application metrics. This is compared to a traditional application stack, where we use metrics for components such as the operating system and the application. 

Metric Explosion

Note: Prometheus Scrapping Metrics

In the traditional world, we didn’t have to be concerned with the additional components such as an orchestrator or the dynamic nature of many containers. With a container cluster, we must consider metrics from the operating system, application, orchestrator, and containers.  We refer to this as a metric explosion. So now we have loads of metrics that need to be gathered. There are also different ways to pull or scrape these metrics.

Prometheus is expected in the world of Kubernetes and uses a very scalable pull approach to getting those metrics from HTTP endpoints either through Prometheus client libraries or exports.

Prometheus YAML file
Diagram: Prometheus YAML file

**A key point: What happens to visibility**  

So, we need complete visibility now more than ever—not just for single components but visibility at a holistic level. Therefore, we need to monitor a lot more data points than we had to in the past. We need to monitor the application servers, Pods and containers, clusters running the containers, the network for service/pod/cluster communication, and the host OS.

All of the monitoring data needs to be in a central place so trends can be seen and different queries to the data can be acted on. Correlating local logs would be challenging in a sizeable multi-tier application with docker containers. We can use Log forwarders or Log shippers, such as FluentD or Logstash, to transform and ship logs to a backend such as Elasticsearch.

**A key point: New avenues for monitoring**

Containers are the norm for managing workloads and adapting quickly to new markets. Therefore, new avenues have opened up for monitoring these environments. So we have, for example, AppDynamics and Elastic search, which are part of the ELK stack, and the various log shippers that can be used to help you provide a welcome layer of unification. We also have Prometheus to get metrics. Keep in mind that Prometheus works only in the land of metrics. There will be different ways to visualize all this data, such as Grafana and Kibana. 

Distributed Systems Visbility

What happens to visibility? We need complete visibility now more than ever, not just for single components but at a holistic level. Therefore, we need to monitor a lot more data points than we had to in the past. We need to monitor the application servers, Pods and containers, clusters running the containers, the network for service/pod/cluster communication, and the host OS. 

All of the monitoring data needs to be in a central place so trends can be seen and different queries to the data can be acted on. Correlating local logs would be challenging in a sizeable multi-tier application with docker containers. We can use Logforwarders or Log shippers, such as FluentD or Logstash, to transform and ship logs to a backend such as Elasticsearch.

Containers are the norm for managing workloads and adapting quickly to new markets. Therefore, new avenues have opened up for monitoring these environments. So, I have mentioned AppDynamics, Elastic search, which is part of the ELK stack, and the various log shippers that can be used to help you provide a layer of unification. We also have Prometheus. There will be different ways to visualize all this data, such as Grafana and Kibana. 

A: – Microservices Complexity

With the wave towards microservices, we get the benefits of scalability and business continuity, but managing them is very complex. The monolith is much easier to manage and monitor. Also, as separate components, they don’t need to be written in the same language or toolkits, so you can mix and match different technologies.

This approach has a lot of flexibility, but we can increase latency and complexity. There are a lot more moving parts that will increase complexity.

We have, for example, reverse proxies, load balancers, firewalls, and other infrastructure support services. What used to be method calls or interprocess calls within the monolith host now go over the network and are susceptible to deviations in latency. 

B: – Debugging Microservices

With the monolith, the application is simply running in a single process, and it is relatively easy to debug. Many traditional tooling and code instrumentation technologies have been built, assuming you have the idea of a single process. However, with microservices, we have a completely different approach with a distributed application.

Now, your application has multiple processes running in other places. The core challenge is trying to debug microservices applications.

So much of the tooling we have today has been built for traditional monolithic applications. So, there are new monitoring tools for these new applications, but there is a steep learning curve and a high barrier to entry. New tools and technologies such as distributed tracing and chaos engineering kubernetes are not the simplest to pick up on day one.

C: – Automation and monitoring

Automation comes into play with the new environment. With automation, we can do periodic checks not just on the functionality of the underlying components but also implement health checks of how the application performs. All can be automated for specific intervals or in reaction to certain events.

With the rise of complex systems and microservices, it is more important to have real-time performance monitoring and metrics that tell you how the systems behave. For example, what is the usual RTT, and how long can transactions occur under normal conditions?

Summary: Chaos Engineering Kuberentes

Chaos Engineering has emerged as a robust methodology to identify weaknesses and vulnerabilities in complex systems proactively. In the realm of Kubernetes, where orchestration and scalability are paramount, Chaos Engineering can play a crucial role in ensuring the resilience and reliability of your applications. In this blog post, we explored the concept of Chaos Engineering, its relevance to Kubernetes, and how you can harness its power to improve the robustness of your Kubernetes deployments.

Understanding Chaos Engineering

Chaos Engineering is a disciplined approach to experimenting and testing systems under realistic failure conditions. By injecting controlled failures into a system, Chaos Engineering aims to uncover weaknesses, validate assumptions, and build confidence in the system’s ability to handle unexpected events. This methodology is particularly valuable in distributed and complex systems like Kubernetes, where failures can have a widespread impact.

Chaos Engineering Principles

Certain principles should be followed to effectively practice chaos engineering. These include defining a steady-state hypothesis, identifying meaningful metrics to measure impact, designing experiments with well-defined blast radius, automating the chaos experiments, and continually evolving the experiments based on learnings. By adhering to these principles, Chaos Engineering becomes a systematic and iterative process that enhances the resilience of your Kubernetes deployments.

Chaos Engineering Tools for Kubernetes

Several tools and frameworks have been developed specifically for Chaos Engineering in Kubernetes. One notable example is LitmusChaos, an open-source framework that provides a wide range of chaos experiments for Kubernetes. With LitmusChaos, you can simulate various failure scenarios, such as pod failures, network disruptions, and resource exhaustion, to validate the resilience of your Kubernetes clusters and applications.

Best Practices for Chaos Engineering in Kubernetes

To get the most out of Chaos Engineering in Kubernetes, it is important to follow some best practices. These include starting with small, controlled experiments, involving all stakeholders in the chaos engineering process, gradually increasing the complexity of experiments, monitoring and analyzing the impact of chaos experiments, and using chaos engineering as a continuous improvement practice. By embracing these best practices, you can effectively leverage Chaos Engineering to build resilient and robust Kubernetes deployments.

Conclusion:

Chaos Engineering is a powerful approach to ensure the reliability and resilience of Kubernetes deployments. By systematically injecting controlled failures, Chaos Engineering helps uncover vulnerabilities and strengthens the overall system. With the availability of specialized tools and adherence to best practices, Kubernetes users can harness the full potential of Chaos Engineering to build more reliable, scalable, and resilient applications.

Docker security

Docker Container Security: Building a Sandbox

Docker Container Security

In the vast ocean of software development, Docker has emerged as a powerful tool for containerization. With its ability to package applications and their dependencies into self-contained units, Docker offers enhanced portability and scalability. However, as with any technology, security considerations are paramount. In this blog post, we will explore the importance of Docker container security and provide practical tips to safeguard your ship from potential vulnerabilities.

Containerization revolutionized software deployment, but it also introduced new security challenges. Docker containers share the host's operating system kernel, which means that a vulnerability within one container can potentially impact others. It's crucial to grasp the fundamentals of Docker container security to mitigate these risks effectively.

To sail smoothly in the realm of Docker container security, following best practices is crucial. We will delve into key recommendations such as minimizing the attack surface, implementing secure images, enforcing access controls, and regular vulnerability scanning. These practices will fortify the defenses of your containers and keep your ship afloat.

Docker provides various security tools and features to bolster your containerized applications. We will explore tools like Docker Content Trust, which ensures the integrity and authenticity of images, and Docker Security Scanning, a service that identifies vulnerabilities in your container images. By harnessing these tools, you can add an extra layer of protection to your Docker environment.

Container orchestration platforms such as Kubernetes have become the de facto choice for managing containerized applications. In this section, we will discuss how to incorporate security measures into your orchestration workflows. Topics include securing Kubernetes clusters, implementing network policies, and leveraging secrets management for sensitive data.

As you navigate the vast seas of Docker container security, remember that safeguarding your ship requires a proactive approach. By understanding the principles, following best practices, utilizing security tools, and orchestrating securely, you can confidently sail towards a secure Docker environment. May your containers stay afloat, your applications remain protected, and your voyage be prosperous!

Highlights: Docker Container Security

Understanding Docker Security

The Importance of Container Security

As the use of Docker containers grows, so does the potential attack surface. Containers are often deployed in large numbers, and any vulnerability can lead to significant breaches. Understanding the importance of container security is the first step in defending your applications. Ensuring that your containers are secure helps protect sensitive data, maintain application integrity, and uphold your organization’s reputation.

Docker provides numerous security features, such as isolated containers, resource limitations, and image signing. However, misconfigurations or overlooking certain best practices can leave Docker environments vulnerable to potential threats.Understanding the fundamentals of Docker security is essential to ensure a strong security foundation.

Containerization offers process-level isolation, which enhances security by preventing application conflicts and limiting the potential impact of vulnerabilities. Docker achieves this isolation through the use of namespaces and control groups.

**Docker Container Security: Best Practices and Strategies**

### Understanding Docker and Its Popularity

Docker has revolutionized the way we think about software deployment and scalability. By packaging applications and their dependencies into containers, developers can ensure consistency across various environments. This portability is a primary reason for Docker’s widespread adoption in the tech industry. However, with great power comes great responsibility, and securing these containers is crucial for maintaining the integrity and reliability of applications.

### The Importance of Container Security

As the use of Docker containers grows, so does the potential attack surface. Containers are often deployed in large numbers, and any vulnerability can lead to significant breaches. Understanding the importance of container security is the first step in defending your applications. Ensuring that your containers are secure helps protect sensitive data, maintain application integrity, and uphold your organization’s reputation.

### Best Practices for Securing Docker Containers

1. **Use Official and Trusted Images**: Always start with images from trusted sources, such as Docker Hub’s official repository. These images are maintained and updated regularly, reducing the risk of vulnerabilities.

2. **Regularly Update and Patch**: Just as with any software, keeping your container images up to date is crucial. Regular updates and patches can fix known vulnerabilities and improve overall security.

3. **Limit Container Privileges**: Containers should run with the least privileges necessary. Avoid running containers as the root user and apply the principle of least privilege to minimize potential damage from an attack.

4. **Implement Network Segmentation**: Isolate your containers by using Docker’s network features. This segmentation limits communication to only what is necessary, reducing the risk of lateral movement by attackers.

5. **Monitor and Log Activity**: Continuous monitoring and logging of container activity can help detect unusual patterns or potential breaches. Tools like Docker Security Scanning and third-party solutions can provide insights into container behavior.

Example: Container segmentation

Containers can be part of two networks. Consider the bridge network a standard switch, except it is virtual. Anything attached can communicate. So, if we have a container with two virtual ethernet cards connected to two different switches, the container is in two networks.

inspecting container networks
Diagram: Inspecting container networks

Personal Note: Docker Security Vendors

In addition to docker security best practices that we will dicuss soon, leveraging modern tools and technologies can significantly enhance Docker container security. Security solutions such as Aqua Security, Sysdig Secure, and Twistlock offer advanced features like vulnerability scanning, runtime defense, and compliance checks. Incorporating these tools into your security strategy can provide an added layer of protection against sophisticated threats.

Docker Security Key Points

-Building your Docker images on a solid foundation is crucial for container security. Employing secure image practices includes regularly updating base images, minimizing the software installed within the container, and scanning images for vulnerabilities using tools like Docker Security Scanning.

-Controlling access to Docker containers is vital in maintaining their security. This involves utilizing user namespaces, restricting container capabilities, and employing role-based access control (RBAC) to limit privileges and prevent unauthorized access.

-Keeping a close eye on container activities is essential for identifying potential security breaches. Implementing centralized logging and monitoring solutions helps track and analyze container events, enabling timely detection and response to any suspicious or malicious activities.

Docker offers numerous security features, such as isolation, resource management, and process restrictions. However, these features alone may not provide sufficient protection against sophisticated attacks. That’s where SELinux steps in.

A. SELinux: Security Framework

SELinux is a security framework integrated into the Linux kernel. It provides an additional layer of access control policies, enforcing mandatory access controls (MAC) beyond the traditional discretionary access controls (DAC). By leveraging SELinux, administrators can define fine-grained policies that govern processes’ behavior and limit their access to critical resources.

Docker has built-in support for SELinux, allowing users to enable and enforce SELinux policies on containers and their underlying hosts. By enforcing SELinux policies, containers are further isolated from the host system, reducing the potential impact of a compromised container on the overall infrastructure.

B. Namespaces and Control Groups

The building blocks for Docker security options and implementing Docker security best practices, such as the Kernel primitives, have been around for a long time, so they are not all new when considering their security. However, the container itself is not a kernel construct; it is an abstraction of using features of the host operating system kernel. For Docker container security and building a Docker sandbox, these kernel primitives are the namespaces and control groups that allow the abstraction of the container.

-Namespaces provide processes with isolated instances of various system resources. They allow for resource virtualization, ensuring that processes within one namespace cannot interfere with processes in another. Namespaces provide separation at different levels, such as mount points, network interfaces, process IDs, and more. By doing so, they enable enhanced resource utilization and security.

-Control groups, called cgroups, allow for resource limitation, prioritization, and monitoring. They allow administrators to allocate resources, such as CPU, memory, disk I/O, and network bandwidth, to specific groups of processes. Control groups ensure fair distribution of resources among different applications or users, preventing resource hogging and improving overall system performance.

Namespaces and control groups offer a wide range of applications and benefits. In containerization technologies like Docker, namespaces provide the foundation for creating isolated environments, allowing multiple containers to run on a single host without interference. Control groups, on the other hand, enable fine-grained resource control and optimization within these containers, ensuring fair resource allocation and preventing resource contention.

C. The Role of Kernel Primitives

To build a docker sandbox, Docker uses control groups to control workloads’ resources to host resources. As a result, Docker allows you to implement system controls with these container workloads quickly. Fortunately, much of the control group complexity is hidden behind the Docker API, making containers and Container Networking much easier to use.

Then, we have namespaces that control what a container can see. A namespace allows us to take an O/S with all its resources, such as filesystems, and carve it into virtual operating systems called containers. Namespaces are like visual boundaries, and there are several different namespaces.

Example Techniques: Docker Bench Security

Docker Bench is an open-source tool developed by Docker, Inc. It aims to provide a standardized way of assessing Docker security configurations. By running Docker Bench, we can identify security weaknesses, misconfigurations, and potential vulnerabilities in our Docker setup. The tool provides a comprehensive checklist based on industry best practices, including security recommendations from Docker themselves.

To enhance Docker security using Docker Bench, we must follow a few simple steps. First, we need to install Docker Bench on our host machine. Once installed, we can execute the tool to assess our Docker environment. Docker Bench will evaluate various aspects, such as Docker daemon configuration, host security settings, container configurations, and more. It will then provide a detailed report highlighting areas requiring attention.

Privilege Escalation: The Importance of a Sandbox

During an attack, your initial entry point into a Linux system is via a low-privileged account, which provides you with a low-privileged shell. To obtain root-level access, you need to escalate your privileges. This is generally done by starting with enumeration.

Sometimes, your target machine may have misconfigurations that you could leverage to escalate your privileges. Here, we will look for misconfigurations, particularly those that leverage the SUID (Set User Identification) permission.

The SUID permission allows low-privileged users to run an executable with the file system permissions of its owner. For example, if the system installs an executable globally, that executable would run as root. 

Note: A quick way to find all executables with SUID permission is to execute the command labeled Number 1 in the screenshot below. I’m just running Ubuntu on a VM. To illustrate how the SUID permission can be abused for privilege escalation, you will use the found executable labeled number 2.

Privilege Escalation

Analysis:

    • After setting the SUID permission, re-run the command to find all executables. You will see /usr/bin/find now appears in the list.
    • Since find has the SUID permission, you can leverage it to execute commands in the root context. 
    • You should now see the contents of the /etc/shadow file. This file is not visible without root permissions. You can leverage additional commands to execute more tasks and gain a high-privilege backdoor from here.

privilege attacks

The following example shows running a container with an unprivileged user, not a root user. Therefore, we are not roots inside the container. With this example, we are using user ID 1500. Notice how we can’t access the /etc/shadow password file, which is a file that needs root privileges to open. To mitigate the risks associated with running containers as root, organizations should adopt the following best practices:

container security

Proactive Measures to Mitigate Privilege Escalation:

To safeguard against privilege escalation attacks, individuals and organizations should consider implementing the following measures:

1. Regular Software Updates:

Keeping operating systems, applications, and software up to date with the latest security patches helps mitigate vulnerabilities that can be exploited for privilege escalation.

2. Strong Access Controls:

Implementing robust access control mechanisms, such as the principle of least privilege, helps limit user privileges to the minimum level necessary for their tasks, reducing the potential impact of privilege escalation attacks.

3. Multi-factor Authentication:

Enforcing multi-factor authentication adds an extra layer of security, making it more difficult for attackers to gain unauthorized access even if they possess stolen credentials.

4. Security Audits and Penetration Testing:

Regular security audits and penetration testing can identify vulnerabilities and potential privilege escalation paths, allowing proactive remediation before attackers can exploit them.

**Containerized Processes**

Containers are often referred to as “containerized processes.” Essentially, a container is a Linux process running on a host machine. However, the process has a limited view of the host and can access a subtree of the filesystem. Therefore, it would be best to consider a container a process with a restricted view.

Namespace and resource restrictions provide the limited view offered by control groups. The inside of the container looks similar to that of a V.M. with isolated processes, networking, and file system access. However, it seems like a normal process running on the host machine from the outside. 

**Container Security** 

One of the leading security flaws to point out when building a docker sandbox is that containers, by default, run as root.  Notice in the example below that we have a tool run on the Docker host that can perform an initial security scan – called Docker Bench. Remember that running containers as root comes with inherent security risks that organizations must consider carefully. Here are some key concerns:

Note:

1. Exploitation of Vulnerabilities: Running containers as root increases the potential impact of vulnerabilities within the container. If an attacker gains access to a container running as root, they can potentially escalate their privileges and compromise the host system.

2. Escaping Container Isolation: Containers rely on a combination of kernel namespaces, cgroups, and other isolation mechanisms to provide separation from the host and other containers. Running containers as root increases the risk of an attacker exploiting a vulnerability within the container to escape this isolation and gain unauthorized access to the host system.

3. Unauthorized System Access: If a container running as root is compromised, the attacker may gain full access to the underlying host system, potentially leading to unauthorized system modifications, data breaches, or even the compromise of other containers running on the same host.

Containers running as root

Ensuring Kubernetes Security

As with any technology, security is paramount when it comes to Kubernetes. Securing Kubernetes clusters involves multiple layers, starting with the control plane. It’s crucial to limit access to the Kubernetes API server by implementing authentication and authorization mechanisms. Role-Based Access Control (RBAC) is a vital feature that helps define permissions for users and applications, ensuring that only authorized entities can perform specific actions.

Network policies also play a critical role in securing Kubernetes environments. By restricting communication between pods and services, network policies help mitigate the risk of unauthorized access. Additionally, regular audits and monitoring are essential to detect and respond to potential threats in real-time.

Kubernetes network policy

**Best Practices for Managing Kubernetes Clusters**

Effective management of Kubernetes clusters requires adherence to certain best practices. First and foremost, it’s important to keep all Kubernetes components up to date. Regular updates not only provide new features but also address security vulnerabilities.

Another best practice is to implement resource requests and limits for containers. By defining these parameters, you can ensure that applications have the necessary resources to function optimally while preventing any single application from monopolizing the cluster’s resources.

Lastly, consider using namespaces to organize resources within a cluster. Namespaces provide a way to divide cluster resources between multiple users or teams, improving resource allocation and management.

Related: For additional pre-information, you may find the following helpful.

  1. Container Based Virtualization
  2. Remote Browser Isolation
  3. Docker Default Networking 101
  4. Kubernetes Network Namespace
  5. Merchant Silicon
  6. Kubernetes Networking 101

Docker Container Security

Understanding Containers

For a long time, big web-scale players have been operating container technologies to manage the weaknesses of the VM model. In the container model, the container is analogous to the VM. However, a significant difference is that containers do not require their full-blown OS. Instead, all containers operating on a single host share the host’s OS.

This frees up many system resources, such as CPU, RAM, and storage. Containers are again fast to start and ultra-portable. Consequently, moving containers with their application workloads from your laptop to the cloud and then to VMs or bare metal in your data center is a breeze.

Docker Container Diagram
Diagram: Docker Container. Source Docker.

Sandbox containers

– Sandbox containers are a virtualization technology that provides a secure environment for applications and services to run in. They are lightweight, isolated environments that run applications and services safely without impacting the underlying host.

– This virtualization technology enables rapid application deployment while providing a secure environment to isolate, monitor, and control access to data and resources. Sandbox containers are becoming increasingly popular as they offer an easy, cost-effective way to securely deploy and manage applications and services.

– Sandbox containers can also be used for testing, providing a safe and isolated environment for running experiments. In addition, they are highly scalable and can quickly and easily deploy applications across multiple machines. This makes them ideal for large-scale projects, as they can quickly deploy and manage applications on a large scale. The following figures provide information that is generic to sandbox containers.

Benefits of Using a Docker Sandbox:

1. Replicating Production Environment: A Docker sandbox allows developers to create a replica of the production environment. This ensures the application runs consistently across different environments, reducing the chances of unexpected issues when the code is deployed.

2. Isolated Development Environment: With Docker, developers can create a self-contained environment with all the necessary dependencies. This eliminates the hassle of manually setting up development environments and ensures team consistency.

3. Fast and Easy Testing: Docker sandboxes simplify testing applications in different scenarios. By creating multiple sandboxes, developers can test various configurations, such as different operating systems, libraries, or databases, without interfering with each other.

Docker Sandbox
Diagram: Docker Sandbox.

Understanding the Risks: Docker containers present unique security challenges that must be addressed. We will delve into the potential risks associated with containerization, including container breakouts, image vulnerabilities, and compromised host systems.

Implementing Container Isolation: One fundamental aspect of securing Docker containers is isolating them from each other and the host system. We will explore techniques such as namespace and cgroup isolation and use security profiles to strengthen container isolation and prevent unauthorized access.

Regular Image Updates and Vulnerability Scanning: Keeping your Docker images up to date is vital for maintaining a secure container environment. We will discuss the importance of regularly updating base images and utilizing vulnerability scanning tools to identify and patch potential security vulnerabilities in your container images.

Container Runtime Security: The container runtime environment plays a significant role in container security. We will explore runtime security measures such as seccomp profiles, AppArmor, and SELinux to enforce fine-grained access controls and reduce the attack surface of your Docker containers.

Monitoring and Auditing Container Activities

Effective monitoring and auditing mechanisms are essential for promptly detecting and responding to security incidents. We will explore tools and techniques for monitoring container activities, logging container events, and implementing intrusion detection systems specific to Docker containers.

container attack vectors
Diagram: Container attack vectors. Source Adriancitu

**Step-by-Step Guide to Building a Docker Sandbox**

Step 1: Install Docker:

The first step is to install Docker on your machine. Docker provides installation packages for various operating systems, including Windows, macOS, and Linux. Visit the official Docker website and follow the installation instructions specific to your operating system.

Step 2: Set Up Dockerfile:

A Dockerfile is a text file that contains instructions for building a Docker image. Create a new ” Dockerfile ” file and define the base image, dependencies, and any necessary configuration. This file serves as the blueprint for your Docker sandbox.

Step 3: Build the Docker Image:

Once the Dockerfile is ready, you can build the Docker image by running the “docker build” command. This command reads the instructions from the Dockerfile and creates a reusable image that can be used to run containers.

Step 4: Create a Docker Container:

Once you have built the Docker image, you can create a container based on it. Containers are instances of Docker images that can be started, stopped, and managed. Use the “docker run” command to create a container from the image you built.

Step 5: Configure the Sandbox:

Customize the Docker container to match your requirements. This may include installing additional software, setting environment variables, or exposing ports for communication. Use the Docker container’s terminal to make these configurations.

Step 6: Test and Iterate:

Once the sandbox is configured, you can test your application within the Docker environment. Use the container’s terminal to execute commands, run tests, and verify that your application behaves as expected. If necessary, make further adjustments to the container’s configuration and iterate until you achieve the desired results.

**Docker Container Security: Building a Docker Sandbox**

So, we have some foundational docker container security that has been here for some time. A Linux side of security will give us things such as namespace, control groups we have just mentioned, secure computing (seccomp), AppArmor, and SELinux that provide isolation and resource protection. Consider these security technologies to be the first layer of security that is closer to the workload. Then, we can expand from there and create additional layers of protection, creating an in-depth defense strategy.

Container Sandbox
Diagram: Building a Sandbox. Source Aqua

How to create a Docker sandbox environment

As a first layer to creating a Docker sandbox, you must consider the available security module templates. Several security modules can be implemented that can help you enable fine-grained access control or system resources hardening your containerized environment. More than likely, your distribution comes with a security model template for Docker containers, and you can use these out of the box for some use cases.

However, you may need to tailor the out-of-the-box default templates for other use cases. Templates for Secure Computing, AppArmor, and SELinux will be available. These templates, the Dockerfile, and workload best practices will give you an extra safety net. 

Docker container security and protection

Containers run as root by default.

The first thing to consider when starting Docker container security is that containers run as root by default and share the Kernel of the Host OS. They rely on the boundaries created by namespaces for isolation and control groups to prevent one container from consuming resources negatively. So here, we can avoid things like a noisy neighbor, where one application uses up all resources on the system, preventing other applications from performing adequately on the same system.

In the early days of containers, this is how container protection started with namespace and control groups, and the protection was not perfect. For example, it cannot prevent all interference in resources the operating system kernel does not manage. 

So, we need to move to a higher abstraction layer with container images. The container images encapsulate your application code and any dependencies, third-party packages, and libraries. Images are our built assets representing all the fields to run our application on top of the Linux kernel. In addition, images are used to create containers so that we can provide additional Docker container security here.

Docker Container Security
Diagram: Docker Container Security. Rootless mode. Source Aquasec.

Security concerns. Image and supply chain

To run a container, we need to pull images. The images are pulled locally or from remote registries; we can have vulnerabilities here. Your hosts connected to the registry may be secure, but that does not mean the image you are pulling is secure. Traditional security appliances are blind to malware and other image vulnerabilities as they are looking for different signatures. There are several security concerns here.

Users can pull full or bloated images from untrusted registries or images containing malware. As a result, we need to consider the container threats in both runtimes and the supply chain for adequate container security.

Scanning Docker images during the CI stage provides a quick and short feedback loop on security as images are built. You want to discover unsecured images well before you deploy them and enable developers to fix them quickly rather than wait until issues are found in production.

You should also avoid unsecured images in your testing/staging environments, as they could also be targets for attack. For example, we have image scanning from Aqua, and image assurance can be implemented in several CI/CD tools, including the Codefresh CI/CD platform.

Container image security
Diagram: Container image security. Source Aqua.

Note:

1. Principle of Least Privilege: Avoid running containers as root whenever possible. Instead, use non-root users within the container and assign only the necessary privileges required for the application to function correctly.

2. User Namespace Mapping: Utilize user namespace mapping to map non-root users inside the container to a different user outside the container. This helps provide an additional layer of isolation and restricts the impact of any potential compromise.

3. Secure Image Sources: Ensure container images from trusted sources are obtained in your environment. Regularly update and patch container images to minimize the risk of known vulnerabilities.

4. Container Runtime Security: Implement runtime security measures such as container runtime security policies, secure configuration practices, and regular vulnerability scanning to detect and prevent potential security breaches.

Security concerns: Container breakouts

The container process is visible from the host. Therefore, if a bad actor gets access to the host with the correct privileges, it can compromise all the containers on the host. If an application can read the memory that belongs to your application, it can access your data. So, you need to ensure that your applications are safely isolated from each other. If your application runs on separate physical machines, accessing another application’s memory is impossible. From the security perspective, physical isolation is the strongest but is often not always possible. 

If a host gets compromised, all containers running on the host are potentially compromised, too, especially if the attacker gains root or elevates their privileges, such as a member of the Docker Group.

Your host must be locked down and secured, making container breakouts difficult. Also, remember that it’s hard to orchestrate a container breakout. Still, it is not hard to misconfigure a container with additional or excessive privileges that make a container breakout easy.

Docker Container Security
Diagram: Docker container security. Source Aqua.

The role of the Kernel: Potential attack vector

The Kernel manages its userspace processes and assigns memory to each process. So, it’s up to the Kernel to ensure that one application can’t access the memory allocated to another. The Kernel is hardened and battle-tested, but it is complex, and complexity is the number one enemy of good security. You cannot rule out a bug in how the Kernel manages memory; an attacker could exploit that bug to access the memory of other applications.

**Hypervisor: Better isolation? Kernel attack surface**

So, does the Hypervisor give you better isolation than a Kernel gives to its process? The critical point is that a kernel is complex and constantly evolving; as crucial as it manages memory and device access, the Hypervisor has a more specific role. As a result, the hypervisors are smaller and more straightforward than whole Linux kernels. 

What happens if you compare the lines of code in the Linux Kernel to that of an open-source hypervisor? Less code means less complexity, resulting in a smaller attack surface—a more minor attack surface increases the likelihood of a bad actor finding an exploitable flaw.

With a kernel, the userspace process allows some visibility of each other. For example, you can run specific CLI commands and see the running processes on the same machine. Furthermore, you can access information about those processes with the correct permissions. 

This fundamentally differs between the container and V.M. Many consider the container weaker in isolation. With a V.M., you can’t see one machine’s process from another. The fact that containers share a kernel means they have weaker isolation than the V.M. For this reason and from the security perspective, you can place containers into V.Ms.

Docker Security Best Practices – Goal1: Strengthen isolation: Namespaces

One of the main building blocks of containers is a Linux construct called the namespace, which provides a security layer for your applications running inside containers. For example, you can limit what that process can see by putting a process in a namespace. A namespace fools a process that it uniquely has access to. In reality, other processes in their namespace can access similar resources in their isolated environments. The resources belong to the host system.

Docker Security Best Practices – Goal2: Strengthen isolation: Access control

Access control is about managing who can access what on a system. We inherited Unix’s Discretionary Access Control (DAC) features from Linux. Unfortunately, they are constrained, and there are only a few ways to control access to objects. If you want a more fine-grained approach, we have Mandatory Access Control (MAC), which is policy-driven and granular for many object types.

We have a few solutions for MAC. For example, SELinux was in Kernel in 2003 and AppArmor in 2010. These are the most popular in the Linux domain, and these are implemented as modules via the LSM framework.

SELinux was created by the National Security Agency (NSA ) to protect systems and was integrated into the Linux Kernel. It is a Linux kernel security module that has access controls, integrity controls, and role-based access controls (RBAC)

Docker Security Best Practices – Goal3: Strengthen isolation: AppArmor

AppArmor applies access control on an application-by-application basis. To use it, you associate an AppArmor security profile with each program. Docker loads a default profile for the container’s default. Keep in mind that this is used and not on the Docker Daemon. The “default profile” is called docker-default. Docker describes it as moderately protective while providing broad application capability.

So, instantiating a container uses the “docker default” policy unless you override it with the “security-opt” flag. This policy is crafted for the general use case. The default profile is applied to all container workloads if the host has AppArmor enabled. 

Docker Security Best Practices – Goal4: Strengthen isolation: Control groups

Containers should not starve other containers from using all the memory or other host resources. So, we can use control groups to limit resources available to different Linux processes. Control Groups control hosts’ resources and are essential for fending Denial-of-Service Attacks. If a function is allowed to consume, for example, unlimited memory, it can starve other processes on the same host of that host resource.

This could be done inadvertently through a memory leak in the application or maliciously due to a resource exhaustion attack that takes advantage of a memory leak. The container can fork as many processes (PID ) as the max configured for the host kernel.

Unchecked, this is a significant avenue as a DoS. A container should be limited to the number of processors required through the CLI. A control group called PID determines the number of processes allowed within a control group to prevent a fork bomb attack. This can be done with the PID subsystem.

Docker Security Best Practices – Goal5: Strengthen isolation: Highlighting system calls 

System calls run in the Kernel space, with the highest privilege level and kernel and device drivers. At the same time, a user application runs in the user space, which has fewer privileges.  When an application that runs in user space needs to carry out such tasks as cloning a process, it does this via the Kernel, and the Kernel carries out the operation on behalf of the userspace process. This represents an attack surface for a bad actor to play with.

Docker Security Best Practices – Goal6: Security standpoint: Limit the system calls

So, you want to limit the system calls available to an application. If a process is compromised, it may invoke system calls it may not ordinarily use. This could potentially lead to further compromisation.  It would help if you aimed to remove system calls that are not required and reduce the available attack surface. As a result, it will reduce the risk of compromise and risk to the containerized workloads.

Docker Security Best Practices – Goal7: Secure Computing Mode

Secure Computing Mode (seccomp) is a Linux kernel feature that restricts the actions available within the containers. For example, there are over 300+ syscalls in the Linux system call interface, and your container is unlikely to need access. For instance, if you don’t want containers to change kernel modules. Therefore, they do not need to call the “create” module, “delete” module, or “init”_module.” Seccomp profiles are applied to a process that determines whether or not a given system call is permitted. Here, we can list or blocklist a set of system calls.

The default seccomp profile sets the Kernel’s action when a container process attempts to execute a system call. An allowed action specifies an unconditional list of permitted system calls.

For Docker container security, the default seccomp profile blocks over 40 syscalls without ill effects on the containerized applications. You may want to tailor this to suit your security needs, restrict it further, and limit your container to a smaller group of syscalls. It is recommended that each application have a seccomp profile that permits precisely the same syscalls it needs to function. This will follow the security principle of the least privileged.

Summary: Docker Container Security

Docker containers have become a cornerstone of modern application development and deployment in today’s digital landscape. With their numerous advantages, it is imperative to address the importance of container security. This blog post delved into the critical aspects of Docker container security and provided valuable insights to help you safeguard your containers effectively.

Understanding the Basics of Container Security

Containerization brings convenience and scalability, but it also introduces unique security challenges. This section will highlight the fundamental concepts of container security, including container isolation, image vulnerabilities, and the shared kernel model.

Best Practices for Securing Docker Containers

To fortify your Docker environment, it is crucial to implement a set of best practices. This section will explore various security measures, such as image hardening, the least privilege principle, and container runtime security. We will also discuss the significance of regular updates and vulnerability scanning.

Securing Container Networks and Communication

Container networking ensures secure communication between containers and the outside world. These sections delved into strategies such as network segmentation, container firewalls, and secure communication protocols. Additionally, we touched upon the importance of monitoring network traffic for potential intrusions.

Container Image Security Scanning

The integrity of container images is of utmost importance. This section highlights the significance of image-scanning tools and techniques to identify and mitigate vulnerabilities. We explored popular image-scanning tools and discussed how to integrate them seamlessly into your container workflow.

Managing Access Control and Authentication

Controlling access to your Docker environment is critical for maintaining security. This section covered essential strategies for managing user access, implementing role-based access control (RBAC), and enforcing robust authentication mechanisms. We will also touch upon the concept of secrets management and protecting sensitive data within containers.

Conclusion:

In conclusion, securing your Docker containers is a multifaceted endeavor that requires a proactive approach. By understanding the basics of container security, implementing best practices, ensuring container networks, scanning container images, and managing access control effectively, you can significantly enhance the security posture of your Docker environment. Remember, container security is ongoing, and vigilance is critical to mitigating potential risks and vulnerabilities.

Prometheus monitoring example

Prometheus Metric Types

Prometheus Metric Types

Prometheus, the popular open-source monitoring and alerting toolkit, offers a wide range of metric types to collect and analyze data. In this blog post, we will delve into the various Prometheus metric types, their features, and use cases. Whether you're a beginner or an experienced Prometheus user, this guide will provide you with a solid understanding of the different metric types at your disposal.

Counter Metrics: Counter metrics in Prometheus represent a monotonically increasing value. They are commonly used to track the number of events or occurrences over time. With each scrape, the counter resets to zero and starts counting again. These metrics are perfect for measuring things like HTTP requests or the number of messages processed by a system.

Gauge Metrics: Unlike counters, gauge metrics in Prometheus can increase or decrease arbitrarily. They are ideal for capturing values that fluctuate over time, such as CPU usage or memory usage. Gauges provide a snapshot of the current state of a particular metric and are widely used for monitoring resource utilization.

Histogram Metrics: Histogram metrics offer insights into the distribution of values over a given range. They allow you to measure things like request durations or response sizes and provide detailed statistical information. Histograms automatically generate buckets to track observations falling within specific value ranges, enabling you to analyze data distribution effectively.

Summary Metrics: Similar to histograms, summary metrics provide statistical information about observed values. However, summaries are specifically designed to calculate percentiles. They are useful for measuring things like request latency, where you want to understand the distribution of response times across different percentiles. Summaries are lightweight and maintain a small sample size to efficiently calculate percentiles.

Highlights: Prometheus Metric Types

Open-Source Monitoring 

Prometheus is an open-source monitoring and alerting toolkit that was developed initially at SoundCloud. It provides a flexible and powerful platform for monitoring various components of your infrastructure, including servers, containers, and applications. Its robust data model, extensive metrics collection capabilities, and adaptable query language make it a preferred choice for many organizations.

It operates based on a time series database model, collecting and storing metrics from various targets, including servers, applications, or any other component that generates data. Prometheus monitoring offers a versatile set of metric types that cater to different monitoring needs.

Prometheus has become indispensable in the world of cloud-native and containerized applications. It efficiently collects and stores metrics as time series data, offering powerful querying capabilities. But to harness Prometheus’s full potential, it’s crucial to understand its metric types, as they are the foundation of effective monitoring and alerting.

**Use Cases and Benefits**

Prometheus monitoring’s diverse metric types offer immense benefits across various use cases. Let’s explore some notable examples:

1. Infrastructure Monitoring: Prometheus can monitor server resource utilization, network traffic, and system health by utilizing counters and gauges. This allows for proactive system maintenance and efficient capacity planning.

2. Application Performance Monitoring (APM): Histograms and summaries enable detailed analysis of application performance metrics, including response times, latency distributions, and error rates. This helps identify bottlenecks and optimize application performance.

3. Alerting and Incident Management: Prometheus monitoring’s flexible alerting system can trigger notifications based on specific metric thresholds or patterns. This enables rapid incident response and proactive issue resolution. Let’s take a closer look at some of the prominent metric types:

**Four Prometheus Metric Types**

1. Counter: Counters are monotonically increasing metrics used to track the number of occurrences of an event. They are handy for measuring things like request counts or error rates.

2. Gauge: Gauges represent a specific value at a particular point in time. They are ideal for tracking metrics that can fluctuate, such as CPU or memory usage.

3. Histogram: Histograms provide insights into the statistical distribution of a set of values. They divide observations into configurable buckets, allowing for analysis of data distribution and quantiles.

4. Summary: Similar to histograms, summaries provide statistical insights. However, they calculate quantiles on the client side instead of the server side, making them more suitable for monitoring long-running tasks or services.

**Choosing the Right Metric Type**

Selecting the appropriate metric type is vital for accurate data representation and analysis. Understanding the use case of each type helps in leveraging their strengths:

– Use **counters** for monotonically increasing values.

– Opt for **gauges** when tracking fluctuating values.

– Choose **histograms** for bucketed distribution analysis.

– Select **summaries** when quantile information is needed.

By aligning your metrics with the right type, you ensure that your monitoring system is both efficient and insightful.

Prometheus Scraping Metrics

Scraping HTTP Endpoints

Prometheus collects these metrics by scraping HTTP endpoints that expose metrics, known as a pull model. Prometheus exporters built by the community can natively expose those endpoints or via the monitored component. In addition to supporting a wide range of programming languages, Prometheus also provides client libraries that can be used to instrument your code.

Prometheus can also scrape metrics from OpenMetrics. HTTP metrics are exposed via text-based formats (which are more widely used) or more robust and efficient protocol buffer formats. This format is human-readable, so you can open it in a browser or retrieve exposed metrics using a tool like curl.

Prometheus Data Types

Prometheus’ metric model supports only four metric types and is only available in client libraries. The exposition format represents all metric types using one or more underlying Prometheus data types.

This Prometheus data type includes a metric name, labels, and a floating value. In addition, an agent or a monitoring backend adds the timestamp when scraping metrics (Prometheus, for example). PromQL, short for Prometheus Querying Language, is the primary way to query metrics within Prometheus. You can display an expression’s return as a graph or export it using the HTTP API.

PromQL uses three Prometheus data types: scalars, range vectors, and instant vectors. It also uses strings, but only as literals. 

Related: Before you proceed, you may find the following posts helpful:

  1. Observability vs Monitoring
  2. Microservices Observability

Prometheus Metric Types

For Prometheus metric types, we want as many metrics as possible. These need to be stored to follow trends, understand what has been happening from a historical view, and better predict any issues. So, there are several parts to a Prometheus monitoring solution; we must collect the metrics, known as scraping, store them, and then analyze them. In addition, we need to consider storage security, compliance, and regulatory concerns for distributed systems observability.

Monitoring the correct metric is key; having metrics lets you view how the system performs. The Prometheus metrics types represent raw measurement of resource usage, which can help you plan for upgrading and tell you how many resources are being used.  

To be clear, there are two kinds of “types” in Prometheus. There are the metric types of metrics and the data types of PromQL expressions.

Prometheus has four metric types

  1. Counters
  2. Gauges
  3. Histograms 
  4. Summaries 

PromQL subsequently has four data types:

  1. Floats (mostly scalars) 
  2. Range vectors
  3. Instant vectors
  4. Time (though it’s often not counted in this category)

Let me briefly explain them:

Prometheus Metric Types Best Practices:

Now that we have explored the different metric types offered by Prometheus let’s discuss some best practices for using them effectively:

1. Choose the appropriate metric type for your use case. Consider each type’s characteristics and relevance to your monitoring needs.

2. Avoid combining unrelated metrics into a single metric type. This can lead to confusion and make it harder to analyze the data.

3. Use labels to add additional dimensions to your metrics. Labels allow you to differentiate between instances or environments, providing more context to your monitoring data.

Prometheus Pull Approach

It’s easy to monitor a Kubernetes cluster with the pull model because of service discovery and shared network access within the cluster. However, watching a dynamic fleet of virtual machines, AWS Fargate containers, or Lambda functions with Prometheus is hard. How come?

Identifying metrics endpoints for scraping is difficult; network security policies may restrict access to those endpoints. Prometheus Agent Mode was released at the end of 2021 to solve some of these problems. This mode collects metrics and sends them to a monitoring backend via the remote write protocol.

Prometheus target labels
Diagram: Prometheus Service Discovery and Labels

A) Prometheus Service Discovery

Prometheus discovers targets to scrape from service discovery. These can be instrumented or third-party applications you can scrape via an exporter. The scraped data is stored, and you can use it in dashboards using PromQL or send alerts to the Alertmanager, which will convert them into pages, emails, and other notifications. Metrics do not typically magically spring forth from applications; someone has to add the instrumentation that produces them.

B) Default Prometheus configuration.

In the following lab guide, you will see Prometheus’s default configuration. I have done a cat prometheus.yml, and you can see Prometheus is scaping itself for metrics. Prometheus is configured through a single YAML file.

The Prometheus YAML file serves as the configuration file for Prometheus and defines the various aspects of the monitoring setup. It consists of a set of key-value pairs organized in a hierarchical structure, instructing Prometheus on what to scrape, how to scrape it, and how to handle the collected data.

Prometheus YAML file
Diagram: Prometheus YAML file

C) Global Configuration:

This section defines global configurations applicable to the entire Prometheus setup. Key elements in this section include the scrape_interval (specifying the time interval between scrapes) and the evaluation_interval (determining how frequently Prometheus evaluates the collected data).

D) Scrape Configuration:

Prometheus collects metrics from various targets using scrape configurations. In this section, users can define scrape jobs, specifying the target endpoints, scrape intervals, and other related parameters. Each scrape configuration corresponds to a specific target, such as an application, database, or server.

E) Alerting Rules:

Prometheus allows users to define alerting rules to trigger notifications based on specific conditions. The alerting_rules section in the YAML file is dedicated to configuring these rules. Users can define rules for metrics thresholds, anomalies, or other necessary conditions for their monitoring needs.

Core Metric Types

Core Metric Type
Diagram: Core Metric Type. Source is Sentinelone.

Direct Instrumentation with Client Libraries

Client Libraries
Diagram: Direct Instrumentation. Source is Sentinelone.

Indirect Instrumentation with Exporters

Indirect Instrumentation
Diagram: Indirect Instrumentation with Exporters. Source is Sentinelone.

You must remember that Prometheus is only used to collect and explore metrics. Because it uses a time series data model, data is identified by metric names and has key-value pairs.
Prometheus cannot be used for logging or event-driven architectures that require tracking of individual events. Notice that none of the examples or use cases I’ve used in previous sections relate to logs.

Prometheus is suitable for storing CPU usage, latency requests, error rates, or networking bandwidth.

Moreover, Prometheus makes trade-offs with the data it collects. This means that it will prefer to provide 99.99% accurate data (per their documentation) rather than degrade performance or break monitoring systems. You should avoid using this method to access essential information like a bank account balance because you may lose some data.

Starting with Prometheus Metric Types

Metrics can be applied to various components, are a unit of measurement for evaluating an item, and are consistently measured. Examples of common measurements include CPU utilization, memory utilization, and interface utilization. These are numbers about how your resources are performing.

For the metrics side of things, we have runtime, infrastructure, and application metrics, including Prometheus Exporters, response codes, and time-to-serve data. We also have CI/CD pipeline metrics such as build time and failures. Let’s discuss these in more detail.

**The process of exposition**

Exposition is the process of making metrics available to Prometheus. Exposition to Prometheus is done over HTTP. Usually, metrics are exposed under the /metrics path, and a client library handles the request. Prometheus supports both the Prometheus text format and OpenMetrics.

You can produce the exposition format by hand, which is easier with the Prometheus text format. If no suitable library exists for your language, a library is recommended. Most libraries will support both OpenMetrics and Prometheus text formats.

When metrics are defined, they are usually registered with the default registry. If one of the libraries you depend on has Prometheus instrumentation, you will benefit from it automatically. Some users prefer explicitly passing a registry down from the primary function, so you’d have to rely on each library between your application’s primary function and the Prometheus instrumentation to be aware of it. It assumes that all libraries in the dependency chain care about instrumentation and agree on which libraries to use.

**Detailing: Prometheus Metric Types**

The Prometheus client libraries offer four core metric types. These are currently only differentiated in the client libraries (to enable APIs tailored to the usage of the specific types) and in the wire protocol. The Prometheus server does not yet use the type information and flattens all data into untyped time series. This may change in the future.

a) Counters

The first metric type we’ll explore is the Counter. Counters are monotonically increasing values, meaning they only increase over time. These metrics often measure the number of requests served or the total number of events processed. Counters are reset to zero when the Prometheus server restarts, ensuring accurate measurements.

Countermetrics are used to increase measurements. Since they are cumulative, their value can only increase. Exceptionally, the counter’s value is reset to zero when it is restarted. A counter’s value could be more helpful on its own. However, a counter value is often used to compute the delta or rate of change between two timestamps.

Do not use a counter to expose a value that can decrease. For example, do not use a counter for the number of currently running processes; instead, use a gauge.

Client library usage documentation for counters:

b) Gauges

Next up, we have Gauges. Gauges represent a single numerical value that can arbitrarily go up or down. They are commonly used to track metrics like CPU usage or memory consumption. Unlike counters, gauges hold their value until explicitly changed by the exporter.

Gauge metrics measure increases or decreases. This metric type is more familiar since the actual value without additional processing is meaningful. A gauge is, for instance, a metric that measures the temperature, CPU, memory usage, or queue size.

Client library usage documentation for gauges:

c) Histograms

Histograms measure the distribution of values in a given dataset. Prometheus histograms are divided into configurable buckets, allowing you to track metrics such as response times or request sizes. Each bucket captures the number of observations falling into a specific range, making it easier to analyze data distribution.

A histogram represents a distribution of measurements. For example, request durations or response sizes are often measured with them. A histogram counts how many measurements fall into each bucket based on the entire range of measurements.

Client library usage documentation for histograms:

Prometheus Metric Types
Diagram: Prometheus Metric Types. Source timescale.

NOTE: Beginning with Prometheus v2.40, there is experimental support for native histograms. A native histogram requires only a one-time series, including a dynamic number of buckets and the sum and count of observations. Native histograms allow much higher resolution at a fraction of the cost. Detailed documentation will follow once native histograms are closer to becoming stable.

d) Summary

The Summary metric type is similar to histograms but focuses on calculating quantiles over a sliding time window. This makes it helpful in measuring request durations or API response times. Summaries provide information about the distribution of values, including the minimum, maximum, and quantiles.

Client library usage documentation for summaries:

**Summary vs. Histogram**

A critical distinction between summaries and histograms is their storage requirements. Summaries must retain a sample for each observed value, while histograms only store aggregated data points. Therefore, summaries are more memory-intensive but provide more precise quantile calculations.

  • Untyped

The untyped metric type is a catch-all category that can be used when none other metric types fit the collected data. It represents a value that can change over time but does not fall into any specific category. Untyped metrics are helpful when you require flexibility and simplicity, but they lack the advanced analysis capabilities of other metric types.

Highlighting Prometheus Monitoring

Auto Scaling Observability

Previously, Heapsters was used as a monitoring solution that came out of the box with Kubernetes. We now have Prometheus as the de facto standard monitoring system for Kubernetes clusters, bringing many benefits. Firstly, Prometheus monitoring scales with a pull approach and the Prometheus federated options. The challenge is that if we run microservices at scale and the monitoring system pushes metrics out to a metric server, the monitoring system can flood the network.

Also, with a push-based approach, you may need to scale up instead of out, which could be costly. We can have a bunch of different systems we want to monitor. Therefore, the metrics content will differ for the systems and components, but Prometheus collects and exports the same. This provides a welcomed layer of unification for the different systems in the network.

Prometheus Monitoring
Diagram: Prometheus Monitoring. Source is Opcito

Exporters and Client Libraries

With Prometheus monitoring, you can get metrics from the systems you want to monitor using pre-built exporters and custom client libraries. Prometheus works very well with Docker and Kubernetes but can also work outside the container networking world with non-cloud native applications using exporters.

You can monitor your entire stack with a wide range of exporters and client libraries. We install the code library and gather custom applications and runtime metrics for cloud-native applications. Installing code in the application lets us see the custom metrics that matter most to us.

What are Prometheus Exporters?

Prometheus exporters are software components that expose metrics in a format that Prometheus can scrape. They act as intermediaries between Prometheus and the systems or applications being monitored.

These exporters collect metrics from various sources, including databases, web servers, cloud platforms, and custom applications. By providing a standardized interface, exporters allow Prometheus to collect and store metrics from diverse sources uniformly.

Prometheus exporters work by exposing a web server that serves metrics in a format known as the Prometheus exposition format. This format consists of simple text-based data, usually in the form of key-value pairs. The exporters periodically expose an HTTP endpoint that Prometheus scrapes to collect the metrics. The scraped metrics are then stored in the Prometheus time-series database, where they can be queried and visualized.

Types of Prometheus Exporters:

Prometheus exporters come in various forms, each designed to collect metrics from specific systems or applications. Some famous exporters include:

1. Node Exporter collects system-level metrics such as CPU, memory, disk, and network statistics from Linux or Unix-based systems.

2. Blackbox Exporter: Allows monitoring of network endpoints by performing HTTP, TCP, ICMP, and DNS probes.

3. MySQL Exporter: Collects metrics from MySQL databases, providing insights into query execution time, connection statistics, and more.

4. Redis Exporter: This tool gathers metrics from Redis, a popular in-memory data structure store. It enables monitoring key performance indicators such as memory usage, latency, and throughput.

What are Prometheus Client Libraries?

Prometheus provides client libraries in various programming languages, including, but not limited to, Go, Java, Python, Ruby, and JavaScript. These libraries bridge your application and Prometheus, allowing you to collect and expose relevant metrics effortlessly.

1. Go Library:

Prometheus’ Go client library, known as “client_golang,” is widely used due to its simplicity and robustness. It provides a straightforward API to instrument your Go applications, making it easy to expose custom metrics, collect and record them, and serve them via an HTTP endpoint. You can seamlessly integrate Prometheus monitoring into your Go-based projects with the Go client library.

2. Java Library:

Prometheus offers a dedicated client library for Java-based applications called “client_java.” This library allows you to instrument your Java code using simple annotations, making it effortless to expose metrics. The Java library simplifies monitoring Java applications with features like automatic metric exposition via HTTP endpoints and support for popular Java frameworks like Spring Boot and Micronaut.

3. Python Library:

Prometheus’ Python client library, known as “client_python,” provides a Pythonic way to instrument your Python applications. With its intuitive API, you can easily expose custom metrics, collect and record them, and serve them via an HTTP endpoint. The Python library also supports popular web frameworks like Flask and Django, making it a convenient choice for developers.

4. Ruby Library:

For Ruby developers, Prometheus provides a client library called “client_ruby.” This library allows you to instrument your Ruby applications effortlessly. It offers features like metric collection, recording, and exposing metrics via an HTTP endpoint. The Ruby library integrates well with popular Ruby frameworks like Ruby on Rails, enabling seamless monitoring of Ruby applications.

5. JavaScript Library:

Prometheus’ JavaScript client library, known as “client_js,” enables web developers to instrument their JavaScript-based applications. This library lets you easily expose custom metrics, collect and record them, and serve them via an HTTP endpoint. The JavaScript library is compatible with both browser-based JavaScript and Node.js, making it versatile for monitoring frontend and backend applications.

Additional Metric Types:

  • Prometheus Metric type: Runtime metrics

Runtime Metrics are statistics collected by the operating system and application host. These include CPU usage, memory load, and web server requests. For example, this could be CPU and memory usage from a Tomcat and JVM from a Java app.

  • Prometheus Metric type: Infrastructure metrics

We examine CPU utilization, latency, bandwidth, memory, and temperature metrics for Infrastructure metrics. These metrics should be collected over a long period and applied to infrastructure such as networking equipment, hypervisors, and host-based systems.

  • Prometheus Metric type: Application metrics

Then, we have Application metrics and custom statistics relevant only to the application and not the infrastructure. Application metrics pertain specifically to an application. This may include the number of API calls made during a particular time. 

This can be quickly done with web-based applications; here, we can get many status codes that provide information. These metrics are easy to measure, and the response codes are available immediately. For example, an HTTP status code 200 is good, and 400 or more is an issue. 

  • Prometheus Metric type: Time to first byte

Another important metric is the time a web server takes to respond to the data. The important metric here is time to the first byte (TTFB). This measures how long it takes for your application to send data. Time to the first byte refers to the time between the browser requesting a page and when it receives the first byte of information from the server. If this metric exceeds the usual, you may need caching, faster storage, or a better CPU.

Let us take the example of the content delivery network (CDN); what is an excellent time to the first byte? On average, anything with a TTFB under 100 ms is fantastic. Anything between 200 and 500 ms is standard, and anything between 500 ms and 1 second is less than ideal. Anything more significant than 1 second should likely be investigated further.

  • Prometheus Metric type: CI/CD pipeline metrics

For the CI/CD Pipeline metrics, we want to measure the time to do the static code analysis. Next, we want to count the number of errors while running the pipeline. Finally, we want to measure the build time and build failures. These metrics include the time it takes to build an application, the time it takes to complete tests, and how often builds fail.  

  • Prometheus Metric type: Docker metrics

Docker metrics come from the Docker platform. These may include container health checks, the number of online and offline nodes in a cluster, and the number of containers and actions. These container actions may be stopped, paused, or run. So, we have built-in metrics provided by Docker to give additional visibility to the running containers. When running containers in production, monitoring their runtime metrics, such as CPU and memory usage, is essential.

Docker Metrics

Metrics from the Docker Platform are essential for containers when Docker stops and starts applications for you. You can’t gather one metric type without the other. For example, if you look at the application metrics, you only look at half of the puzzle and may miss the problem.

For example, if one of your applications is performing poorly and the docker platform constantly spins up new containers, you would not see that just under the application metrics. Your application and runtime metrics may seem to be within the standard thresholds. However, combining this with the Docker Platform metrics shows the container stats, showing a spike in container creation.

Exposing application metrics to Prometheus

Application metrics give you additional information. Unlike runtime metrics you get for free, you need to record what you care about explicitly. Here, we have client libraries that Prometheus offers. All the major languages have a Prometheus client library, which provides the metrics endpoint. The client library makes application metrics available to Prometheus, giving you a very high level of visibility into your application.

With Prometheus client libraries, you can see what is happening inside the application. So, we have both Prometheus exporters and Prometheus client libraries that allow Prometheus to monitor everything.  

Exposing Docker Metrics to Prometheus

First, Docker Default Networking 101. The Docker Engine interacts with all clients and collects and exports metrics. When you build a Docker image, the Engine records a metric. We need insights into the Docker platform. Here, we can expose Docker metrics to Prometheus. The Docker Engine has a built-in mechanism to export metrics in Prometheus format. So we have, for example, the Docker metrics covering the Engine and container and metrics about images. 

Docker metric types: Three types

The types of metrics have three areas. First, we have the Docker Engine, Builds, and Containers. 

    1. The Engine will give you information on the host, such as the CPU count, O/S version, and build of the Docker Engine. 
    2. Then, for Build metrics, it is helpful for information such as the number of builds triggered, canceled, and failed. 
    3. Container metrics also show the number of containers stopped and paused and the number of health checks that are fired and failed.

Wrap up: Prometheus monitoring.

So, for Prometheus monitoring, we have Prometheus exporters that can get metrics for, let’s say, a Linux server and application metrics that can support Prometheus using a client library.

Both have an HTTP endpoint that returns metrics in the standard Prometheus format. Once the HTTP endpoint is up and running on the application ( legacy or cloud-native ), Prometheus will scrape ( collect ) the metric with dynamic or static approaches.

So we have Exporters that can add metrics to systems that don’t have Prometheus support. We also have Prometheus client libraries that can provide Prometheus support in the application. These client libraries can provide out-of-the-box runtime and custom metrics relevant to the applications.

Summary: Prometheus Metric Types

Regarding monitoring applications, Prometheus stands out as a powerful tool. One of its key features is the ability to work with different metric types, each serving a specific purpose. In this blog post, we dive into the world of Prometheus metric types and understand how they can enhance your application monitoring strategy.

Counter Metrics

Counter metrics are used to track the number of occurrences of a particular event. They start at 0 and can only increase over time. These metrics are ideal for measuring total requests served, successful logins, or errors encountered. By monitoring counter metrics, you can gain valuable insights into your application’s overall health and usage.

Gauge Metrics

Gauge metrics, unlike counter metrics, can increase or decrease over time. They often measure things that fluctuate, such as CPU usage, memory consumption, or the number of active connections. With gauge metrics, you can have a real-time view of the current state of your application and identify any potential bottlenecks or anomalies.

Section 3: Histogram Metrics

Histogram metrics provide a way to measure the distribution of values over a given period of time. They are commonly used to capture things like response times or request sizes. By collecting histogram metrics, you can analyze the distribution of these values and gain insights into the performance characteristics of your application.

Summary Metrics

Summary metrics are similar to histograms but provide more advanced features. They calculate percentiles, allowing you to measure the 90th percentile response time. Additionally, summary metrics provide a rolling time window for observations, making them more suitable for long-term analysis of your application’s performance.

In conclusion, understanding the different Prometheus metric types is critical to effectively monitoring your applications. By utilizing counter metrics, gauge metrics, histogram metrics, and summary metrics, you can gain valuable insights into the behavior and performance of your application. Whether tracking events, monitoring resource utilization, or analyzing response times, Prometheus metric types provide the flexibility and power to ensure your applications run smoothly.