WAN Design Requirements

DMVPN

DMVPN

Cisco DMVPN is based on a virtual private network (VPN), which provides private connectivity over a public network like the Internet. Furthermore, the DMVPN network takes this VPN concept further by allowing multiple VPNs to be deployed over a shared infrastructure in a manageable and scalable way.

This shared infrastructure, or “DMVPN network,” enables each VPN to connect to the other VPNs without needing expensive dedicated connections or complex configurations.

DMVPN Explained: DMVPN creates a virtual network built on the existing infrastructure. This virtual network consists of “tunnels” between various endpoints, such as corporate networks, branch offices, or remote users. This virtual network allows for secure communication between these endpoints, regardless of their geographic location. As we are operating under an underlay, DMVPN is an overlay solution.

Hub Router:The hub router serves as the central point of connectivity in a DMVPN deployment. It acts as a central hub for all the spoke routers, enabling secure communication between them. The hub router is responsible for managing the dynamic IPsec tunnels and facilitating efficient routing.

Spoke Routers: Spoke routers are the remote endpoints in a DMVPN network. They establish IPsec tunnels with the hub router to securely transmit data. Spoke routers are typically located in branch offices or connected to remote workers' devices. They dynamically establish tunnels based on network requirements, ensuring optimal routing.

Next-Hop Resolution Protocol (NHRP):NHRP is a critical component of DMVPN that aids in dynamic IPsec tunnel establishment. It assists spoke routers in resolving the next-hop addresses for establishing tunnels with other spoke routers or the hub router. NHRP maintains a mapping database that allows efficient routing and simplifies network configuration.

Scalability:DMVPN offers excellent scalability, making it suitable for organizations with expanding networks. As new branch offices or remote workers join the network, DMVPN dynamically establishes tunnels without the need for manual configuration. This scalability eliminates the complexities associated with traditional point-to-point VPN solutions.

Cost Efficiency:By utilizing DMVPN, organizations can leverage affordable public network infrastructures instead of costly dedicated connections. DMVPN makes efficient use of bandwidth, reducing operational costs while providing secure and reliable connectivity.

Flexibility:DMVPN provides flexibility in terms of network design and management. It supports different routing protocols, allowing seamless integration with existing network infrastructure. Additionally, DMVPN supports various transport technologies, including MPLS, broadband, and cellular, enabling organizations to choose the most suitable option for their needs.

Highlights: DMVPM

VPN-based security solutions

VPN-based security solutions are increasingly popular and have proven effective and secure technology for protecting sensitive data traversing insecure channel mediums, such as the Internet.

Traditional IPsec-based site-to-site, hub-to-spoke VPN deployment models must scale better and be adequate only for small- and medium-sized networks. However, as demand for IPsec-based VPN implementation grows, organizations with large-scale enterprise networks require scalable and dynamic IPsec solutions that interconnect sites across the Internet with reduced latency while optimizing network performance and bandwidth utilization.

Scaling traditional IPsec VPN

Dynamic Multipoint VPN (DMVPN) technology scales IPsec VPN networks by offering a large-scale deployment model that allows the network to expand and realize its full potential. In addition, DMVPN offers scalability that enables zero-touch deployment models.

ipsec tunnel
Diagram: IPsec Tunnel

Encryption is supported through IPsec, making DMVPN a popular choice for connecting different sites using regular Internet connections. It’s a great backup or alternative to private networks like MPLS VPN. A popular option for DMVPN is FlexVPN.

Routing Technique

DMVPN (Dynamic Multipoint VPN) is a routing technique for building a VPN network with multiple sites without configuring all devices statically. It’s a “hub and spoke” network in which the spokes can communicate directly without going through the hub.

 

Related: For pre-information, you may find the following posts helpful.

  1. VPNOverview
  2. Dynamic Workload Scaling
  3. DMVPN Phases
  4. IPSec Fault Tolerance
  5. Dead Peer Detection
  6. Network Overlays
  7. IDS IPS Azure
  8. SD WAN SASE
  9. Network Traffic Engineering

DMVPN

Key DMVPN Discussion Points:


  • DMVPN Introduction.

  • Challenging landscape with standard VPN security technologies.

  • Technical details on how to approach implementing a DMVPN network.

  • Types of DMVPN phases.

  • DMVPN components and features.

Back To Basics: DMVPN

Cisco DMVPN

♦DMVPN Components Involved

The DMVPN solution consists of a combination of existing technologies so that sites can learn about each other and create dynamic VPNs. Therefore, efficiently designing and implementing a Cisco DMVPN network requires thoroughly understanding these components, their interactions, and how they all come together to create a DMVPN network.

These technologies may seem complex, and this post aims to simplify them. First, we mentioned that DMVPN has different components, which are the building blocks of a DMVPN network. These include Generic Routing Encapsulation (GRE), Next Hop Redundancy Protocol (NHRP), and IPsec.

The Dynamic Multipoint VPN (DMVPN) feature allows users to better scale large and small IP Security (IPsec) Virtual Private Networks (VPNs) by combining generic routing encapsulation (GRE) tunnels, IPsec encryption, and Next Hop Resolution Protocol (NHRP).

Each of these components needs a base configuration for DMVPN to work. Once the base configuration is in place, we have a variety of show and debug commands to troubleshoot a DMVPN network to ensure smooth operations.

There are four pieces to DMVPN:

  • Multipoint GRE (mGRE)
  • NHRP (Next Hop Resolution Protocol)
  • Routing (RIP, EIGRP, OSPF, BGP, etc.)
  • IPsec (not required but recommended)

Cisco DMVPN Components

Main DMVPN Components

Dynamic VPN

  • Multipoint GRE and Point to Point GRE

  • NHRP (Next Hop Resolution Protocol)

  • Routing (RIP, EIGRP, OSPF, BGP, etc.)

  • IPsec (not required but recommended)

1st Lab Guide: Displaying the DMVPN configuration

DMVPN Network

The following screenshot is from a DMVPN network using Cisco modeling labs. We have R1 as the hub, R2 and R3 as the spokes. The command: show DMVPN displays that we have two spokes routers. Notice the “D” attribute. This means the spokes have been learned dynamically, which is the essence of DMVPN.

The spokes are learned with a process called the Next Hop Resolution Protocol. As this is a nonbroadcast multiaccess network, we must use a protocol other than the Address Resolution Protocol (ARP).

Note:

  1. As you can see, with tunnel configuration for one of the spokes, we have a static mapping for the hub with the command IP nhrp NHS 192.168.100.1.  We also have point-to-point GRE tunnels in the spokes with the command: tunnel destination 172.17.11.2.
  2. Therefore, we are running DMVPN Phase 1. DMVPN phase 3 will have mGRE. More on this later.
DMVPN configuration
Diagram: DMVPN Configuration.

Key DMVPN components include:

●   Multipoint GRE (mGRE) tunnel interface: Allows a single GRE interface to support multiple IPsec tunnels, simplifying the size and complexity of the configuration. Standard point-to-point GRE tunnels are used in the earlier versions or phases of DMVPN.

●   Dynamic discovery of IPsec tunnel endpoints and crypto profiles: Eliminates the need to configure static crypto maps defining every pair of IPsec peers, further simplifying the configuration.

●   NHRP: Allows spokes to be deployed with dynamically assigned public IP addresses (i.e., behind an ISP’s router). The hub maintains an NHRP database of the public interface addresses of each spoke. Each spoke registers its actual address when it boots; when it needs to build direct tunnels with other spokes, it queries the NHRP database for real addresses of the destination spokes

DMVPN Explained
Diagram: DMVPN explained. Source is TechTarget

DMVPN Explained

Overlay Networking

A Cisco DMVPN network consists of many overlay virtual networks. Such a virtual network is called an overlay network because it depends on an underlying transport called the underlay network. The underlay network forwards traffic flowing through the overlay network. With the use of protocol analyzers, the underlay network is aware of the existence of the overlay. However, left to its defaults, the underlay network does not fully see the overlay network.

We will have routers at the company’s sites that are considered the endpoints of the tunnel that forms the overlay network. So, we could have a WAN edge router or Cisco ASA configured for DMVPN.  Then, for the underlay that is likely out of your control, have an array of service provider equipment such as routers, switches, firewalls, and load balancers that make up the underlay.

The following diagram displays the different overlay solutions. VXLAN is expected in the data center, while GRE is used across the WAN. DMVPN uses GRE.

What is VXLAN
Diagram: Virtual overlay solutions.

2nd Lab Guide: VXLAN overlay

Overlay Networking

While DMVPN does not run VXLAN as the overlay protocol, viewing for background information and reference is helpful. VXLAN is a network overlay technology that provides a scalable and flexible solution for creating virtualized networks.

It enables the creation of logical Layer 2 networks over an existing Layer 3 infrastructure, allowing organizations to extend their networks across data centers and virtualized environments. In the following example, we create a Layer 2 overlay over a Layer 3 core. A significant difference between DMVPN’s use of GRE as the overlay and the use of VXLAN is the VNI.

Note:

  1. One critical component of VXLAN is the Virtual Network Identifier (VNI). In this blog post, we will explore the details of VXLAN VNI and its significance in modern network architectures.
  2. VNI is a 24-bit identifier that uniquely identifies a VXLAN network. It allows multiple VXLAN networks to coexist over the same physical network infrastructure. Each VNI represents a separate Layer 2 network, enabling the isolation and segmentation of traffic between different virtual networks.

Below, you can see the VNI used and the peers that have been created. VXLAN also works in multicast mode.

Overlay networking
Diagram: Overlay Networking with VXLAN

DMVPN Overlay Networking

Creating an overlay network


  • To create an overlay network, one needs a tunneling technique.Multipoint GRE and Point to Point GRE

  • GRE tunnel is the most widely used for external connectivity

  • VXLAN is for internal to the data center.

  • GRE tunnel support IP-based network, works by inserting IP and GRE header on top of the original protocol packet.

DMVPN: Creating an overlay network

The overlay network does not magically appear. To create one, we need a tunneling technique. Many tunneling technologies can be used to form the overlay network. The Generic Routing Encapsulation (GRE) tunnel is the most widely used external connectivity, while VXLAN is used for internal connectivity to the data center.

And the one that DMVPN adopts. A GRE tunnel can support tunneling for various protocols over an IP-based network. It works by inserting an IP and GRE header on top of the original protocol packet, creating a new GRE/IP packet. 

GRE over IPsec

The resulting GRE/IP packet uses a source/destination pair routable over the underlying infrastructure. The GRE/IP header is the outer header, and the original protocol header is the inner header.

♦ Is GRE over IPsec a tunneling protocol? 

GRE is a tunneling protocol that transports multicast, broadcast, and non-IP packets like IPX. IPSec is an encryption protocol. IPSec can only transport unicast packets, not multicast & broadcast. Hence, we wrap it GRE first and then into IPSec, which is called GRE over IPSec.

3rd Lab Guide: Displaying the DMVPN configuration

DMVPN Configuration

We are using a different DMVPN lab setup than before. R11 is the hub of the DMVPN network, and we only have one spoke of R12. In the DMVPN configuration, the tunnel interface has an “encapsulation tunnel.” This is the overlay network, and we are using GRE.

We are currently using the standard point-to-point GRE and not multipoint GRE. We know this as we have explicitly set the tunnel destination with the command tunnel destination 172.16.31.2.  This is fine for a small network of a few spokes. However, for the more extensive network, we need to use mGRE. And take full advantage of the dynamic nature of DMVPN.

Note:

  1. As for the routing protocols, we run EIGRP over the tunnel ( GRE ) interface. So, we only have one EIGRP neighbor, so we don’t need to worry about the split horizon. Before we move on, one key point is that running a traceroute from R11 to R12 only shows one hop.
  2. This is because the TTL is also carried in the GRE. So, no matter how many devices are in the path ( underlay network ) between R11 and R12, either physical or virtual, it will always show as one hop due to the overlay network, i.e., GRE.
DMVPN configuration
Diagram: DMVPN Configuration

Multipoint GRE. What is mGRE? 

An alternative to configuring multiple point-to-point GRE tunnels is to use multipoint GRE tunnels to provide the connectivity desired. Multipoint GRE (mGRE) tunnels are similar in construction to point-to-point GRE tunnels except for the tunnel destination command. However, instead of declaring a static destination, no destination is declared, and instead, the tunnel mode gre multipoint command is issued.

How does one remote site know what destination to set for the GRE/IP packet created by the tunnel interface? The easy answer is that it can’t on its own. The site can only glean the destination address with the help of an additional protocol. The next component used to create a DMVPN network is the Next Hop Resolution Protocol (NHRP). 

Essentially, mGRE features a single GRE interface on each router, allowing multiple destinations. This interface secures multiple IPsec tunnels and reduces the overall scope of the DMVPN configuration. However, if two branch routers need to tunnel traffic, mGRE and point-to-point GRE may not know which IP addresses to use.

The Next Hop Resolution Protocol (NHRP) is used to solve this issue. The following diagram depicts the functionality of mGRE in DMVPN technology.

what is mgre
Diagram: What is mGRE? Source is Stucknactive

Next Hop Resolution Protocol (NHRP)

The Next Hop Resolution Protocol (NHRP) is a networking protocol designed to facilitate efficient and reliable communication between two nodes on a network. It does this by providing a way for one node to discover the IP address of another node on the same network.

The primary role of NHRP is to allow a node to resolve the IP address of another node that is not directly connected to the same network. This is done by querying an NHRP server, which contains a mapping of all the nodes on the network. When a node requests the NHRP server, it will return the IP address of the destination node.

NHRP was initially designed to allow routers connected to non-broadcast multiple-access (NBMA) networks to discover the proper next-hop mappings to communicate. It is specified in RFC 2332. NBMA networks faced a similar issue as mGRE tunnels. 

Cisco DMVPN
Diagram: Cisco DMVPN and NHRP. The source is network direction.

The NHRP can deploy spokes with assigned IP addresses, which can be connected from the central DMVPN hub. One branch router requires this protocol to find the public IP address of the second branch router. NHRP uses a “server-client” model, where one router functions as the NHRP server while the other routers are the NHRP clients. In the multipoint GRE/DMVPN topology, the hub router is the NHRP server, and all other routers are the spokes. 

Each client registers with the server and reports its public IP address, which the server tracks in its cache. Then, through a process that involves registration and resolution requests from the client routers and resolution replies from the server router, traffic is enabled between various routers in the DMVPN.

4th Lab Guide: Displaying the DMVPN configuration

NHC and NHS Design

The following DMVPN configuration shows a couple of new topics. DMVPN works with an NHS and NHC design; the hub is the HHS. You can see this explicitly configured on the spokes, and this configuration needs to be on the spokes rather than the hubs.

The hub configuration is meant to be more dynamic. Also, if you recall, we are running EIGRP over the GRE tunnel. Two important points here. Firstly, we must consider a split-horizon as we have two spokes. Secondly, we need to use the “multicast” command.

Note:

  1. This is because EIGRP uses multicast HELLO messages to form neighbor relationships. If we had BGP running over the tunnel interface and not EIGRP, we would not need the multicast keywords. As BGP does not use multicast. The full command: IP nhrp nhs 192.168.100.11 nmba 172.16.11.1 multicast.
  2. On the spoke, we are telling the router that R11 is the NHS and to map its tunnel interface of 192.168.100.11 to the address 172.16.11.1 and to allow multicast traffic, or better explained, we are creating a new multicast mapping table.

DMVPN configuration

5th Lab Guide: DMVPN over IPsec

Securing DMVPN

In the following screenshot, we have DMVPN operating over IPsec. So, I have connected the hub and two spokes into an unmanaged switch to simulate the WAN environment. A WAN and DMVPN do not have encryption by default. However, since you probably use DMVPN with the Internet as the underlying network, it might be wise to encrypt your tunnels.

In this network, we are running RIP v2 as the routing protocol. Remember that you must turn off the split horizon at the hub site. IPsec has phases 1 and 2 (don’t confuse them with the DMVPN phases). Firstly, we need an ISAKMP policy that matches all our routers. Then, for Phase 2, we require a transform set on each router that tells the router what encryption/hashing to use and if we want tunnel or transport mode.

Note:

  1. I used ESP with AES as the encryption algorithm for this configuration and SHA for hashing. The mode is essential; since we are using GRE, we have already used tunnels as a transport mode. If you use tunnel mode, we will have even more overhead, which is unnecessary.
  2. The primary test here was to run a ping between the spokes. Since the ping works behind the scenes, our two spoke routers will establish an IPsec tunnel. You can see the security association below:
DMVPN over IPsec
Diagram: DMVPN over IPsec

IPsec Tunnels

An IPsec tunnel is a secure connection between two or more devices over an untrusted network using a set of cryptographic security protocols. The most common type of IPsec tunnel is the site-to-site tunnel, which connects two sites or networks. It allows two remote sites to communicate securely and exchange traffic between them. Another type of IPsec tunnel is the remote-access tunnel, which allows a remote user to connect to the corporate network securely.

When setting up an IPsec tunnel, several parameters, such as authentication method, encryption algorithm, and tunnel mode, must be configured. Depending on the organization’s needs, additional security protocols, such as Internet Key Exchange (IKE), can also be used for further authentication and encryption.

IPsec VPN
Diagram: IPsec VPN. Source Wikimedia.

IPsec Tunnel Endpoint Discovery 

Tunnel Endpoint Discovery (TED) allows routers to discover IPsec endpoints automatically so that static crypto maps must not be configured between individual IPsec tunnel endpoints. In addition, TED allows endpoints or peers to dynamically and proactively initiate the negotiation of IPsec tunnels to discover unknown peers.

These remote peers do not need to have TED configured to be discovered by inbound TED probes. So, while configuring TED, VPN devices that receive TED probes on interfaces — that are not configured for TED — can negotiate a dynamically initiated tunnel using TED.

DMVPN Checkpoint 

Main DMVPN Points To Consisder

  • Dynamic Multipoint VPN (DMVPN) technology is used for scaling IPsec VPN networks.

  • The DMVPN solution consists of a combination of existing technologies.

  • The overlay network does not magically appear. To create an overlay network, we need a tunneling technique. 

  • Once the virtual tunnel is fully functional, the routers need a way to direct traffic through their tunnels. Dynamic routing protocols are excellent choices for this.

  • mGRE features a single GRE interface on each router with the possibility of multiple destinations.

  • The Next Hop Resolution Protocol (NHRP) is a networking protocol designed to facilitate efficient and reliable communication between two nodes on a network.

  • An IPsec tunnel is a secure connection between two or more devices over an untrusted network using a set of cryptographic security protocols. DMVPN is not secure by default.

Continue Reading


DMVPN and Routing protocols 

Routing protocols enable the DMVPN to find routes between different endpoints efficiently and effectively. Therefore, choosing the right routing protocol is essential to building a scalable and stable DMVPN. One option is to use Open Shortest Path First (OSPF) as the interior routing protocol. However, OSPF is best suited for small-scale DMVPN deployments. 

The Enhanced Interior Gateway Routing Protocol (EIGRP) or Border Gateway Protocol (BGP) is more suitable for large-scale implementations. EIGRP is not restricted by the topology limitations of a link-state protocol and is easier to deploy and scale in a DMVPN topology. BGP can scale to many peers and routes, and it puts less strain on the routers compared to other routing protocols

DMVPN supports various routing protocols that enable efficient communication between network devices. In this section, we will explore three popular DMVPN routing protocols: Enhanced Interior Gateway Routing Protocol (EIGRP), Open Shortest Path First (OSPF), and Border Gateway Protocol (BGP). We will examine their characteristics, advantages, and use cases, allowing network administrators to make informed decisions when implementing DMVPN.

EIGRP: The Dynamic Routing Powerhouse

EIGRP is a distance vector routing protocol widely used in DMVPN deployments. This section will provide an in-depth look at EIGRP, discussing its features such as fast convergence, load balancing, and scalability. Furthermore, we will highlight best practices for configuring EIGRP in a DMVPN environment, optimizing network performance and reliability.

OSPF: Scalable and Flexible Routing

OSPF is a link-state routing protocol that offers excellent scalability and flexibility in DMVPN networks. This section will explore OSPF’s key attributes, including its hierarchical design, area types, and route summarization capabilities. We will also discuss considerations for deploying OSPF in a DMVPN environment, ensuring seamless connectivity and effective network management.

BGP: Extending DMVPN to the Internet

BGP, a path vector routing protocol, connects DMVPN networks to the global Internet. This section will focus on BGP’s unique characteristics, such as its autonomous system (AS) concept, policy-based routing, and route reflectors. We will also address the challenges and best practices of integrating BGP into DMVPN architectures.

6th Lab Guide: DMVPN Phase 1 and OSPF

DMVPN Routing

OSPF is not the best solution for DMVPN. Because it’s a link-state protocol, each spoke router must have the complete LSDB for the DMVPN area. Since we use a single subnet on the multipoint GRE interfaces, all spoke routers must be in the same area.

This is no problem with a few routers, but it doesn’t scale well when you have dozens or hundreds. Most spoke routers are probably low-end devices at branch offices that don’t like all the LSA flooding that OSPF might do within the area. One way to reduce the number of prefixes in the DMVPN network is to use a stub or total stub area.

Note:

  1. The example below shows we are running OSPF between the Hub and the two spokes. OSPF network type can be viewed on the hub along with the status of DMVPN. Please take note of the next hop on the Spoke router when I do a show IP route OSPF.
  2. Each router has learned the networks on the different loopback interfaces. The next hop value is preserved when we use the broadcast network type.

You have seen all the different OSPF Broadcast network types on DMVPN phase 1. As you can see, some stuff is in the routing tables. All traffic goes through the hub, so our spoke routers don’t need to know everything. Unfortunately, it’s impossible to summarize within the area. However, we can reduce the number of routes by changing the DMVPN area into a stub or total stub area.

Unlike the broadcast network type, point-to-point and point-to-multipoint network types do not preserve the spokes’ next-hop IP addresses.

7th Lab Guide: DMVPN Phase 2 with OSPF

DMVPN Routing

In the following example, we have DMVPN Phase 2 running with OSPF. We are using the Broadcast network type. However, the following OSPF network types are potentials for DMVPN phase 2.

  • point-to-point
  • broadcast
  • non-broadcast
  • point-to-multipoint
  • point-to-multipoint non-broadcast

Below, all routers have learned the networks on each other’s loopback interfaces. Look closely at the next hop IP addresses for the 2.2.2.2/32 and 3.3.3.3/32 entries. This looks good; these are the IP addresses of the spoke routers. You can also see that 1.1.1.1/32 is an inter-area route. This is good; we can summarize networks “behind” the hub towards the spoke routers if we want to.

When running OSPF for DMVPN phase 2, you only have two choices if you want direct spoke-to-spoke communication: broadcast and non-broadcast. Let me give you an overview:

  • Point-to-point: This will not work since we use multipoint GRE interfaces.
  • Broadcast: This network type is your best choice. We are using it in the example above. The automatic neighbor discovery and correct next-hop addresses. Make sure that the spoke router can’t become DR or BDR. Here, we can use the priority commands and set them to 0 on the spokes.
  • Non-broadcast: similar to broadcast, but you have to configure static neighbors.
  • Point-to-multipoint: Don’t use this for DMVPN phase 2 since the hub changes the next hop address; you won’t have direct spoke-to-spoke communication.
  • Point-to-multipoint non-broadcast: same story as point-to-multipoint, but you must also configure static neighbors.

 

DMVPN Deployment Scenarios: 

Cisco DMVPN can be deployed in two ways:

  1. Hub-and-spoke deployment model
  2. Spoke-to-spoke deployment model

Hub-and-spoke deployment model: In this traditional topology, remote sites, which are the spokes, are aggregated into a headend VPN device. The headend VPN location would be at the corporate headquarters, known as the hub. 

Traffic from any remote site to other remote sites would need to pass through the headend device. Cisco DMVPN supports dynamic routing, QoS, and IP Multicast while significantly reducing the configuration effort. 

Spoke-to-spoke deployment model: Cisco DMVPN allows the creation of a full-mesh VPN, in which traditional hub-and-spoke connectivity is supplemented by dynamically created IPsec tunnels directly between the spokes. 

With direct spoke-to-spoke tunnels, traffic between remote sites does not need to traverse the hub; this eliminates additional delays and conserves WAN bandwidth while improving performance. 

Spoke-to-spoke capability is supported in a single-hub or multi-hub environment. Multihub deployments provide increased spoke-to-spoke resiliency and redundancy.  

DMVPN Designs

The word phase is almost always connected to discussions on DMVPN design. DMVPN phase refers to the version of DMVPN implemented in a DMVPN design. As mentioned above, we can have two deployment models, each of which can be mapped to a DMVPN Phase.

Cisco DMVPN as a solution was rolled out in different stages as the explanation became more widely adopted to address performance issues and additional improvised features. There are three main phases for DMVPN:

  • Phase 1 – Hub-and-spoke
  • Phase 2 – Spoke-initiated spoke-to-spoke tunnels
  • Phase 3 – Hub-initiated spoke-to-spoke tunnels
What is DMVPN
Diagram: What is DMVPN? Source is Lira

The differences between the DMVPN phases are related to routing efficiency and the ability to create spoke-to-spoke tunnels. We started with DMVPN Phase 1, which only had a hub to spoke. This needed more scalability as we could not have direct spoke-to-spoke communication. Instead, the spokes could communicate with one another but were required to traverse the hub.

Then, we went to DMVPN Phase 2 to support spoke-to-spoke with dynamic tunnels. These tunnels were initially brought up by passing traffic via the hub. Later, Cisco developed DMVPN Phase 3, which optimized how spoke-to-spoke commutation happens and the tunnel build-up.

Dynamic multipoint virtual private networks began simply as what is best described as hub-and-spoke topologies. The primary tool for creating these VPNs combines Multipoint Generic Routing Encapsulation (mGRE) connections employed on the hub with traditional Point-to-Point (P2P) GRE tunnels on the spoke devices.

In this initial deployment methodology, known as a Phase 1 DMVPN, the spokes can only join the hub and communicate with one another through the hub. This phase does not use spoke-to-spoke tunnels. Instead, the spokes are configured for point-to-point GRE to the hub and register their logical IP with the non-broadcast multi-access (NBMA) address on the next hop server (NHS) hub.

It is essential to keep in mind that there is a total of three phases, and each one can influence the following:

  1. Spoke-to-spoke traffic patterns
  2. Routing protocol design
  3. Scalability

DMVPN Design Options

The disadvantage of a single hub router is that it’s a single point of failure. Once your hub router fails, the entire DMVPN network is gone.

We need another hub router to add redundancy to our DMVPN network. There are two options for this:

  1. Dual hub – Single Cloud
  2. Dual hub – Dual Cloud

With the single cloud option, we use a single DMVPN network but add a second hub. The spoke routers will use only one multipoint GRE interface, and we configure the second hub as a next-hop server. The dual cloud option also has two hubs, but we will use two DMVPN networks, meaning all spoke routers will get a second multipoint GRE interface.

Understanding DMVPN Dual Hub Single Cloud:

DMVPN dual hub single cloud is a network architecture that provides redundancy and high availability by utilizing two hub devices connected to a single cloud. The cloud can be an internet-based infrastructure or a private WAN. This configuration ensures the network remains operational even if one hub fails, as the other hub takes over the traffic routing responsibilities.

Benefits of DMVPN Dual Hub Single Cloud:

1. Redundancy: With dual hubs, organizations can ensure network availability even during hub device failures. This redundancy minimizes downtime and maximizes productivity.

2. Load Balancing: DMVPN dual hub single cloud allows for efficient load balancing between the two hubs. Traffic can be distributed evenly, optimizing bandwidth utilization and enhancing network performance.

3. Scalability: The architecture is highly scalable, allowing organizations to easily add new sites without reconfiguring the entire network. New sites can be connected to either hub, providing flexibility and ease of expansion.

4. Simplified Management: DMVPN dual hub single cloud simplifies network management by centralizing control and reducing the complexity of VPN configurations. Changes and updates can be made at the hub level, ensuring consistent policies across all connected sites.

The disadvantage is that we have limited control over routing. Since we use a single multipoint GRE interface, making the spoke routers prefer one hub over another is challenging.

8th Lab Guide: DMVPN Dual Hub Single Cloud

DMVPN Advanced Configuration

Below is a DMVPN network with two hubs and two spoke routers. Hub1 will be the primary hub, and hub2 will be the secondary hub. We use a single DMVPN network; each router only has one multipoint GRE interface. On top of that, we have R1 on the leading site where we use the 10.10.10.0/24 subnet. Behind R1, we have a loopback interface with IP address 1.1.1.1/32.

Note:

  1. The two hub routers and spoke routers are connected to the Internet. Usually, you would connect the two hub routers to different ISPs. To keep it simple, I combined all routers into the 192.168.1.0/24 subnet, represented as an unmanaged switch in the lab below.
  2. Each spoke router has a loopback interface with an IP address. The DMVPN network will use subnet 172.16.1.0/24, where hub1 will be the primary hub. Spoke routers will register themselves with both hub routers.
Dual Hub Single Cloud
Diagram: Dual Hub Single Cloud

Summary of DMVPN Phases

Phase 1—Hub-to-Spoke Designs: Phase 1 was the first design introduced for hub-to-spoke implementation, where spoke-to-spoke traffic would traverse via the hub. Phase 1 also introduced daisy chaining of identical hubs for scaling the network, thereby providing Server Load Balancing (SLB) capability to increase the CPU power.

Phase 2—Spoke-to-Spoke Designs: Phase 2 design introduced the ability for dynamic spoke-to-spoke tunnels without traffic going through the hub, intersite communication bypassing the hub, thereby providing greater scalability and better traffic control.

In Phase 2 network design, each DMVPN network is independent of other DMVPN networks, causing spoke-to-spoke traffic from different regions to traverse the regional hubs without going through the central hub.

Phase 3—Hierarchical (Tree-Based) Designs: Phase 3 extended Phase 2 design with the capability to establish dynamic and direct spoke-to-spoke tunnels from different DMVPN networks across multiple regions. In Phase 3, all regional DMVPN networks are bound to form a single hierarchical (tree-based) DMVPN network, including the central hubs.

As a result, spoke-to-spoke traffic from different regions can establish direct tunnels with each other, thereby bypassing both the regional and main hubs.

DMVPN network
Diagram: DMVPN network and phases explained. Source is blog

DMVPN Architecture

Design recommendation


  • To create an overlay network, one needs a tunneling technique.Multipoint GRE and Point to Point GRE

  • Dynamic routing protocols are typically required in all but the smallest deployments or wherever static routing is not manageable or optimal.

  • QoS: Mandatory to ensure performance and quality of voice, video, and real-time data applications.

DMVPN Design recommendation

Which deployment model can you use? The 80:20 traffic rule can be used to determine which model to use:

  1. If 80 percent or more of the traffic from the spokes is directed into the hub network itself, deploy the hub-and-spoke model.
  2. Consider the spoke-to-spoke model if more than 20 percent of the traffic is meant for other spokes.

The hub-and-spoke model is usually preferred for networks with a high volume of IP Multicast traffic.

Architecture

Medium-sized and large-scale site-to-site VPN deployments require support for advanced IP network services such as:

● IP Multicast: Required for efficient and scalable one-to-many (i.e., Internet broadcast) and many-to-many (i.e., conferencing) communications and commonly needed by voice, video, and specific data applications

● Dynamic routing protocols: Typically required in all but the smallest deployments or wherever static routing is not manageable or optimal

● QoS: Mandatory to ensure performance and quality of voice, video, and real-time data applications

Traditionally, supporting these services required tunneling IPsec inside protocols such as Generic Route Encapsulation (GRE), which introduced an overlay network, making it complex to set up and manage and limiting the solution’s scalability.

Indeed, traditional IPsec only supports IP Unicast, making deploying applications that involve one-to-many and many-to-many communications inefficient. Cisco DMVPN combines GRE tunneling and IPsec encryption with Next-Hop Resolution Protocol (NHRP) routing to meet these requirements while reducing the administrative burden. 

How DMVPN Works

How DMVPN Works


  •  Each spoke establishes a permanent tunnel to the hub. IPsec is optional.

  • Each spoke registers its actual address as a client to the NHRP server on the hub

  • When a spoke requires that packets be sent to a destination subnet on another spoke, it queries the NHRP server for the real (outside) addresses of other spoke.

  • After the originating spoke learns the peer address of the target spoke, it initiates a dynamic IPsec tunnel to the target spoke.

  • The spoke-to-spoke tunnels are established on demand whenever traffic is sent between the spokes.

  • DMVPN Operations.

How DMVPN Works

DMVPN builds a dynamic tunnel overlay network.

• Initially, each spoke establishes a permanent IPsec tunnel to the hub. (At this stage, spokes do not establish tunnels with other spokes within the network.) The hub address should be static and known by all of the spokes.

• Each spoke registers its actual address as a client to the NHRP server on the hub. The NHRP server maintains an NHRP database of the public interface addresses for each spoke.

• When a spoke requires that packets be sent to a destination (private) subnet on another spoke, it queries the NHRP server for the real (outside) addresses of the other spoke’s destination to build direct tunnels.

• The NHRP server looks up the NHRP database for the corresponding destination spoke and replies with the real address for the target router. NHRP prevents dynamic routing protocols from discovering the route to the correct spoke. (Dynamic routing adjacencies are established only from spoke to the hub.)

• After the originating spoke learns the peer address of the target spoke, it initiates a dynamic IPsec tunnel to the target spoke.

• Integrating the multipoint GRE (mGRE) interface, NHRP, and IPsec establishes a direct dynamic spoke-to-spoke tunnel over the DMVPN network.

The spoke-to-spoke tunnels are established on demand whenever traffic is sent between the spokes. After that, packets can bypass the hub and use the spoke-to-spoke tunnel directly. 

Feature Design of Dynamic Multipoint VPN 

The Dynamic Multipoint VPN (DMVPN) feature combines GRE tunnels, IPsec encryption, and NHRP routing to provide users with ease of configuration via crypto profiles—which override the requirement for defining static crypto maps—and dynamic discovery of tunnel endpoints. 

This feature relies on the following two Cisco-enhanced standard technologies: 

  • NHRP is a client-server protocol where the hub is the server and the spokes are the clients. The hub maintains an NHRP database of each spoke’s public interface addresses. Each spoke registers its real address when it boots and queries the NHRP database for the real addresses of the destination spokes to build direct tunnels. 
  • mGRE Tunnel Interface –Allows a single GRE interface to support multiple IPsec tunnels and simplifies the size and complexity of the configuration.
  • Each spoke has a permanent IPsec tunnel to the hub, not to the other spokes within the network. Each spoke registers as a client of the NHRP server. 
  • When a spoke needs to send a packet to a destination (private) subnet on another spoke, it queries the NHRP server for the real (outside) address of the destination (target) spoke. 
  • After the originating spoke “learns” the peer address of the target spoke, a dynamic IPsec tunnel can be initiated into the target spoke. 
  • The spoke-to-spoke tunnel is built over the multipoint GRE interface. 
  • The spoke-to-spoke links are established on demand whenever there is traffic between the spokes. After that, packets can bypass the hub and use the spoke-to-spoke tunnel.
Cisco DMVPN
Diagram: Cisco DMVPN features. The source is Cisco.

Cisco DMVPN Solution Architecture

DMVPN allows IPsec VPN networks to scale hub-to-spoke and spoke-to-spoke designs better, optimizing performance and reducing communication latency between sites.

DMVPN offers a wide range of benefits, including the following:

• The capability to build dynamic hub-to-spoke and spoke-to-spoke IPsec tunnels

• Optimized network performance

• Reduced latency for real-time applications

• Reduced router configuration on the hub that provides the capability to dynamically add multiple spoke tunnels without touching the hub configuration

• Automatic triggering of IPsec encryption by GRE tunnel source and destination, assuring zero packet loss

• Support for spoke routers with dynamic physical interface IP addresses (for example, DSL and cable connections)

• The capability to establish dynamic and direct spoke-to-spoke IPsec tunnels for communication between sites without having the traffic go through the hub; that is, intersite communication bypassing the hub

• Support for dynamic routing protocols running over the DMVPN tunnels

• Support for multicast traffic from hub to spokes

• Support for VPN Routing and Forwarding (VRF) integration extended in multiprotocol label switching (MPLS) networks

• Self-healing capability maximizing VPN tunnel uptime by rerouting around network link failures

• Load-balancing capability offering increased performance by transparently terminating VPN connections to multiple headend VPN devices

Network availability over a secure channel is critical in designing scalable IPsec VPN solutions prepared with networks becoming geographically distributed. DMVPN solution architecture is by far the most effective and scalable solution available.

Summary: DMVPM

One technology that has gained significant attention and revolutionized the way networks are connected is DMVPN (Dynamic Multipoint Virtual Private Network). In this blog post, we delved into the depths of DMVPN, exploring its architecture, benefits, and use cases.

Section 1: Understanding DMVPN

DMVPN, at its core, is a scalable and efficient solution for providing secure and dynamic connectivity between multiple sites over a public network infrastructure. It combines the best features of traditional VPNs and multipoint GRE tunnels, resulting in a flexible and cost-effective network solution.

Section 2: The Architecture of DMVPN

The architecture of DMVPN involves three main components: the hub router, the spoke routers, and the underlying routing protocol. The hub router acts as a central point for the network, while the spoke routers establish secure tunnels with the hub. These tunnels are dynamically built using multipoint GRE, allowing efficient data transmission.

Section 3: Benefits of DMVPN

3.1 Enhanced Scalability: DMVPN provides a scalable solution, allowing for easy addition or removal of spokes without complex configurations. This flexibility is particularly useful in dynamic network environments.

3.2 Cost Efficiency: Using existing public network infrastructure, DMVPN eliminates the need for costly dedicated lines or leased circuits. This significantly reduces operational expenses and makes it an attractive option for organizations of all sizes.

3.3 Simplified Management: With DMVPN, network administrators can centrally manage the network infrastructure through the hub router. This centralized control simplifies configuration, monitoring, and troubleshooting tasks.

Section 4: Use Cases of DMVPN

4.1 Branch Office Connectivity: DMVPN is ideal for connecting branch offices to a central headquarters. It provides secure and reliable communication while minimizing the complexity and cost associated with traditional WAN solutions.

4.2 Mobile or Remote Workforce: DMVPN offers a secure and efficient solution for connecting remote employees to the corporate network in today’s mobile-centric work environment. Whether it’s sales representatives or telecommuters, DMVPN ensures seamless connectivity regardless of the location.

Conclusion:

DMVPN has emerged as a game-changer in the world of network connectivity. Its scalable architecture, cost efficiency, and simplified management make it an attractive option for organizations seeking to enhance their network infrastructure. Whether connecting branch offices or enabling a mobile workforce, DMVPN provides a robust and secure solution. Embracing the power of DMVPN can revolutionize network connectivity, opening doors to a more connected and efficient future.

eBOOK on SASE Capabilities

eBOOK – SASE Capabilities

In the following ebook, we will address the key points:

  1. Challenging Landscape
  2. The rise of SASE based on new requirements
  3. SASE definition
  4. Core SASE capabilities
  5. Final recommendations

 

 

Preliminary Information: Useful Links to Relevant Content

For pre-information, you may find the following links useful:

 

Secure Access Service Edge (SASE) is a service designed to provide secure access to cloud applications, data, and infrastructure from anywhere. It allows organizations to securely deploy applications and services from the cloud while managing users and devices from a single platform. As a result, SASE simplifies the IT landscape and reduces the cost and complexity of managing security for cloud applications and services.

SASE provides a unified security platform allowing organizations to connect users and applications securely without managing multiple security solutions. It offers secure access to cloud applications, data, and infrastructure with a single set of policies, regardless of the user’s physical location. It also enables organizations to monitor and control all user activities with the same set of security policies.

SASE also helps organizations reduce the risk of data breaches and malicious actors by providing visibility into user activity and access. In addition, it offers end-to-end encryption, secure authentication, and secure access control. It also includes threat detection, advanced analytics, and data loss prevention.

SASE allows organizations to scale their security infrastructure quickly and easily, providing them with a unified security platform that can be used to connect users and applications from anywhere securely. With SASE, organizations can quickly and securely deploy applications and services from the cloud while managing users and devices from a single platform.

 

 

 

 

POZNAN, POL - APR 15, 2021: Laptop computer displaying logo of OpenShift, a family of containerization software products developed by Red Hat

OpenShift Networking

OpenShift Networking

OpenShift, developed by Red Hat, is a leading container platform that enables organizations to streamline their application development and deployment processes. With its robust networking capabilities, OpenShift provides a secure and scalable environment for running containerized applications. This blog post will explore the critical aspects of OpenShift networking and how it can benefit your organization.

OpenShift networking is built on top of Kubernetes networking and extends its capabilities to provide a flexible and scalable networking solution for containerized applications. It offers various networking options to meet the diverse needs of different organizations.

Load balancing and service discovery are essential aspects of Openshift networking. In this section, we will explore how Openshift handles load balancing across pods using services. We will discuss the various load balancing algorithms available and highlight the importance of service discovery in ensuring seamless communication between microservices within an Openshift cluster.

Openshift offers different networking models to suit diverse deployment scenarios. We will explore the three main models: Overlay Networking, Host Networking, and VxLAN Networking. Each model has its advantages and considerations, and we'll highlight the use cases where they shine.

Openshift provides several advanced networking features that enhance performance, security, and flexibility. We'll dive into topics like Network Policies, Service Mesh, Ingress Controllers, and Load Balancing. Understanding and utilizing these features will empower you to optimize your Openshift networking environment.

Table of Contents

Highlights: OpenShift Networking

OpenShift relies heavily on a networking stack at two layers:

  1. In the case of OpenShift itself deployed in the virtual environment, the physical network equipment directly determines the underlying network topology. OpenShift does not control this level, which provides connectivity to OpenShift masters and nodes.
  2. OpenShift SDN plugin determines the virtual network topology. At this level, applications are connected, and external access is provided.

The network topology in OpenShift

OpenShift uses an overlay network based on VXLAN to enable containers to communicate with each other. Layer 2 (Ethernet) frames can be transferred across Layer 3 (IP) networks using the Virtual Extensible Local Area Network (VXLAN) protocol.

Whether the communication is limited to pods within the same project or completely unrestricted depends on the SDN plugin being used. The network topology remains the same regardless of which plugin is used.

SDN plugins

OpenShift makes its internal SDN plugins available out-of-the-box and for integration with third-party SDN frameworks. The following are three built-in plugins that are available in OpenShift:

  1. ovs-subnet
  2. ovs-multitenant
  3. ovs-network policy

Related: Before you proceed, you may find the following posts helpful for some pre-information

  1. Kubernetes Networking 101 
  2. Kubernetes Security Best Practice 
  3. Internet of Things Theory
  4. OpenStack Neutron
  5. Load Balancing
  6. ACI Cisco



OpenShift Networking

Key OpenShift Networking Discussion points:


  • Challenges with containers and endpoint reachability.

  • Challenges to Docker networking.

  • The Kubernetes networking model.

  • Basics of Pod networking.

  • OpenShift networking SDN modes.

Back to Basics: OpenShift Networking

Networking Overview

Each pod in Kubernetes is assigned an IP address from an internal network that allows pods to communicate with each other. By doing this, all containers within the pod behave as if they were on the same host. The IP address of each pod enables pods to be treated like physical hosts or virtual machines. It includes port allocation, networking, naming, service discovery, load balancing, application configuration, and migration.

Linking pods together is unnecessary, and IP addresses shouldn’t be used to communicate directly between pods. Instead, create a service to interact with the pods.

OpenShift Container Platform DNS

To enable the frontend pods to communicate with the backend services when running multiple services, such as frontend and backend, environment variables are created for user names, service IPs, and more. To pick up the updated values for the service IP environment variable, the frontend pods must be recreated if the service is deleted and recreated. To ensure that the IP address for the backend service is generated correctly and that it can be passed to the frontend pods as an environment variable, the backend service must be created before any frontend pods.

Due to this, the OpenShift Container Platform has a built-in DNS, enabling the service to be reached by both the service DNS and the service IP/port. Split DNS is supported by the OpenShift Container Platform by running SkyDNS on the master, which answers DNS queries for services. By default, the master listens on port 53.

 

Networking Overview

1. Pod Networking:

In OpenShift, containers are encapsulated within pods, the most minor deployable units. Each pod has its IP address and can communicate with other pods within the same project or across different projects. This enables seamless communication and collaboration between applications running on different pods.

2. Service Networking:

OpenShift introduces the concept of services, which act as stable endpoints for accessing pods. Services provide a layer of abstraction, allowing applications to communicate with each other without worrying about the underlying infrastructure. With service networking, you can easily expose your applications to the outside world and manage traffic efficiently.

3. Ingress and Egress:

OpenShift provides a robust routing infrastructure through its built-in Ingress Controller. It lets you define rules and policies for accessing your applications outside the cluster. To ensure seamless connectivity, you can easily configure routing paths, load balancing, SSL termination, and other advanced features.

4. Network Policies:

OpenShift enables fine-grained control over network traffic through network policies. You can define rules to allow or deny communication between pods based on their labels and namespaces. This helps in enforcing security measures and isolating sensitive workloads from unauthorized access.

5. Multi-Cluster Networking:

OpenShift allows you to connect multiple clusters, creating a unified networking fabric. This enables you to distribute your applications across different clusters, improving scalability and fault tolerance. You can easily manage and monitor your multi-cluster environment using OpenShift’s intuitive interface.

Benefits of OpenShift Networking:

– Scalability: OpenShift’s networking capabilities allow you to scale your applications horizontally by adding more pods or vertically by increasing the resources allocated to each pod.

– Security: With network policies and ingress/egress controls, you can enforce strict security measures and protect your applications from unauthorized access.

– High Availability: OpenShift’s multi-cluster networking enables you to distribute your applications across multiple clusters, ensuring high availability and resilience.

– Easy Management: OpenShift provides a user-friendly interface for managing and monitoring your networking configurations, making it easier for administrators to maintain and troubleshoot the network.

Key Challenges

We have several challenges with traditional data center networks that prove the inability to support today’s applications, such as microservices and containers. Therefore, we need a new set of networking technologies built into OpenShift SDN to deal adequately with today’s landscape changes.

Firstly, one of the main issues is that we have a tight coupling with all the networking and infrastructure components. With traditional data center networking, Layer 4 is coupled with the network topology at fixed network points and lacks the flexibility to support today’s containerized applications that are more agile than the traditional monolith application.

One of the main issues is that containers are short-lived and constantly spun down. Assets that support the application, such as IP addresses, firewalls, policies, and overlay networks that glue the connectivity, are continually recycled. These changes bring a lot of agility and business benefits, but there is an extensive comparison to a traditional network that is relatively static, where changes happen every few months.

OpenShift Networking

OpenShift is a powerful containerization platform that enables developers to build, deploy, and manage applications easily. One crucial aspect of OpenShift is its networking capabilities, which play a vital role in ensuring connectivity and communication between various components within the platform.

1. Overview of OpenShift Networking:
OpenShift networking provides a robust and scalable network infrastructure for applications running on the platform. It allows containers and pods to communicate with each other and external systems and services. The networking model in OpenShift is based on the Kubernetes networking model, providing a standardized and flexible approach.

2. Network Namespace Isolation:
OpenShift networking leverages network namespaces to achieve isolation between different projects, or namespaces, on the platform. Each project has its virtual network, ensuring that containers and pods within a project can communicate securely while isolated from other projects.

3. Service Discovery and OpenShift Load Balancer:
OpenShift networking provides service discovery and load-balancing mechanisms to facilitate communication between various components of an application. Services act as stable endpoints, allowing containers and pods to connect to them using DNS or environmental variables. The built-in OpenShift load balancer ensures that traffic is distributed evenly across multiple instances of a service, improving scalability and reliability.

4. Ingress and Egress Network Policies:
OpenShift networking allows administrators to define ingress and egress network policies to control network traffic flow within the platform. Ingress policies specify rules for incoming traffic, allowing or denying access to specific services or pods. Egress policies, on the other hand, regulate outgoing traffic from pods, enabling administrators to restrict access to external systems or services.

5. Network Plugins and Providers:
OpenShift networking supports various network plugins and providers, allowing users to choose the networking solution that best fits their requirements. Some popular options include Open vSwitch (OVS), Flannel, Calico, and Multus. These plugins provide additional capabilities such as network isolation, advanced routing, and security features.

6. Network Monitoring and Troubleshooting:
OpenShift provides robust monitoring and troubleshooting tools to help administrators track network performance and resolve issues. The platform integrates with monitoring systems like Prometheus, allowing users to collect and analyze network metrics. Additionally, OpenShift provides logging and debugging features to aid in identifying and resolving network-related problems.

POD network

As a general rule, pod-to-pod communication holds for all Kubernetes clusters: An IP address is assigned to each Pod in Kubernetes. While pods can communicate directly with each other by addressing their IP addresses, it is recommended that they use Services instead. Services consist of Pods accessed through a single, fixed DNS name or IP address. The majority of Kubernetes applications use Services to communicate. Since Pods can be restarted frequently, addressing them directly by name or IP is highly brittle. Instead, use a Service to manage another pod.

Simple pod-to-pod communication

The first thing to understand is how Pods communicate within Kubernetes. Kubernetes provides IP addresses for each Pod. IP addresses are used to communicate between pods at a very primitive level. Therefore, you can directly address another Pod using its IP address whenever needed.

A Pod has the same characteristics as a virtual machine (VM), which has an IP address, exposes ports, and interacts with other VMs on the network via IP address and port.

What is the communication mechanism between the frontend pod and the backend pod? In a web application architecture, a front-end application is expected to talk to a backend, which could be an API or a database. In Kubernetes, the front and back end would be separated into two Pods.

The front end could be configured to communicate directly with the back end via its IP address. However, a front end would still need to know the backend’s IP address, which can be tricky when the Pod is restarted or moved to another node. Using a Service can make our solution less brittle.

Because the app still communicates with the API pods via the Service, which has a stable IP address, if the Pods die or need to be restarted, this won’t affect the app.

pod networking
Diagram: Pod networking. Source is tutorialworks

How do containers in the same Pod communicate?

Sometimes, you may need to run multiple containers in the same Pod. The IP addresses of various containers in the same Pod are the same, so Localhost can be used to communicate between them. For example, a container in a pod can use the address localhost:8080 to communicate with another container in the Pod on port 8080.

Two containers cannot share the same port in the same pod because the IP addresses are shared, and communication occurs on localhost. For instance, you wouldn’t be able to have two containers in the same Pod that expose port 8080. So, it would help if you ensured that the services use different ports.

In Kubernetes, pods can communicate with each other in a few different ways:

  1. Containers in the same Pod can connect using localhost; the other container exposes the port number.
  2. A container in a Pod can connect to another Pod using its IP address. To find the IP address of a pod, you can use oc get pods.
  3. A container can connect to another Pod through a Service. A service has an IP address and usually has a DNS name, like my service.

OpenShift and Pod Networking

When you initially deploy OpenShift, a private pod network is created. Each pod in your OpenShift cluster is assigned an IP address on the pod network, which is used to communicate with each pod across the cluster.

The pod network spanned all nodes in your cluster and was extended to your second application node when that was added to the cluster. Your pod network IP addresses can’t be used on your network by any network that OpenShift might need to communicate with. OpenShift’s internal network routing follows all the rules of any network, and multiple destinations for the same IP address lead to confusion.

Endpoint Reachability

Also, Endpoint Reachability. Not only have endpoints changed, but have the ways we reach them. The application stack previously had very few components, maybe just a cache, web server, or database. Using a load balancing algorithm, the most common network service allows a source to reach an application endpoint or load balance to several endpoints.

A simple round-robin or a load balancer that measured load was standard. Essentially, the sole purpose of the network was to provide endpoint reachability. However, changes inside the data center are driving networks and network services toward becoming more integrated with the application.

Nowadays, the network function exists no longer solely to satisfy endpoint reachability; it is fully integrated. In the case of Red Hat’s OpenShift, the network is represented as a Software-Defined Networking (SDN) layer. SDN means different things to different vendors. So, let me clarify in terms of OpenShift.

Highlighting software-defined network (SDN)

When you examine traditional networking devices, you see the control and forwarding planes, which are shared on a single device. The concept of SDN separates these two planes, i.e., the control and forwarding planes are decoupled. They can now reside on different devices, bringing many performance and management benefits.

The benefits of network integration and decoupling make it much easier for the applications to be divided into several microservice components driving the microservices culture of application architecture. You could say that SDN was a requirement for microservices.

software defined networking
Diagram: Software Defined Networking (SDN). Source is Opennetworking

Challenges to Docker Networking 

Port mapping and NAT

Docker containers have been around for a while, but networking had significant drawbacks when they first came out. If you examine container networking, for example, Docker containers have other challenges when they connect to a bridge on the node where the docker daemon is running.

To allow network connectivity between those containers and any endpoint external to the node, we need to do some port mapping and Network Address Translation (NAT). This adds complexity.

Port Mapping and NAT have been around for ages. Introducing these networking functions will complicate container networking when running at scale. It is perfectly fine for 3 or 4 containers, but the production network will have many more endpoints to deal with. The origins of container networking are based on a simple architecture and primarily a single-host solution.

Docker at scale: The need for an orchestration layer

The core building blocks of containers, such as namespaces and control groups, are battle-tested. Although the docker engine manages containers by facilitating Linux Kernel resources, it’s limited to a single host operating system. Once you get past three hosts, networking is hard to manage. Everything needs to be spun up in a particular order, and consistent network connectivity and security, regardless of the mobility of the workloads, are also challenged.

Docker Default networking
Diagram: Docker Default networking

This led to an orchestration layer. Just as a container is an abstraction over the physical machine, the container orchestration framework is an abstraction over the network. This brings us to the Kubernetes networking model, which Openshift takes advantage of and enhances; for example, the OpenShift Route Construct exposes applications for external access.

We will be discussing OpenShift Routes and Kubernetes Services in just a moment.

Introduction to OpenShift

OpenShift Container Platform (formerly known as OpenShift Enterprise) or OCP is Red Hat’s offering for the on-premises private platform as a service (PaaS). OpenShift is based on the Origin open-source project and is a Kubernetes distribution. The foundation of the OpenShift Container Platform is based on Kubernetes and, therefore, shares some of the same networking technology along with some enhancements.

Kubernetes is the leading container orchestration, and OpenShift is derived from containers, with Kubernetes as the orchestration layer. All of these elements lay upon an SDN layer that glues everything together. It is the role of SDN to create the cluster-wide network. And the glue that connects all the dots is the overlay network that operates over an underlay network. But first, let us address the Kubernetes Networking model of operation.

The Kubernetes model: Pod networking

As we discussed, the Kubernetes networking model was developed to simplify Docker container networking, which had drawbacks. It introduced the concept of Pod and Pod networking, allowing multiple containers inside a Pod to share an IP namespace. They can communicate with each other on IPC or localhost.

Nowadays, we are placing a single container into a pod, which acts as a boundary layer for any cluster parameters directly affecting the container. So, we run deployment against pods and not containers.

In OpenShift, we can assign networking and security parameters to Pods that will affect the container inside. When an app is deployed on the cluster, each Pod gets an IP assigned, and each Pod could have different applications.

For example, Pod 1 could have a web front end, and Pod could be a database, so the Pods need to communicate. For this, we need a network and IP address. By default, Kubernetes allocates an internal IP address for each Pod for applications running within the Pod. Pods and their containers can network, but clients outside the cluster cannot access internal cluster resources by default. With Pod networking, every Pod must be able to communicate with each other Pod in the cluster without Network Address Translation (NAT).

OpenShift Network Policy
Diagram: OpenShift Network Policy.

A typical service type: ClusterIP

The most common service IP address type is “ClusterIP .” ClusterIP is a persistent virtual IP address used for load-balancing traffic internal to the cluster. Services with these service types cannot be directly accessed outside the cluster; there are other service types for that requirement.

The service type of Cluster-IP is considered for East-West traffic since it originates from Pods running in the cluster to the service IP backed by Pods that also run in the cluster.

Then, to enable external access to the cluster, we need to expose the services that the Pod or Pods represent, and this is done with an Openshift Route that provides a URL. So, we have a service running in front of the pod or groups of pods. The default is for internal access only. Then, we have a URL-based route that gives the internal service external access.

Openshift load balancer
Diagram: Openshift networking and clusterIP. Source is Redhat.

Using an OpenShift Load Balancer

Get Traffic into the Cluster

OpenShift Container Platform clusters can be accessed externally through an OpenShift load balancer service if you do not need a specific external IP address. The OpenShift load balancer allocates unique IP addresses from configured pools. Load balancers have a single edge router IP (which can be a virtual IP (VIP), but it is still a single machine for initial load balancing). How many OpenShift load balancers are there in OpenShift?

Two load balancers

The solution supports some load balancer configuration options: Use the playbooks to configure two load balancers for highly available production deployments. Use the playbooks to configure a single load balancer, which is helpful for proof-of-concept deployments. Deploy the solution using your OpenShift load balancer.

This process involves the following:

  1. The administrator performs the prerequisites;
  2. The developer creates a project and service if the service to be exposed does not exist;
  3. The developer exposes the service to create a route.
  4. The developer creates the Load Balancer Service.
  5. The network administrator configures networking to the service.

OpenShift load balancer: Different Openshift SDN networking modes

OpenShift security best practices  

So, depending on your Openshift SDN configuration, you can tailor the network topology differently. You can have free-for-all Pod connectivity, similar to a flat network or something stricter, with different security boundaries and restrictions. A free-for-all Pod connectivity between all projects might be good for a lab environment.

Still, for production networks with multiple projects, you may need to tailor the network with segmentation, which can be done with one of the OpenShift SDN plugins. We will get to this in a moment.

Openshift networking does this with an SDN layer and enhances Kubernetes networking to have a virtual network across all the nodes created with the Open switch standard. For the Openshift SDN, this Pod network is established and maintained by the OpenShift SDN, configuring an overlay network using Open vSwitch (OVS).

The OpenShift SDN plugin

We mentioned that you could tailor the virtual network topology to suit your networking requirements. The OpenShift SDN plugin and the SDN model you select can determine this. With the default OpenShift SDN, several modes are available.

This level of SDN mode you choose is concerned with managing connectivity between applications and providing external access to them.

Some modes are more fine-grained than others. How are all these plugins enabled? The Openshift Container Platform (OCP) networking relies on the Kubernetes CNI model while supporting several plugins by default and several commercial SDN implementations, including Cisco ACI. The native plugins rely on the virtual switch Open vSwitch and offer alternatives to providing segmentation using VXLAN, specifically the VNID or the Kubernetes Network Policy objects:

We have, for example:

        • ovs-subnet  
        • ovs-multitenant  
        • ovs-network policy

Choosing the right plugin depends on your security and control goals. As SDNs take over networking, third-party vendors develop programmable network solutions. OpenShift is tightly integrated with products from such providers by Red Hat. According to Red Hat, the following solutions are production-ready:

  1. Nokia Nuage
  2. Cisco Contiv
  3. Juniper Contrail
  4. Tigera Calico
  5. VMWare NSX-T
  6. ovs-subnet plugin

After OpenShift is installed, this plugin is enabled by default. As a result, pods can be connected across the entire cluster without limitations so traffic can flow freely between them. This may be undesirable if security is a top priority in large multitenant environments. 

ovs-multitenant plugin

Security is usually unimportant in PoCs and sandboxes but becomes paramount when large enterprises have diverse teams and project portfolios, especially when third parties develop specific applications. A multitenant plugin like ovs-multitenant is an excellent choice if simply separating projects is all you need.

This plugin sets up flow rules on the br0 bridge to ensure that only traffic between pods with the same VNID is permitted, unlike the ovs-subnet plugin, which passes all traffic across all pods. It also assigns the same VNID to all pods for each project, keeping them unique across projects.

ovs-networkpolicy plugin

While the ovs-multitenant plugin provides a simple and largely adequate means for managing access between projects, it does not allow granular control over access. In this case, the ovs-networkpolicy plugin can be used to create custom NetworkPolicy objects that, for example, apply restrictions to traffic egressing or entering the network.

Egress routers

In OpenShift, routers direct ingress traffic from external clients to services, which then forward it to pods. As well as forwarding egress traffic from pods to external networks, OpenShift offers a reverse type of router. Egress routers, on the other hand, are implemented using Squid instead of HAProxy. Routers with egress capabilities can be helpful in the following situations:

  • They are masking different external resources used by several applications with a single global resource. For example, applications may be developed so that they are built, pulling dependencies from other mirrors, and collaboration between their development teams is rather loose. So, instead of getting them to use the same mirror, an operations team can set up an egress router to intercept all traffic directed to those mirrors and redirect it to the same site.
  • To redirect all suspicious requests for specific sites to the audit system for further analysis.

OpenShift supports the following types of egress routers:

  • redirect for redirecting traffic to a specific destination IP
  • http-proxy for proxying HTTP, HTTPS, and DNS traffic

Summary: OpenShift Networking

In the ever-evolving world of cloud computing, Openshift has emerged as a robust application development and deployment platform. One crucial aspect that makes it stand out is its networking capabilities. In this blog post, we delved into the intricacies of Openshift networking, exploring its key components, features, and benefits.

Section 1: Understanding Openshift Networking Fundamentals

Openshift networking operates on a robust and flexible architecture that enables efficient communication between various components within a cluster. It utilizes a combination of software-defined networking (SDN) and network overlays to create a scalable and resilient network infrastructure.

Section 2: Exploring Networking Models in Openshift

Openshift offers different networking models to suit various deployment scenarios. The most common models include Single-Stack Networking, Dual-Stack Networking, and Multus CNI. Each model has its advantages and considerations, allowing administrators to choose the most suitable option for their specific requirements.

Section 3: Deep Dive into Openshift SDN

At the core of Openshift networking lies the Software-Defined Networking (SDN) solution. It provides the necessary tools and mechanisms to manage network traffic, implement security policies, and enable efficient communication between Pods and Services. We will explore the inner workings of Openshift SDN, including its components like the SDN controller, virtual Ethernet bridges, and IP routing.

Section 4: Network Policies in Openshift

To ensure secure and controlled communication between Pods, Openshift implements Network Policies. These policies define rules and regulations for network traffic, allowing administrators to enforce fine-grained access controls and segmentation. We will discuss the concept of Network Policies, their syntax, and practical examples to showcase their effectiveness.

Conclusion:

Openshift’s networking capabilities play a crucial role in enabling seamless communication and connectivity within a cluster. By understanding the fundamentals, exploring different networking models, and harnessing the power of SDN and Network Policies, administrators can leverage Openshift’s networking features to build robust and scalable applications.

In conclusion, Openshift networking opens up a world of possibilities for developers and administrators, empowering them to create resilient and interconnected environments. By diving deep into its intricacies, one can unlock the full potential of Openshift networking and maximize the efficiency of their applications.

auto scaling observability

Auto Scaling Observability

 

 

Auto Scaling Observability

Observability in the context of autoscaling is a crucial aspect of managing and optimizing the scalability and efficiency of modern applications. This blog post will delve into autoscaling observability and its significance in today’s dynamic and rapidly evolving technological landscape.

 

Highlights: Auto Scaling Observability

  • The Role of the Metric

“What Is a Metric: Good for Known” So when it comes to auto-scaling observability and auto-scaling metrics, one needs to understand the downfall of the metric. A metric is a single number, with tags optionally appended for grouping and searching those numbers. They are disposable and cheap and have a predictable storage footprint.

A metric is a numerical representation of a system state over the recorded time interval and can tell you if a particular resource is over or underutilized at a specific moment. For example, CPU utilization might be at 75% right now.

  • Prometheus Pull Approach

There can be many tools to gather metrics, such as Prometheus, along with several techniques used to collect these metrics, such as the PUSH and PULL approaches. There are pros and cons to each method. However, Prometheus metric types and its PULL approach are prevalent in the market. However, if you want full observability and controllability, remember it is solely in metrics-based monitoring solutions.  For additional information on Monitoring and Observability and their difference, visit this post on observability vs monitoring.

 

Related: Before you proceed, you may find the following helpful

  1. Load Balancing
  2. Microservices Observability
  3. Network Functions
  4. Distributed Systems Observability

 



Auto Scaling Metrics

Key Auto Scaling Observability Discussion points:


  • Metrics are good for "known" issues. 

  • Challenges and issues around metrics for monitoring.

  • Observability considerations.

  • No need to predict.

  • Used for unknown / unknown failure modes.

 

Back to basics with Auto Scaling Observability

Understanding Autoscaling

Before we dive into observability, let’s briefly explore the concept of autoscaling. Autoscaling refers to the ability of an application or infrastructure to adjust its resources based on demand automatically. It enables organizations to handle fluctuating workloads and optimize resource allocation efficiently.

Observability, in the context of autoscaling, refers to gaining insights into an autoscaling system’s performance, health, and efficiency. It involves collecting, analyzing, and visualizing relevant data to understand the behavior and patterns of the application and infrastructure. Organizations can make informed decisions to optimize autoscaling algorithms, resource allocation, and overall system performance through observability.

Observability

Main Auto Scaling Observability Components

Auto Scaling Observability

  • Metrics and Monitoring

  • Logging and Tracing

  • Alerting and Tresholds

  • Numerous Tools and Platfroms.

Critical Components of Autoscaling Observability

To achieve effective autoscaling observability, several critical components come into play. These include:

Metrics and Monitoring: Gathering and monitoring key metrics such as CPU utilization, response times, request rates, and error rates are fundamental for understanding the performance of the application and infrastructure.

Logging and Tracing: Logging captures detailed information about events and transactions within the system, while tracing provides insights into the flow of requests across various components. Both logging and tracing contribute to a comprehensive understanding of system behavior.

Alerting and Thresholds: Setting up appropriate alerts and thresholds based on predefined criteria ensures timely notifications when specific conditions are met. This allows

Tools and Technologies for Autoscaling Observability

A wide range of tools and technologies are available to facilitate autoscaling observability. Prominent examples include Prometheus, Grafana, Elasticsearch, Kibana, and CloudWatch. These tools provide robust monitoring, visualization, and analysis capabilities, enabling organizations to gain deep insights into their autoscaling systems.

The first component of observability is the channels that convey observations to the observer. There are three channels: logs, traces, and metrics. These channels are common to all areas of observability, including data observability.

  • Logs

Logs are the most typical channel and take several forms (e.g., line of free-text, JSON. Logs are intended to encapsulate information about an event.

  • Traces

Traces allow you to do what logs don’t—reconnect the dots of a process. Because traces represent the link between all events of the same process, they allow the whole context to be derived from logs efficiently. Each pair of events, an operation, is a span that can be distributed across multiple servers.

  • Metrics

Finally, we have metrics. Every system state has some component that can be represented with numbers, and these numbers change as the state changes. Metrics provide a basis of information that allows an observer not only to understand using factual information but also leverage mathematical methods to derive insight from even a large number of metrics (e.g., the CPU load, the number of open files, the average amount of rows, the minimum date).

 

Auto scaling observability
Auto scaling observability: Metric Overload

 

Auto Scaling Observability

Metrics: Resource Utilization Only

So, metrics help tell us about resource utilization. Within a Kubernetes environment, these metrics are used to perform auto-healing and auto-scheduling purposes. So, when it comes to metrics, monitoring performs several functions. First, it can collect, aggregate, and analyze metrics to shift through known patterns that indicate troubling trends.

The critical point here is that it shifts through known patterns. Then, based on a known event, metrics trigger alerts that notify when further investigation is needed. Finally, we have dashboards that display the metrics data trends adapted for visual consumption on top of all of this.

These monitoring systems work well for identifying previously encountered known failures but don’t help as much for the unknown. Unknown failures are the norm today with disgruntled systems and complex system interactions.

Metrics are suitable for dashboards, but there won’t be a predefined dashboard for unknowns as it can’t track something it does not know about. Using metrics and dashboards like this is a very reactive approach. Yet, it’s an approach widely accepted as the norm. Monitoring is a reactive approach best suited for detecting known problems and previously identified patterns. 

Metrics and intermittent problems?

So, the metrics can help you when the microservice is healthy or unhealthy within a microservices environment. Still, a metric will have difficulty telling you if a microservices function takes a long time to complete or if there is an intermittent problem with an upstream or downstream dependency. So, we need different tools to gather this type of information.

We have an issue with auto-scaling metrics because they only look at individual microservices with a given set of attributes. So, they don’t give you a holistic view of the problem. For example, the application stack now exists in numerous locations and location types; we need a holistic viewpoint.

And a metric does not give this. For example, metrics are used to track simplistic system states that might indicate a service may be running poorly or may be a leading indicator or an early warning signal. However, while those measures are easy to collect, they don’t turn out to be proper measures for triggering alerts.

Auto-scaling metrics: Issues with dashboards: Useful only for a few metrics

So, these metrics are gathered and stored in time-series databases, and we have several dashboards to display these metrics. These dashboards were first built, and there weren’t many system metrics to worry about. You could have gotten away with 20 or so dashboards. But that was about it. As a result, it was easy to see the critical data anyone should know about for any given service. Moreover, those systems were simple and did not have many moving parts. This contrasts the modern services that typically collect so many metrics that fitting them into the same dashboard is impossible.

Auto-scaling metrics: Issues with aggregate metrics

So, we must find ways to fit all the metrics into a few dashboards. Here, the metrics are often pre-aggregated and averaged. However, the issue is that the aggregate values no longer provide meaningful visibility, even when we have filters and drill-downs. Therefore, we need to predeclare conditions that describe conditions we expect in the future. 

This is where we use instinctual practices of past experiences and rely on gut feeling. Remember the network and software hero? It would help to avoid aggregation and averaging within the metrics store. On the other hand, we have Percentiles that offer a richer view. Keep in mind, however, that they require raw data.

Auto Scaling Observability: Any Question

For auto-scaling observability, we take on an entirely different approach. They strive for other exploratory methods to find problems. Essentially, those operating observability systems don’t sit back and wait for an alert or something to happen. Instead, they are always actively looking and asking random questions to the observability system.

Observability tools should gather rich telemetry for every possible event, having full content of every request and then having the ability to store it and query. In addition, these new auto-scaling observability tools are specifically designed to query against high-cardinality data. High cardinality allows you to interrogate your event data in any arbitrary way that we see fit. Now, we ask any questions about your system and inspect its corresponding state. 

 

Key Auto Scaling Observability Considerations

No predictions in advance.

Due to the nature of modern software systems, you want to understand any inner state and services without anticipating or predicting them in advance. For this, we need to gain valuable telemetry and use some new tools and technological capabilities to gather and interrogate this data once it has been collected. Telemetry needs to be constantly gathered in flexible ways to debug issues without predicting how failures may occur. 

The conditions affecting infrastructure health change infrequently and are relatively easier to monitor. In addition, we have several well-established practices to predict, such as capacity planning and the ability to remediate automatically, e.g., auto-scaling in a Kubernetes environment. All of which can be used to tackle these types of known issues.

Auto Scaling Observability
Diagram: Auto Scaling Observability and Observability tools.

 

Due to its relatively predictable and slowly changing nature, the aggregated metrics approach monitors and alerts perfectly for infrastructure problems. So here, a metric-based system works well. Metrics-based systems and their associated signals help you see when capacity limits or known error conditions of underlying systems are being reached.

So, metrics-based systems work well for infrastructure problems that don’t change much but fall dramatically short in complex distributed systems. You should opt for an observability and controllability platform for these systems. 

 

Summary: Understanding Autoscaling

Autoscaling is a mechanism that automatically adjusts the number of computing resources allocated to an application based on its demand. By dynamically scaling resources up or down, autoscaling enables organizations to handle fluctuating workloads efficiently. However, to truly harness the power of autoscaling, it is crucial to have robust observability in place.

Section 1: The Role of Observability in Autoscaling

Observability is the ability to gain insights into the internal state of a system based on its external outputs. Observability plays a pivotal role in understanding the system’s behavior, identifying bottlenecks, and making informed scaling decisions when it comes to autoscaling. It provides visibility into key metrics like CPU utilization, memory usage, and network traffic. With observability, you can make data-driven decisions and ensure optimal resource allocation.

Section 2: Monitoring and Metrics

To achieve effective autoscaling observability, comprehensive monitoring is essential. Monitoring tools collect various metrics, such as response times, error rates, and resource utilization, to provide a holistic view of your infrastructure. These metrics can be analyzed to identify patterns, detect anomalies, and trigger autoscaling actions when necessary. You can proactively address performance issues and optimize resource utilization by monitoring and analyzing metrics.

Section 3: Logging and Tracing

In addition to monitoring, logging, and tracing are critical components of autoscaling observability. Logging captures detailed information about system events, errors, and activities, enabling you to troubleshoot issues and gain insights into system behavior. Tracing helps you understand the flow of requests across different services. Logging and tracing provide a granular view of your application’s performance, aiding in autoscaling decisions and ensuring smooth operation.

Section 4: Automation and Alerting

To truly master autoscaling observability, automation, and alerting mechanisms are vital. You can configure thresholds and triggers that initiate autoscaling actions based on predefined conditions by setting up automated processes. This allows for proactive scaling, ensuring your system is constantly optimized for performance. Additionally, timely alerts can notify you of critical events or anomalies, enabling you to take immediate action and maintain the desired scalability.

Conclusion:

Autoscaling observability is the key to unlocking the true potential of autoscaling. By understanding the behavior of your system through comprehensive monitoring, logging, and tracing, you can make informed decisions and ensure optimal resource allocation. With automation and alerting mechanisms in place, you can proactively respond to changing demands and maintain high efficiency. Embrace autoscaling observability and take your infrastructure management to new heights!

 

ACI networks

ACI Networks

 

 

ACI Networks

In today’s fast-paced digital landscape, reliable and efficient network connectivity is crucial for businesses of all sizes. As technology advances, traditional network infrastructures often struggle to meet growing demands. However, a game-changing solution is transforming how companies operate and communicate – ACI Networks.

ACI, or application-centric infrastructure, is a cutting-edge networking architecture focusing on application requirements rather than traditional network infrastructure. It provides a holistic and programmable approach to network management, enabling businesses to achieve unprecedented agility, scalability, and security. By leveraging software-defined networking (SDN) principles, ACI networks centralize control, simplify network operations, and enhance overall performance.

 

Highlights: ACI Networks

  • The Traditional Data Center 

Firstly, the Cisco data center design traditionally built our networks based on hierarchical data center topologies. This is often referred to as the traditional data center with a three-tier structure with an access layer, an aggregation layer, and a core layer. Historically, this design enabled substantial predictability because aggregation switch blocks simplified the spanning-tree topology. In addition, the need for scalability often pushed this design into modularity with ACI networks and ACI Cisco, which increased predictability.

  • The Challenges

However, although we increased predictability, the main challenge inherent in the three-tier models is that it was difficult to scale. As the number of endpoints increases and the need to move between segments, we need to span layer 2. This is a significant difference between the traditional and the ACI data center.

 

Related: For pre-information, you may find the following post helpful:

  1. Data Center Security 

 



ACI Networks


Key ACI Networks Discussion Points:


  • Design around issues with Spanning Tree Protocol.

  • Layer 2 all the way to the Core.

  • Routing at the Access layer.

  • The changes from ECMP.

  • ACI networks and normalization.

  • Leaf and Spine designs.

 

Back to basics: ACI Networks

Critical Benefits of ACI Networks

Cisco ACI 

Main ACI Networks Components

ACI Networks

  • Enhanced Scalability and Flexibility

  • Simplified Network Operations:

  • Enhanced Security:

  • Data Centrer and Network Virtualization

Enhanced Scalability and Flexibility:

One of the critical advantages of ACI networks is their ability to scale and adapt to changing business needs. Traditional networks often struggle to accommodate rapid growth or dynamic workloads, leading to performance bottlenecks. ACI networks, on the other hand, offer seamless scalability and flexibility, allowing businesses to quickly scale up or down as required without compromising performance or security.

Simplified Network Operations:

Gone are the days of manual network configurations and time-consuming troubleshooting. ACI networks introduce a centralized management approach, where policies and structures can be defined and automated across the entire network infrastructure. This simplifies network operations, reduces human errors, and enables IT teams to focus on strategic initiatives rather than mundane tasks.

Enhanced Security:

In today’s threat landscape, network security is paramount. ACI networks integrate security as a foundational element rather than an afterthought. With ACI’s microsegmentation capabilities, businesses can create granular security policies and isolate workloads, effectively containing potential threats and minimizing the impact of security breaches. This approach ensures that critical data and applications remain protected despite evolving cyber threats.

Real-World Use Cases of ACI Networks

Data Centers and Cloud Environments:

ACI networks have revolutionized data center and cloud environments, enabling businesses to achieve unprecedented agility and efficiency. By providing a unified management platform, ACI networks simplify data center operations, enhance workload mobility, and optimize resource utilization. Furthermore, ACI’s seamless integration with cloud platforms ensures consistent network policies and security across hybrid and multi-cloud environments.

Network Virtualization and Automation:

ACI networks are a game-changer for network virtualization and automation. By abstracting network functionality from physical hardware, ACI enables businesses to create virtual networks, provision services on-demand, and automate network operations. Streamlining network deployments accelerates service delivery, reduces costs, and improves overall performance.

 

The Traditional Data Center

Our journey towards ACI started in the early 1990s, looking at the most traditional and well-known two- or three-layer network architecture. This Core/Aggregation/Access design was generally used and recommended for campus enterprise networks.

At that time and in that environment, it delivered sufficient quality for typical client-server types of applications. The traditional design taken from campus networks was based on Layer 2 connectivity between all network parts, segmentation was implemented using VLANs, and the loop-free topology relied on the Spanning Tree Protocol (STP).

Scaling such an architecture implies the growth of broadcast and failure domains, which could be more beneficial for the resulting performance and stability. For instance, picture each STP Topology Change Notification (TCN) message causing MAC tables aging in the whole datacenter for a particular VLAN, followed by excessive BUM (Broadcast, Unknown Unicast, Multicast) traffic flooding until all MACs are relearned.

 

Designing around STP

Before we delve into the Cisco ACI overview, let us first address some basics around STP design. The traditional Cisco data center design often leads to poor network design and human error. You don’t want a layer 2 segment between the data center unless you have the proper controls.

Although modularization is still desired in networks today, the general trend has been to move away from this design type that evolves around spanning tree to a more flexible and scalable solution with VXLAN and other similar Layer 3 overlay technologies. In addition, the Layer 3 overlay technologies bring a lot of network agility, which is vital to business success.

Agility refers to making changes, deploying services, and supporting the business at its desired speed. This means different things to different organizations. For example, a network team can be considered agile if it can deploy network services in a matter of weeks.

In others, it could mean that business units in a company should be able to get applications to production or scale core services on demand through automation with Ansible CLI or Ansible Tower.

Regardless of how you define agility, there is little disagreement with the idea that network agility is vital to business success. The problem is that network agility has traditionally been hard to achieve until now with the ACI data center. Let’s recap some of the leading Cisco data center design transitions to understand fully.

 

Cisco data center design
Diagram: Cisco data center design transformation.

 

Cisco ACI Overview: The Need for ACI Networks

Layer 2 to the Core

The traditional SDN data center has gone through several transitions. Firstly, we had Layer 2 to the core. Then, from the access to the core, we had Layer 2 and not Layer 3. A design like this would, for example, trunk all VLANs to the core. For redundancy, you would manually prune VLANs from the different trunk links.

Our challenge with this approach of having Layer 2 to the core relies on Spanning Tree Protocol. Therefore, redundant links are blocked. As a result, we don’t have the total bandwidth, leading to performance degradation and waste of resources. Another challenge is to rely on topology changes to fix the topology.

Data Center Design

Data Center Stability

Layer 2 to the Core layer

STP blocks reduandant links

Manual pruning of VLANs

STP for topology changes

Efficient design

Spanning Tree Protocol does have timers to limit the convergence and can be tuned for better performance. Still, we rely on the convergence from Spanning Tree Protocol to fix the topology, but Spanning Tree Protocol was never meant to be a routing protocol.

Compared to other protocols operating higher up in the stack, they are designed to be more optimized to react to changes in the topology. However, STP is not an optimized control plane protocol, significantly hindering the traditional data center. You could relate this to how VLANs have transitioned to become a security feature. However, their purpose was originally for performance reasons.

Routing to Access Layer

The Layer 3 boundary gets pushed further to the network’s edge to overcome these challenges to build stable data center networks. Layer 3 networks can use the advances in routing protocols to handle failures and link redundancy much more efficiently.

It is a lot more efficient than Spanning Tree Protocol, which should never have been there in the first place. Then we had routing at the access. With this design, we can eliminate the Spanning Tree Protocol to the core and then run Equal Cost MultiPath (ECMP) from the access to the core.

We can run ECMP as we are now Layer 3 routing from the access to the core layer instead of running STP that blocks redundant links.  However, equal-cost multipath (ECMP) routes offer a simple way to share the network load by distributing traffic onto other paths.

ECMP is typically applied only to entire flows or sets of flows. Destination address, source address, transport level ports, and payload protocol may characterize a flow in this respect.

Data Center Design

Data Center Stability


Layer 3 to the Core layer

Routing protocol stability 

Automatic routing  convergence

STP for topology changes

Efficient design

  • A Key Point: Equal Cost MultiPath (ECMP)

Equal Cost MultiPath (ECMP) brings many advantages; firstly, ECMP gives us total bandwidth with equal-cost links. As we are routing, we no longer have to block redundant links to prevent loops at Layer 2. However, we still have Layer 2 in the network design and Layer 2 on the access layer; therefore, parts of the network will still rely on the Spanning Tree Protocol, which converges when there is a change in the topology.

So we may have Layer 3 from the access to the core, but we still have Layer 2 connections at the edge and rely on STP to block redundant links to prevent loops. Another potential drawback is that having smaller Layer 2 domains can limit where the application can reside in the data center network, which drives more of a need to transition from the traditional data center design.

 

data center network design
Diagram: Data center network design: Equal cost multipath.

 

The Layer 2 domain that the applications may use could be limited to a single server rack connected to one ToR or two ToR for redundancy with a layer 2 interlink between the two ToR switches to pass the Layer 2 traffic.

These designs are not optimal, as you must specify where your applications are set. Therefore, putting the breaks on agility. As a result, there was another critical Cisco data center design transition, and this was the introduction to overlay data center designs.

 

Cisco ACI Overview

Cisco data center design: The rise of virtualization

Virtualization is creating a virtual — rather than actual — version of something, such as an operating system (OS), a server, a storage device, or network resources. Virtualization uses software that simulates hardware functionality to create a virtual system.

It is creating a virtual version of something like computer hardware. It was initially developed during the mainframe era. With virtualization, the virtual machine could exist on any host. As a result, Layer 2 had to be extended to every switch.

This was problematic for Larger networks as the core switch had to learn every MAC address for every flow that traversed it. To overcome this and take advantage of the convergence and stability of layer 3 networks, overlay networks became the choice for data center networking, along with introducing control plane technologies such as EVPM MPLS.

VXLAN
Diagram: Changing the VNI

Overlay networking with VXLAN

VXLAN is an encapsulation protocol that provides data center connectivity using tunneling to stretch Layer 2 connections over an underlying Layer 3 network. VXLAN is the most commonly used protocol in data centers to create a virtual overlay solution that sits on top of the physical network, enabling virtual networks. The VXLAN protocol supports the virtualization of the data center network while addressing the needs of multi-tenant data centers by providing the necessary segmentation on a large scale.

Here, we are encapsulating traffic into a VXLAN header and forwarding between VXLAN tunnel endpoints, known as the VTEPs. With overlay networking, we have the overlay and the underlay concept. By encapsulating the traffic into the overlay VXLAN, we now use the underlay, which in the ACI is provided by IS-IS, to provide the Layer 3 stability and redundant paths using Equal Cost Multipathing (ECMP) along with the fast convergence of routing protocols.

 

Horizontal scaling load balancing

 

The Cisco Data Center Design Transition

The Cisco data center design has gone through several stages when you think about it. First, we started with Spanning Tree, moved to the Spanning Tree with vPCs, and then replaced the Spanning Tree with FabricPath. FabricPath is what is known as a MAC-in-MAC Encapsulation.

Then we returned Spanning Tree with VXLAN: VXLAN vs VLAN, a MAC-in-IP Encapsulation. Today, in the data center, VXLAN is the de facto overlay protocol for data center networking. The Cisco ACI uses an enhanced version of VXLAN to implement both Layer 2 and Layer 3 forwarding with a unified control plane. Replacing SpanningTree with VXLAN, where we have a MAC-in-IP encapsulation, was a welcomed milestone for data center networking.

VXLAN multicast mode
Diagram: VXLAN multicast mode

 

Cisco ACI Overview: Introduction to the ACI Networks

The base of the ACI network is the Cisco Application Centric Infrastructure Fabric (ACI)—the Cisco SDN solution for the data center. Cisco has taken a different approach from the centralized control plane SDN approach with other vendors and has created a scalable data center solution that can be extended to multiple on-premises, public, and private cloud locations.

The ACI networks have many components, including Cisco Nexus 9000 Series switches with the APIC Controller running in the spine leaf architecture ACI fabric mode. These components form the building blocks of the ACI, supporting a dynamic integrated physical and virtual infrastructure.

The Cisco ACI version

Before Cisco ACI 4.1, the Cisco ACI fabric allowed only a two-tier (spine-and-leaf switch) topology. Each leaf switch is connected to every spine switch in the network with no interconnection between leaf switches or spine switches.

Starting from Cisco ACI 4.1, the Cisco ACI fabric allows a multitier (three-tier) fabric and two tiers of leaf switches, which provides the capability for vertical expansion of the Cisco ACI fabric. This is useful to migrate a traditional three-tier architecture of core aggregation access that has been a standard design model for many enterprise networks and is still required today.

ACI fabric Details
Diagram: Cisco ACI fabric Details

The APIC Controller

The ACI networks are driven by the Cisco Application Policy Infrastructure Controller ( APIC) database working in a cluster from the management perspective. The APIC is the centralized control point; you can do everything you want to configure in the APIC.

Consider the APIC to be the brains of the ACI fabric and server as the single source of truth for configuration within the fabric. The APIC controller is a policy engine and holds the defined policy, which tells the other elements in the ACI fabric what to do. This database allows you to manage the network as a single entity. 

In summary, the APIC is the infrastructure controller and is the main architectural component of the Cisco ACI solution. It is the unified point of automation and management for the Cisco ACI fabric, policy enforcement, and health monitoring. The APIC is not involved in data plane forwarding.

data center layout
Diagram: Data center layout: The Cisco APIC controller

 

The APIC represents the management plane, allowing the system to maintain the control and data plane in the network. The APIC is not the control plane device, nor does it sit in the data traffic path. Remember that the APIC controller can crash, and you still have forwarded in the fabric. The ACI solution is not an SDN centralized control plane approach. The ACI is a distributed fabric with independent control planes on all fabric switches. 

 

Cisco Data Center Design: The Leaf and Spine 

Leaf-spine is a two-layer data center network topology for data centers that experience more east-west network traffic than north-south traffic. The topology comprises leaf switches (servers and storage connect) and spine switches (to which leaf switches connect).

In this two-tier Clos architecture, every lower-tier switch (leaf layer) is connected to each top-tier switch (Spine layer) in a full-mesh topology. The leaf layer consists of access switches connecting to devices like servers.

The Spine layer is the network’s backbone and interconnects all Leaf switches. Every Leaf switch connects to every spine switch in the fabric. The path is randomly chosen, so the traffic load is evenly distributed among the top-tier switches. Therefore, if one of the top-tier switches fails, it would only slightly degrade performance throughout the data center.

SDN data center
Diagram: Cisco ACI fabric checking.

Unlike the traditional Cisco data center design, the ACI data center operates with a Leaf and Spine architecture. Now, traffic comes in through a device sent from an end host. In the ACI data center, this is known as a Leaf device.

We also have the Spine devices that are Layer 3 routers with no unique hardware dependencies. In a primary Leaf and Spine fabric, every Leaf is connected to every Spine. Any endpoint in the fabric is always the same distance regarding hops and latency from every other internal endpoint.

The ACI Spine switches are Clos intermediary switches with many vital functions. Firstly, they exchange routing updates with leaf switches via Intermediate System-to-Intermediate System (IS-IS) and rapidly forward packets between them. They provide endpoint lookup services to leaf switches through the Council of Oracle Protocol (COOP). They also handle route reflection to the leaf switches using Multiprotocol BGP (MP-BGP).

Cisco ACI Overview
Diagram: Cisco ACI Overview.

The Leaf switches are the ingress/egress points for traffic into and out of the ACI fabric. In addition, they are the connectivity points for the various endpoints that the Cisco ACI supports. The leaf switches provide end-host connectivity.

The spines act as a fast, non-blocking Layer 3 forwarding plane that supports Equal Cost Multipathing (ECMP) between any two endpoints in the fabric and uses overlay protocols such as VXLAN under the hood. VXLAN enables any workload to exist anywhere in the fabric. Using VXLAN, we can now have workloads anywhere in the fabric without introducing too much complexity.

ACI data center and ACI networks

This is a significant improvement to data center networking. We can now have physical or virtual workloads in the same logical layer 2 domain, even running Layer 3 down to each ToR switch. The ACI data center is a scalable solution as the underlay is specifically built to be scalable as more links are added to the topology and resilient when links in the fabric are brought down due to, for example, maintenance or failure. 

 

ACI Networks: The Normalization event

VXLAN is an industry-standard protocol that extends Layer 2 segments over Layer 3 infrastructure to build Layer 2 overlay logical networks. The ACI infrastructure Layer 2 domains reside in the overlay, with isolated broadcast and failure bridge domains. This approach allows the data center network to grow without risking creating too large a failure domain. All traffic in the ACI fabric is normalized as VXLAN packets.

ACI encapsulates external VLAN, VXLAN, and NVGRE packets in a VXLAN packet at the ingress. This is known as ACI encapsulation normalization. As a result, the forwarding in the ACI data center fabric is not limited to or constrained by the encapsulation type or overlay network. If necessary, the ACI bridge domain forwarding policy can be defined to provide standard VLAN behavior where required.

Cisco ACI overview with making traffic ACI-compatible

As a final note in this Cisco ACI overview, let us address the normalization process. When traffic hits the Leaf, there is a normalization event. The normalization takes traffic from the servers to the ACI, making it ACI-compatible. Essentially, we are giving traffic sent from the servers a VXLAN ID to be sent across the ACI fabric.

Traffic is normalized, encapsulated with a VXLAN header, and routed across the ACI fabric to the destination Leaf, where the destination endpoint is. This is, in a nutshell, how the ACI Leaf and Spine work. We have a set of leaf switches that connect to the workloads and the spines that connect to the Leaf.

VXLAN is the overlay protocol that carries data traffic across the ACI data center fabric. A key point to this type of architecture is that the Layer 3 boundary is moved to the Leaf. This brings a lot of value and benefits to data center design. This boundary makes more sense as we must route and encapsulate this layer without going to the core layer.

In conclusion, ACI networks are revolutionizing how businesses connect and operate in the digital age. With their focus on application-centric infrastructure, ACI networks offer enhanced scalability, simplified network operations, and top-notch security. By leveraging ACI networks, businesses can unleash the full potential of their network infrastructure, ensuring seamless connectivity and staying ahead in today’s competitive landscape.

 

Summary: Understanding ACI Networks

ACI networks, short for application-centric infrastructure networks, represent a software-driven approach to networking that brings automation, agility, and simplicity to network operations. Unlike traditional networks that rely on manual configurations, ACI networks leverage policy-based automation, enabling organizations to manage and scale their network infrastructure efficiently. By abstracting network policies from the underlying physical infrastructure, ACI networks empower businesses to adapt to changing requirements quickly.

Section 1: The Building Blocks of ACI Networks

At the core of ACI networks lie two fundamental components: the Application Policy Infrastructure Controller (APIC) and the Nexus switches. The APIC is the central orchestrator, providing a unified view of the entire network fabric. It enables administrators to define policies and automate network provisioning, reducing human error and increasing operational efficiency. On the other hand, the Nexus switches form the backbone of the network, delivering high-performance connectivity and supporting advanced features such as micro-segmentation and traffic engineering.

Section 2: Key Benefits of ACI Networks

ACI networks offer many benefits that revolutionize connectivity for organizations of all sizes. Firstly, the automation capabilities of ACI networks streamline network management, reducing the time and effort required to provision, configure, and troubleshoot network infrastructure. This allows IT teams to focus on strategic initiatives and innovation rather than being bogged down by mundane tasks.

Secondly, ACI networks enhance security by implementing micro-segmentation. By dividing the network into smaller segments and applying specific security policies to each, ACI networks minimize the risk of lateral movement in case of a breach, protecting critical assets and sensitive information.

Lastly, ACI networks provide unparalleled scalability and agility. With their dynamic and flexible nature, businesses can quickly adapt their network infrastructure to accommodate changing requirements and rapidly deploy new services or applications. This agility enables organizations to stay ahead in today’s fast-paced digital landscape.

Conclusion: In conclusion, ACI networks are revolutionizing connectivity by offering a software-driven, automated, and secure approach to network management. By leveraging the power of ACI networks, businesses can unlock new levels of efficiency, scalability, and agility, enabling them to thrive in the digital era. Whether streamlining operations, fortifying security, or embracing innovation, ACI networks are paving the way toward a connected future.

 

Cisco ACI Overview

Cisco ACI

Service Level Objectives (SLOs): Customer-centric view

 

 

Service Level Objectives (SLOs)

In today’s fast-paced digital world, businesses heavily rely on various software applications and online services to ensure smooth operations and deliver value to their customers. However, the increasing complexity of these systems often poses challenges in terms of reliability, availability, and performance. This is where Service Level Objectives (SLOs) come into play. In this blog post, we will delve into the concept of SLOs and explore their significance in achieving service excellence.

Service Level Objectives, or SLOs, are measurable targets defining the desired performance level, availability, and service reliability. They are critical to Service Level Agreements (SLAs) between service providers and customers. SLOs help set clear expectations and enable businesses to monitor, measure, and improve their service delivery based on agreed-upon metrics.

 

Highlights: Service Level Objectives (SLOs)

Site Reliability Engineering (SRE) teams have tools such as Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets that can guide them on the road to building a reliable system with the customer viewpoint as the metric. These new tools or technologies form the basis for reliability in distributed system and are the core building blocks of a reliable stack that assist with baseline engineering. The first thing you need to understand is the service’s expectations. This introduces the areas of service-level management and its components.

  • The Role of Service-Level Management

The core concepts of service level management are Service Level Agreement (SLA), Service Level Objectives (SLO), and Service Level Indicators (SLIs). The common indicators used are Availability, latency, duration, and efficiency. Monitoring these indicators to catch problems before your SLO is violated is critical. These are the cornerstone of developing a good SRE practice.

    • SLI: Service level Indicator: A well-defined measure of “successful enough.” It is a quantifiable measurement of whether a given user interaction was good enough. Did it meet the expectation of the users? Does a web page load? Within a specific time. This allows you to categorize whether a given interaction is good or bad.
    • SLO: Service level objective: A top-line target for a fraction of successful interactions.
    • SLA: Service level agreement: consequences. It’s more of a legal construct. 

 

For pre-information, you may find the following helpful:

  1. Starting Observability
  2. Distributed Firewalls
  3. Network Traffic Engineering
  4. Brownfield Network Automation

 



Service Level Objectives

Key Service Level Objectives (slos) Discussion points:


  • Required for baseline engineering. 

  • Components of a Reliable system.

  • Chaos Engineering.

  • The issue with static thresholds.

  • How to approach Reliability.

 

  • A key point: Video on System Reliability and SLOs

The following video will discuss the importance of distributed systems observability and the need to fully comprehend them with practices like Chaos Engineering and Site Reliability Engineering (SRE). In addition, we will again discuss the problems with monitoring and static thresholds.

 

Site Reliability Engineering | Observability
Prev 1 of 1 Next
Prev 1 of 1 Next

 

  • A key point: Back to basics with Service Level Objectives

Site Reliability Engineering (SRE)

Pioneered by Google to make more scalable and reliable large-scale systems, SRE has become one of today’s most valuable software innovation opportunities. SRE is a concrete opinionated implementation of the DevOps philosophy. The main goals are to create scalable and highly reliable software systems.

According to Benjamin Treynor Sloss, the founder of Google’s Site Reliability Team, “SRE is what happens when a software engineer is tasked with what used to be called operations.”

System Reliability Meaning
Diagram: System reliability meaning.

 

 

So, Reliability is not so much a feature but more of a practice that must be prioritized and taken into consideration from the very beginning and is not something that should be added later on. For example, when a system or service is in production. Reliability is the essential feature of any system, and it’s not a feature that a vendor can sell you.

So if someone tries to sell you an add-on solution called Reliability, don’t buy it, especially if they offer 100% reliability. Nothing can be 100% reliable all the time. If you strive for 100% reliability, you will miss out on opportunities to perform innovative tasks and the need to experiment and take risks that can help you build better products and services. 

Why are SLOs Important?

SLOs play a vital role in ensuring customer satisfaction and meeting business objectives. Here are a few reasons why SLOs are essential:

1. Accountability: SLOs provide a framework for holding service providers accountable for meeting the promised service levels. They establish a baseline for evaluating the performance and quality of the service.

2. Customer Experience: By setting SLOs, businesses can align their service offerings with customer expectations. This helps deliver a superior customer experience, foster customer loyalty, and gain a competitive edge in the market.

3. Performance Monitoring and Improvement: SLOs enable businesses to monitor their services’ performance and identify improvement areas continuously. Regularly tracking SLO metrics allows for proactive measures and optimizations to enhance service reliability and availability.

Critical Elements of SLOs:

To effectively implement SLOs, it is essential to consider the following key elements:

1. Metrics: SLOs should be based on relevant, measurable metrics that accurately reflect the desired service performance. Standard metrics include response time, uptime percentage, error rate, and throughput.

2. Targets: SLOs must define specific targets for each metric, considering customer expectations, industry standards, and business requirements. Targets should be achievable yet challenging enough to drive continuous improvement.

3. Monitoring and Alerting: Establishing robust monitoring and alerting mechanisms allows businesses to track the performance of their services in real time. This enables timely intervention and remediation in case of deviations from the defined SLOs.

4. Communication: Effective communication with customers is crucial to ensure transparency and manage expectations. Businesses should communicate SLOs, including the metrics, targets, and potential limitations, to foster trust and maintain a healthy customer-provider relationship.

 

Components of a Reliable System

Distributed system

To build reliable systems that can tolerate various failures, the system needs to be distributed so that a problem in one location doesn’t mean your entire service stops operating. So you need to build a system that can handle, for example, a node dying or perform adequately with a particular load.

To create a reliable system, you need to understand it fully and what happens when the different components that make up the system reach certain thresholds. This is where practices such as Chaos engineering kubernetes can help you.

 

Chaos Engineering 

We can have practices like Chaos Engineering that can confirm your expectations, give you confidence in your system at different levels, and prove you can have certain tolerance levels to Reliability. Chaos Engineering allows you to find weaknesses and vulnerabilities in complex systems. It is an important task that can be automated into your CI/CD pipelines.

So you can have various Chaos Engineering verifications before you reach production. And these Chaos Engineering tests, such as load and Latency tests, can all be automated with little or no human interaction. Site Reliability Engineering (SRE) teams often use Chaos Engineering to improve resilience and must be part of your software development/deployment process.  

 

  • A key point: Video on Starting a Chaos Engineering Project

This educational tutorial will begin with guidance on how the application has changed from the monolithic style to the microservices-based approach and how this has affected failures. I will introduce to you how this can be solved by knowing exactly how your application and infrastructure perform under stress and what are their breaking points.

 

Chaos Engineering: How to Start A Project
Prev 1 of 1 Next
Prev 1 of 1 Next

 

It’s All About Perception: Customer-Centric View

Reliability is all about perception. Suppose the user considers your service unreliable. In that case, you will lose consumer trust as service perception is poor, so it’s important to provide consistency with your services as much as possible. For example, it’s OK to have some outages. Outages are expected, but you can’t have them all the time and for long durations.

Users expect to have outages at some point in time, but not for so long. User Perception is everything; if the user thinks you are unreliable, you are. Therefore you need to have a customer-centric view, and using customer satisfaction is a critical metric to measure.

This is where the key components of service management, such as Service Level Objectives (SLO) and Service Level Indicators (SLI), come to play. There is a balance that you need to find between Velocity and Stability. You can’t stop innovation, but you can’t take too many risks. An Error Budget will help you here and Site Reliability Engineering (SRE) principles. 

 

Users experience Static thresholds.

User experience means different things to different sets of users. We now have a model where different service users may be routed through the system in different ways, using different components. Therefore providing experiences that can vary widely. We also know now that the services no longer tend to break in the same few predictable ways over and over.

With complex microservices and many software interactions, we have many unpredictable failures that we have never seen before. These are often referred to as black holes. We should have a few alerts triggered by only focusing on symptoms that directly impact user experience and not because a threshold was reached.

If your POD network reaches a certain threshold, this tells you nothing about user experience. You can’t rely on static thresholds anymore, as they have no relationship to customer satisfaction.

If you use static thresholds, they can’t reliably indicate any issues with user experience. Alerts should be set up to detect failures that impact user experience. Traditional monitoring falls short in trying this as it usually has predefined dashboards looking for something that has happened before.

This brings us back to the challenges with traditional metrics-based monitoring; we rely on static thresholds to define optimal system conditions, which have nothing to do with user experience. However, modern systems change shape dynamically under different workloads. Static thresholds for monitoring can’t reflect impacts on user experience. They lack context and are too coarse.

 

How to Approach Reliability 

New tools and technologies

We have new tools, such as distributed tracing. So if the system becomes slow, what is the best way to find the bottleneck? Here you can use Distributed Tracing and Open Telemetry. So the tracing helps us instrument our system, so we figure out where the time has been spent and can be used across distributed microservice architecture to troubleshoot problems. Open Telemetry provides a standardized way of instrumenting our system and providing those traces.

We have already touched on Service Level Objectives, Indicators, and Error Budget. You want to know why and how something has happened. So we don’t just want to know when something has happened and then react to an event that is not looking from the customer’s perspective.

We need to understand if we are meeting Service Level Agreement (SLA) by gathering the number and frequency of the outages and any performance issues. Service Level Objectives (SLO) and Service Level Indicators (SLI) can assist you with measurements. 

Service Level Objectives (SLO) and Service Level Indicators (SLI) not only assist you with measurements. They offer a tool for having better system reliability and form the base for the Reliability Stack. SLIs and SLOs help us interact with Reliability differently and offer us a path to build a reliable system.

So now we have the tools and a disciple to use the tools within. Can you recall what that disciple is? The discipline is Site Reliability Engineering (SRE)

System Reliability Formula
Diagram: System Reliability Formula.

 

  • SLO-Based approach to reliability

If you’re too reliable all the time, you’re also missing out on some of the fundamental features that SLO-based approaches give you. The main area you will miss is the freedom to do what you want, test, and innovate. If you’re too reliable, you’re missing out on opportunities to experiment, perform chaos engineering, ship features quicker than before, or even introduce structured downtime to see how your dependencies react.

To learn a system, you need to break it. So if you are 100% reliable, you can’t touch your system, so you will never truly learn and understand your system. You want to give your users a good experience, but you’ll run out of resources in various ways if you try to ensure this good experience happens 100% of the time. SLOs let you pick a target that lives between those two worlds.

 

  • Balance velocity and stability

So you can’t just have Reliability by itself; you must also have new features and innovation. Therefore, you need to find a balance between velocity and stability. So we need to balance Reliability with other features you have and are proposing to offer. Suppose you have access to a system with a fantastic feature that doesn’t work. The users that have the choice will leave.

So Site Reliability Engineering is the framework for balancing velocity and stability. So how do you know what level of Reliability you need to provide your customer? This all goes back to the business needs that reflect the customer’s expectations. So with SRE, we have a customer-centric approach.

The primary source of outages is making changes even when the changes are planned. This can come in many forms, such as pushing new features, applying security patches, deploying new hardware, and scaling up to meet customer demand will significantly impact if you strive for a 100% reliability target. 

If nothing changes to the physical/logical infrastructure or other components, we will not have bugs. We can freeze our current user base and never have to scale the system. In reality, this will not happen. There will always be changes. So it would be best if you found a balance.

Conclusion:

In conclusion, Service Level Objectives (SLOs) are a cornerstone for delivering reliable and high-quality services in today’s technology-driven world. By setting measurable targets, businesses can align their service performance with customer expectations, drive continuous improvement, and ultimately enhance customer satisfaction. Implementing and monitoring SLOs allows businesses to proactively address issues, optimize service delivery, and stay ahead of the competition. By embracing SLOs, businesses can pave the way for successful service delivery and long-term growth.

 

auto scaling observability

Observability vs Monitoring

Observability vs Monitoring

In today's fast-paced digital landscape, where complex systems and applications drive businesses, it's crucial to have a clear understanding of observability and monitoring. These two terms are often used interchangeably, but they represent distinct concepts in the realm of system management and troubleshooting. In this blog post, we will delve into the differences between observability and monitoring, shedding light on their unique features and benefits.

What is Observability? Observability refers to the ability to gain insight into the internal state of a system through its external outputs. It focuses on understanding the behavior and performance of a system from an external perspective, without requiring deep knowledge of its internal workings. Observability provides a holistic view of the system, enabling comprehensive analysis and troubleshooting.

The Essence of Monitoring: Monitoring, on the other hand, involves the systematic collection and analysis of various metrics and data points within a system. It primarily focuses on tracking predefined performance indicators, such as CPU usage, memory utilization, and network latency. Monitoring provides real-time data and alerts to ensure that system health is maintained and potential issues are promptly identified.

Data Collection and Analysis:Observability emphasizes comprehensive data collection and analysis, aiming to capture the entire system's behavior, including its interactions, dependencies, and emergent properties. Monitoring, however, focuses on specific metrics and predefined thresholds, often using predefined agents, plugins, or monitoring tools.

Contextual Understanding: Observability aims to provide a contextual understanding of the system's behavior, allowing engineers to trace the flow of data and understand the cause and effect of different components. Monitoring, while offering real-time insights, lacks the contextual understanding provided by observability.

Reactive vs Proactive: Monitoring is primarily reactive, alerting engineers when predefined thresholds are exceeded or when specific events occur. Observability, on the other hand, enables a proactive approach, empowering engineers to explore and investigate the system's behavior even before issues arise.

In conclusion, observability and monitoring are both crucial elements in system management, but they have distinct focuses and approaches. Observability provides a holistic and contextual understanding of the system's behavior, allowing for comprehensive analysis and proactive troubleshooting. Monitoring, on the other hand, offers real-time data and alerts based on predefined metrics, ensuring system health is maintained. Understanding the differences between these two concepts is vital for effectively managing and optimizing complex systems.

Highlights: Observability vs Monitoring

Observability: The First Steps

The first step towards achieving modern observability is to gather metrics, traces, and logs. From the collected data points, observability aims to generate valuable outcomes for decision-making. The decision-making process goes beyond resolving problems as they arise. Next-generation observability goes beyond application remediation, focusing on creating business value to help companies achieve their operational goals. This decision-making process can be enhanced by incorporating user experience, topology, and security data.

Observability Platform

A full-stack observability platform monitors every monitored host in your environment. Depending on the technologies used, an average of 500 metrics are generated per computational node. AWS, Azure, Kubernetes, and VMware Tanzu are some platforms that use observability to collect important key performance metrics for services and real-user monitored applications. 

Within a microservices environment, there can be dozens, if not hundreds, of microservices calling one another. Distributed tracing can help you understand how the different services connect and how your requests flow through them. 

The three pillars of observability form a strong foundation for making data-driven decisions, but there are opportunities to extend observability. User experience and security details must be considered to gain a deeper understanding. A holistic, context-driven approach to advanced observability enables proactively addressing potential problems before they arise.

The Role of Monitoring

To understand the difference between observability and monitoring, we need first to discuss the role of monitoring. Monitoring is the evaluation that helps identify the most practical and efficient use of resources. So, the big question I put to you is what to monitor. This is the first step to preparing a monitoring strategy.

You can ask yourself a couple of questions to fully understand if monitoring is enough or if you need to move to an observability platform. Firstly, you should consider what you should be monitoring, why you should be monitoring it, and how you should be monitoring it. 

Options: Open source or commercial

Knowing this lets you move into the different tools and platforms available. Some of these tools will be open source, and others commercial. When evaluating these tools, one word of caution: does each tool work in a silo, or can it be used across technical domains? Silos are breaking agility in every form of technology.

For pre-information, you may find the following posts helpful:

  1. Microservices Observability
  2. Auto Scaling Observability
  3. Network Visibility
  4. WAN Monitoring
  5. Distributed Systems Observability
  6. Prometheus Monitoring
  7. Correlate Disparate Data Points
  8. Segment Routing



Monitoring vs Observability

Key Observability vs Monitoring Discussion points:


  • The difference between Monitoring vs Observability. 

  • Google’s four Golden signals.

  • The role of metrics, logs and alerts.

  • The need for Observability.

  • Observability and Monitoring working together.

Back to Basics with Observability vs Monitoring

Monitoring and distributed systems

By utilizing distributed architectures, the cloud native ecosystem allows organizations to build scalable, resilient, and novel software architectures. However, the ever-changing nature of distributed systems means that previous approaches to monitoring can no longer keep up. The introduction of containers made the cloud flexible and empowered distributed systems.

Nevertheless, the ever-changing nature of these systems can cause them to fail in many ways. Distributed systems are inherently complex, and, as systems theorist Richard Cook notes, “Complex systems are intrinsically hazardous systems.”

Cloud-native systems require a new approach to monitoring, one that is open-source compatible, scalable, reliable, and able to control massive data growth. However, cloud-native monitoring can’t exist in a vacuum: it needs to be part of a broader observability strategy.

Observability vs Monitoring
Diagram: Observability vs monitoring.

Key Features of Observability:

1. High-dimensional data collection: Observability involves collecting a wide variety of data from different system layers, including metrics, logs, traces, and events. This comprehensive data collection provides a holistic view of the system’s behavior.

2. Distributed tracing: Observability allows tracing requests as they flow through a distributed system, enabling engineers to understand the path and identify performance bottlenecks or errors.

3. Contextual understanding: Observability emphasizes capturing contextual information alongside the data, enabling teams to correlate events and understand the impact of changes or incidents.

Benefits of Observability:

1. Faster troubleshooting: By providing detailed insights into system behavior, observability helps teams quickly identify and resolve issues, minimizing downtime and improving system reliability.

2. Proactive monitoring: Observability allows teams to detect potential problems before they become critical, enabling proactive measures to prevent service disruptions.

3. Improved collaboration: With observability, different teams, such as developers, operations, and support, can have a shared understanding of the system’s behavior, leading to improved collaboration and faster incident response.

Monitoring:

On the other hand, monitoring focuses on collecting and analyzing metrics to assess the health and performance of a system. It involves setting up predefined thresholds or rules and generating alerts based on specific conditions.

Key Features of Monitoring:

1. Metric-driven analysis: Monitoring relies on predefined metrics collected and analyzed to measure system performance, such as CPU usage, memory consumption, response time, or error rates.

2. Alerting and notifications: Monitoring systems generate alerts and notifications when predefined thresholds or rules are violated, enabling teams to take immediate action.

3. Historical analysis: Monitoring systems provide historical data, allowing teams to analyze trends, identify patterns, and make informed decisions based on past performance.

Benefits of Monitoring:

1. Performance optimization: Monitoring helps identify performance bottlenecks and inefficiencies within a system, enabling teams to optimize resources and improve overall system performance.

2. Capacity planning: By monitoring resource utilization and workload patterns, teams can accurately plan for future growth and ensure sufficient resources are available to meet demand.

3. Compliance and SLA enforcement: Monitoring systems help organizations meet compliance requirements and enforce service level agreements (SLAs) by tracking and reporting on key metrics.

Observability and Monitoring: A Unified Approach:

While observability and monitoring differ in their approaches and focus, they are not mutually exclusive. When used together, they complement each other and provide a more comprehensive understanding of system behavior.

Observability enables teams to gain deep insights into system behavior, understand complex interactions, and troubleshoot issues effectively. Conversely, monitoring provides a systematic approach to tracking predefined metrics, generating alerts, and ensuring the system meets performance requirements.

Combining observability and monitoring can help organizations create a robust system monitoring and management strategy. This integrated approach empowers teams to detect, diagnose, and resolve issues quickly, improving system reliability, performance, and customer satisfaction.

The Starting Point: Observability vs Monitoring

You need to measure and gather the correct event information in your environment, which will be done with several tools. This will let you know what is affecting your application performance and infrastructure. As a good starting point, there are four golden signals for Latency, saturation, traffic, and errors. These are Google’s Four Golden Signals. The four most important metrics to keep track of are: 

      1. Latency: How long it takes to serve a request
      2. Traffic: The number of requests being made.
      3. Errors: The rate of failing requests. 
      4. Saturation: How utilized the service is.

So now we have some guidance on what to monitor and let us apply this to Kubernetes to, for example, let’s say, a frontend web service that is part of a tiered application, we would be looking at the following:

      1. How many requests is the front end processing at a particular point in time,
      2. How many 500 errors are users of the service received, and 
      3. Does the request overutilize the service?

We already know that monitoring is a form of evaluation that helps identify the most practical and efficient use of resources. With monitoring, we observe and check the progress or quality of something over time. Within this, we have metrics, logs, and alerts. Each has a different role and purpose.

Monitoring: The role of metrics

Metrics are related to some entity and allow you to view how many resources you consume. Metric data consists of numeric values instead of unstructured text, such as documents and web pages. Metric data is typically also a time series, where values or measures are recorded over some time. 

Available bandwidth and latency are examples of such metrics. Understanding baseline values is essential. Without a baseline, you will not know if something is happening outside the norm.

What are the average baseline values for bandwidth and latency metrics? Are there any fluctuations in these metrics? How do these values rise and fall during normal operations and peak usage? This may change over different days, weeks, and months.

If you notice a rise in these values during normal operations, this would be deemed abnormal and should act as a trigger that something could be wrong and needs to be investigated. Remember that these values should not be gathered as a once-off but can be gathered over time to understand your application and its underlying infrastructure better.

Monitoring: The role of logs

Logging is an essential part of troubleshooting application and infrastructure performance. Logs give you additional information about events, which is important for troubleshooting or discovering the root cause of the events. Logs will have a lot more detail than metrics, so you will need some way to parse the logs or use a log shipper.

A typical log shipper will take these logs from the standard out in a Docker container and ship them to a backend for processing.

FluentD or Logstash has pros and cons. The group can use it here and send it to a backend database, which could be the ELK stack ( Elastic Search). Using this approach, you can add different things to logs before sending them to the backend. For example, you can add GEO IP information. This will add richer information to the logs that can help you troubleshoot.

Monitoring: The role of alerting

Then we have the alerting, and it would be best to balance how you monitor and what you alert on. So, we know that alerting is not always perfect, and getting the right alerting strategy in place will take time. It’s not a simple day-one installation and requires much effort and cross-team collaboration.

You know that alerting on too much can cause alert fatigue. We are all too familiar with the problems alert fatigue can bring and the tensions it can create in departments.

To minimize this, consider Service Level Objective (SLO) for alerts. SLOs are measurable characteristics such as availability, throughput, frequency, and response times. Service Level Objectives are the foundation for a reliability stack. Also, it would be best if you also considered alert thresholds. If these are too short, you will get a lot of false positives on your alerts. 

Monitoring is not enough.

Even with all of these in place, monitoring is not enough. Due to the sheer complexity of today’s landscape, you need to consider and think differently about the tools you use and how you use the intelligence and data you receive from them to resolve issues before they become incidents.  That monitoring by itself is not enough.

The tool used to monitor is just a tool that probably does not cross technical domains, and different groups of users will administer each tool without a holistic view. The tools alone can take you only half the way through the journey.  Also, what needs to be addressed is the culture and the traditional way of working in silos. A siloed environment can affect the monitoring strategy you want to implement. Here, you can look at an observability platform.

Observability vs Monitoring

When it comes to observability vs. monitoring, we know that monitoring can detect problems and tell you if a system is down, and when your system is UP, Monitoring doesn’t care. Monitoring only cares when there is a problem. The problem has to happen before monitoring takes action. It’s very reactive. So, if everything is working, monitoring doesn’t care.

On the other hand, we have an observability platform, which is a more proactive practice. It’s about what and how your system and services are doing. Observability lets you improve your insight into how complex systems work and quickly get to the root cause of any problem, known or unknown.

Observability is best suited for interrogating systems to explicitly discover the source of any problem, along any dimension or combination of dimensions, without first predicting. This is a proactive approach.

The pillars of observability

This is achieved by combining logs, metrics, and traces. So, we need data collection, storage, and analysis across these domains while also being able to perform alerting on what matters most. Let’s say you want to draw correlations between units like TCP/IP packets and HTTP errors experienced by your app.

The Observability platform pulls context from different sources of information, such as logs, metrics, events, and traces, into one central context. Distributed tracing adds a lot of value here.

Also, when everything is placed into one context, you can quickly switch between the necessary views to troubleshoot the root cause. Viewing these telemetry sources with one single pane of glass is an excellent key component of any observability system. 

Distributed Tracing in Microservices
Diagram: Distributed tracing in microservices.

Known and Unknown / Observability Unknown and Unknown

Monitoring automatically reports whether known failure conditions are occurring or are about to occur. In other words, it is optimized for reporting on unknown conditions about known failure modes, which are referred to as known unknowns. In contrast, Observability is centered around discovering if and why previously unknown failure modes may be occurring, in other words, to find unknown unknowns.

The monitoring-based approach of metrics and dashboards is an investigative practice that relies on humans’ experience and intuition to detect and understand system issues. This is okay for a simple legacy system that fails in predictable ways, but the instinctual technique falls short for modern systems that fail in unpredictable ways.

With modern applications, the complexity and scale of their underlying systems quickly make that approach unattainable, and we can’t rely on hunches. Observability tools differ from traditional monitoring tools because they enable engineers to investigate any system, no matter how complex. You don’t need to react to a hunch or have intimate system knowledge to generate a hunch.

Monitoring vs Observability: Working together?

Monitoring helps engineers understand infrastructure concerns, while observability helps engineers understand software concerns. So, Observability and Monitoring can work together. First, the infrastructure does not change too often, and when it fails, it will fail more predictably. So, we can use monitoring here.

This is compared to software system states that change daily and are unpredictable. Observability fits this purpose. The conditions that affect infrastructure health change infrequently and are relatively more straightforward to predict. We have several well-established practices to expect, such as capacity planning and the ability to remediate automatically (e.g., as auto-scaling in a Kubernetes environment. All of which can be used to tackle these types of known issues. 

Monitoring and infrastructure problems

Due to its relatively predictable and slowly changing nature, the aggregated metrics approach monitors and alerts perfectly for infrastructure problems. So here, a metric-based system works well. Metrics-based systems and their associated alerts help you see when capacity limits or known error conditions of underlying systems are being reached.

Now, we need to look at monitoring the Software. Now, we need access to high-cardinality fields. This may include the user ID or a shopping cart ID. Code that is well-instrumented for Observability allows you to answer complex questions that are easy to miss when examining aggregate performance.

Observability and monitoring are essential practices in modern software development and operations. While observability focuses on understanding system behavior through comprehensive data collection and analysis, monitoring uses predefined metrics to assess performance and generate alerts. By leveraging both approaches, organizations can gain a holistic view of their systems, enabling proactive measures, faster troubleshooting, and optimal performance. Embracing observability and monitoring as complementary practices can pave the way for more reliable, scalable, and efficient systems in the digital era.

 

Summary: Observability vs Monitoring

As technology advances rapidly, understanding and managing complex systems becomes increasingly important. Two terms that often arise in this context are observability and monitoring. While they may seem interchangeable, they represent distinct approaches to gaining insights into system performance. In this blog post, we delved into observability and monitoring, exploring their differences, benefits, and how they can work together to provide a comprehensive understanding of system behavior.

Section 1: Understanding Monitoring

Monitoring is a well-established practice in the world of technology. It involves collecting and analyzing data from various sources to ensure the smooth functioning of a system. Monitoring typically focuses on key performance indicators (KPIs) such as response time, error rates, and resource utilization. Organizations can proactively identify and resolve issues by tracking these metrics, ensuring optimal system performance.

Section 2: Unveiling Observability

Observability takes a more holistic approach compared to monitoring. It emphasizes understanding the internal state of a system by leveraging real-time data and contextual information. Unlike monitoring, which focuses on predefined metrics, observability aims to provide a clear picture of how a system behaves under different conditions. It achieves this by capturing fine-grained telemetry data, including logs, traces, and metrics, which can be analyzed to uncover patterns, anomalies, and root causes of issues.

Section 3: The Benefits of Observability

One of the key advantages of observability is its ability to handle unexpected scenarios and unknown unknowns. Capturing detailed data about system behavior enables teams to investigate issues retroactively, even those that were not anticipated during the design phase. Additionally, observability allows for better collaboration between different teams, as the shared visibility into system internals facilitates more effective troubleshooting and faster incident resolution.

Section 4: Synergy between Observability and Monitoring

While observability and monitoring are distinct concepts, they are not mutually exclusive. They can complement each other to provide a comprehensive understanding of system performance. Monitoring can provide high-level insights into system health and performance trends, while observability can dive deeper into specific issues and offer a more granular view. By combining these approaches, organizations can achieve a proactive and reactive system management approach, ensuring stability and resilience.

Conclusion:

Observability and monitoring are two powerful tools in the arsenal of system management. While monitoring focuses on predefined metrics, observability takes a broader and more dynamic approach, capturing fine-grained data to gain deeper insights into system behavior. By embracing observability and monitoring, organizations can unlock a comprehensive understanding of their systems, enabling them to proactively address issues, optimize performance, and deliver exceptional user experiences.

OpenShift Security Context Constraints

OpenShift Security Best Practices

OpenShift Security Best Practices

In today's digital landscape, security is of utmost importance. This is particularly true for organizations utilizing OpenShift, a powerful container platform. In this blog post, we will explore the best practices for OpenShift security, ensuring that your deployments are protected from potential threats.

Container Security: Containerization has revolutionized application deployment, but it also introduces unique security considerations. By implementing container security best practices, you can mitigate risks and safeguard your OpenShift environment. We will delve into topics such as image security, vulnerability scanning, and secure container configurations.

Access Control: Controlling access to your OpenShift cluster is vital for maintaining a secure environment. We will discuss the importance of strong authentication mechanisms, implementing role-based access control (RBAC), and regularly reviewing and updating user permissions. These measures will help prevent unauthorized access and potential data breaches.

Network Security: Securing the network infrastructure is crucial to protect your OpenShift deployments. We will explore topics such as network segmentation, implementing firewall rules, and utilizing secure network protocols. By following these practices, you can create a robust network security framework for your OpenShift environment.

Monitoring and Logging: Effective monitoring and logging are essential for detecting and responding to security incidents promptly. We will discuss the importance of implementing comprehensive logging mechanisms, utilizing monitoring tools, and establishing alerting systems. These practices will enable you to proactively identify potential security threats and take necessary actions to mitigate them. Regular Updates and Patching: Keeping your OpenShift environment up to date with the latest patches and updates is vital for maintaining security. We will emphasize the significance of regular patching and provide tips for streamlining the update process. By staying current with security patches, you can address vulnerabilities and protect your OpenShift deployments.

In conclusion, securing your OpenShift environment requires a multi-faceted approach that encompasses container security, access control, network security, monitoring, and regular updates. By implementing the best practices discussed in this blog post, you can fortify your OpenShift deployments and ensure a robust security posture. Protecting your applications and data is a continuous effort, and staying vigilant is key in the ever-evolving landscape of cybersecurity.

Highlights: OpenShift Security Best Practices

Understanding Cluster Access

To begin our journey, let’s establish a solid understanding of cluster access. Cluster access refers to the process of authenticating and authorizing users or entities to interact with an Openshift cluster. It involves managing user identities, permissions, and secure communication channels.

Implementing Multi-Factor Authentication (MFA)

Multi-factor authentication adds an extra layer of security by requiring users to provide multiple forms of identification. By enabling MFA within your Openshift cluster, you can significantly reduce the risk of unauthorized access. This section will outline the steps to configure and enforce MFA for enhanced cluster access security.

Role-Based Access Control (RBAC)

RBAC is a crucial component of Openshift security, allowing administrators to define and manage user permissions at a granular level. We will explore the concept of RBAC and its practical implementation within an Openshift cluster. Discover how to define roles, assign permissions, and effectively control access to various resources.

Secure Communication Channels

Establishing secure communication channels is vital to protect data transmitted between cluster components. In this section, we will discuss the utilization of Transport Layer Security (TLS) certificates to encrypt communication and prevent eavesdropping or tampering. Learn how to generate and manage TLS certificates within your Openshift environment.

Continuous Monitoring and Auditing

Maintaining a robust security posture involves constantly monitoring and auditing cluster access activities. Through integrating monitoring tools and auditing mechanisms, administrators can detect and respond to potential security breaches promptly. Uncover the best practices for implementing a comprehensive monitoring and auditing strategy within your Openshift cluster.

Threat modelling

A threat model maps out the likelihood and impact of potential threats to your system. It is essential to think about and evaluate the risks to your platform, as your security team is busy doing that.

OpenShift clusters are no different, so when hardening your cluster, keep that in mind. Using the Kubeadmin user is probably fine if you use CodeReadContainers on your laptop to learn OpenShift. Access control and RBAC rules are probably a good idea for your company’s production clusters exposed to the internet.

If you model threats beforehand, you will be able to explain what you did to protect your infrastructure and why you may not have taken certain other actions.

Stricter Securit than Kubernetes

OpenShift has stricter security policies than Kubernetes. For instance, running a container as root is forbidden. To enhance security, it also offers a secure-by-default option. Kubernetes doesn’t have built-in authentication or authorization capabilities, so developers must manually create bearer tokens and other authentication procedures.

OpenShift provides a range of security features, including role-based access control (RBAC), image scanning, and container isolation, that help ensure the safety of containerized applications.

Related: For useful pre-information on OpenShift basics, you may visit the following posts helpful:

  1. OpenShift Networking
  2. Kubernetes Security Best Practice
  3. Container Networking
  4. Identity Security
  5. Docker Container Security
  6. Load Balancing



OpenShift Security.

Key Observability Security Best Practices points:


  • The traditional fixed stack security architecture. 

  • Microservices: Many different entry points.

  • Docker container attack vectors.

  • Security Context Constraints.

  • OpenShift network security.

Back to Basics: Starting OpenShift Security with OpenShift Best Practices

Generic: Securing containerized environments

Securing containerized environments is considerably different from securing the traditional monolithic application because of the inherent nature of the microservices architecture. We went from one to many, and there is a clear difference in attack surface and entry points. So, there is much to consider for OpenShift network security and OpenShift security best practices, including many Docker security options.

The application stack previously had very few components, maybe just a cache, web server, and database separated and protected by a context firewall. The most common network service allows a source to reach an application, and the sole purpose of the network is to provide endpoint reachability.

As a result, the monolithic application has few entry points, such as ports 80 and 443. Not every monolithic component is exposed to external access and must accept requests directly, so we designed our networks around these facts. The following diagram provides information on the threats you must consider for container security.

container security
Diagram: Container security. Source Neuvector

1. Secure Authentication and Authorization: One of the fundamental aspects of OpenShift security is ensuring that only authorized users have access to the platform. Implementing robust authentication mechanisms, such as multifactor authentication (MFA) or integrating with existing identity management systems, is crucial to prevent unauthorized access. Additionally, defining fine-grained access controls and role-based access control (RBAC) policies will help enforce the principle of least privilege.

2. Container Image Security: OpenShift leverages containerization technology, which brings its security considerations. Using trusted container images from reputable sources and regularly updating them to include the latest security patches is essential. Implementing image scanning tools to detect vulnerabilities and malware within container images is also recommended. Furthermore, restricting privileged containers and enforcing resource limits will help mitigate potential security risks.

3. Network Security: OpenShift supports network isolation through software-defined networking (SDN). It is crucial to configure network policies to restrict communication between different components and namespaces, thus preventing lateral movement and unauthorized access. Implementing secure communication protocols, such as Transport Layer Security (TLS), between services and enforcing encryption for data in transit will further enhance network security.

4. Monitoring and Logging: A robust monitoring and logging strategy is essential for promptly detecting and responding to security incidents. OpenShift provides built-in monitoring capabilities, such as Prometheus and Grafana, which can be leveraged to monitor system health, resource usage, and potential security threats. Additionally, enabling centralized logging and auditing of OpenShift components will help identify and investigate security events.

5. Regular Vulnerability Assessments and Penetration Testing: To ensure the ongoing security of your OpenShift environment, it is crucial to conduct regular vulnerability assessments and penetration testing. These activities will help identify any weaknesses or vulnerabilities within the platform and its associated applications. Addressing these vulnerabilities promptly will minimize the risk of potential attacks and data breaches.

OpenShift Security

OpenShift delivers all the tools you need to run software on top of it with SRE paradigms, from a monitoring platform to an integrated CI/CD system that you can use to monitor and run both the software deployed to the OpenShift cluster and the cluster itself. So, the cluster and the workload that runs in it need to be secured. 

From a security standpoint, OpenShift provides robust encryption controls to protect sensitive data, including platform secrets and application configuration data. In addition, OpenShift optionally utilizes FIPS 140-2 Level 1 compliant encryption modules to meet security standards for U.S. federal departments.

This post highlights OpenShift security and provides security best practices and considerations when planning and operating your OpenShift cluster. These will give you a starting point. However, as clusters, as well as bad actors, are ever-evolving, it is of significant importance to revise the steps you took.

Central security architecture

Therefore, we often see security enforcement in a fixed central place in the network infrastructure. This could be, for example, a significant security stack consisting of several security appliances. We are often referred to as a kludge of devices. As a result, the individual components within the application need not worry about carrying out any security checks as they occur centrally for them.

On the other hand, with the common microservices architecture, those internal components are specifically designed to operate independently and accept requests alone, which brings considerable benefits to scaling and deploying pipelines.

However, each component may now have entry points and accept external connections. Therefore, they need to be concerned with security individually and not rely on a central security stack to do this for them.

OpenShift Security Guide
Diagram: OpenShift security best practices.

The different container attack vectors 

These changes have considerable consequences for security and how you approach your OpenShift security best practices. The security principles still apply, and we still are concerned with reducing the blast radius, least privileges, etc. Still, they must be used from a different perspective and to multiple new components in a layered approach. Security is never done in isolation.

So, as the number of entry points to the system increases, the attack surface broadens, leading us to several docker container security attack vectors not seen with the monolithic. We have, for example, attacks on the Host, images, supply chain, and container runtime. There is also a considerable increase in the rate of change for these types of environments; an old joke says that a secure application is an application stack with no changes.

So when you change, you can open the door to a bad actor. Today’s application varies considerably a few times daily for an agile stack. We have unit and security tests and other safety tests that can reduce mistakes, but no matter how much preparation you do, there is a chance of a breach whenever there is a change.

So, we have environmental changes that affect security and some alarming technical challenges to how containers run as default, such as running as root by default and with an alarming amount of capabilities and privileges. The following image displays attack vectors that are linked explicitly to containers.

container attack vectors
Diagram: Container attack vectors. Source Adriancitu

Challenges with Securing Containers

  • Containers running as root

So, as you know, containers run as root by default and share the Kernel of the Host OS, and the container process is visible from the Host. This in itself is a considerable security risk when a container compromise occurs. When a security vulnerability in the container runtime arose and a container escape was performed, as the application ran as root, it could become root on the underlying Host.

Therefore, if a bad actor gets access to the Host and has the correct privileges, it can compromise all the hosts’ containers.

Risky Configuration

Containers often run with excessive privileges and capabilities—much more than they need to do their job efficiently. As a result, we need to consider what privileges the container has and whether it runs with any unnecessary capabilities it does not need.

Some of the capabilities a container may have are defaults that fall under risky configurations and should be avoided. You want to keep an eye on the CAP_SYS_ADMIN. This flag grants access to an extensive range of privileged activities.

The container has isolation boundaries by default with namespace and control groups ( when configured correctly). However, granting the excessive container capabilities will weaken the isolation between the container and this Host and other containers on the same Host. They are, essentially, removing or dissolving the container’s ring-fence capabilities.

Starting OpenShift Security Best Practices

Then, we have security with OpenShift that overcomes many of the default security risks you have with running containers. And OpenShift does much of this out of the box. If you want further information on securing an OpenShift cluster, kindly check out my course for Pluralsight on OpenShift Security and OpenShift Network Security.

OpenShift Container Platform (formerly known as OpenShift Enterprise) or OCP is Red Hat’s offering for the on-premises private platform (PaaS). OpenShift is based on the Origin open-source project and is a Kubernetes distribution.

The foundation of the OpenShift Container Platform and OpenShift Network Security is based on Kubernetes and, therefore, shares some of the same networking technology and some enhancements. However, as you know, Kubernetes is a complex beast and can be utilized by itself when trying to secure clusters.  OpenShift does an excellent job of wrapping Kubernetes in a layer of security, such as using Security Context Constraints (SCCs) that give your cluster a good security base.

Security Context Constraints

By default, OpenShift prevents the cluster container from accessing protected functions. These functions—Linux features such as shared file systems, root access, and some core capabilities such as the KILL command—can affect other containers running in the same Linux kernel, so the cluster limits access.

Most cloud-native applications work fine with these limitations, but some (especially stateful workloads) need greater access. Applications that require these functions can still use them but need the cluster’s permission.

The application’s security context specifies the permissions that the application needs, while the cluster’s security context constraints specify the permissions that the cluster allows. An SC with an SCC enables an application to request access while limiting the access that the cluster will grant.

What are security contexts and security context constraints?

A pod configures a container’s access with permissions requested in the pod’s security context and approved by the cluster’s security context constraints:

security context (SC), defined in a pod, enables a deployer to specify a container’s permissions to access protected functions. When the pod creates the container, it configures it to allow these permissions and block all others. The cluster will only deploy the pod if the permissions it requests are permitted by a corresponding SCC.

security context constraint (SCC), defined in a cluster, enables an administrator to control pod permissions, which manage containers’ access to protected Linux functions. Similarly to how role-based access control (RBAC) manages users’ access to a cluster’s resources, an SCC manages pods’ access to Linux functions.

By default, a pod is assigned an SCC named restricted that blocks access to protected functions in OpenShift v4.10 or earlier. Instead, in OpenShift v4.11 and later, the restricted-v2 SCC is used by default. For an application to access protected functions, the cluster must make an SCC that allows it to be available to the pod.

While an SCC grants access to protected functions, each pod needing access must request it. To request access to the functions its application needs, a pod specifies those permissions in the security context field of the pod manifest. The manifest also specifies the service account that should be able to grant this access.

When the manifest is deployed, the cluster associates the pod with the service account associated with the SCC. For the cluster to deploy the pod, the SCC must grant the permissions that the pod requests.

One way to envision this relationship is to think of the SCC as a lock that protects Linux functions, while the manifest is the key. The pod is allowed to deploy only if the key fits.

Security Context Constraints
Diagram: Security Context Constraints. Source is IBM

A final note: Security context constraint

When your application is deployed to OpenShift in a virtual data center design, the default security model will enforce that it is run using an assigned Unix user ID unique to the project for which you are deploying it. Now, we can prevent images from being run as the Unix root user. When hosting an application using OpenShift, the user ID that a container runs as will be assigned based on which project it is running in.

Containers cannot run as the root user by default—a big win for security. SCC also allows you to set different restrictions and security configurations for PODs.

So, instead of allowing your image to run as the root, which is a considerable security risk, you should run as an arbitrary user by specifying an unprivileged USER, setting the appropriate permissions on files and directories, and configuring your application to listen on unprivileged ports.

OpenShift Security Context Constraints
Diagram: OpenShift security context constraints.

OpenShift Network Security: SCC defaults access

Security context constraints let you drop privileges by default, which is essential and still the best practice. Red Hat OpenShift security context constraints (SCCs) ensure that no privileged containers run on OpenShift worker nodes by default—another big win for security. Access to the host network and host process IDs are denied by default. Users with the required permissions can adjust the default SCC policies to be more permissive.

So, when considering SCC, consider SCC admission controllers as restricting POD access, similar to how RBAC restricts user access. To control the behavior of pods, we have security context constraints (SCCs). These cluster-level resources define what resources pods can access and provide additional control. 

Security context constraints let you drop privileges by default, a critical best practice. With Red Hat OpenShift SCCs, no privileged containers run on OpenShift worker nodes. Access to the host network and host process IDs are denied by default—a big win for OpenShift security.

 

Restricted security context constraints (SCCs)

A few SCCs are available by default, and you may have the head of the restricted SCC. By default, all pods, except those for builds and deployments, use a default service account assigned by the restricted SCC, which doesn’t allow privileged containers – that is, those running under the root user and listening on privileged ports are ports under <1024. SCC can be used to manage the following:

    1. Privilege Mode: This setting allows or denies a container from running in privilege mode. As you know, privilege mode bypasses any restriction such as control groups, Linux capabilities, secure computing profiles, 
    2. Privilege Escalation: This setting enables or disables privilege escalation inside the container ( all privilege escalation flags)
    3. Linux Capabilities: This setting allows the addition or removal of specific Linux capabilities
    4. Seccomp profile – this setting shows which secure computing profiles are used in a pod.
    5. Root-only file system: this makes the root file system read-only 

The goal is to assign the fewest possible capabilities for a pod to function fully. This least-privileged model ensures that pods can’t perform tasks on the system that aren’t related to their application’s proper function. The default value for the privileged option is False; setting the privileged option to True is the same as giving the pod the capabilities of the root user on the system. Although doing so shouldn’t be common practice, privileged pods can be helpful under certain circumstances. 

OpenShift Network Security: Authentication

The term authentication refers to the process of validating one’s identity. Usually, users aren’t created in OpenShift but are provided by an external entity, such as the LDAP server or GitHub. The only part where OpenShift steps in is authorization—determining roles and permissions for a user.

OpenShift supports integration with various identity management solutions in corporate environments, such as FreeIPA/Identity Management, Active Directory, GitHub, Gitlab, OpenStack Keystone, and OpenID.

OpenShift Network Security: Users and identities

A user is any human actor who can request the OpenShift API to access resources and perform actions. Users are typically created in an external identity provider, usually a corporate identity management solution such as Lightweight Directory Access Protocol (LDAP) or Active Directory.

To support multiple identity providers, OpenShift relies on the concept of identities as a bridge between users and identity providers. A new user and identity are created upon the first login by default. There are four ways to map users to identities:

Service accounts

Service accounts allow us to control API access without sharing users’ credentials. Pods and other non-human actors use them to perform various actions and are a central vehicle by which their access to resources is managed. By default, three service accounts are created in each project:

Authorization and role-based access control

Authorization in OpenShift is built around the following concepts:

Rules: Sets of actions allowed to be performed on specific resources.
Roles are collections of rules that allow them to be applied to a user according to a specific user profile. They can be used either at the cluster or project level.
Role bindings: Associations between users/groups and roles. A given user or group can be associated with multiple roles.

If pre-defined roles aren’t sufficient, you can always create custom roles with just the specific rules you need.

Summary: OpenShift Security Best Practices

In the ever-evolving technological landscape, ensuring the security of your applications is of utmost importance. OpenShift, a powerful containerization platform, offers robust security features to protect your applications and data. This blog post explored some essential OpenShift security best practices to help you fortify your applications and safeguard sensitive information.

Section 1: Understand OpenShift Security Model

OpenShift follows a layered security model that provides multiple levels of protection. It is crucial to understand this model to implement adequate security measures. From authentication and authorization mechanisms to network policies and secure container configurations, OpenShift offers a comprehensive security framework.

Section 2: Implement Strong Authentication Mechanisms

Authentication is the first line of defense against unauthorized access. OpenShift supports various authentication methods, including username/password, token-based, and integration with external authentication providers like LDAP or Active Directory. Implementing robust authentication mechanisms ensures that only trusted users can access your applications and resources.

Section 3: Apply Fine-Grained Authorization Policies

Authorization plays a vital role in controlling users’ actions within the OpenShift environment. You can limit privileges to specific users or groups by defining fine-grained access control policies. OpenShift’s Role-Based Access Control (RBAC) allows you to assign roles with different levels of permissions, ensuring that each user has appropriate access rights.

Section 4: Secure Container Configurations

Containers are at the heart of OpenShift deployments; securing them is crucial for protecting your applications. Employing best practices such as using trusted container images, regularly updating base images, and restricting container capabilities can significantly reduce the risk of vulnerabilities. OpenShift’s security context constraints enable you to define and enforce security policies for containers, ensuring they run with the minimum required privileges.

Section 5: Enforce Network Policies

OpenShift provides network policies that enable you to define traffic flow rules between your application’s different components. By implementing network policies, you can control inbound and outbound traffic, restrict access to specific ports or IP ranges, and isolate sensitive components. This helps prevent unauthorized communication and protects your applications from potential attacks.

Conclusion:

Securing your applications on OpenShift requires a multi-faceted approach, encompassing various layers of protection. By understanding the OpenShift security model, implementing strong authentication and authorization mechanisms, securing container configurations, and enforcing network policies, you can enhance the overall security posture of your applications. Stay vigilant, keep up with the latest security updates, and regularly assess your security measures to mitigate potential risks effectively.

System Observability

Distributed Systems Observability

Distributed Systems Observability

In the realm of modern technology, distributed systems have become the backbone of numerous applications and services. However, the increasing complexity of such systems poses significant challenges when it comes to monitoring and understanding their behavior. This is where observability steps in, offering a comprehensive solution to gain insights into the intricate workings of distributed systems. In this blog post, we will embark on a captivating journey into the realm of distributed systems observability, exploring its key concepts, tools, and benefits.
Observability, as a concept, enables us to gain deep insights into the internal state of a system based on its external outputs. When it comes to distributed systems, observability takes on a whole new level of complexity. It encompasses the ability to effectively monitor, debug, and analyze the behavior of interconnected components across a distributed architecture. By employing various techniques and tools, observability allows us to gain a holistic understanding of the system's performance, bottlenecks, and potential issues.

To achieve observability in distributed systems, it is crucial to focus on three interconnected components: logs, metrics, and traces.

Logs provide a chronological record of events and activities within the system, offering valuable insights into what has occurred. By analyzing logs, engineers can identify anomalies, track down errors, and troubleshoot issues effectively.

Metrics, on the other hand, provide quantitative measurements of the system's performance and behavior. They offer a rich source of data that can be analyzed to gain a deeper understanding of resource utilization, response times, and overall system health.

Traces enable the visualization and analysis of transactions as they traverse through the distributed system. By capturing the flow of requests and their associated metadata, traces allow engineers to identify bottlenecks, latency issues, and performance optimizations.

In the ever-evolving landscape of distributed systems observability, a plethora of tools and frameworks have emerged to simplify the process. Prominent examples include:

1. Prometheus: A powerful open-source monitoring and alerting system that excels in collecting and storing metrics from distributed environments.

2. Jaeger: An end-to-end distributed tracing system that enables the visualization and analysis of transaction flows across complex systems.

3. ELK Stack: A comprehensive combination of Elasticsearch, Logstash, and Kibana, which collectively offer powerful log management, analysis, and visualization capabilities.

4. Grafana: A widely-used open-source platform for creating rich and interactive dashboards, allowing engineers to visualize metrics and logs in real-time.

The adoption of observability in distributed systems brings forth a multitude of benefits. It empowers engineers and DevOps teams to proactively detect and diagnose issues, leading to faster troubleshooting and reduced downtime. Observability also aids in capacity planning, resource optimization, and identifying performance bottlenecks. Moreover, it facilitates collaboration between teams by providing a shared understanding of the system's behavior and enabling effective communication.

Conclusion: In the ever-evolving landscape of distributed systems, observability plays a pivotal role in unraveling the complexity and gaining insights into system behavior. By leveraging the power of logs, metrics, and traces, along with robust tools and frameworks, engineers can navigate the intricate world of distributed systems with confidence. Embracing observability empowers organizations to build resilient, high-performing systems that can withstand the challenges of today's digital landscape.

Highlights: Distributed Systems Observability

The Role of Distributed Systems

Several decades ago, only a handful of mission-critical services worldwide were required to meet the availability and reliability requirements of today’s always-on applications and APIs. In response to user demand, every application must be built to scale nearly instantly to accommodate the potential for rapid, viral growth. Almost every app built today—whether a mobile app for consumers or a backend payment system—must meet these constraints and requirements.

Inherently, distributed systems are more reliable due to their distributed nature. When appropriately designed software engineers build these systems, they can benefit from more scalable organizational models. There is, however, a price to pay for these advantages. Designing, building, and debugging these distributed systems can be challenging. A reliable distributed system requires significantly more engineering skills than a single-machine application, such as a mobile app or a web frontend. Regardless, distributed systems are becoming increasingly important. There is a corresponding need for tools, patterns, and practices to build them.

Multicloud environments are increasingly distributed.

As digital transformation accelerates, organizations adopt multicloud environments to drive secure innovation and achieve speed, scale, and agility. As a result, technology stacks are becoming increasingly complex and scalable. Today, even the simplest digital transaction is supported by an array of cloud-native services and platforms delivered by a variety of providers. In order to improve user experience and resilience, IT and security teams must monitor and manage their applications.

Fragmented Monitoring Tools

Fragmented monitoring tools and manual analytics strategies challenge IT and security teams. The lack of a single source of truth and real-time insight makes it increasingly difficult for these teams to access the answers they need to accelerate innovation and optimize digital services. To gain insight, they must manually query data from various monitoring tools and piece together different sources of information. This complex and time-consuming process distracts Team members from driving innovation and creating new value for the business and customers. In addition, many teams monitor only their mission-critical applications due to the effort involved in managing all these tools, platforms, and dashboards. The result is a multitude of blind spots across the technology stack, which makes it harder for teams to gain insights.

Kubernetes is Complex

Understanding how Kubernetes adds to the complexity of technology stacks is imperative. In the drive toward modern technology stacks, it is the platform of choice for organizations refactoring their applications for the cloud-native world. Through dynamic resource provisioning, Kubernetes architectures can quickly scale services to new users and increase efficiency.

However, the constant changes in cloud environments make it difficult for IT and security teams to maintain visibility into them. To provide observability in their Kubernetes environments, these teams cannot manually configure various traditional monitoring tools. The result is that they are often unable to gain real-time insights to improve user experience, optimize costs, and strengthen security. Due to this visibility challenge, many organizations are delaying moving more of their mission-critical services to Kubernetes.

The Role of Megatrends

We have had a considerable drive with innovation that has spawned several megatrends that have affected how we manage and view our network infrastructure and the need for distributed systems observability. We have seen the decomposition of everything from one to many.

Many services and dependencies in multiple locations, aka microservices observability, must be managed and operated instead of the monolithic where everything is generally housed internally. The megatrends have resulted in a dynamic infrastructure with new failure modes not seen in the monolithic, forcing us to look at different systems observability tools and network visibility practices. 

Shift in Control

There has also been a shift in the point of control. As we move towards new technologies, many of these loosely coupled services or infrastructures your services depend on are not under your control. The edge of control has been pushed, creating different network and security perimeters. These parameters are now closer to the workload than a central security stack. Therefore, the workloads themselves are concerned with security.

For pre-information, you may find the following posts helpful:

  1. Observability vs Monitoring 
  2. Prometheus Monitoring
  3. Network Functions



Distributed Systems Observability.

Key Distributed Systems Observability points:


  • We no longer have predictable failures.

  • The different demands on networks.

  • The issues with the metric-based approach.

  • Static thresholds and alerting.

  • The 3 pillars of Observability.

Back to Basics with Distributed Systems Observability

Distributed Systems

Today’s world of always-on applications and APIs has availability and reliability requirements that would have been needed of solely a handful of mission-critical services around the globe only a few decades ago. Likewise, the potential for rapid, viral service growth means that every application has to be built to scale nearly instantly in response to user demand.

Finally, these constraints and requirements mean that almost every application made—whether a consumer mobile app or a backend payments application—needs to be a distributed system. A distributed system is an environment where different components are spread across multiple computers on a network. These devices split up the work, harmonizing their efforts to complete the job more efficiently than if a single device had been responsible.

System Observability Design
Diagram: Systems Observability design.

The Key Components of Observability:

Observability in distributed systems is achieved through three main components: monitoring, logging, and tracing.

1. Monitoring:

Monitoring involves the continuous collection and analysis of system metrics and performance indicators. It provides real-time visibility into the health and performance of the distributed system. By monitoring various metrics such as CPU usage, memory consumption, network traffic, and response times, engineers can proactively identify anomalies and make informed decisions to optimize system performance.

2. Logging:

Logging involves recording events, activities, and errors within the distributed system. Log data provides a historical record that can be analyzed to understand system behavior and debug issues. Distributed systems generate vast amounts of log data, and effective log management practices, such as centralized log storage and log aggregation, are crucial for efficient troubleshooting.

3. Tracing:

Tracing involves capturing the flow of requests and interactions between different distributed system components. It allows engineers to trace the journey of a specific request and identify potential bottlenecks or performance issues. Tracing is particularly useful in complex distributed architectures where multiple services interact.

Benefits of Observability in Distributed Systems:

Adopting observability practices in distributed systems offers several benefits:

1. Enhanced Troubleshooting:

Observability enables engineers to quickly identify and resolve issues by providing detailed insights into system behavior. With real-time monitoring, log analysis, and tracing capabilities, engineers can pinpoint the root cause of problems and take appropriate actions, minimizing downtime and improving system reliability.

2. Performance Optimization:

By closely monitoring system metrics, engineers can identify performance bottlenecks and optimize system resources. Observability allows for proactive capacity planning and efficient resource allocation, ensuring optimal performance even under high loads.

3. Efficient Change Management:

Observability facilitates monitoring system changes and their impact on overall performance. Engineers can track changes in metrics and easily identify any deviations or anomalies caused by updates or configuration changes. This helps maintain system stability and avoid unexpected issues.

How This Affects Failures

The primary issue I have seen with my clients is that application failures are no longer predictable, and dynamic systems can fail creatively, challenging existing monitoring solutions. But, more importantly, the practices that support them. We have a lot of partial failures that are not just unexpected but not known or have never been seen before. For example, if you recall, we have the network hero. 

The network hero

It is someone who knows every part of the network and has seen every failure at least once. These people are no longer helpful in today’s world and need proper Observation. When I was working as an Engineer, we would have plenty of failures, but more than likely, we would have seen them before. And there was a system in place to fix the error. Today’s environment is much different.

We can no longer rely on simply seeing a UP or Down, setting static thresholds, and then alerting based on those thresholds. A key point to note at this stage is that none of these thresholds considers the customer’s perspective.  If your POD runs at 80% CPU, does that mean the customer is unhappy?

When monitoring, you should look from your customer’s perspectives and what matters to them. Content Delivery Network (CDN) was one of the first to realize this game and measure what matters most to the customer.

Distributed Systems Observability

The different demands

So, the new, modern, and complex distributed systems place very different demands on your infrastructure and the people who manage it. For example, in microservices, there can be several problems with a particular microservice:

    • The microservices could be running under high resource utilization and, therefore, slow to respond, causing a timeout
    • The microservices could have crashed or been stopped and is, therefore, unavailable
    • The microservices could be fine, but there could be slow-running database queries.
    • So we have a lot of partial failures. 

Therefore, We can no longer predict

The big shift we see with software platforms is that they evolve much quicker than the products and paradigms we use to monitor them. As a result, we need to consider new practices and technologies with dedicated platform teams and good system observability. We can’t predict anything anymore, which puts the brakes on some traditional monitoring approaches, especially the metrics-based approach to monitoring.

I’m not saying that these monitoring tools are not doing what you want them to do. But, they work in a siloed environment, and there is a lack of connectivity. So we have monitoring tools working in silos in different parts of the organization and more than likely managed by other people trying to monitor a very dispersed application with multiple components and services in various places. 

Relying On Known Failures

Metric-Based Approach

A metrics-based monitoring approach relies on having previously encountered known failure modes. The metric-based approach relies on known failures and predictable failure modes. So, we have predictable thresholds that someone is considered to experience abnormal.

Monitoring can detect when these systems are either over or under the predictable thresholds that were previously set. Then, we can set alerts, and we hope that these alerts are actionable. This is only useful for variants of predictable failure modes.

Traditional metrics and monitoring tools can tell you any performance spikes or notice that a problem occurs. But they don’t let you dig into the source of the issues and let us slice and dice or see correlations between errors. If the system is complex, this approach is more challenging to get to the root cause in a reasonable timeframe.

Traditional style metrics systems

With traditional metrics systems, you had to define custom metrics, which were always defined upfront. This approach prevents us from starting to ask new questions about problems. So, it would be best if you defined the questions to ask upfront.

Then, we set performance thresholds, pronounce them “good” or “bad, ” and check and re-check those thresholds. We would tweak the thresholds over time, but that was about it. This monitoring style has been the de facto approach, but we don’t now want to predict how a system can fail. Always observe instead of waiting for problems, such as reaching a certain threshold before acting.

System Observability Analysis
Diagram: System Observability analysis.

Metrics: Lack of connective event

Metrics did not retain the connective event. As a result, you cannot ask new questions in the existing dataset.  These traditional system metrics could miss unexpected failure modes in complex distributed systems. Also, the condition detected via system metrics might be unrelated to what is happening.

An example of this could be an odd number of running threads on one component, which might indicate garbage collection is in progress or that slow response times are imminent in an upstream service.

Users experience static thresholds.

User experience means different things to different sets of users. We now have a model where different service users may be routed through the system in other ways, using various components and providing experiences that can vary widely. We also know now that the services no longer tend to break in the same few predictable ways over and over.  

We should have a few alerts triggered by only focusing on symptoms that directly impact user experience and not because a threshold was reached.

The Challenge: Can’t reliably indicate any issues with user experience

If you use static thresholds, they can’t reliably indicate any issues with user experience. Alerts should be set up to detect failures that impact user experience. Traditional monitoring falls short in trying to do this. With traditional metrics-based monitoring, we rely on static thresholds to define optimal system conditions, which have nothing to do with user experience.

However, modern systems change shape dynamically under different workloads. Static monitoring thresholds can’t reflect impacts on user experience. They lack context and are too coarse.

The Need For Distributed Systems Observability

Systems observability and reliability in distributed system is a practice. Rather than just focusing on a tool that does logging, metrics, or altering, Observability is all about how you approach problems, and for this, you need to look at your culture. So you could say that Observability is a cultural practice and allows you to be proactive to findings instead of relying on a reactive approach that we are used to in the past.

Nowadays, we need a different viewpoint and want to see everything from one place. You want to know how the application works and how it interacts with the other infrastructure components, such as the underlying servers, physical or server, the network, and how data transfer looks in a transfer and stale state. 

What level of observation is needed to ensure everything is performing as it should? What should you look at to obtain this level of detail?

Monitoring is knowing the data points and the entities from which we gather information. On the other hand, Observability is like putting all the data together. So monitoring is collecting data, and Observability is putting it together in one single pane of glass. Observability is observing the different patterns and deviations from the baseline; monitoring is getting the data and putting it into the systems. A vital part of an Observability toolkit is service level objectives (slos).

The three pillars of distributed systems observability

We have three pillars of Systems Observability. There are Metrics, Traces, and Logging. So, defining or viewing Observability as having these pillars is an oversimplification. But for Observability, you need these in place. Observability is all about connecting the dots from each of these pillars.

If someone asked me which one I prefer, it would be distributed tracing. Distributed tracing allows you to visualize each step in service request executions. As a result, it doesn’t matter if services have complex dependencies. You could say that the complexity of the Dynamic systems is abstracted with distributed tracing.

Use Case: Challenges without tracing.

For example, latency can stack up if a downstream database service experiences performance bottlenecks. As a result, the end-to-end latency is high. When latency is detected three or four layers upstream, it can be complicated to identify which component of the system is the root of the problem because now that same latency is being seen in dozens of other services.

Distributed tracing: A winning formula

Modern distributed systems tend to scale into a tangled knot of dependencies. Therefore, distributed tracing shows the relationships between various services and components in a distributed system. Traces help you understand system interdependencies. Unfortunately, those inter-dependencies can obscure problems and make them challenging to debug unless their relationships are clearly understood.

In distributed systems, observability is vital in ensuring complex architectures’ stability, performance, and reliability. Monitoring, logging, and tracing provide engineers with the tools to understand system behavior, troubleshoot issues, and optimize performance. By adopting observability practices, organizations can effectively manage their distributed systems and provide seamless and reliable services to their users.

Summary: Distributed Systems Observability

In the vast landscape of distributed systems, observability is crucial in ensuring their reliable and efficient functioning. This blogpost aims to delve into the critical components of distributed systems observability and shed light on their significance.

Telemetry

Telemetry forms the foundation of observability in distributed systems. It involves collecting, processing, and analyzing various metrics, logs, and traces. By monitoring and measuring these data points, developers gain valuable insights into the performance and behavior of their distributed systems.

Logging

Logging is an essential component of observability, providing a detailed record of events and activities within a distributed system. It captures important information such as errors, warnings, and informational messages, which aids in troubleshooting and debugging. Properly implemented logging mechanisms enable developers to identify and resolve issues promptly.

Metrics

Metrics are quantifiable measurements that provide a high-level view of the health and performance of a distributed system. They offer valuable insights into resource utilization, throughput, latency, error rates, and other critical indicators. By monitoring and analyzing metrics, developers can proactively identify bottlenecks, optimize performance, and ensure the smooth operation of their systems.

Tracing

Tracing allows developers to understand the flow and behavior of requests as they traverse through a distributed system. It provides detailed information about the path a request takes, including the various services and components it interacts with. Tracing is instrumental in diagnosing and resolving performance issues, as it highlights potential latency hotspots and bottlenecks.

Alerting and Visualization

Alerting mechanisms and visualization tools are vital for effective observability in distributed systems. Alerts notify developers when certain predefined thresholds or conditions are met, enabling them to take timely action. Visualization tools provide intuitive and comprehensive representations of system metrics, logs, and traces, making identifying patterns, trends, and anomalies easier.

Conclusion

In conclusion, the key components of distributed systems observability, namely telemetry, logging, metrics, tracing, alerting, and visualization, form a comprehensive toolkit for monitoring and understanding the intricacies of such systems. By leveraging these components effectively, developers can ensure their distributed systems’ reliability, performance, and scalability.

Reliability in Distributed Systems

Reliability In Distributed System

Reliability In Distributed System

Distributed systems have become an integral part of our modern technological landscape. Whether it's cloud computing, internet banking, or online shopping, these systems play a crucial role in providing seamless services to users worldwide. However, as distributed systems grow in complexity, ensuring their reliability becomes increasingly challenging.

In this blog post, we will explore the concept of reliability in distributed systems and discuss various techniques to achieve fault-tolerant operations.

Reliability in distributed systems refers to the ability of the system to consistently function as intended, even in the presence of hardware failures, network partitions, and other unforeseen events. To achieve reliability, system designers employ various techniques, such as redundancy, replication, and fault tolerance, to minimize the impact of failures and ensure continuous service availability.

Highlights: Reliability In Distributed System

Shift in Landscape

When considering reliability in a distributed system, considerable shifts in our environmental landscape have caused us to examine how we operate and run our systems and networks. We have had a mega change with the introduction of various cloud platforms and their services and containers, along with the complexity of managing distributed systems observability and microservices observability that unveil significant gaps in current practices in our technologies. Not to mention the flaws with the operational practices around these technologies.

Existing Static Tools

This has caused a knee-jerk reaction to a welcomed drive-in innovation to system reliability. Yet, some technologies and tools used to manage these innovations do not align with the innovative events. Many of these tools have stayed relatively static in our dynamic environment. So, we have static tools used in a dynamic environment, which causes friction to reliability in distributed systems and the rise for more efficient network visibility.

Understanding the Complexity

Distributed systems are inherently complex, with multiple components across different machines or networks. This complexity introduces challenges like network latency, hardware failures, and communication bottlenecks. Understanding the intricate nature of distributed systems is crucial to devising reliable solutions.

Redundancy and Replication

One critical approach to enhancing reliability in distributed systems is redundancy and replication. The system becomes more fault-tolerant by duplicating critical components or data across multiple nodes. This ensures the system can function seamlessly even if one component fails, minimizing the risk of complete failure.

Consistency and Consensus Algorithms

Maintaining consistency in distributed systems is a significant challenge due to the possibility of concurrent updates and network delays. Consensus algorithms, such as the Paxos or Raft algorithms, are vital in achieving consistency by ensuring agreement among distributed nodes. These algorithms enable reliable decision-making and guarantee that all nodes reach a consensus state.

Reliability in distributed systems

Monitoring and Failure Detection

To ensure reliability, it is essential to have robust monitoring mechanisms in place. Monitoring tools can track system performance, resource utilization, and network health. Additionally, implementing efficient failure detection mechanisms allows for prompt identification of faulty components, enabling proactive measures to mitigate their impact on the overall system.

Load Balancing and Scalability

Load balancing is crucial in distributing the workload evenly across nodes in a distributed system. It ensures that no single node is overwhelmed, reducing the risk of system instability. Furthermore, designing systems with scalability in mind allows for seamless expansion as the workload grows, ensuring that reliability is maintained even during periods of high demand.

Related: Before you proceed, you may find the following post helpful:

  1. Distributed Firewalls
  2. SD WAN Static Network Based

 



Reliability In Distributed Systems


Key Reliability in Distributed System Discussion Points:


  • Complexity managing distributed systems.

  • Static tools in a dynamic environment.

  • Observability vs Monitoring.

  • Creative failures and black holes.

  • SRE teams and service level objectives.

  • New tools: Disributed tracing.

 

Back to Basics: Reliability in Distributed Systems

Understanding Distributed Systems

Distributed systems refer to a network of interconnected computers that communicate and coordinate their actions to achieve a common goal. Unlike traditional centralized systems, where a single entity controls all components, distributed systems distribute tasks and data across multiple nodes. This decentralized approach enables enhanced scalability, fault tolerance, and resource utilization.

Key Components of Distributed Systems

To comprehend the inner workings of distributed systems, we must familiarize ourselves with their key components. These components include nodes, communication channels, protocols, and distributed file systems. Nodes represent individual machines or devices within the network; communication channels facilitate data transmission, protocols ensure reliable communication, and distributed file systems enable data storage across multiple nodes.

Distribued vs centralized

 

Distributed Systems Use Cases

Distributed systems are used in many modern applications. Mobile and web applications with high traffic are distributed systems. Web browsers or mobile applications serve as clients in a client-server environment. As a result, the server becomes its own distributed system. The modern web server follows a multi-tier system pattern. Requests are delegated to several server logic nodes via a load balancer.

Kubernetes is popular among distributed systems since it enables containers to be combined into a distributed system. Kubernetes orchestrates network communication between the distributed system nodes and handles dynamic horizontal and vertical scaling of the nodes. 

Cryptocurrencies like Bitcoin and Ethereum are also distributed systems that are peer-to-peer. The currency ledger is replicated at every node in a cryptocurrency network. To bootstrap, a currency node connects to other nodes and downloads its full ledger copy. Additionally, cryptocurrency wallets use JSON RPC to communicate with the ledger nodes.

Challenges in Distributed Systems

While distributed systems offer numerous advantages, they also pose various challenges. One significant challenge is achieving consensus among distributed nodes. Ensuring that all nodes agree on a particular value or decision can be complex, especially in the presence of failures or network partitions. Additionally, maintaining data consistency across distributed nodes and mitigating issues related to concurrency control requires careful design and implementation.

Example: Distributed System of Microservices

Microservices are one type of distributed system since they decompose an application into individual components. A microservice architecture, for example, may have services corresponding to business features (payments, users, products, etc.), with each component handling the corresponding business logic. Multiple redundant copies of the services will then be available, so there is no single point of failure.

microservices

Example: Distributed Tracing

Using distributed tracing, you can profile or monitor the results of requests across a distributed system. Distributed systems can be challenging to monitor since each node generates its logs and metrics. To get a complete view of a distributed system, it is necessary to aggregate these separate node metrics holistically. 

A distributed system generally doesn’t access its entire set of nodes but rather a path through those nodes. With distributed tracing, teams can analyze and monitor commonly accessed paths through a distributed system. The distributed tracing is installed on each system node, allowing teams to query the system for information on node health and performance.

Benefits and Applications

Despite the challenges, distributed systems offer a wide array of benefits. One notable advantage is enhanced fault tolerance. Distributing tasks and data across multiple nodes improves system reliability, as a single point of failure does not bring down the entire system. Additionally, distributed systems enable improved scalability, accommodating growing demands by adding more nodes to the network. The applications of distributed systems are vast, ranging from cloud computing and large-scale data processing to peer-to-peer networks and distributed databases.

 

Distributed Systems: The Challenge

Distributed systems are required to implement the reliability, agility, and scale expected of modern computer programs. Distributed systems are applications of many different components running on many other machines. Containers are the foundational building block, and groups of containers co-located on a single device comprise the atomic elements of distributed system patterns.

Distributed System Observability

The significant shift we see with software platforms is that they evolve much quicker than the products and paradigms we use to monitor them. We need to consider new practices and technologies with dedicated platform teams to enable a new era of system reliability in a distributed system. Along with the practices of Observability that are a step up to the traditional monitoring of static infrastructure: Observability vs monitoring.

 

Knowledge Check: Distributed Systems Architecture

Client-Server Architecture

A client-server architecture has two primary responsibilities. The client presents user interfaces and is then connected to the server via a network. The server handles business logic and state management. Unless the server is redundant, a client-server architecture can quickly degrade into a centralized architecture. A truly distributed client-server setup will consist of multiple server nodes that distribute client connections. In modern client-server architectures, clients connect to encapsulated distributed systems on the server.

Multi-tier Architecture

Multi-tier architectures are extensions of client-server architectures. Multi-tier architectures decompose servers into further granular nodes, which decouple additional backend server responsibilities like data processing and data management. By processing long-running jobs asynchronously, these additional nodes free up the remaining backend nodes to focus on responding to client requests and interacting with the data store.

Peer-to-Peer Architecture

Peer-to-peer distributed systems contain complete instances of applications on each node. There is no separation between presentation and data processing at the node level. A node consists of a presentation layer and a data handling layer. Peer nodes may contain the entire state data of the system. 

Peer-to-peer systems have a great deal of redundancy. Peer-to-peer nodes discover and connect to other peers when they are initiated and brought online, thereby synchronizing their local state with the system’s. As a result of this feature, nodes on a peer-to-peer network won’t be disrupted by the failure of one. Additionally, peer-to-peer systems will persist. 

Service-orientated Architecture

A service-oriented architecture (SOA) is a precursor to microservices. Microservices differ from SOA primarily in their node scope, which is at the feature level. Each microservice node encapsulates a specific set of business logic, such as payment processing. Multiple nodes of business logic interface with independent databases in a microservice architecture. In contrast, SOA nodes encapsulate an entire application or enterprise division. Database systems are typically included within the service boundary of SOA nodes.

Because of their benefits, microservices have become more popular than SOA. The small service nodes provide functionality that teams can reuse through microservices. The advantages of microservices include greater robustness and a more extraordinary ability for vertical and horizontal scaling to be dynamic.

 

Reliability in Distributed Systems: Components

Redundancy and Replication:

Redundancy and replication are two fundamental concepts distributed systems use to enhance reliability. Redundancy involves duplicating critical system components, such as servers, storage devices, or network links, so the redundant component can seamlessly take over if one fails. Replication, on the other hand, involves creating multiple copies of data across different nodes in a system, enabling efficient data access and fault tolerance. By incorporating redundancy and replication, distributed systems can continue to operate even when individual components fail.

Fault Tolerance:

Fault tolerance is a crucial aspect of achieving reliability in distributed systems. It involves designing systems to operate correctly even when one or more components encounter failures. Several techniques, such as error detection, recovery, and prevention mechanisms, are employed to achieve fault tolerance.

Error Detection:

Error detection techniques, such as checksums, hashing, and cyclic redundancy checks (CRC), identify errors or data corruption during transmission or storage. By verifying data integrity, these techniques help identify and mitigate potential failures in distributed systems.

Error Recovery:

Error recovery mechanisms, such as checkpointing and rollback recovery, aim to restore the system to a consistent state after a failure. Checkpointing involves periodically saving the system’s state and data, allowing recovery to a previously known good state in case of failures. On the other hand, rollback recovery involves undoing the effects of failed operations and returning the system to a consistent state.

Error Prevention:

To enhance reliability, distributed systems employ error prevention techniques, such as redundancy elimination, consensus algorithms, and load balancing. Redundancy elimination reduces unnecessary duplication of data or computation, thereby reducing the chances of errors. Consensus algorithms ensure that all nodes in a distributed system agree on a shared state despite failures or message delays. Load balancing techniques distribute computational tasks evenly across multiple nodes to prevent overloading and potential shortcomings.

 

Lack of Connective Event: Traditional Monitoring

If you examine traditional monitoring systems, they look to capture and investigate signals in isolation. The monitoring systems work in a siloed environment, similar to that of developers and operators before the rise of DevOps. Existing monitoring systems cannot detect the “Unknowns Unknowns” that are familiar with modern distributed systems. This often leads to disruptions of services. So you may be asking what an “Unknown Unknown” is.

I’ll put it to you this way: the distributed systems we see today lack predictability—certainly not enough predictability to rely on static thresholds, alerts, and old monitoring tools. If something is fixed, it can be automated, and we have static events, such as in Kubernetes, a POD reaching a limit.

Then, a replica set introduces another pod on a different node if specific parameters are met, such as Kubernetes Labels and Node Selectors. However, this is only a tiny piece of the failure puzzle in a distributed environment.  Today, we have what’s known as partial failures and systems that fail in very creative ways.

 

Reliability In Distributed System: Creative ways to fail

So, we know that some of these failures are quickly predicted, and actions are taken. For example, if this Kubernetes POD node reaches a specific utilization, we can automatically reschedule PODs on a different node to stay within our known scale limits.

Predictable failures can be automated in Kubernetes and with any infrastructure. An Ansible script is useful when these events occur. However, we have much more to deal with than POD scaling; we have many partial and complicated failures known as black holes.

 

In today’s world of partial failures

Microservices applications are distributed and susceptible to many external factors. On the other hand, if you examine the traditional monolithic application style, all the functions reside in the same process. It was either switched ON or OFF!! Not much happened in between. So, if there is a failure in the procedure, the application as a whole will fail. The results are binary, usually either a UP or Down.

With some essential monitoring, this was easy to detect, and failures were predictable. There was no such thing as a partial failure. In a monolith application, all application functions are within the same process. A significant benefit of these monoliths is that you don’t have partial failures.

However, in a cloud-native world, where we have broken the old monolith into a microservices-based application, a client request can go through multiple hops of microservices, and we can have several problems to deal with.

There is a lack of connectivity between the different domains. Many monitoring tools and knowledge will be tied to each domain, and alerts are often tied to thresholds or rate-of-change violations that have nothing to do with user satisfaction. User satisfaction is a critical metric to care about.

 

System reliability: Today, you have no way to predict

So, the new, modern, and complex distributed systems place very different demands on your infrastructure—considerably different from the simple three-tier application, where everything is generally housed in one location.  We can’t predict anything anymore, which breaks traditional monitoring approaches.

When you can no longer predict what will happen, you can no longer rely on a reactive approach to monitoring and management. The move towards a proactive approach to system reliability is a welcomed strategy.

 

Blackholes: Strange failure modes

When considering a distributed system, many things can happen. A service or region can disappear or disappear for a few seconds or ms and reappear. We believe this is going into a black hole when we have strange failure modes. So when anything goes into it will disappear. Peculiar failure modes are unexpected and surprising.

There is certainly nothing predictable about strange failure modes. So, what happens when your banking transactions are in a black hole? What if your banking balance is displayed incorrectly or if you make a transfer to an external account and it does not show up? 

 

Highlighting Site Reliability Engineering (SRE) and Observability

Site reliability engineering (SRE) and observational practices are needed to manage these types of unpredictability and unknown failures. SRE is about making systems more reliable. And everyone has a different way of implementing SRE practices. Usually, about 20% of your issues cause 80% of your problems.

You need to be proactive and fix these issues upfront. You need to be able to get ahead of the curve and do these things to stop the incidents from occurring. This usually happens in the wake of a massive incident. This usually acts as a teachable moment. This gives the power to be the reason to listen to a Chaos Engineering project. 

 

New tools and technologies: Distributed tracing

We have new tools, such as distributed tracing. So, what is the best way to find the bottleneck if the system becomes slow? Here, you can use Distributed Tracing and Open Telemetry. The tracing helps us instrument our system, figuring out where the time has been spent and where it can be used across distributed microservice architecture to troubleshoot problems. Open Telemetry provides a standardized way of instrumenting our system and providing those traces.

distributed tracing

 

SLA, SLI, SLO, and Error Budgets

So we don’t just want to know when something has happened and then react to an event that is not looking from the customer’s perspective. We need to understand if we are meeting SLA by gathering the number and frequency of the outages and any performance issues.

Service Level Objectives (SLO) and Service Level Indicators (SLI) can assist you with measurements. Service Level Objectives (SLOs) and Service Level Indicators (SLI) not only help you with measurements but also offer a tool for having better reliability and forming the base for the reliability stack.

 

Summary: Reliability In Distributed System

In modern technology, distributed systems have become the backbone of numerous applications and services. These systems, consisting of interconnected nodes, provide scalability, fault tolerance, and improved performance. However, maintaining reliability in such distributed environments is a challenging endeavor. This blog post explored the key aspects and strategies for ensuring reliability in distributed systems.

Section 1: Understanding the Challenges

Distributed systems face a myriad of challenges that can impact their reliability. These challenges include network failures, node failures, message delays, and data inconsistencies. These aspects can introduce vulnerabilities that may disrupt system operations and compromise reliability.

Section 2: Replication for Resilience

One of the fundamental techniques to enhance reliability in distributed systems is data replication. By replicating data across multiple nodes, system resilience is improved. Replication increases fault tolerance and enables load balancing and localized data access. However, maintaining reliability is crucial to managing consistency and synchronization among replicated copies.

Section 3: Consensus Protocols

Consensus protocols play a vital role in achieving reliability in distributed systems. These protocols enable nodes to agree on a shared state despite failures or network partitions. Popular consensus algorithms such as Paxos and Raft ensure that distributed nodes reach a consensus, making them resilient against failures and maintaining system reliability.

Section 4: Fault Detection and Recovery

Detecting faults in a distributed system is crucial for maintaining reliability. Techniques like heartbeat monitoring, failure detectors, and health checks aid in identifying faulty nodes or network failures. Once a fault is detected, recovery mechanisms such as automatic restarts, replica synchronization, or reconfigurations can be employed to restore system reliability.

Section 5: Load Balancing and Scalability

Reliability in distributed systems can also be enhanced through load balancing and scalability. By distributing the workload evenly among nodes and dynamically scaling resources, the system can handle varying demands and prevent bottlenecks. Load-balancing algorithms and auto-scaling mechanisms contribute to overall system reliability.

Conclusion:

In the world of distributed systems, reliability is a paramount concern. By understanding the challenges, employing replication techniques, utilizing consensus protocols, implementing fault detection and recovery mechanisms, and focusing on load balancing and scalability, we can embark on a journey of resilience. Ensuring reliability in distributed systems requires careful planning, robust architectures, and continuous monitoring. By addressing these aspects, we can build distributed systems that are truly reliable, empowering businesses and users alike.

Chaos Engineering

Baseline Engineering

Baseline Engineering

In today's fast-paced digital landscape, network performance plays a vital role in ensuring seamless connectivity and efficient operations. Network baseline engineering is a powerful technique that allows organizations to establish a solid foundation for optimizing network performance, identifying anomalies, and planning for future scalability. In this blog post, we will explore the ins and outs of network baseline engineering and its significant benefits.

Network baseline engineering is the process of establishing a benchmark or reference point for network performance metrics. By monitoring and analyzing network traffic patterns, bandwidth utilization, latency, and other key parameters over a specific period, organizations can create a baseline that represents the normal behavior of their network. This baseline becomes a crucial reference for detecting deviations, troubleshooting issues, and capacity planning.

Proactive Issue Detection: One of the primary advantages of network baseline engineering is the ability to proactively detect and address network issues. By comparing real-time network performance against the established baseline, anomalies and deviations can be quickly identified. This allows network administrators to take immediate action to resolve potential problems before they escalate and impact user experience.

Improved Performance Optimization: With a solid network baseline in place, organizations can gain valuable insights into network performance patterns. This information can be leveraged to fine-tune configurations, optimize resource allocation, and enhance overall network efficiency. By understanding the normal behavior of the network, administrators can make informed decisions to improve performance and provide a seamless user experience.

Data Collection: The first step in network baseline engineering is collecting relevant data, including network traffic statistics, bandwidth usage, application performance, and other performance metrics. This data can be obtained from network monitoring tools, SNMP agents, flow analyzers, and other network monitoring solutions.

Data Analysis and Baseline Creation: Once the data is collected, it needs to be analyzed to identify patterns, trends, and normal behavior. Statistical analysis techniques, such as mean, median, and standard deviation, can be applied to determine the baseline values for various performance parameters. This process may involve using specialized software or network monitoring platforms.

Maintaining and Updating the Network Baseline: Networks are dynamic environments, and their behavior can change over time due to various factors such as increased user demands, infrastructure upgrades, or new applications. It is essential to regularly review and update the network baseline to reflect these changes accurately. By periodically reevaluating the baseline, organizations can ensure its relevance and effectiveness in capturing the network's current behavior.

Conclusion: Network baseline engineering is a fundamental practice that empowers organizations to better understand, optimize, and maintain their network infrastructure. By establishing a reliable baseline, organizations can proactively detect issues, enhance performance, and make informed decisions for future network expansion. Embracing network baseline engineering sets the stage for a robust and resilient network that supports the ever-growing demands of the digital age.

Highlights: Baseline Engineering

Traditional Network Infrastructure

Baseline Engineering was easy in the past; applications ran in single private data centers, potentially two data centers for high availability. There may have been some satellite PoPs, but generally, everything was housed in a few locations. These data centers were on-premises, and all components were housed internally. As a result, troubleshooting, monitoring, and baselining any issues was relatively easy. The network and infrastructure were pretty static, the network and security perimeters were known, and there weren’t many changes to the stack, for example, daily.

Distributed Applications

However, nowadays, we are in a completely different environment. We have distributed applications with components/services located in many other places and types of places, on-premises and in the cloud, with dependencies on both local and remote services. We span multiple sites and accommodate multiple workload types.

In comparison to the monolith, today’s applications have many different types of entry points to the external world. All of this calls for the practice of Baseline Engineering and Chaos engineering kubernetes so you can fully understand your infrastructure and scaling issues. 

The Role of Network Baselining

Network baselining involves capturing and analyzing network traffic data to establish a benchmark or baseline for normal network behavior. This baseline represents the typical performance metrics of the network under regular conditions. It encompasses various parameters such as bandwidth utilization, latency, packet loss, and throughput. By monitoring these metrics over time, administrators can identify patterns, trends, and anomalies, enabling them to make informed decisions about network optimization and troubleshooting.

Before you proceed, you may find the following post helpful:

  1. Network Traffic Engineering
  2. Low Latency Network Design
  3. Transport SDN
  4. Load Balancing
  5. What is OpenFlow
  6. Observability vs Monitoring
  7. Kubernetes Security Best Practice

 



Baseline Engineering


Key Baseline Engineering Discussion Points:


  • Monitoring was easy in the past.

  • How to start a baseline engineering project.

  • Distributed components and latency.

  • Chaos Engineeering Kubernetes.

Back to basics with baseline engineering

Chaos Engineering

Chaos engineering is a methodology of experimenting on a software system to build confidence in the system’s capability to withstand turbulent environments in production. It is an essential part of the DevOps philosophy, allowing teams to experiment with their system’s behavior in a safe and controlled manner.

This type of baseline engineering allows teams to identify weaknesses in their software architecture, such as potential bottlenecks or single points of failure, and take proactive measures to address them. By injecting faults into the system and measuring the effects, teams gain insights into system behavior that can be used to improve system resilience.

Finally, chaos Engineering teaches you to develop and execute controlled experiments that uncover hidden problems. For instance, you may need to inject system-shaking failures that disrupt system calls, networking, APIs, and Kubernetes-based microservices infrastructures.

Chaos engineering is defined as “the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production.” In other words, it’s a software testing method that concentrates on finding evidence of problems before users experience them.

Chaos Engineering

 

Network Baselining

Network baselinelining involves measuring the network’s performance at different times. This includes measuring throughput, latency, and other performance metrics and the network’s configuration. It is important to note that performance metrics can vary greatly depending on the type of network being used. This is why it is essential to establish a baseline for the network to be used as a reference point for comparison.

Network baselining is integral to network management as it allows organizations to identify and address potential issues before they become more serious. Organizations can be alerted to potential problems by baselining the network’s performance. This can help organizations avoid costly downtime and ensure their networks run at peak performance.

network baselining
Diagram: Network Baselining. Source is DNSstuff

 

 

The Importance of Network Baselining:

Network baselining provides several benefits for network administrators and organizations:

1. Performance Optimization: Baselining helps identify bottlenecks, inefficiencies, and abnormal behavior within the network infrastructure. Administrators can optimize network resources, improve performance, and ensure a smoother user experience by understanding the baseline.

2. Security Enhancement: Baselining also plays a crucial role in detecting and mitigating security threats. Administrators can identify unusual or malicious activities by comparing current network behavior against the established baseline, such as abnormal traffic patterns or unauthorized access attempts.

3. Capacity Planning: Understanding network baselines enables administrators to forecast future capacity requirements accurately. By analyzing historical data, they can determine when and where network upgrades or expansions may be necessary, ensuring consistent performance as the network grows.

Establishing a Network Baseline:

To establish an accurate network baseline, administrators follow a systematic approach:

1. Data Collection: Network traffic data is collected using specialized monitoring tools like network analyzers or packet sniffers. These tools capture and analyze network packets, providing detailed insights into performance metrics.

2. Duration: Baseline data should ideally be collected over an extended period, typically from a few days to a few weeks. This ensures the baseline accounts for variations due to different network usage patterns.

3. Normalizing Factors: Administrators consider various factors impacting network performance, such as peak usage hours, seasonal variations, and specific application requirements. Normalizing the data can establish a more accurate baseline that reflects typical network behavior.

4. Analysis and Documentation: Once the baseline data is collected, administrators analyze the metrics to identify patterns and trends. This analysis helps establish thresholds for acceptable performance and highlights any deviations that may require attention. Documentation of the baseline and related analysis is crucial for future reference and comparison.

Network Baselining: A Lot Can Go Wrong

Infrastructure is becoming increasingly complex, and let’s face it, a lot can go wrong. It’s imperative to have a global view of all the infrastructure components and a good understanding of the application’s performance and health. In a large-scale container-based application design, there are many moving pieces and parts, and it is hard to validate the health of each piece manually.  

Therefore, monitoring and troubleshooting are much more complex, especially as everything is interconnected, making it difficult for a single person in one team to understand what is happening entirely. Nothing is static anymore; things are moving around all the time. This is why it is even more important to focus on the patterns and to be able to see the path of the issue efficiently.

Some modern applications could be in multiple clouds and different location types simultaneously. As a result, there are numerous data points to consider. If any of these segments are slightly overloaded, the sum of each overloaded segment results in poor performance on the application level. 

What does this mean to latency?

Distributed computing has many components and services, with far apart components. This contrasts with a monolith with all parts in one location. As a result of the distributed nature of modern applications, latency can add up. So, we have both network latency and application latency. The network latency is several orders of magnitude more significant.

As a result, you need to minimize the number of Round Trip Times and reduce any unneeded communication to an absolute minimum. When communication is required across the network, it’s better to gather as much data together to get bigger packets that are more efficient to transfer. Also, consider using different types of buffers, both small and large, which will have varying effects on the dropped packet test.

Dropped Packet Test
Diagram: Dropped Packet Test and Packet Loss.

With the monolith, the application is simply running in a single process, and it is relatively easy to debug. Many traditional tooling and code instrumentation technologies have been built, assuming you have the idea of a single process. The core challenge is trying to debug microservices applications. So much of the tooling we have today has been built for traditional monolithic applications. So, there are new monitoring tools for these new applications, but there is a steep learning curve and a high barrier to entry.

A new approach: Network baselining and Baseline engineering

For this, you need to understand practices like Chaos Engineering, along with service level objectives (SLOs), and how they can improve the reliability of the overall system. Chaos Engineering is a baseline engineering practice that allows tests to be performed in a controlled way. Essentially, we intentionally break things to learn how to build more resilient systems.

So, we are injecting faults in a controlled way to make the overall application more resilient by injecting various issues and faults. Implementing practices like Chaos Engineering will help you understand and manage unexpected failures and performance degradation. The purpose of Chaos Engineering is to build more robust and resilient systems.

A final note on baselines: Don’t forget them!!

Creating a good baseline is a critical factor. You need to understand how things work under normal circumstances. A baseline is a fixed point of reference used for comparison purposes. You usually need to know how long it takes to start the application to the actual login and how long it takes to do the essential services before there are any issues or heavy load. Baselines are critical to monitoring.

It’s like security; if you can’t see what, you can’t protect. The same assumptions apply here. Go for a good baseline and if you can have this fully automated. Tests need to be carried out against the baseline on an ongoing basis. You need to test constantly to see how long it takes users to use your services. Without baseline data, estimating any changes or demonstrating progress is difficult.

Network baselining is a critical practice for maintaining optimal network performance and security. Administrators can proactively monitor, analyze, and optimize their networks by establishing a baseline. This approach enables them to promptly identify and address performance issues, enhance security measures, and plan for future capacity requirements. Organizations can ensure a reliable and efficient network infrastructure that supports their business objectives by investing time and effort in network baselining.

 

Summary: Baseline Engineering

Maintaining stability and performance is crucial in the fast-paced world of technology, where networks are the backbone of modern communication. This blog post will delve into the art of Network Baseline Engineering, uncovering its significance, methods, and benefits—strap in as we embark on a journey to understand and master this essential aspect of network management.

Section 1: What is Network Baseline Engineering?

Network Baseline Engineering is a process that involves establishing a benchmark or baseline for network performance, allowing for effective monitoring, troubleshooting, and optimization. Administrators can identify patterns, trends, and anomalies by capturing and analyzing network data over a certain period.

Section 2: The Importance of Network Baseline Engineering

A stable network is vital for seamless operations, preventing downtime, and ensuring user satisfaction. Network Baseline Engineering helps understand normal network behavior, crucial for detecting deviations, security threats, and performance issues. It enables proactive measures, reducing the impact of potential disruptions.

Section 3: Establishing a Baseline

Administrators need to consider various factors to create an accurate network baseline. These include defining key performance indicators (KPIs), selecting appropriate tools for data collection, and determining the time frame for capturing network data. Proper planning and execution are essential to ensure data accuracy and reliability.

Section 4: Analyzing and Interpreting Network Data

Once network data is collected, the real work begins. Skilled analysts leverage specialized tools to analyze the data, identify patterns, and establish baseline performance metrics. This step requires expertise in statistical analysis and a deep understanding of network protocols and traffic patterns.

Section 5: Benefits of Network Baseline Engineering

Network Baseline Engineering offers numerous benefits. It enables administrators to promptly detect and resolve performance issues, optimize network resources, and enhance overall network security. Organizations can make informed decisions, plan capacity upgrades, and ensure a smooth user experience by having a clear picture of normal network behavior.

Conclusion:

Network Baseline Engineering is the foundation for maintaining network stability and performance. By establishing a benchmark and continuously monitoring network behavior, organizations can proactively address issues, optimize resources, and enhance overall network security. Embrace the power of Network Baseline Engineering and unlock the full potential of your network infrastructure.

Docker network security

Docker Security Options

Docker Security Options

In the ever-evolving world of containerization, Docker has emerged as a leading platform for deploying and managing applications. As the popularity of Docker continues to grow, so does the importance of securing your containers and protecting your valuable data. In this blog post, we will delve into various Docker security options and strategies to help you fortify your container environment.

Docker brings numerous benefits, but it also introduces unique security challenges. We will explore common Docker security risks such as container breakout, unauthorized access, and image vulnerabilities. By understanding these risks, you can better grasp the significance of implementing robust security measures.

To mitigate potential vulnerabilities, it is crucial to follow Docker security best practices. We will share essential recommendations, including the importance of regularly updating Docker, utilizing strong access controls, and implementing image scanning tools. By adopting these practices, you can significantly enhance the security posture of your Docker environment.

Fortunately, the Docker ecosystem offers a range of security tools to assist in safeguarding your containers. We will delve into popular tools like Docker Security Scanning, Notary, and AppArmor. Each tool serves a specific purpose, whether it's vulnerability detection, image signing, or enforcing container isolation. By leveraging these tools effectively, you can bolster your Docker security framework.

Network security is a critical aspect of any container environment. We will explore Docker networking concepts, including bridge networks, overlay networks, and network segmentation. Additionally, we will discuss the importance of implementing firewalls, network policies, and encryption to protect your containerized applications

The container runtime plays a crucial role in ensuring the security of your containers. We will examine container runtimes like Docker Engine and containerd, highlighting their security features and best practices for configuration. Understanding these runtime security aspects will empower you to make informed decisions to protect your containers.

Conclusion: Securing your Docker environment is not a one-time task, but an ongoing effort. By understanding the risks, implementing best practices, leveraging security tools, and focusing on network and runtime security, you can mitigate potential vulnerabilities and safeguard your containers effectively. Remember, a proactive approach to Docker security is key in today's ever-evolving threat landscap e

Highlights: Docker Security Options

The fact that containers share the kernel of the Linux server boosts their performance and makes them lightweight. Because of this, Linux containers pose the most significant security risk. Namespaces are not everywhere in the kernel, which is the main reason for this concern.

Because cgroups and standard namespaces provide some necessary isolation from the host’s core resources, containerized applications are more secure than noncontainerized applications. However, containers should not be used as a replacement for good security practices. It would be best if you run all your containers as you would run an application on a production system. The same should apply if your application runs as a nonprivileged user on a server.

Docker Attack Surface

So you are currently in the Virtual Machine world and considering transitioning to a containerized environment. You want to smoothen your application pipeline and gain the benefits of a Docker containerized environment. But you have heard from many that the containers are insecure and are concerned about Docker network security. There is a Docker attack surface to be concerned about.

For example, containers run by root by default and have many capabilities that scare you. Yes, we have a lot of benefits to the containerized environment, and containers are the only way to do it for some application stacks. However, we have a new attack surface with some benefits of deploying containers and forcing you to examine Docker security options. The following post will discuss security issues, a container security video to help you get started, and an example of Docker escape techniques.

New Attacks and New Components

Containers are secure by themselves, and the kernel is pretty much battle-tested. A container escape is hard to orchestrate unless misconfiguration could result in excessive privileges. So, even though the bad actors’ intent may stay the same, we must mitigate a range of new attacks and protect new components.

To combat these, you need to be aware of the most common Docker network security options and follow the recommended practices for Docker container security. A platform approach is also recommended, and OpenShift is a robust platform for securing and operating your containerized environment.

For pre-information, you may find the following posts helpful: 

  1. OpenShift Security Best Practices
  2. Docker Default Networking 101
  3. What Is BGP Protocol in Networking
  4. Container Based Virtualization
  5. Hands On Kubernetes

 



Docker Security Options


Key Docker Network Security Discussion Points:


  • Docker network security.

  • Docker attack surface.

  • Container security video.

  • Securing Docker containers.

  • Docker escape techniques.

Back to basics with Docker Security Options

Docker Security

To use Docker safely in production and development, you must be aware of potential security issues and the primary tools and techniques for securing container-based systems. Your system’s defenses should also consist of multiple layers.

For example, your containers will most likely run in VMs so that if a container breakout occurs, another level of defense can prevent the attacker from getting to the host or other containers. Monitoring systems should be in place to alert admins in the case of unusual behavior. Finally, firewalls should restrict network access to containers, limiting the external attack surface.

Docker network security
Diagram: Docker container security supply chain.

Container Isolation:

One of the key security features of Docker is container isolation, which ensures that each container runs in its own isolated environment. By utilizing Linux kernel features such as namespaces and cgroups, Docker effectively isolates containers from each other and the host system, mitigating the risk of unauthorized access or interference between containers.

Image Vulnerability Scanning:

To ensure the security of Docker images, it is crucial to scan them for vulnerabilities regularly. Docker Security Scanning is an automated service that helps identify known security issues in your containers’ base images and dependencies. By leveraging this feature, you can proactively address vulnerabilities and apply necessary patches, reducing the risk of potential exploits.

Docker Content Trust:

Docker Content Trust is a security feature that allows you to verify the authenticity and integrity of images you pull from Docker registries. By enabling this feature, Docker ensures that only signed and verified images are used, preventing the execution of untrusted or tampered images. This provides an additional layer of protection against malicious or compromised containers.

Role-Based Access Control (RBAC):

Controlling access to Docker resources is critical to maintaining a secure environment. Docker Enterprise Edition (EE) offers Role-Based Access Control (RBAC), which allows you to define granular access controls for users and teams. By assigning appropriate roles and permissions, you can restrict access to sensitive operations and ensure that only authorized individuals can manage Docker resources.

Network Segmentation:

Docker provides various networking options to facilitate communication between containers and the outside world. Implementing network segmentation techniques, such as bridge or overlay networks, helps isolate containers and restrict unnecessary network access. By carefully configuring the network settings, you can minimize the attack surface and protect your containers from potential network-based threats.

Container Runtime Security:

In addition to securing the container environment, it is equally important to focus on the security of the container runtime. Docker supports different container runtimes, such as Docker Engine and containerd. Regularly updating these runtimes to the latest stable versions ensures that you benefit from the latest security patches and bug fixes, reducing the risk of potential vulnerabilities.

Docker Attack Surface

Often, the tools and appliances in place are entirely blind to containers. The tools look at a running process and think, if the process is secure, then I’m secure. One of my clients ran a container with the DockerFile and pulled an insecure image. The onsite tools did not know what an image was and could not scan it.

As a result, we had malware right in the network’s core, a little bit too close to the database server for my liking. 

Yes, we call containers a fancy process, and I’m to blame here, too, but we need to consider what is around the container to secure it fully. For a container to function, it needs the support of the infrastructure around it, such as the CI/CD pipeline and supply chain.

To improve your security posture, you must consider all the infrastructures. If you are looking for quick security tips on Docker network security, this course I created for Pluralsight may help you with Docker security options.

container security video

Ineffective Traditional Tools

The containers are not like traditional workloads. We can run an entire application with all its dependencies with a single command. The legacy security tools and processes often assume largely static operations and must be adjusted to adapt to the rate of change in containerized environments. With non-cloud-native data centers, Layer 4 is coupled with the network topology at fixed network points and lacks the flexibility to support containerized applications.

There is often only inter-zone filtering and east-to-west traffic may go unchecked. A container changes the perimeter, and it moves right to the workload. Just look at a microservices architecture. It has many entry points as compared to monolithic applications.

container security video

Docker container networking

When considering container networking, we are a world apart from the monolithic. Containers are short-lived and constantly spun down, and assets such as servers, IP addresses, firewalls, drives, and overlay networks are recycled to optimize utilization and enhance agility. Traditional perimeters designed with I.P. address-based security controls lag in a containerized environment.

Rapidly changing container infrastructure rules and signature-based controls can’t keep up with a containerized environment. Securing hyper-dynamic container infrastructure using traditional networks ​​and endpoint controls won’t work. For this reason, you should adopt purpose-built tools and techniques for a containerized environment.

The Need for Observability

Not only do you need to implement good Docker security options, but you also need to concern yourself with the recent observability tools. So, we need proper observability of the state of security and the practices used in the containerization environment, and we need to automate this as much as possible—not just the development but also the security testing, container scanning, and monitoring.

You are only as secure as the containers you have running. You need to be observable in systems and applications and proactive in these findings. It is not something you can buy; it is a cultural change. You want to know how the application works with the server, how the network is with the application, and what data transfer looks like in transfer and a stable state.  

    Data Point

Single Platform


Logs


Metrics


Traces

What level of observation do you need so you know that everything is performing as it should? There are several challenges to securing a containerized environment. Containerized technologies are dynamic and complex and require a new approach that can handle the agility and scale of today’s landscape. There are initial security concerns that you must understand before you get started with container security. This will help you explore a better starting strategy.

Docker attack surface: Container attack vectors 

We must consider a different threat model and understand how security principles such as least privilege and in-depth defense apply to Docker security options. With Docker containers, we have a completely different way of running applications and, as a result, a different set of risks to deal with.

Instructions are built into Dockerfiles, which run applications differently from a normal workload. With the correct rights, a bad actor could put anything in the Dockerfile without the necessary guard rails that understand containers; there will be a threat.

Therefore, we must examine new network and security models, as old tools and methods won’t meet these demands.  A new network and security model requires you to mitigate against a new attack vector. Bad actors’ intent stays the same. They are not going away anytime soon. But they now have a different and potentially easier attack surface if misconfigured.

I would consider the container attack surface pretty significant; if not locked down, bad actors will have many default tools at their disposal. For example, we have image vulnerabilities, access control exploits, container escapes, privilege escalation, application code exploits, attacks on the docker host, and all the docker components.

Docker security options: A final security note

Containers by themselves are secure, and the kernel is pretty much battle-tested. You will not often encounter kernel compromises, but they happen occasionally. A container escape is hard to orchestrate unless misconfiguration could result in excessive privileges. It would be best if you stayed clear of setting container capabilities that provide excessive privileges from a security standpoint.

Minimise container capabilities: Reduce the attack surface.

If you minimize the container’s capabilities, you are stripping down the container’s functionality to a bare minimum. And we mentioned this in the container security video. Therefore, the attack surface is limited, and the attack vector available to the attacker is minimized. 

You also want to keep an eye on CAP_SYS_ADMIN. This flag grants access to an extensive range of privileged activities. Containers run many other capacities by default that can cause havoc.

As Docker continues to gain popularity, understanding and implementing proper security measures is essential to safeguarding your containers and infrastructure. By leveraging the security options discussed in this blog post, you can mitigate risks, protect against potential threats, and ensure the integrity and confidentiality of your applications. Stay vigilant, stay secure, and embrace the power of Docker while keeping your containers safe.

 

Summary: Docker Security Options

With the growing popularity of containerization, Docker has become a leading platform for deploying and managing applications. However, as with any technology, security should be a top priority. In this blog post, we delved into various Docker security options that can help you safeguard your containers and ensure the integrity of your applications.

Section 1: Understanding Docker Security

Before we discuss the specific security options, let’s establish a foundational understanding of Docker security. We’ll explore the concept of container isolation, Docker vulnerabilities, and potential risks associated with containerized environments.

Section 2: Docker Security Best Practices

It’s crucial to follow Docker security best practices to mitigate security risks. This section will outline critical recommendations, including limiting container privileges, using secure base images, and implementing container scanning and vulnerability assessment tools.

Section 3: Docker Content Trust

Docker Content Trust, also known as Docker Notary, is a security feature that ensures the authenticity and integrity of Docker images. We’ll explore how It works, how to enable it, and the benefits it provides in preventing image tampering and unauthorized modifications.

Section 4: Docker Network Security

Securing Docker networks is essential to protect against unauthorized access and potential attacks. In this section, we’ll discuss network segmentation, Docker network security models, and techniques such as network policies and firewalls to enhance the security of your containerized applications.

Section 5: Container Runtime Security

The container runtime plays a critical role in Docker security. We’ll examine different container runtimes, such as Docker Engine and containerd, and explore features like seccomp, AppArmor, and SELinux that can help enforce fine-grained security policies and restrict container capabilities.

Conclusion:

In this blog post, we have explored various Docker security options that can empower you to protect your containers and fortify your applications against potential threats. By understanding Docker security fundamentals, following best practices, leveraging Docker Content Trust, securing Docker networks, and utilizing container runtime security features, you can enhance the overall security posture of your containerized environment. As you continue your journey with Docker, remember to prioritize security and stay vigilant in adopting the latest security measures to safeguard your valuable assets.