WAN Design Requirements

DMVPN

DMVPN

Cisco DMVPN is based on a virtual private network (VPN), which provides private connectivity over a public network like the Internet. Furthermore, the DMVPN network takes this VPN concept further by allowing multiple VPNs to be deployed over a shared infrastructure in a manageable and scalable way.

This shared infrastructure, or “DMVPN network,” enables each VPN to connect to the other VPNs without needing expensive dedicated connections or complex configurations.

DMVPN Explained: DMVPN creates a virtual network built on the existing infrastructure. This virtual network consists of “tunnels” between various endpoints, such as corporate networks, branch offices, or remote users. This virtual network allows for secure communication between these endpoints, regardless of their geographic location. As we are operating under an underlay, DMVPN is an overlay solution.

Hub Router:The hub router serves as the central point of connectivity in a DMVPN deployment. It acts as a central hub for all the spoke routers, enabling secure communication between them. The hub router is responsible for managing the dynamic IPsec tunnels and facilitating efficient routing.

Spoke Routers: Spoke routers are the remote endpoints in a DMVPN network. They establish IPsec tunnels with the hub router to securely transmit data. Spoke routers are typically located in branch offices or connected to remote workers' devices. They dynamically establish tunnels based on network requirements, ensuring optimal routing.

Next-Hop Resolution Protocol (NHRP):NHRP is a critical component of DMVPN that aids in dynamic IPsec tunnel establishment. It assists spoke routers in resolving the next-hop addresses for establishing tunnels with other spoke routers or the hub router. NHRP maintains a mapping database that allows efficient routing and simplifies network configuration.

Scalability:DMVPN offers excellent scalability, making it suitable for organizations with expanding networks. As new branch offices or remote workers join the network, DMVPN dynamically establishes tunnels without the need for manual configuration. This scalability eliminates the complexities associated with traditional point-to-point VPN solutions.

Cost Efficiency:By utilizing DMVPN, organizations can leverage affordable public network infrastructures instead of costly dedicated connections. DMVPN makes efficient use of bandwidth, reducing operational costs while providing secure and reliable connectivity.

Flexibility:DMVPN provides flexibility in terms of network design and management. It supports different routing protocols, allowing seamless integration with existing network infrastructure. Additionally, DMVPN supports various transport technologies, including MPLS, broadband, and cellular, enabling organizations to choose the most suitable option for their needs.

Highlights: DMVPM

VPN-based security solutions

VPN-based security solutions are increasingly popular and have proven effective and secure technology for protecting sensitive data traversing insecure channel mediums, such as the Internet.

Traditional IPsec-based site-to-site, hub-to-spoke VPN deployment models must scale better and be adequate only for small- and medium-sized networks. However, as demand for IPsec-based VPN implementation grows, organizations with large-scale enterprise networks require scalable and dynamic IPsec solutions that interconnect sites across the Internet with reduced latency while optimizing network performance and bandwidth utilization.

Scaling traditional IPsec VPN

Dynamic Multipoint VPN (DMVPN) technology scales IPsec VPN networks by offering a large-scale deployment model that allows the network to expand and realize its full potential. In addition, DMVPN offers scalability that enables zero-touch deployment models.

ipsec tunnel
Diagram: IPsec Tunnel

Encryption is supported through IPsec, making DMVPN a popular choice for connecting different sites using regular Internet connections. It’s a great backup or alternative to private networks like MPLS VPN. A popular option for DMVPN is FlexVPN.

Routing Technique

DMVPN (Dynamic Multipoint VPN) is a routing technique for building a VPN network with multiple sites without configuring all devices statically. It’s a “hub and spoke” network in which the spokes can communicate directly without going through the hub.

Advanced:

DMVPN Phase 2

DMVPN Phase 2 is an enhanced version of the initial DMVPN implementation. It introduces the concept of Next Hop Resolution Protocol (NHRP), which provides dynamic mapping between the participating devices’ physical and virtual IP addresses. This dynamic mapping allows for efficient and scalable communication within the DMVPN network.

Resolutions triggered by the NHRP

Learning the mapping information required through NHRP resolution creates a dynamic spoke-to-spoke tunnel. How does a spoke know how to perform such a task? As an enhancement to DMVPN Phase 1, spoke-to-spoke tunnels were first introduced in Phase 2 of the network. Phase 2 handed responsibility for NHRP resolution requests to each spoke individually, which means that spokes initiated NHRP resolution requests when they determined a packet needed a spoke-to-spoke tunnel.

Cisco Express Forwarding (CEF) would assist the spoke in making this decision based on information contained in its routing table.

Related: For pre-information, you may find the following posts helpful.

  1. VPNOverview
  2. Dynamic Workload Scaling
  3. DMVPN Phases
  4. IPSec Fault Tolerance
  5. Dead Peer Detection
  6. Network Overlays
  7. IDS IPS Azure
  8. SD WAN SASE
  9. Network Traffic Engineering

DMVPM

Cisco DMVPN

♦DMVPN Components Involved

The DMVPN solution consists of a combination of existing technologies so that sites can learn about each other and create dynamic VPNs. Therefore, efficiently designing and implementing a Cisco DMVPN network requires thoroughly understanding these components, their interactions, and how they all come together to create a DMVPN network.

These technologies may seem complex, and this post aims to simplify them. First, we mentioned that DMVPN has different components, which are the building blocks of a DMVPN network. These include Generic Routing Encapsulation (GRE), Next Hop Redundancy Protocol (NHRP), and IPsec.

The Dynamic Multipoint VPN (DMVPN) feature allows users to better scale large and small IP Security (IPsec) Virtual Private Networks (VPNs) by combining generic routing encapsulation (GRE) tunnels, IPsec encryption, and Next Hop Resolution Protocol (NHRP).

Each of these components needs a base configuration for DMVPN to work. Once the base configuration is in place, we have a variety of show and debug commands to troubleshoot a DMVPN network to ensure smooth operations.

There are four pieces to DMVPN:

  • Multipoint GRE (mGRE)
  • NHRP (Next Hop Resolution Protocol)
  • Routing (RIP, EIGRP, OSPF, BGP, etc.)
  • IPsec (not required but recommended)

Cisco DMVPN Components

Main DMVPN Components

Dynamic VPN

  • Multipoint GRE and Point to Point GRE

  • NHRP (Next Hop Resolution Protocol)

  • Routing (RIP, EIGRP, OSPF, BGP, etc.)

  • IPsec (not required but recommended)

1st Lab Guide: Displaying the DMVPN configuration

DMVPN Network

The following screenshot is from a DMVPN network using Cisco modeling labs. We have R1 as the hub, R2 and R3 as the spokes. The command: show DMVPN displays that we have two spokes routers. Notice the “D” attribute. This means the spokes have been learned dynamically, which is the essence of DMVPN.

The spokes are learned with a process called the Next Hop Resolution Protocol. As this is a nonbroadcast multiaccess network, we must use a protocol other than the Address Resolution Protocol (ARP).

Note:

  1. As you can see, with tunnel configuration for one of the spokes, we have a static mapping for the hub with the command IP nhrp NHS 192.168.100.1.  We also have point-to-point GRE tunnels in the spokes with the command: tunnel destination 172.17.11.2.
  2. Therefore, we are running DMVPN Phase 1. DMVPN phase 3 will have mGRE. More on this later.
DMVPN configuration
Diagram: DMVPN Configuration.

Key DMVPN components include:

●   Multipoint GRE (mGRE) tunnel interface: Allows a single GRE interface to support multiple IPsec tunnels, simplifying the size and complexity of the configuration. Standard point-to-point GRE tunnels are used in the earlier versions or phases of DMVPN.

●   Dynamic discovery of IPsec tunnel endpoints and crypto profiles: Eliminates the need to configure static crypto maps defining every pair of IPsec peers, further simplifying the configuration.

●   NHRP: Allows spokes to be deployed with dynamically assigned public IP addresses (i.e., behind an ISP’s router). The hub maintains an NHRP database of the public interface addresses of each spoke. Each spoke registers its actual address when it boots; when it needs to build direct tunnels with other spokes, it queries the NHRP database for real addresses of the destination spokes

DMVPN Explained
Diagram: DMVPN explained. Source is TechTarget

DMVPN Explained

Overlay Networking

A Cisco DMVPN network consists of many overlay virtual networks. Such a virtual network is called an overlay network because it depends on an underlying transport called the underlay network. The underlay network forwards traffic flowing through the overlay network. With the use of protocol analyzers, the underlay network is aware of the existence of the overlay. However, left to its defaults, the underlay network does not fully see the overlay network.

We will have routers at the company’s sites that are considered the endpoints of the tunnel that forms the overlay network. So, we could have a WAN edge router or Cisco ASA configured for DMVPN.  Then, for the underlay that is likely out of your control, have an array of service provider equipment such as routers, switches, firewalls, and load balancers that make up the underlay.

The following diagram displays the different overlay solutions. VXLAN is expected in the data center, while GRE is used across the WAN. DMVPN uses GRE.

What is VXLAN
Diagram: Virtual overlay solutions.

2nd Lab Guide: VXLAN overlay

Overlay Networking

While DMVPN does not run VXLAN as the overlay protocol, viewing for background information and reference is helpful. VXLAN is a network overlay technology that provides a scalable and flexible solution for creating virtualized networks.

It enables the creation of logical Layer 2 networks over an existing Layer 3 infrastructure, allowing organizations to extend their networks across data centers and virtualized environments. In the following example, we create a Layer 2 overlay over a Layer 3 core. A significant difference between DMVPN’s use of GRE as the overlay and the use of VXLAN is the VNI.

Note:

  1. One critical component of VXLAN is the Virtual Network Identifier (VNI). In this blog post, we will explore the details of VXLAN VNI and its significance in modern network architectures.
  2. VNI is a 24-bit identifier that uniquely identifies a VXLAN network. It allows multiple VXLAN networks to coexist over the same physical network infrastructure. Each VNI represents a separate Layer 2 network, enabling the isolation and segmentation of traffic between different virtual networks.

Below, you can see the VNI used and the peers that have been created. VXLAN also works in multicast mode.

Overlay networking
Diagram: Overlay Networking with VXLAN

DMVPN Overlay Networking

Creating an overlay network


  • To create an overlay network, one needs a tunneling technique.Multipoint GRE and Point to Point GRE

  • GRE tunnel is the most widely used for external connectivity

  • VXLAN is for internal to the data center.

  • GRE tunnel support IP-based network, works by inserting IP and GRE header on top of the original protocol packet.

DMVPN: Creating an overlay network

The overlay network does not magically appear. To create one, we need a tunneling technique. Many tunneling technologies can be used to form the overlay network. The Generic Routing Encapsulation (GRE) tunnel is the most widely used external connectivity, while VXLAN is used for internal connectivity to the data center.

And the one that DMVPN adopts. A GRE tunnel can support tunneling for various protocols over an IP-based network. It works by inserting an IP and GRE header on top of the original protocol packet, creating a new GRE/IP packet. 

GRE over IPsec

The resulting GRE/IP packet uses a source/destination pair routable over the underlying infrastructure. The GRE/IP header is the outer header, and the original protocol header is the inner header.

♦ Is GRE over IPsec a tunneling protocol? 

GRE is a tunneling protocol that transports multicast, broadcast, and non-IP packets like IPX. IPSec is an encryption protocol. IPSec can only transport unicast packets, not multicast & broadcast. Hence, we wrap it GRE first and then into IPSec, which is called GRE over IPSec.

3rd Lab Guide: Displaying the DMVPN configuration

DMVPN Configuration

We are using a different DMVPN lab setup than before. R11 is the hub of the DMVPN network, and we only have one spoke of R12. In the DMVPN configuration, the tunnel interface has an “encapsulation tunnel.” This is the overlay network, and we are using GRE.

We are currently using the standard point-to-point GRE and not multipoint GRE. We know this as we have explicitly set the tunnel destination with the command tunnel destination 172.16.31.2.  This is fine for a small network of a few spokes. However, for the more extensive network, we need to use mGRE. And take full advantage of the dynamic nature of DMVPN.

Note:

  1. As for the routing protocols, we run EIGRP over the tunnel ( GRE ) interface. So, we only have one EIGRP neighbor, so we don’t need to worry about the split horizon. Before we move on, one key point is that running a traceroute from R11 to R12 only shows one hop.
  2. This is because the TTL is also carried in the GRE. So, no matter how many devices are in the path ( underlay network ) between R11 and R12, either physical or virtual, it will always show as one hop due to the overlay network, i.e., GRE.
DMVPN configuration
Diagram: DMVPN Configuration

Multipoint GRE. What is mGRE? 

An alternative to configuring multiple point-to-point GRE tunnels is to use multipoint GRE tunnels to provide the connectivity desired. Multipoint GRE (mGRE) tunnels are similar in construction to point-to-point GRE tunnels except for the tunnel destination command. However, instead of declaring a static destination, no destination is declared, and instead, the tunnel mode gre multipoint command is issued.

How does one remote site know what destination to set for the GRE/IP packet created by the tunnel interface? The easy answer is that it can’t on its own. The site can only glean the destination address with the help of an additional protocol. The next component used to create a DMVPN network is the Next Hop Resolution Protocol (NHRP). 

Essentially, mGRE features a single GRE interface on each router, allowing multiple destinations. This interface secures multiple IPsec tunnels and reduces the overall scope of the DMVPN configuration. However, if two branch routers need to tunnel traffic, mGRE and point-to-point GRE may not know which IP addresses to use.

The Next Hop Resolution Protocol (NHRP) is used to solve this issue. The following diagram depicts the functionality of mGRE in DMVPN technology.

what is mgre
Diagram: What is mGRE? Source is Stucknactive

Next Hop Resolution Protocol (NHRP)

The Next Hop Resolution Protocol (NHRP) is a networking protocol designed to facilitate efficient and reliable communication between two nodes on a network. It does this by providing a way for one node to discover the IP address of another node on the same network.

The primary role of NHRP is to allow a node to resolve the IP address of another node that is not directly connected to the same network. This is done by querying an NHRP server, which contains a mapping of all the nodes on the network. When a node requests the NHRP server, it will return the IP address of the destination node.

NHRP was initially designed to allow routers connected to non-broadcast multiple-access (NBMA) networks to discover the proper next-hop mappings to communicate. It is specified in RFC 2332. NBMA networks faced a similar issue as mGRE tunnels. 

Cisco DMVPN
Diagram: Cisco DMVPN and NHRP. The source is network direction.

The NHRP can deploy spokes with assigned IP addresses, which can be connected from the central DMVPN hub. One branch router requires this protocol to find the public IP address of the second branch router. NHRP uses a “server-client” model, where one router functions as the NHRP server while the other routers are the NHRP clients. In the multipoint GRE/DMVPN topology, the hub router is the NHRP server, and all other routers are the spokes. 

Each client registers with the server and reports its public IP address, which the server tracks in its cache. Then, through a process that involves registration and resolution requests from the client routers and resolution replies from the server router, traffic is enabled between various routers in the DMVPN.

4th Lab Guide: Displaying the DMVPN configuration

NHC and NHS Design

The following DMVPN configuration shows a couple of new topics. DMVPN works with an NHS and NHC design; the hub is the HHS. You can see this explicitly configured on the spokes, and this configuration needs to be on the spokes rather than the hubs.

The hub configuration is meant to be more dynamic. Also, if you recall, we are running EIGRP over the GRE tunnel. Two important points here. Firstly, we must consider a split-horizon as we have two spokes. Secondly, we need to use the “multicast” command.

Note:

  1. This is because EIGRP uses multicast HELLO messages to form neighbor relationships. If we had BGP running over the tunnel interface and not EIGRP, we would not need the multicast keywords. As BGP does not use multicast. The full command: IP nhrp nhs 192.168.100.11 nmba 172.16.11.1 multicast.
  2. On the spoke, we are telling the router that R11 is the NHS and to map its tunnel interface of 192.168.100.11 to the address 172.16.11.1 and to allow multicast traffic, or better explained, we are creating a new multicast mapping table.

DMVPN configuration

5th Lab Guide: DMVPN over IPsec

Securing DMVPN

In the following screenshot, we have DMVPN operating over IPsec. So, I have connected the hub and two spokes into an unmanaged switch to simulate the WAN environment. A WAN and DMVPN do not have encryption by default. However, since you probably use DMVPN with the Internet as the underlying network, it might be wise to encrypt your tunnels.

In this network, we are running RIP v2 as the routing protocol. Remember that you must turn off the split horizon at the hub site. IPsec has phases 1 and 2 (don’t confuse them with the DMVPN phases). Firstly, we need an ISAKMP policy that matches all our routers. Then, for Phase 2, we require a transform set on each router that tells the router what encryption/hashing to use and if we want tunnel or transport mode.

Note:

  1. I used ESP with AES as the encryption algorithm for this configuration and SHA for hashing. The mode is essential; since we are using GRE, we have already used tunnels as a transport mode. If you use tunnel mode, we will have even more overhead, which is unnecessary.
  2. The primary test here was to run a ping between the spokes. Since the ping works behind the scenes, our two spoke routers will establish an IPsec tunnel. You can see the security association below:
DMVPN over IPsec
Diagram: DMVPN over IPsec

IPsec Tunnels

An IPsec tunnel is a secure connection between two or more devices over an untrusted network using a set of cryptographic security protocols. The most common type of IPsec tunnel is the site-to-site tunnel, which connects two sites or networks. It allows two remote sites to communicate securely and exchange traffic between them. Another type of IPsec tunnel is the remote-access tunnel, which allows a remote user to connect to the corporate network securely.

When setting up an IPsec tunnel, several parameters, such as authentication method, encryption algorithm, and tunnel mode, must be configured. Depending on the organization’s needs, additional security protocols, such as Internet Key Exchange (IKE), can also be used for further authentication and encryption.

IPsec VPN
Diagram: IPsec VPN. Source Wikimedia.

IPsec Tunnel Endpoint Discovery 

Tunnel Endpoint Discovery (TED) allows routers to discover IPsec endpoints automatically so that static crypto maps must not be configured between individual IPsec tunnel endpoints. In addition, TED allows endpoints or peers to dynamically and proactively initiate the negotiation of IPsec tunnels to discover unknown peers.

These remote peers do not need to have TED configured to be discovered by inbound TED probes. So, while configuring TED, VPN devices that receive TED probes on interfaces — that are not configured for TED — can negotiate a dynamically initiated tunnel using TED.

DMVPN Checkpoint 

Main DMVPN Points To Consisder

  • Dynamic Multipoint VPN (DMVPN) technology is used for scaling IPsec VPN networks.

  • The DMVPN solution consists of a combination of existing technologies.

  • The overlay network does not magically appear. To create an overlay network, we need a tunneling technique. 

  • Once the virtual tunnel is fully functional, the routers need a way to direct traffic through their tunnels. Dynamic routing protocols are excellent choices for this.

  • mGRE features a single GRE interface on each router with the possibility of multiple destinations.

  • The Next Hop Resolution Protocol (NHRP) is a networking protocol designed to facilitate efficient and reliable communication between two nodes on a network.

  • An IPsec tunnel is a secure connection between two or more devices over an untrusted network using a set of cryptographic security protocols. DMVPN is not secure by default.

Continue Reading


DMVPN and Routing protocols 

Routing protocols enable the DMVPN to find routes between different endpoints efficiently and effectively. Therefore, choosing the right routing protocol is essential to building a scalable and stable DMVPN. One option is to use Open Shortest Path First (OSPF) as the interior routing protocol. However, OSPF is best suited for small-scale DMVPN deployments. 

The Enhanced Interior Gateway Routing Protocol (EIGRP) or Border Gateway Protocol (BGP) is more suitable for large-scale implementations. EIGRP is not restricted by the topology limitations of a link-state protocol and is easier to deploy and scale in a DMVPN topology. BGP can scale to many peers and routes, and it puts less strain on the routers compared to other routing protocols

DMVPN supports various routing protocols that enable efficient communication between network devices. In this section, we will explore three popular DMVPN routing protocols: Enhanced Interior Gateway Routing Protocol (EIGRP), Open Shortest Path First (OSPF), and Border Gateway Protocol (BGP). We will examine their characteristics, advantages, and use cases, allowing network administrators to make informed decisions when implementing DMVPN.

EIGRP: The Dynamic Routing Powerhouse

EIGRP is a distance vector routing protocol widely used in DMVPN deployments. This section will provide an in-depth look at EIGRP, discussing its features such as fast convergence, load balancing, and scalability. Furthermore, we will highlight best practices for configuring EIGRP in a DMVPN environment, optimizing network performance and reliability.

OSPF: Scalable and Flexible Routing

OSPF is a link-state routing protocol that offers excellent scalability and flexibility in DMVPN networks. This section will explore OSPF’s key attributes, including its hierarchical design, area types, and route summarization capabilities. We will also discuss considerations for deploying OSPF in a DMVPN environment, ensuring seamless connectivity and effective network management.

BGP: Extending DMVPN to the Internet

BGP, a path vector routing protocol, connects DMVPN networks to the global Internet. This section will focus on BGP’s unique characteristics, such as its autonomous system (AS) concept, policy-based routing, and route reflectors. We will also address the challenges and best practices of integrating BGP into DMVPN architectures.

6th Lab Guide: DMVPN Phase 1 and OSPF

DMVPN Routing

OSPF is not the best solution for DMVPN. Because it’s a link-state protocol, each spoke router must have the complete LSDB for the DMVPN area. Since we use a single subnet on the multipoint GRE interfaces, all spoke routers must be in the same area.

This is no problem with a few routers, but it doesn’t scale well when you have dozens or hundreds. Most spoke routers are probably low-end devices at branch offices that don’t like all the LSA flooding that OSPF might do within the area. One way to reduce the number of prefixes in the DMVPN network is to use a stub or total stub area.

Note:

  1. The example below shows we are running OSPF between the Hub and the two spokes. OSPF network type can be viewed on the hub along with the status of DMVPN. Please take note of the next hop on the Spoke router when I do a show IP route OSPF.
  2. Each router has learned the networks on the different loopback interfaces. The next hop value is preserved when we use the broadcast network type.

You have seen all the different OSPF Broadcast network types on DMVPN phase 1. As you can see, some stuff is in the routing tables. All traffic goes through the hub, so our spoke routers don’t need to know everything. Unfortunately, it’s impossible to summarize within the area. However, we can reduce the number of routes by changing the DMVPN area into a stub or total stub area.

Unlike the broadcast network type, point-to-point and point-to-multipoint network types do not preserve the spokes’ next-hop IP addresses.

7th Lab Guide: DMVPN Phase 2 with OSPF

DMVPN Routing

In the following example, we have DMVPN Phase 2 running with OSPF. We are using the Broadcast network type. However, the following OSPF network types are potentials for DMVPN phase 2.

  • point-to-point
  • broadcast
  • non-broadcast
  • point-to-multipoint
  • point-to-multipoint non-broadcast

Below, all routers have learned the networks on each other’s loopback interfaces. Look closely at the next hop IP addresses for the 2.2.2.2/32 and 3.3.3.3/32 entries. This looks good; these are the IP addresses of the spoke routers. You can also see that 1.1.1.1/32 is an inter-area route. This is good; we can summarize networks “behind” the hub towards the spoke routers if we want to.

When running OSPF for DMVPN phase 2, you only have two choices if you want direct spoke-to-spoke communication: broadcast and non-broadcast. Let me give you an overview:

  • Point-to-point: This will not work since we use multipoint GRE interfaces.
  • Broadcast: This network type is your best choice. We are using it in the example above. The automatic neighbor discovery and correct next-hop addresses. Make sure that the spoke router can’t become DR or BDR. Here, we can use the priority commands and set them to 0 on the spokes.
  • Non-broadcast: similar to broadcast, but you have to configure static neighbors.
  • Point-to-multipoint: Don’t use this for DMVPN phase 2 since the hub changes the next hop address; you won’t have direct spoke-to-spoke communication.
  • Point-to-multipoint non-broadcast: same story as point-to-multipoint, but you must also configure static neighbors.

DMVPN Deployment Scenarios: 

Cisco DMVPN can be deployed in two ways:

  1. Hub-and-spoke deployment model
  2. Spoke-to-spoke deployment model

Hub-and-spoke deployment model: In this traditional topology, remote sites, which are the spokes, are aggregated into a headend VPN device. The headend VPN location would be at the corporate headquarters, known as the hub. 

Traffic from any remote site to other remote sites would need to pass through the headend device. Cisco DMVPN supports dynamic routing, QoS, and IP Multicast while significantly reducing the configuration effort. 

Spoke-to-spoke deployment model: Cisco DMVPN allows the creation of a full-mesh VPN, in which traditional hub-and-spoke connectivity is supplemented by dynamically created IPsec tunnels directly between the spokes. 

With direct spoke-to-spoke tunnels, traffic between remote sites does not need to traverse the hub; this eliminates additional delays and conserves WAN bandwidth while improving performance. 

Spoke-to-spoke capability is supported in a single-hub or multi-hub environment. Multihub deployments provide increased spoke-to-spoke resiliency and redundancy.  

DMVPN Designs

The word phase is almost always connected to discussions on DMVPN design. DMVPN phase refers to the version of DMVPN implemented in a DMVPN design. As mentioned above, we can have two deployment models, each of which can be mapped to a DMVPN Phase.

Cisco DMVPN as a solution was rolled out in different stages as the explanation became more widely adopted to address performance issues and additional improvised features. There are three main phases for DMVPN:

  • Phase 1 – Hub-and-spoke
  • Phase 2 – Spoke-initiated spoke-to-spoke tunnels
  • Phase 3 – Hub-initiated spoke-to-spoke tunnels
What is DMVPN
Diagram: What is DMVPN? Source is Lira

The differences between the DMVPN phases are related to routing efficiency and the ability to create spoke-to-spoke tunnels. We started with DMVPN Phase 1, which only had a hub to spoke. This needed more scalability as we could not have direct spoke-to-spoke communication. Instead, the spokes could communicate with one another but were required to traverse the hub.

Then, we went to DMVPN Phase 2 to support spoke-to-spoke with dynamic tunnels. These tunnels were initially brought up by passing traffic via the hub. Later, Cisco developed DMVPN Phase 3, which optimized how spoke-to-spoke commutation happens and the tunnel build-up.

Dynamic multipoint virtual private networks began simply as what is best described as hub-and-spoke topologies. The primary tool for creating these VPNs combines Multipoint Generic Routing Encapsulation (mGRE) connections employed on the hub with traditional Point-to-Point (P2P) GRE tunnels on the spoke devices.

In this initial deployment methodology, known as a Phase 1 DMVPN, the spokes can only join the hub and communicate with one another through the hub. This phase does not use spoke-to-spoke tunnels. Instead, the spokes are configured for point-to-point GRE to the hub and register their logical IP with the non-broadcast multi-access (NBMA) address on the next hop server (NHS) hub.

It is essential to keep in mind that there is a total of three phases, and each one can influence the following:

  1. Spoke-to-spoke traffic patterns
  2. Routing protocol design
  3. Scalability

DMVPN Design Options

The disadvantage of a single hub router is that it’s a single point of failure. Once your hub router fails, the entire DMVPN network is gone.

We need another hub router to add redundancy to our DMVPN network. There are two options for this:

  1. Dual hub – Single Cloud
  2. Dual hub – Dual Cloud

With the single cloud option, we use a single DMVPN network but add a second hub. The spoke routers will use only one multipoint GRE interface, and we configure the second hub as a next-hop server. The dual cloud option also has two hubs, but we will use two DMVPN networks, meaning all spoke routers will get a second multipoint GRE interface.

Understanding DMVPN Dual Hub Single Cloud:

DMVPN dual hub single cloud is a network architecture that provides redundancy and high availability by utilizing two hub devices connected to a single cloud. The cloud can be an internet-based infrastructure or a private WAN. This configuration ensures the network remains operational even if one hub fails, as the other hub takes over the traffic routing responsibilities.

Benefits of DMVPN Dual Hub Single Cloud:

1. Redundancy: With dual hubs, organizations can ensure network availability even during hub device failures. This redundancy minimizes downtime and maximizes productivity.

2. Load Balancing: DMVPN dual hub single cloud allows for efficient load balancing between the two hubs. Traffic can be distributed evenly, optimizing bandwidth utilization and enhancing network performance.

3. Scalability: The architecture is highly scalable, allowing organizations to easily add new sites without reconfiguring the entire network. New sites can be connected to either hub, providing flexibility and ease of expansion.

4. Simplified Management: DMVPN dual hub single cloud simplifies network management by centralizing control and reducing the complexity of VPN configurations. Changes and updates can be made at the hub level, ensuring consistent policies across all connected sites.

The disadvantage is that we have limited control over routing. Since we use a single multipoint GRE interface, making the spoke routers prefer one hub over another is challenging.

8th Lab Guide: DMVPN Dual Hub Single Cloud

DMVPN Advanced Configuration

Below is a DMVPN network with two hubs and two spoke routers. Hub1 will be the primary hub, and hub2 will be the secondary hub. We use a single DMVPN network; each router only has one multipoint GRE interface. On top of that, we have R1 on the leading site where we use the 10.10.10.0/24 subnet. Behind R1, we have a loopback interface with IP address 1.1.1.1/32.

Note:

  1. The two hub routers and spoke routers are connected to the Internet. Usually, you would connect the two hub routers to different ISPs. To keep it simple, I combined all routers into the 192.168.1.0/24 subnet, represented as an unmanaged switch in the lab below.
  2. Each spoke router has a loopback interface with an IP address. The DMVPN network will use subnet 172.16.1.0/24, where hub1 will be the primary hub. Spoke routers will register themselves with both hub routers.
Dual Hub Single Cloud
Diagram: Dual Hub Single Cloud

Summary of DMVPN Phases

Phase 1—Hub-to-Spoke Designs: Phase 1 was the first design introduced for hub-to-spoke implementation, where spoke-to-spoke traffic would traverse via the hub. Phase 1 also introduced daisy chaining of identical hubs for scaling the network, thereby providing Server Load Balancing (SLB) capability to increase the CPU power.

Phase 2—Spoke-to-Spoke Designs: Phase 2 design introduced the ability for dynamic spoke-to-spoke tunnels without traffic going through the hub, intersite communication bypassing the hub, thereby providing greater scalability and better traffic control.

In Phase 2 network design, each DMVPN network is independent of other DMVPN networks, causing spoke-to-spoke traffic from different regions to traverse the regional hubs without going through the central hub.

Phase 3—Hierarchical (Tree-Based) Designs: Phase 3 extended Phase 2 design with the capability to establish dynamic and direct spoke-to-spoke tunnels from different DMVPN networks across multiple regions. In Phase 3, all regional DMVPN networks are bound to form a single hierarchical (tree-based) DMVPN network, including the central hubs.

As a result, spoke-to-spoke traffic from different regions can establish direct tunnels with each other, thereby bypassing both the regional and main hubs.

DMVPN network
Diagram: DMVPN network and phases explained. Source is blog

DMVPN Architecture

Design recommendation


  • To create an overlay network, one needs a tunneling technique.Multipoint GRE and Point to Point GRE

  • Dynamic routing protocols are typically required in all but the smallest deployments or wherever static routing is not manageable or optimal.

  • QoS: Mandatory to ensure performance and quality of voice, video, and real-time data applications.

DMVPN Design recommendation

Which deployment model can you use? The 80:20 traffic rule can be used to determine which model to use:

  1. If 80 percent or more of the traffic from the spokes is directed into the hub network itself, deploy the hub-and-spoke model.
  2. Consider the spoke-to-spoke model if more than 20 percent of the traffic is meant for other spokes.

The hub-and-spoke model is usually preferred for networks with a high volume of IP Multicast traffic.

Architecture

Medium-sized and large-scale site-to-site VPN deployments require support for advanced IP network services such as:

● IP Multicast: Required for efficient and scalable one-to-many (i.e., Internet broadcast) and many-to-many (i.e., conferencing) communications and commonly needed by voice, video, and specific data applications

● Dynamic routing protocols: Typically required in all but the smallest deployments or wherever static routing is not manageable or optimal

● QoS: Mandatory to ensure performance and quality of voice, video, and real-time data applications

Traditionally, supporting these services required tunneling IPsec inside protocols such as Generic Route Encapsulation (GRE), which introduced an overlay network, making it complex to set up and manage and limiting the solution’s scalability.

Indeed, traditional IPsec only supports IP Unicast, making deploying applications that involve one-to-many and many-to-many communications inefficient. Cisco DMVPN combines GRE tunneling and IPsec encryption with Next-Hop Resolution Protocol (NHRP) routing to meet these requirements while reducing the administrative burden. 

How DMVPN Works

How DMVPN Works


  •  Each spoke establishes a permanent tunnel to the hub. IPsec is optional.

  • Each spoke registers its actual address as a client to the NHRP server on the hub

  • When a spoke requires that packets be sent to a destination subnet on another spoke, it queries the NHRP server for the real (outside) addresses of other spoke.

  • After the originating spoke learns the peer address of the target spoke, it initiates a dynamic IPsec tunnel to the target spoke.

  • The spoke-to-spoke tunnels are established on demand whenever traffic is sent between the spokes.

  • DMVPN Operations.

How DMVPN Works

DMVPN builds a dynamic tunnel overlay network.

• Initially, each spoke establishes a permanent IPsec tunnel to the hub. (At this stage, spokes do not establish tunnels with other spokes within the network.) The hub address should be static and known by all of the spokes.

• Each spoke registers its actual address as a client to the NHRP server on the hub. The NHRP server maintains an NHRP database of the public interface addresses for each spoke.

• When a spoke requires that packets be sent to a destination (private) subnet on another spoke, it queries the NHRP server for the real (outside) addresses of the other spoke’s destination to build direct tunnels.

• The NHRP server looks up the NHRP database for the corresponding destination spoke and replies with the real address for the target router. NHRP prevents dynamic routing protocols from discovering the route to the correct spoke. (Dynamic routing adjacencies are established only from spoke to the hub.)

• After the originating spoke learns the peer address of the target spoke, it initiates a dynamic IPsec tunnel to the target spoke.

• Integrating the multipoint GRE (mGRE) interface, NHRP, and IPsec establishes a direct dynamic spoke-to-spoke tunnel over the DMVPN network.

The spoke-to-spoke tunnels are established on demand whenever traffic is sent between the spokes. After that, packets can bypass the hub and use the spoke-to-spoke tunnel directly. 

Feature Design of Dynamic Multipoint VPN 

The Dynamic Multipoint VPN (DMVPN) feature combines GRE tunnels, IPsec encryption, and NHRP routing to provide users with ease of configuration via crypto profiles—which override the requirement for defining static crypto maps—and dynamic discovery of tunnel endpoints. 

This feature relies on the following two Cisco-enhanced standard technologies: 

  • NHRP is a client-server protocol where the hub is the server and the spokes are the clients. The hub maintains an NHRP database of each spoke’s public interface addresses. Each spoke registers its real address when it boots and queries the NHRP database for the real addresses of the destination spokes to build direct tunnels. 
  • mGRE Tunnel Interface –Allows a single GRE interface to support multiple IPsec tunnels and simplifies the size and complexity of the configuration.
  • Each spoke has a permanent IPsec tunnel to the hub, not to the other spokes within the network. Each spoke registers as a client of the NHRP server. 
  • When a spoke needs to send a packet to a destination (private) subnet on another spoke, it queries the NHRP server for the real (outside) address of the destination (target) spoke. 
  • After the originating spoke “learns” the peer address of the target spoke, a dynamic IPsec tunnel can be initiated into the target spoke. 
  • The spoke-to-spoke tunnel is built over the multipoint GRE interface. 
  • The spoke-to-spoke links are established on demand whenever there is traffic between the spokes. After that, packets can bypass the hub and use the spoke-to-spoke tunnel.
Cisco DMVPN
Diagram: Cisco DMVPN features. The source is Cisco.

Cisco DMVPN Solution Architecture

DMVPN allows IPsec VPN networks to scale hub-to-spoke and spoke-to-spoke designs better, optimizing performance and reducing communication latency between sites.

DMVPN offers a wide range of benefits, including the following:

• The capability to build dynamic hub-to-spoke and spoke-to-spoke IPsec tunnels

• Optimized network performance

• Reduced latency for real-time applications

• Reduced router configuration on the hub that provides the capability to dynamically add multiple spoke tunnels without touching the hub configuration

• Automatic triggering of IPsec encryption by GRE tunnel source and destination, assuring zero packet loss

• Support for spoke routers with dynamic physical interface IP addresses (for example, DSL and cable connections)

• The capability to establish dynamic and direct spoke-to-spoke IPsec tunnels for communication between sites without having the traffic go through the hub; that is, intersite communication bypassing the hub

• Support for dynamic routing protocols running over the DMVPN tunnels

• Support for multicast traffic from hub to spokes

• Support for VPN Routing and Forwarding (VRF) integration extended in multiprotocol label switching (MPLS) networks

• Self-healing capability maximizing VPN tunnel uptime by rerouting around network link failures

• Load-balancing capability offering increased performance by transparently terminating VPN connections to multiple headend VPN devices

Network availability over a secure channel is critical in designing scalable IPsec VPN solutions prepared with networks becoming geographically distributed. DMVPN solution architecture is by far the most effective and scalable solution available.

Summary: DMVPM

One technology that has gained significant attention and revolutionized the way networks are connected is DMVPN (Dynamic Multipoint Virtual Private Network). In this blog post, we delved into the depths of DMVPN, exploring its architecture, benefits, and use cases.

Section 1: Understanding DMVPN

DMVPN, at its core, is a scalable and efficient solution for providing secure and dynamic connectivity between multiple sites over a public network infrastructure. It combines the best features of traditional VPNs and multipoint GRE tunnels, resulting in a flexible and cost-effective network solution.

Section 2: The Architecture of DMVPN

The architecture of DMVPN involves three main components: the hub router, the spoke routers, and the underlying routing protocol. The hub router acts as a central point for the network, while the spoke routers establish secure tunnels with the hub. These tunnels are dynamically built using multipoint GRE, allowing efficient data transmission.

Section 3: Benefits of DMVPN

3.1 Enhanced Scalability: DMVPN provides a scalable solution, allowing for easy addition or removal of spokes without complex configurations. This flexibility is particularly useful in dynamic network environments.

3.2 Cost Efficiency: Using existing public network infrastructure, DMVPN eliminates the need for costly dedicated lines or leased circuits. This significantly reduces operational expenses and makes it an attractive option for organizations of all sizes.

3.3 Simplified Management: With DMVPN, network administrators can centrally manage the network infrastructure through the hub router. This centralized control simplifies configuration, monitoring, and troubleshooting tasks.

Section 4: Use Cases of DMVPN

4.1 Branch Office Connectivity: DMVPN is ideal for connecting branch offices to a central headquarters. It provides secure and reliable communication while minimizing the complexity and cost associated with traditional WAN solutions.

4.2 Mobile or Remote Workforce: DMVPN offers a secure and efficient solution for connecting remote employees to the corporate network in today’s mobile-centric work environment. Whether it’s sales representatives or telecommuters, DMVPN ensures seamless connectivity regardless of the location.

Conclusion:

DMVPN has emerged as a game-changer in the world of network connectivity. Its scalable architecture, cost efficiency, and simplified management make it an attractive option for organizations seeking to enhance their network infrastructure. Whether connecting branch offices or enabling a mobile workforce, DMVPN provides a robust and secure solution. Embracing the power of DMVPN can revolutionize network connectivity, opening doors to a more connected and efficient future.

eBOOK on SASE Capabilities

eBOOK – SASE Capabilities

In the following ebook, we will address the key points:

  1. Challenging Landscape
  2. The rise of SASE based on new requirements
  3. SASE definition
  4. Core SASE capabilities
  5. Final recommendations

 

 

Preliminary Information: Useful Links to Relevant Content

For pre-information, you may find the following links useful:

 

Secure Access Service Edge (SASE) is a service designed to provide secure access to cloud applications, data, and infrastructure from anywhere. It allows organizations to securely deploy applications and services from the cloud while managing users and devices from a single platform. As a result, SASE simplifies the IT landscape and reduces the cost and complexity of managing security for cloud applications and services.

SASE provides a unified security platform allowing organizations to connect users and applications securely without managing multiple security solutions. It offers secure access to cloud applications, data, and infrastructure with a single set of policies, regardless of the user’s physical location. It also enables organizations to monitor and control all user activities with the same set of security policies.

SASE also helps organizations reduce the risk of data breaches and malicious actors by providing visibility into user activity and access. In addition, it offers end-to-end encryption, secure authentication, and secure access control. It also includes threat detection, advanced analytics, and data loss prevention.

SASE allows organizations to scale their security infrastructure quickly and easily, providing them with a unified security platform that can be used to connect users and applications from anywhere securely. With SASE, organizations can quickly and securely deploy applications and services from the cloud while managing users and devices from a single platform.

 

 

 

 

POZNAN, POL - APR 15, 2021: Laptop computer displaying logo of OpenShift, a family of containerization software products developed by Red Hat

OpenShift Networking

OpenShift Networking

OpenShift, developed by Red Hat, is a leading container platform that enables organizations to streamline their application development and deployment processes. With its robust networking capabilities, OpenShift provides a secure and scalable environment for running containerized applications. This blog post will explore the critical aspects of OpenShift networking and how it can benefit your organization.

OpenShift networking is built on top of Kubernetes networking and extends its capabilities to provide a flexible and scalable networking solution for containerized applications. It offers various networking options to meet the diverse needs of different organizations.

Load balancing and service discovery are essential aspects of Openshift networking. In this section, we will explore how Openshift handles load balancing across pods using services. We will discuss the various load balancing algorithms available and highlight the importance of service discovery in ensuring seamless communication between microservices within an Openshift cluster.

Openshift offers different networking models to suit diverse deployment scenarios. We will explore the three main models: Overlay Networking, Host Networking, and VxLAN Networking. Each model has its advantages and considerations, and we'll highlight the use cases where they shine.

Openshift provides several advanced networking features that enhance performance, security, and flexibility. We'll dive into topics like Network Policies, Service Mesh, Ingress Controllers, and Load Balancing. Understanding and utilizing these features will empower you to optimize your Openshift networking environment.

Highlights:OpenShift Networking

Overview – OpenShift Networking

OpenShift networking provides a robust and scalable network infrastructure for applications running on the platform. It allows containers and pods to communicate with each other and external systems and services. The networking model in OpenShift is based on the Kubernetes networking model, providing a standardized and flexible approach.

Software-Defined Networking (SDN) is a crucial component of OpenShift Networking, enabling dynamic, programmatically efficient network configuration. SDN abstracts the traditional networking hardware, providing a flexible network fabric that can adapt to the needs of your applications. SDN operates within OpenShift, enabling improved scalability, enhanced security, and simplified network management.

Key Initial Considerations:

A. OpenShift Container Platform

OpenShift Container Platform (formerly known as OpenShift Enterprise) or OCP is Red Hat’s offering for the on-premises private platform as a service (PaaS). OpenShift is based on the Origin open-source project and is a Kubernetes distribution. The foundation of the OpenShift Container Platform is based on Kubernetes and, therefore, shares some of the same networking technology along with some enhancements.

B. Kubernetes: Orchestration Layer

Kubernetes is the leading container orchestration, and OpenShift is derived from containers, with Kubernetes as the orchestration layer. All of these elements lie upon an SDN layer that glues everything together by providing an abstraction layer. SDN creates the cluster-wide network. The glue that connects all the dots is the overlay network that operates over an underlay network. 

C. OpenShift Networking & Plugins

When it comes to container orchestration, networking plays a pivotal role in ensuring connectivity and communication between containers, pods, and services. Openshift Networking provides a robust framework that enables efficient and secure networking within the Openshift cluster. By leveraging various networking plugins and technologies, it facilitates seamless communication between applications and allows load balancing, service discovery, and more.

The Architecture of OpenShift Networking

OpenShift’s networking model is built upon the foundation of Kubernetes, but it takes things a step further with its own enhancements bringing better networking and security capabilities than the default Kubernetes model. The architecture consists of several key components, including the OpenShift SDN (Software Defined Networking), network policies, and service mesh capabilities.

The OpenShift SDN abstracts the underlying network infrastructure, allowing developers to focus on application logic rather than network configurations. This abstraction enables greater flexibility and simplifies the deployment process.

**Network Policies: Securing Your Cluster**

One of the standout features of OpenShift networking is the implementation of network policies. These policies allow administrators to define how pods communicate with each other and with the outside world. By leveraging network policies, teams can enforce security boundaries, ensuring that only authorized traffic is allowed to flow within the cluster. This is particularly crucial in multi-tenant environments where different teams might share the same OpenShift cluster but require isolated communication channels.

**Service Mesh: Enhancing Microservices Communication**

As organizations increasingly adopt microservices architectures, managing service-to-service communication becomes a challenge. OpenShift addresses this with its service mesh capabilities, primarily through the integration of Istio. A service mesh adds an additional layer of security, observability, and resilience to your microservices. It provides features like advanced traffic management, circuit breaking, and telemetry collection, all of which are essential for maintaining a healthy and efficient microservices ecosystem.

**Navigating Through OpenShift’s Route and Ingress**

Routing is another fundamental aspect of OpenShift networking. OpenShift’s routing layer allows developers to expose their applications to external users. This can be achieved through both routes and ingress resources. Routes are specific to OpenShift and provide a simple way to map external URLs to services within the cluster. On the other hand, ingress resources, which are part of Kubernetes, offer more advanced configurations and are particularly useful when you need to define complex routing rules.

Example Technology: Cloud Service Mesh

 

Key Considerations: OpenShift Networking

1. Network Namespace Isolation:
OpenShift networking leverages network namespaces to achieve isolation between different projects, or namespaces, on the platform. Each project has its virtual network, ensuring that containers and pods within a project can communicate securely while isolated from other projects.

2. Service Discovery and OpenShift Load Balancer:
OpenShift networking provides service discovery and load-balancing mechanisms to facilitate communication between various application components. Services act as stable endpoints, allowing containers and pods to connect to them using DNS or environmental variables. The built-in OpenShift load balancer ensures that traffic is distributed evenly across multiple instances of a service, improving scalability and reliability.

3. Ingress and Egress Network Policies:
OpenShift networking allows administrators to define ingress and egress network policies to control network traffic flow within the platform. Ingress policies specify rules for incoming traffic, allowing or denying access to specific services or pods. Egress policies, on the other hand, regulate outgoing traffic from pods, enabling administrators to restrict access to external systems or services.

4. Network Plugins and Providers:
OpenShift networking supports various network plugins and providers, allowing users to choose the networking solution that best fits their requirements. Some popular options include Open vSwitch (OVS), Flannel, Calico, and Multus. These plugins provide additional capabilities such as network isolation, advanced routing, and security features.

5. Network Monitoring and Troubleshooting:
OpenShift provides robust monitoring and troubleshooting tools to help administrators track network performance and resolve issues. The platform integrates with monitoring systems like Prometheus, allowing users to collect and analyze network metrics. Additionally, OpenShift provides logging and debugging features to aid in identifying and resolving network-related problems.

Example Technology: Prometheus Pull Approach 

Prometheus YAML file

OpenShift Networking Features

OpenShift Networking offers many features that empower developers and administrators to build and manage scalable applications. Some notable features include:

1. SDN Integration: Openshift seamlessly integrates with Software-Defined Networking (SDN) solutions, allowing for flexible network configurations and efficient traffic routing.

2. Multi-tenancy Support: With Openshift Networking, you can create isolated network zones, enabling multiple teams or projects to coexist within the same cluster while maintaining secure communication.

3. Service Load Balancing: Openshift Networking incorporates built-in load balancing capabilities, distributing incoming traffic across multiple instances of a service, thus ensuring high availability and optimal performance.

**Key Challenges: Traditional data center**

Several challenges with traditional data center networks prove they cannot support today’s applications, such as microservices and containers. Therefore, we need a new set of networking technologies built into OpenShift SDN to deal adequately with today’s landscape changes.

– Firstly, one of the main issues is that we have a tight coupling with all the networking and infrastructure components. With traditional data center networking, Layer 4 is coupled with the network topology at fixed network points and lacks the flexibility to support today’s containerized applications, which are more agile than traditional monolith applications.

– One of the main issues is that containers are short-lived and constantly spun down. Assets that support the application, such as IP addresses, firewalls, policies, and overlay networks that glue the connectivity, are continually recycled. These changes bring a lot of agility and business benefits, but there is an extensive comparison to a traditional network that is relatively static, where changes happen every few months.

OpenShift networking – Two layers

  1. In the case of OpenShift itself deployed in the virtual environment, the physical network equipment directly determines the underlying network topology. OpenShift does not control this level, which provides connectivity to OpenShift masters and nodes.
  2. OpenShift SDN plugin determines the virtual network topology. At this level, applications are connected, and external access is provided.

OpenShift uses an overlay network based on VXLAN to enable containers to communicate with each other. Layer 2 (Ethernet) frames can be transferred across Layer 3 (IP) networks using the Virtual Extensible Local Area Network (VXLAN) protocol.

Whether the communication is limited to pods within the same project or completely unrestricted depends on the SDN plugin being used. The network topology remains the same regardless of which plugin is used. OpenShift makes its internal SDN plugins available out-of-the-box and for integration with third-party SDN frameworks. 

Example Overlay Technology: VXLAN

VXLAN unicast mode

OpenShift Networking Plugins

### Understanding OpenShift Networking Plugins

OpenShift networking plugins are essential for managing network traffic within a cluster. They determine how pods communicate with each other and with external networks. The choice of networking plugin can significantly impact the performance and security of your applications. OpenShift supports several plugins, each with distinct features, allowing administrators to tailor their network configuration to specific needs.

### Popular Networking Plugins in OpenShift

1. **OVN-Kubernetes**: This plugin is widely used for its scalability and support for network policies. It provides a flexible, secure, and scalable network solution that is ideal for complex deployments. OVN-Kubernetes integrates seamlessly with Kubernetes, offering enhanced network isolation and simplified network management.

2. **Calico**: Known for its simplicity and efficiency, Calico provides high-performance network connectivity. It excels in environments that require fine-grained network policies and is particularly popular in microservices architectures. Calico’s support for IPv6 and integration with Kubernetes NetworkPolicy API makes it a versatile choice.

3. **Flannel**: A simpler alternative, Flannel is suitable for users who prioritize ease of setup and basic networking functionality. It uses a flat network model, which is straightforward to configure and manage, making it a good choice for smaller clusters or those in the early stages of development.

### Choosing the Right Plugin

Selecting the appropriate networking plugin for your OpenShift environment depends on several factors, including your organization’s security requirements, scalability needs, and existing infrastructure. For instance, if your primary concern is network isolation and security, OVN-Kubernetes might be the best fit. On the other hand, if you need a lightweight solution with minimal overhead, Flannel could be more suitable.

### Implementing and Managing Networking Plugins

Once you’ve chosen a networking plugin, the next step is implementation. This involves configuring the plugin within your OpenShift cluster and ensuring it aligns with your network policies. Regular monitoring and management are crucial to maintaining optimal performance and security. Utilizing OpenShift’s built-in tools and dashboards can greatly aid in this process, providing visibility into network traffic and potential bottlenecks.

Related: Before you proceed, you may find the following posts helpful for some pre-information

  1. Kubernetes Networking 101 
  2. Kubernetes Security Best Practice 
  3. Internet of Things Theory
  4. OpenStack Neutron
  5. Load Balancing
  6. ACI Cisco

OpenShift Networking

Networking Overview

1. Pod Networking: In OpenShift, containers are encapsulated within pods, the most minor deployable units. Each pod has its IP address and can communicate with other pods within the same project or across different projects. This enables seamless communication and collaboration between applications running on different pods.

2. Service Networking: OpenShift introduces the concept of services, which act as stable endpoints for accessing pods. Services provide a layer of abstraction, allowing applications to communicate with each other without worrying about the underlying infrastructure. With service networking, you can easily expose your applications to the outside world and manage traffic efficiently.

3. Ingress and Egress: OpenShift provides a robust routing infrastructure through its built-in Ingress Controller. It lets you define rules and policies for accessing your applications outside the cluster. To ensure seamless connectivity, you can easily configure routing paths, load balancing, SSL termination, and other advanced features.

4. Network Policies: OpenShift enables fine-grained control over network traffic through network policies. You can define rules to allow or deny communication between pods based on their labels and namespaces. This helps enforce security measures and isolate sensitive workloads from unauthorized access.

5. Multi-Cluster Networking: OpenShift allows you to connect multiple clusters, creating a unified networking fabric. This enables you to distribute your applications across different clusters, improving scalability and fault tolerance. OpenShift’s intuitive interface makes managing and monitoring your multi-cluster environment easy.

6. Network Policies and Security: One key aspect of Openshift Networking is its support for network policies. These policies allow administrators to define fine-grained rules and access controls, ensuring secure communication between different components within the cluster. With Openshift Networking, organizations can enforce policies restricting traffic flow, implementing encryption mechanisms, and safeguarding sensitive data from unauthorized access.

7. Service Discovery and Load Balancing: Service discovery and load balancing are crucial for maintaining high availability and optimal performance in a dynamic container environment. Openshift Networking offers robust mechanisms for service discovery, allowing containers to locate and communicate with one another seamlessly. Additionally, it provides load-balancing capabilities to distribute incoming traffic efficiently across multiple instances, ensuring optimal resource utilization and preventing bottlenecks.

8. Network Plugin Options: Openshift Networking offers a range of network plugin options, allowing organizations to select the most suitable solution for their specific requirements. Whether the native OpenShift SDN (Software-Defined Networking) plugin, the popular Flannel plugin, or the versatile Calico plugin, Openshift provides flexibility and compatibility with different network architectures and setups.

Example Network Policy Technology: GKE Kubernetes

Kubernetes network policy

POD Networking

Each pod in Kubernetes is assigned an IP address from an internal network that allows pods to communicate with each other. By doing this, all containers within the pod behave as if they were on the same host. The IP address of each pod enables pods to be treated like physical hosts or virtual machines.

It includes port allocation, networking, naming, service discovery, load balancing, application configuration, and migration. Linking pods together is unnecessary, and IP addresses shouldn’t be used to communicate directly between pods. Instead, create a service to interact with the pods.

OpenShift Container Platform DNS

To enable the frontend pods to communicate with the backend services when running multiple services, such as frontend and backend, environment variables are created for user names, service IPs, and more. To pick up the updated values for the service IP environment variable, the frontend pods must be recreated if the service is deleted and recreated.

To ensure that the IP address for the backend service is generated correctly and that it can be passed to the frontend pods as an environment variable, the backend service must be created before any frontend pods.

Due to this, the OpenShift Container Platform has a built-in DNS, enabling the service to be reached by both the service DNS and the service IP/port. Split DNS is supported by the OpenShift Container Platform by running SkyDNS on the master, which answers DNS queries for services. By default, the master listens on port 53.

POD Network & POD Communication

As a general rule, pod-to-pod communication holds for all Kubernetes clusters: An IP address is assigned to each Pod in Kubernetes. While pods can communicate directly with each other by addressing their IP addresses, it is recommended that they use Services instead. Services consist of Pods accessed through a single, fixed DNS name or IP address. The majority of Kubernetes applications use Services to communicate. Since Pods can be restarted frequently, addressing them directly by name or IP is highly brittle. Instead, use a Service to manage another pod.

Simple pod-to-pod communication

The first thing to understand is how Pods communicate within Kubernetes. Kubernetes provides IP addresses for each Pod. IP addresses are used to communicate between pods at a very primitive level. Therefore, you can directly address another Pod using its IP address whenever needed.

A Pod has the same characteristics as a virtual machine (VM), which has an IP address, exposes ports, and interacts with other VMs on the network via IP address and port.

What is the communication mechanism between the front-end pod and the back-end pod? In a web application architecture, a front-end application is expected to talk to a backend, an API, or a database. In Kubernetes, the front and back end would be separated into two Pods.

The front end could be configured to communicate directly with the back end via its IP address. However, a front end would still need to know the backend’s IP address, which can be tricky when the Pod is restarted or moved to another node. Using a Service can make our solution less brittle.

Because the app still communicates with the API pods via the Service, which has a stable IP address, if the Pods die or need to be restarted, this won’t affect the app.

pod networking
Diagram: Pod networking. Source is tutorialworks

How do containers in the same Pod communicate?

Sometimes, you may need to run multiple containers in the same Pod. The IP addresses of various containers in the same Pod are the same, so Localhost can be used to communicate between them. For example, a container in a pod can use the address localhost:8080 to communicate with another container in the Pod on port 8080.

Two containers cannot share the same port in the pod because the IP addresses are shared, and communication occurs on localhost. For instance, you wouldn’t be able to have two containers in the same Pod that expose port 8080. So, it would help if you ensured that the services use different ports.

In Kubernetes, pods can communicate with each other in a few different ways:

  1. Containers in the same Pod can connect using localhost; the other container exposes the port number.
  2. A container in a Pod can connect to another Pod using its IP address. To find the IP address of a pod, you can use oc get pods.
  3. A container can connect to another Pod through a Service. Like my service, a service has an IP address and usually a DNS name.

OpenShift and Pod Networking

When you initially deploy OpenShift, a private pod network is created. Each pod in your OpenShift cluster is assigned an IP address on the pod network, which is used to communicate with each pod across the cluster.

The pod network spanned all nodes in your cluster and was extended to your second application node when that was added to the cluster. Your pod network IP addresses can’t be used on your network by any network that OpenShift might need to communicate with. OpenShift’s internal network routing follows all the rules of any network and multiple destinations for the same IP address lead to confusion.

**Endpoint Reachability**

Also, Endpoint Reachability. Not only have endpoints changed, but have the ways we reach them. The application stack previously had very few components, maybe just a cache, web server, or database. Using a load balancing algorithm, the most common network service allows a source to reach an application endpoint or load balance to several endpoints.

A simple round-robin or a load balancer that measured load was standard. Essentially, the sole purpose of the network was to provide endpoint reachability. However, changes inside the data center are driving networks and network services toward becoming more integrated with the application.

Nowadays, the network function exists no longer solely to satisfy endpoint reachability; it is fully integrated. In the case of Red Hat’s OpenShift, the network is represented as a Software-Defined Networking (SDN) layer. SDN means different things to different vendors. So, let me clarify in terms of OpenShift.

Highlighting software-defined network (SDN)

When you examine traditional networking devices, you see the control and forwarding planes shared on a single device. The concept of SDN separates these two planes, i.e., the control and forwarding planes are decoupled. They can now reside on different devices, bringing many performance and management benefits.

The benefits of network integration and decoupling make it much easier for the applications to be divided into several microservice components driving the microservices culture of application architecture. You could say that SDN was a requirement for microservices.

software defined networking
Diagram: Software Defined Networking (SDN). Source is Opennetworking

Challenges to Docker Networking 

Port mapping and NAT

Docker containers have been around for a while, but networking had significant drawbacks when they first came out. If you examine container networking, for example, Docker containers have other challenges when they connect to a bridge on the node where the docker daemon is running.

To allow network connectivity between those containers and any endpoint external to the node, we need to do some port mapping and Network Address Translation (NAT). This adds complexity.

Port Mapping and NAT have been around for ages. Introducing these networking functions will complicate container networking when running at scale. It is perfectly fine for 3 or 4 containers, but the production network will have many more endpoints. The origins of container networking are based on a simple architecture and primarily a single-host solution.

Docker at scale: Orchestration layer

The core building blocks of containers, such as namespaces and control groups, are battle-tested. Although the docker engine manages containers by facilitating Linux Kernel resources, it’s limited to a single host operating system. Once you get past three hosts, networking is hard to manage. Everything needs to be spun up in a particular order, and consistent network connectivity and security, regardless of the mobility of the workloads, are also challenged.

Docker Default networking
Diagram: Docker Default networking

This led to an orchestration layer. Just as a container is an abstraction over the physical machine, the container orchestration framework is an abstraction over the network. This brings us to the Kubernetes networking model, which Openshift takes advantage of and enhances; for example, the OpenShift Route Construct exposes applications for external access.

The Kubernetes model: Pod networking

As we discussed, the Kubernetes networking model was developed to simplify Docker container networking, which had drawbacks. It introduced the concept of Pod and Pod networking, allowing multiple containers inside a Pod to share an IP namespace. They can communicate with each other on IPC or localhost.

Nowadays, we place a single container into a pod, which acts as a boundary layer for any cluster parameters directly affecting the container. So, we run deployment against pods rather than containers.

In OpenShift, we can assign networking and security parameters to Pods that will affect the container inside. When an app is deployed on the cluster, each Pod gets an IP assigned, and each Pod could have different applications.

For example, Pod 1 could have a web front end, and Pod could be a database, so the Pods need to communicate. For this, we need a network and IP address. By default, Kubernetes allocates an internal IP address for each Pod for applications running within the Pod. Pods and their containers can network, but clients outside the cluster cannot access internal cluster resources by default. With Pod networking, every Pod must be able to communicate with each other Pod in the cluster without Network Address Translation (NAT).

A typical service type: ClusterIP

The most common service IP address type is “ClusterIP .” ClusterIP is a persistent virtual IP address used for load-balancing traffic internal to the cluster. Services with these service types cannot be directly accessed outside the cluster; there are other service types for that requirement.

The service type of Cluster-IP is considered for East-West traffic since it originates from Pods running in the cluster to the service IP backed by Pods that also run in the cluster.

Then, to enable external access to the cluster, we need to expose the services that the Pod or Pods represent, and this is done with an Openshift Route that provides a URL. So, we have a service in front of the pod or groups of pods. The default is for internal access only. Then, we have a URL-based route that gives the internal service external access.

Openshift load balancer
Diagram: Openshift networking and clusterIP. Source is Redhat.

Using an OpenShift Load Balancer

Get Traffic into the Cluster

If you do not need a specific external IP address, OpenShift Container Platform clusters can be accessed externally through an OpenShift load balancer service. The OpenShift load balancer allocates unique IP addresses from configured pools. Load balancers have a single edge router IP (which can be a virtual IP (VIP), but it is still a single machine for initial load balancing). How many OpenShift load balancers are there in OpenShift?

Two load balancers

The solution supports some load balancer configuration options: Use the playbooks to configure two load balancers for highly available production deployments, or use the playbooks to configure a single load balancer, which is helpful for proof-of-concept deployments. Deploy the solution using your OpenShift load balancer.

This process involves the following:

  1. The administrator performs the prerequisites;
  2. The developer creates a project and service if the service to be exposed does not exist;
  3. The developer exposes the service to create a route.
  4. The developer creates the Load Balancer Service.
  5. The network administrator configures networking to the service.

OpenShift load balancer: Different Openshift SDN networking modes

OpenShift security best practices  

So, depending on your Openshift SDN configuration, you can tailor the network topology differently. You can have free-for-all Pod connectivity, similar to a flat network or something stricter, with different security boundaries and restrictions. A free-for-all Pod connectivity between all projects might be good for a lab environment.

Still, you may need to tailor the network with segmentation for production networks with multiple projects, which can be done with one of the OpenShift SDN plugins. We will get to this in a moment.

Openshift networking does this with an SDN layer and enhances Kubernetes networking to have a virtual network across all the nodes created with the Open switch standard. For the Openshift SDN, this Pod network is established and maintained by the OpenShift SDN, configuring an overlay network using Open vSwitch (OVS).

The OpenShift SDN plugin

We mentioned that you could tailor the virtual network topology to suit your networking requirements. The OpenShift SDN plugin and the SDN model you select can determine this. With the default OpenShift SDN, several modes are available.

This level of SDN mode you choose is concerned with managing connectivity between applications and providing external access to them.

Some modes are more fine-grained than others. How are all these plugins enabled? The Openshift Container Platform (OCP) networking relies on the Kubernetes CNI model while supporting several plugins by default and several commercial SDN implementations, including Cisco ACI.

The native plugins rely on the virtual switch Open vSwitch and offer alternatives to providing segmentation using VXLAN, specifically the VNID or the Kubernetes Network Policy objects:

We have, for example:

        • ovs-subnet  

        • ovs-multitenant  

        • ovs-network policy

Choosing the right plugin depends on your security and control goals. As SDNs take over networking, third-party vendors develop programmable network solutions. OpenShift is tightly integrated with products from such providers by Red Hat. According to Red Hat, the following solutions are production-ready:

  1. Nokia Nuage
  2. Cisco Contiv
  3. Juniper Contrail
  4. Tigera Calico
  5. VMWare NSX-T
  6. ovs-subnet plugin

After OpenShift is installed, this plugin is enabled by default. As a result, pods can be connected across the entire cluster without limitations so traffic can flow freely between them. This may be undesirable if security is a top priority in large multitenant environments. 

  • ovs-multitenant plugin

Security is usually unimportant in PoCs and sandboxes but becomes paramount when large enterprises have diverse teams and project portfolios, especially when third parties develop specific applications. A multitenant plugin like ovs-multitenant is an excellent choice if simply separating projects is all you need.

This plugin sets up flow rules on the br0 bridge to ensure that only traffic between pods with the same VNID is permitted, unlike the ovs-subnet plugin, which passes all traffic across all pods. It also assigns the same VNID to all pods for each project, keeping them unique across projects.

  • ovs-networkpolicy plugin

While the ovs-multitenant plugin provides a simple and largely adequate means for managing access between projects, it does not allow granular control over access. In this case, the ovs-networkpolicy plugin can be used to create custom NetworkPolicy objects that, for example, apply restrictions to traffic egressing or entering the network.

  • Egress routers

In OpenShift, routers direct ingress traffic from external clients to services, which then forward it to pods. As well as forwarding egress traffic from pods to external networks, OpenShift offers a reverse type of router. Egress routers, on the other hand, are implemented using Squid instead of HAProxy. Routers with egress capabilities can be helpful in the following situations:

They are masking different external resources used by several applications with a single global resource. For example, applications may be developed so that they are built, pulling dependencies from other mirrors, and collaboration between their development teams is rather loose. So, instead of getting them to use the same mirror, an operations team can set up an egress router to intercept all traffic directed to those mirrors and redirect it to the same site.

To redirect all suspicious requests for specific sites to the audit system for further analysis.

OpenShift supports the following types of egress routers:

  • redirect for redirecting traffic to a specific destination IP
  • http-proxy for proxying HTTP, HTTPS, and DNS traffic

Summary:OpenShift Networking

In the ever-evolving world of cloud computing, Openshift has emerged as a robust application development and deployment platform. One crucial aspect that makes it stand out is its networking capabilities. In this blog post, we delved into the intricacies of Openshift networking, exploring its key components, features, and benefits.

Understanding Openshift Networking Fundamentals

Openshift networking operates on a robust and flexible architecture that enables efficient communication between various components within a cluster. It utilizes a combination of software-defined networking (SDN) and network overlays to create a scalable and resilient network infrastructure.

Exploring Networking Models in Openshift

Openshift offers different networking models to suit various deployment scenarios. The most common models include Single-Stacked Networking, Dual-Stacked Networking, and Multus CNI. Each model has advantages and considerations, allowing administrators to choose the most suitable option for their specific requirements.

Deep Dive into Openshift SDN

At the core of Openshift networking lies the Software-Defined Networking (SDN) solution. It provides the necessary tools and mechanisms to manage network traffic, implement security policies, and enable efficient communication between Pods and Services. We will explore the inner workings of Openshift SDN, including its components like the SDN controller, virtual Ethernet bridges, and IP routing.

Network Policies in Openshift

To ensure secure and controlled communication between Pods, Openshift implements Network Policies. These policies define rules and regulations for network traffic, allowing administrators to enforce fine-grained access controls and segmentation. We will discuss the concept of Network Policies, their syntax, and practical examples to showcase their effectiveness.

Conclusion: Openshift’s networking capabilities play a crucial role in enabling seamless communication and connectivity within a cluster. By understanding the fundamentals, exploring different networking models, and harnessing the power of SDN and Network Policies, administrators can leverage Openshift’s networking features to build robust and scalable applications.

In conclusion, Openshift networking opens up a world of possibilities for developers and administrators, empowering them to create resilient and interconnected environments. By diving deep into its intricacies, one can unlock the full potential of Openshift networking and maximize the efficiency of their applications.

auto scaling observability

Auto Scaling Observability

Autoscaling Observability

In today's digital landscape, where applications and systems are becoming increasingly complex and dynamic, the need for efficient auto scaling observability has never been more critical. This blog post will delve into the fascinating world of auto scaling observability, exploring its importance, key components, and best practices. Let's embark on this journey together!

Auto scaling observability is the practice of monitoring and gathering data about the performance, health, and behavior of an application or system as it dynamically scales. It enables organizations to gain deep insights into their infrastructure, ensuring optimal performance, resource allocation, and cost-efficiency.

Data Collection and Monitoring: The foundation of auto scaling observability lies in effectively collecting and monitoring data from various sources, including metrics, logs, and traces. This allows for real-time visibility into the system's behavior and performance.

Metrics and Alerting: Metrics play a crucial role in understanding the health and performance of an application or system. By defining relevant metrics and setting up proactive alerts, organizations can quickly identify anomalies and take necessary actions.

Logs and Log Analysis: Logs provide a wealth of information about the internal workings of an application or system. Leveraging log analysis tools and techniques enables organizations to detect errors, troubleshoot issues, and gain valuable insights for optimization.

Distributed Tracing: In complex distributed systems, tracing requests across various services becomes essential. Distributed tracing enables end-to-end visibility, helping organizations identify bottlenecks, latency issues, and optimize system performance.

Define Clear Observability Goals: Before implementing auto scaling observability, it's crucial to define clear goals and objectives. This will ensure that the selected tools and strategies align with the organization's specific needs and requirements.

Choose the Right Monitoring Tools: There is a plethora of monitoring tools available in the market, each with its own strengths and features. Consider factors such as scalability, ease of integration, and customization options when selecting the right tool for your auto scaling observability needs.

Implement Robust Alerting Mechanisms: Setting up effective alerts is vital for timely responses to critical events. Define meaningful thresholds and ensure that alerts are routed to the appropriate teams or individuals for prompt action.

Embrace Automation: Auto scaling observability thrives on automation. Leverage automation tools and frameworks to streamline data collection, analysis, and alerting processes. This will save time, reduce human error, and enable faster decision-making.

Auto scaling observability has emerged as a crucial aspect of managing modern applications and systems. By embracing the art of auto scaling observability, organizations can unlock the power of data insights, optimize performance, and enhance overall system reliability. So, take the first step towards a more observant future, and witness the transformation it brings to your digital landscape.

Highlights: Auto-scaling Observability

Understanding Auto-Scaling Observability

1: ) Auto-scaling observability refers to a system’s ability to adjust its resources based on real-time monitoring and analysis automatically. It combines two essential components: auto-scaling, which dynamically allocates resources, and observability, which provides insights into the system’s behavior and performance.

2: ) By leveraging advanced monitoring tools and intelligent algorithms, auto-scaling observability enables organizations to optimize resource allocation and respond swiftly to changing demands.

3: ) Auto scaling observability involves the continuous monitoring and analysis of system metrics to automate the scaling of resources. This process allows organizations to automatically increase or decrease their computing power based on current demand, ensuring that applications run smoothly without over-provisioning or incurring unnecessary costs.

Auto-Scaling – Key Points:

– Enhanced Scalability and Performance: Auto-scaling observability allows systems to scale resources up or down based on actual usage patterns. This ensures that the system can handle peak loads efficiently without overprovisioning resources during periods of low demand. Organizations can avoid costly downtime by dynamically adjusting resources and ensuring optimal performance during sudden traffic spikes.

– Cost Optimization: With auto-scaling observability, businesses can significantly reduce infrastructure costs. Organizations can avoid unnecessary idle resource expenditures by accurately provisioning resources based on real-time data. This cost optimization approach ensures that companies only pay for the resources required, resulting in considerable savings.

– Improved Fault Tolerance: Auto-scaling observability is crucial in enhancing system resilience. Organizations can promptly identify and address potential issues by continuously monitoring the system’s health. In case of anomalies or failures, the system can automatically scale resources or trigger alerts for immediate remediation. This proactive approach minimizes the impact of failures and enhances the system’s overall fault tolerance.

Auto-scaling Example: Scaling with Docker Swarm

Auto Scaling – Key Components:

To fully harness the power of auto scaling observability, it’s important to understand its key components. These include metrics collection, alerting, and automated responses. Let’s delve deeper into each of these elements:

1. **Metrics Collection:** Gathering data from various sources is the foundation of observability. This involves collecting CPU usage, memory utilization, network traffic, and other vital metrics. With comprehensive data, organizations can better understand their infrastructure’s behavior and make informed scaling decisions.

2. **Alerting:** Once data is collected, it’s essential to set up alerts for any anomalies or thresholds that are breached. Alerting enables teams to respond swiftly to potential issues, minimizing downtime and maintaining application performance.

3. **Automated Responses:** The ultimate goal of observability is to automate responses to fluctuating demands. By employing pre-defined rules and machine learning algorithms, businesses can ensure that resources are scaled up or down automatically, optimizing both performance and cost.

**The Role of the Metric**

“What Is a Metric: Good for Known” Regarding auto-scaling observability and metrics, one must understand the metric’s downfall. A metric is a single number, with tags optionally appended for grouping and searching those numbers. They are disposable and cheap and have a predictable storage footprint.

A metric is a numerical representation of a system state over a recorded time interval. It can tell you if a particular resource is over or underutilized at a specific moment. For example, CPU utilization might be at 75% right now.

Implementing Auto-Scaling Observability

Choosing the Right Monitoring Tools: To effectively implement auto-scaling observability, organizations must select appropriate monitoring tools that provide real-time insights into system performance, resource utilization, and user behavior. These tools should offer robust analytics capabilities and seamless integration with auto-scaling platforms.

Defining Metrics and Thresholds: Accurate metrics and thresholds are critical for successful auto-scaling observability. Organizations must identify key performance indicators (KPIs) that align with their business objectives and set appropriate thresholds for scaling actions. For example, CPU utilization, response time, and error rates are standard metrics for auto-scaling decisions.

Automating Scaling Actions: Organizations should automate scaling actions based on predefined rules to fully leverage the benefits of auto-scaling observability. By integrating monitoring tools, auto-scaling platforms, and orchestration frameworks, businesses can ensure that resource allocation adjustments are performed seamlessly and without human intervention.

Service Mesh & Auto-Scaling

Service mesh acts as a dedicated infrastructure layer for managing service-to-service communications. It provides a suite of capabilities, including traffic management, security, and, most importantly, observability. By integrating a service mesh, such as Istio or Linkerd, into your auto scaling environment, you gain granular visibility into your microservices architecture. This includes detailed metrics, tracing, and logging, enabling you to monitor traffic patterns, latency, and error rates with precision.

### Implementing Service Mesh for Optimal Observability

Deploying a service mesh involves several considerations to maximize its observability benefits. Start by identifying the microservices that will benefit most from enhanced observability. Next, configure the service mesh to collect and process telemetry data effectively. Ensure your observability stack—comprising metrics, logs, and traces—is equipped to handle the data influx. Finally, leverage the insights gained to optimize your auto scaling strategy, ensuring minimal downtime and optimal performance.

### What is a Cloud Service Mesh?

A cloud service mesh is a dedicated infrastructure layer that manages service-to-service communication within a distributed application. It decouples the networking logic from the application code, enabling developers to focus on core functionality without worrying about the complexities of inter-service communication. Service meshes provide features like load balancing, service discovery, and security policies, making them indispensable for modern cloud-native applications.

### Key Benefits of Service Mesh

#### Simplified Networking

One of the primary benefits of a service mesh is the simplification of networking within a microservices architecture. By abstracting the communication logic, service meshes make it easier to manage and scale applications. Developers can implement features like retries, timeouts, and circuit breakers without modifying their application code.

#### Enhanced Security

Service meshes provide robust security features, including mutual TLS (mTLS) for service-to-service encryption and authentication. This ensures that communication between services is secure by default, reducing the risk of data breaches and unauthorized access.

#### Traffic Management

With a service mesh, you can intelligently route traffic between services based on various criteria such as load, service version, or geographic location. This level of control enables canary deployments, blue-green deployments, and A/B testing, making it easier to roll out new features with minimal risk.

### The Role of Observability

Observability is the ability to measure the internal states of a system based on the outputs it produces. In the context of a service mesh, observability involves collecting and analyzing metrics, logs, and traces to gain insight into the performance and behavior of the services.

### Why Observability Matters

Without proper observability, managing a service mesh can become a daunting task. Observability allows you to monitor the health of your services, detect anomalies, and troubleshoot issues in real-time. It provides the visibility needed to ensure that your service mesh is functioning as intended and that any problems are quickly identified and resolved.

### Tools and Techniques

Several tools can enhance observability in a service mesh, such as Prometheus for metrics, Jaeger for tracing, and Fluentd for logging. Combining these tools provides a comprehensive view of your service mesh’s performance and health, enabling proactive maintenance and quicker issue resolution.

Gaining Visibility with Google Ops Agent

**The Role of Google Ops Agent in Observability**

Google Ops Agent is a unified agent that simplifies the process of collecting telemetry data from your cloud environment. Its ability to seamlessly integrate with Google Cloud’s operations suite makes it an essential tool for businesses looking to enhance their auto-scaling observability. By providing detailed insights into CPU usage, memory utilization, and network traffic, Google Ops Agent ensures that your infrastructure is both monitored and optimized in real-time.

**Benefits of Enhanced Observability**

With Google Ops Agent in place, businesses gain a clearer picture of how their auto-scaling systems are functioning. This enhanced observability translates to several benefits: improved system reliability, faster troubleshooting, and better resource allocation. By actively monitoring metrics and logs, organizations can preemptively address issues before they escalate into significant problems. Furthermore, insights derived from this data can inform future scaling strategies and infrastructure investments.

**Example: Understanding Ops Agent**

Ops Agent is a lightweight and efficient monitoring agent developed by Google Cloud. It enables you to collect crucial metrics and logs from your Compute Engine instances, providing valuable insights into their performance and health. By leveraging Ops Agent, you can proactively detect issues, troubleshoot problems, and optimize the utilization of your instances.

To begin monitoring your Compute Engine instances with Ops Agent, install it on your virtual machines. The installation process is straightforward and can be done using package managers like apt or yum. Once installed, Ops Agent seamlessly integrates with Google Cloud Monitoring, allowing you to access and analyze the gathered data.

After installing Ops Agent, it is essential to configure the monitoring metrics that you want to collect. Ops Agent supports many metrics, including CPU usage, memory utilization, disk I/O, and network traffic. By tailoring the metrics collection to your specific needs, you can efficiently monitor the performance of your Compute Engine instances and identify any anomalies or bottlenecks.

What is GKE-Native Monitoring?

GKE-Native Monitoring is a powerful monitoring solution provided by Google Cloud Platform (GCP) designed explicitly for GKE clusters. It leverages the capabilities of Prometheus and Stackdriver, offering a unified monitoring experience within the GCP ecosystem. With GKE-Native Monitoring, users can effortlessly collect, visualize, and analyze metrics and logs related to their GKE clusters, enabling them to make data-driven decisions and proactively address any issues that may arise.

GKE-Native Monitoring offers a range of features that enhance observability and simplify monitoring workflows. Some notable features include:

1. Automatic Metric Collection: GKE-Native Monitoring automatically collects a rich set of metrics from every GKE cluster, including CPU and memory utilization, network traffic, and application-specific metrics. This eliminates the need for manual configuration and ensures comprehensive monitoring that is out of the box.

2. Custom Metrics and Alerts: Users can define custom metrics and alerts tailored to their applications and business requirements. This empowers them to monitor critical aspects of their clusters and receive notifications when predefined thresholds are crossed, enabling timely actions and proactive troubleshooting.

3. Integration with Stackdriver Logging: GKE-Native Monitoring integrates with Stackdriver Logging, allowing users to correlate log data with metrics. By combining logs and metrics, users can gain a holistic view of their application’s behavior and quickly identify the root causes of any issues.

Kubernetes Autoscaling

Kubernetes auto scaling involves dynamically adjusting the number of running pods in a cluster based on current demand. This ensures that applications remain responsive while optimizing resource utilization. With the right configuration, auto scaling can help maintain performance during traffic spikes and reduce costs during low-traffic periods.

**Benefits of Implementing Auto Scaling**

The implementation of Kubernetes auto scaling brings a multitude of benefits to organizations. Firstly, it enhances resource efficiency by ensuring that applications use only the necessary resources, reducing wastage and lowering operational costs. Secondly, it improves application performance and reliability by adapting to traffic fluctuations, ensuring a consistent user experience even during peak demand. Moreover, auto scaling supports rapid scaling for new deployments, enabling businesses to respond swiftly to market changes without manual intervention.

**Challenges and Best Practices**

While Kubernetes auto scaling offers significant advantages, it also presents challenges that organizations need to navigate. One common issue is configuring the right scaling metrics and thresholds to avoid over-provisioning or under-provisioning resources. It’s crucial to thoroughly test and monitor these settings in a staging environment before deploying them to production. Additionally, consider using custom metrics that align closely with your application’s performance indicators for more accurate scaling decisions. Regularly reviewing and updating your scaling policies ensures they remain effective as application workloads evolve.

### Types of Auto Scaling in Kubernetes

Kubernetes offers several types of auto scaling, each designed to address different aspects of resource management. The most commonly used are Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), and Cluster Autoscaler.

1. **Horizontal Pod Autoscaler (HPA):** HPA adjusts the number of pod replicas in a deployment or replication controller based on observed CPU utilization or other select metrics. It’s ideal for applications with varying workloads, ensuring that resources are available when needed and conserved when demand is low.

2. **Vertical Pod Autoscaler (VPA):** VPA automatically adjusts the CPU and memory requests and limits for containers within pods. This ensures that each pod has the right amount of resources, preventing over-provisioning and under-provisioning.

3. **Cluster Autoscaler:** This tool automatically adjusts the size of the Kubernetes cluster so that all pods have a place to run. It adds nodes when pods are unschedulable due to resource shortages and removes nodes when they’re underutilized.

### Configuring Auto Scaling in Kubernetes

To leverage Kubernetes auto scaling effectively, you’ll need to configure it to meet your application’s specific needs. The process typically involves setting up metrics and thresholds that trigger scaling actions.

For HPA, you’ll define the target CPU utilization or other custom metrics that the autoscaler should monitor. VPA requires setting up recommendations for resource requests and limits. Finally, the Cluster Autoscaler needs to be linked with your cloud provider to manage node scaling efficiently.

It’s crucial to regularly monitor and adjust these configurations to ensure optimal performance, as application demands can evolve over time.

### Best Practices for Kubernetes Auto Scaling

Implementing auto scaling in Kubernetes is not a set-it-and-forget-it task. Here are some best practices to consider:

– **Understand Your Workloads:** Analyze your application’s workload patterns to choose the right type of auto scaling and set appropriate thresholds.

– **Use Custom Metrics:** While CPU and memory are common metrics, consider using application-specific metrics to drive more accurate scaling decisions.

– **Test and Monitor:** Regularly test your auto scaling configurations in a non-production environment. Continuous monitoring and logging are essential to catch and resolve issues early.

You can dynamically scale up or down any architecture component through autoscaling.

An example of a good use of autoscaling is as follows:

You may need additional web servers to handle the surge in traffic at the end of the day when your website’s load increases. Where does the rest of the day fit in? Your servers can’t sit idle during most business hours. Especially if you’re using a cloud provider, you want to optimize your environment’s potential costs. An autoscale allows you to increase the number of components during a spike and scale down during a regular period.

Example: Prometheus Pull Approach

There can be many tools to gather metrics, such as Prometheus, along with several techniques used to collect these metrics, such as the PUSH and PULL approaches. There are pros and cons to each method. However, Prometheus metric types and its PULL approach are prevalent in the market. However, if you want full observability and controllability, remember it is solely in metrics-based monitoring solutions.  For additional information on Monitoring and Observability and their difference, visit this post on observability vs monitoring.

These autoscalers rely on Kubernetes’ metric server for scaling up or down k8s objects.

Adopting Auto-scaling

Autoscaling is a mechanism that automatically adjusts the number of computing resources allocated to an application based on its demand. By dynamically scaling resources up or down, autoscaling enables organizations to handle fluctuating workloads efficiently. However, robust observability is crucial to harness the power of autoscaling truly.

The Role of Observability in Autoscaling

Observability is the ability to gain insights into a system’s internal state based on its external outputs. It plays a pivotal role in understanding the system’s behavior, identifying bottlenecks, and making informed scaling decisions regarding autoscaling. It provides visibility into key metrics like CPU utilization, memory usage, and network traffic. With observability, you can make data-driven decisions and ensure optimal resource allocation.

 Monitoring and Metrics

To achieve effective autoscaling observability, comprehensive monitoring is essential. Monitoring tools collect various metrics, such as response times, error rates, and resource utilization, to provide a holistic view of your infrastructure. These metrics can be analyzed to identify patterns, detect anomalies, and trigger autoscaling actions when necessary. You can proactively address performance issues and optimize resource utilization by monitoring and analyzing metrics.

Logging and Tracing

In addition to monitoring, logging, and tracing are critical components of autoscaling observability. Logging captures detailed information about system events, errors, and activities, enabling you to troubleshoot issues and gain insights into system behavior. Tracing helps you understand the flow of requests across different services. Logging and tracing provide a granular view of your application’s performance, aiding in autoscaling decisions and ensuring smooth operation.

Automation and Alerting

Automation and alerting mechanisms are vital to mastering autoscaling observability. You can configure thresholds and triggers that initiate autoscaling actions based on predefined conditions by setting up automated processes. This allows for proactive scaling, ensuring your system is constantly optimized for performance. Additionally, timely alerts can notify you of critical events or anomalies, enabling you to take immediate action and maintain the desired scalability.

Autoscaling observability is the key to unlocking its true potential. By understanding your system’s behavior through comprehensive monitoring, logging, and tracing, you can make informed decisions and ensure optimal resource allocation. With automation and alerting mechanisms, you can proactively respond to changing demands and maintain high efficiency. Embrace autoscaling observability and take your infrastructure management to new heights.

Managed Instance Groups

### Auto Scaling: Adapting to Your Needs

One of the standout features of managed instance groups is auto scaling. With auto scaling, your infrastructure can dynamically adjust to the current demand. This ensures that your applications have the necessary resources without overspending. By setting up policies based on CPU usage, requests per second, or custom metrics, MIGs can efficiently allocate resources, keeping your applications responsive and your costs under control.

### Observability: Keeping a Close Watch

Observability is key in maintaining the health of your cloud infrastructure. Google Cloud’s managed instance groups provide comprehensive monitoring tools that give you insights into the performance and stability of your instances. By leveraging metrics, logs, and traces, you can detect anomalies, optimize performance, and ensure your applications run smoothly. This proactive approach to monitoring allows you to address potential issues before they impact your services.

### Integration with Google Cloud

Managed instance groups seamlessly integrate with various Google Cloud services, enhancing their utility and flexibility. From load balancing to deploying containerized applications with Google Kubernetes Engine, MIGs work in tandem with Google’s ecosystem to provide a cohesive and powerful cloud solution. This integration not only simplifies management but also boosts the scalability and reliability of your applications.

Managed Instance Group

Related: Before you proceed, you may find the following helpful

  1. Load Balancing
  2. Microservices Observability
  3. Network Functions
  4. Distributed Systems Observability

Autoscaling Observability

Understanding Autoscaling

-Before we discuss observability, let’s briefly explore the concept of autoscaling. Autoscaling refers to the ability of an application or infrastructure to automatically adjust its resources based on demand. It enables organizations to handle fluctuating workloads and optimize resource allocation efficiently.

-Observability, in the context of autoscaling, refers to gaining insights into an autoscaling system’s performance, health, and efficiency. It involves collecting, analyzing, and visualizing relevant data to understand the application and infrastructure’s behavior and patterns.

-Through observability, organizations can make informed decisions to optimize autoscaling algorithms, resource allocation, and overall system performance. To achieve effective autoscaling observability, several critical components come into play. These include:

A. Metrics and Monitoring: Gathering and monitoring key metrics such as CPU utilization, response times, request rates, and error rates is fundamental for understanding the application and infrastructure’s performance.

B. Logging and Tracing: Logging captures detailed information about events and transactions within the system, while tracing provides insights into the flow of requests across various components. Both logging and tracing contribute to a comprehensive understanding of system behavior.

C. Alerting and Thresholds: Setting up appropriate alerts and thresholds based on predefined criteria ensures timely notifications when specific conditions are met.  

Tools and Technologies for Autoscaling Observability

A wide range of tools and technologies are available to facilitate autoscaling observability. Prominent examples include Prometheus, Grafana, Elasticsearch, Kibana, and CloudWatch. These tools provide robust monitoring, visualization, and analysis capabilities, enabling organizations to gain deep insights into their autoscaling systems.

The first component of observability is the channels that convey observations to the observer. There are three channels: logs, traces, and metrics. These channels are common to all areas of observability, including data observability.

1.Logs: Logs are the most typical channel and take several forms (e.g., a line of free text or JSON). They are intended to encapsulate information about an event.

2.Traces: Traces allow you to do what logs don’t—reconnect the dots of a process. Because traces represent the link between all events of the same process, they allow the whole context to be derived from logs efficiently. Each pair of events, an operation, is a span that can be distributed across multiple servers.

3.Metrics: Finally, we have metrics. Every system state has some component that can be represented with numbers, and these numbers change as the state changes.

Understanding VPC Flow Logs

VPC Flow Logs capture information about the IP traffic going in and out of Virtual Private Clouds (VPCs) within Google Cloud. Enabling VPC Flow Logs allows you to gain visibility into network traffic at the subnet level, thereby facilitating network troubleshooting, security analysis, and performance monitoring.

Once the VPC Flow Logs are enabled and data starts flowing in, it’s time to tap into the potential of Google Cloud Logging. Using the appropriate filters and queries, you can sift through the vast amount of log data and extract meaningful insights. Whether it’s identifying suspicious traffic patterns, monitoring network performance metrics, or investigating security incidents, Google Cloud Logging provides a robust set of tools to facilitate these analyses.

Auto Scaling Observability

**Metrics: Resource Utilization Only**

– Metrics help us understand resource utilization. In a Kubernetes environment, these metrics are used for auto-healing and auto-scheduling. Monitoring performs several functions when it comes to metrics. First, it can collect, aggregate, and analyze metrics to identify known patterns that indicate troubling trends.

– The critical point here is that it shifts through known patterns. Then, based on a known event, metrics trigger alerts that notify when further investigation is needed. Finally, we have dashboards that display the metrics data trends adapted for visual consumption.

– These monitoring systems work well for identifying previously encountered known failures but don’t help as much for the unknown. Unknown failures are the norm today with disgruntled systems and complex system interactions.

– Metrics are suitable for dashboards, but there won’t be a predefined dashboard for unknowns, as it can’t track something it does not know about. Using metrics and dashboards like this is a reactive approach, yet it’s widely accepted as the norm. Monitoring is a reactive approach best suited for detecting known problems and previously identified patterns. 

**Metrics and intermittent problems**

– The metrics can help you determine whether a microservice is healthy or unhealthy within a microservices environment. Still, a metric will have difficulty telling you if a microservices function takes a long time to complete or if there is an intermittent problem with an upstream or downstream dependency. So, we need different tools to gather this type of information.

– We have an issue with auto-scaling metrics because they only look at individual microservices with a given set of attributes. So, they don’t give you a holistic view of the problem. For example, the application stack now exists in numerous locations and location types; we need a holistic viewpoint.

– A metric does not give this. For example, metrics track simplistic system states that indicate a service is running poorly or may be a leading indicator or an early warning signal. However, while those measures are easy to collect, they don’t turn out to be proper measures for triggering alerts.

Latency & Cloud Trace

Latency, in the context of applications, refers to the time it takes for a request to travel from the user to the server and back. Various factors, such as network delays, server processing time, and database queries influence it. Understanding latency is essential for developers to identify bottlenecks and optimize their applications for better performance.

Google Cloud Trace is a powerful tool provided by Google Cloud Platform that allows developers to analyze and diagnose application latency issues. By integrating Cloud Trace into their applications, developers can gain valuable insights into their code’s performance and identify areas for improvement.

Developers need to capture traces to analyze application latency effectively. Traces provide a detailed record of a request’s execution path, allowing developers to pinpoint the exact areas where latency occurs. With Cloud Trace, developers can easily capture and visualize traces in a user-friendly interface.

Auto-scaling metrics: Issues with dashboards

Useful only for a few metrics

So, these metrics are gathered and stored in time-series databases, and we have several dashboards to display these metrics. These dashboards were first built, and there weren’t many system metrics to worry about. You could have gotten away with 20 or so dashboards. But that was about it.

As a result, it was easy to see the critical data anyone should know about for any given service. Moreover, those systems were simple and did not have many moving parts. This contrasts with modern services that typically collect so many metrics that fitting them into the same dashboard is impossible.

Issues with aggregate metrics

So, we must find ways to fit all the metrics into a few dashboards. Here, the metrics are often pre-aggregated and averaged. However, the issue is that the aggregate values no longer provide meaningful visibility, even when we have filters and drill-downs. Therefore, we need to predeclare conditions that describe conditions we expect in the future. 

This is where we use instinctual practices based on past experiences and rely on gut feeling. Remember the network and software hero? It would help to avoid aggregation and averaging within the metrics store. On the other hand, we have percentiles that offer a richer view. Keep in mind, however, that they require raw data.

**Auto Scaling Observability: Any Question**

A: ) For auto-scaling observability, we take on an entirely different approach. They strive for other exploratory methods to find problems. Essentially, those operating observability systems don’t sit back and wait for an alert or something to happen. Instead, they are always actively looking and asking random questions to the observability system.

B: ) Observability tools should gather rich telemetry for every possible event, having full content of every request and then being able to store it and query it. In addition, these new auto-scaling observability tools are specifically designed to query against high-cardinality data. High cardinality allows you to interrogate your event data in any arbitrary way that we see fit. Now, we ask any questions about your system and inspect its corresponding state. 

**No predictions in advance**

C: ) Due to the nature of modern software systems, you want to understand any inner state and services without anticipating or predicting them in advance. For this, we need to gain valuable telemetry and use some new tools and technological capabilities to gather and interrogate this data once it has been collected. Telemetry needs to be constantly gathered in flexible ways to debug issues without predicting how failures may occur. 

D: ) Conditions affecting infrastructure health change infrequently and are relatively more straightforward to monitor. In addition, we have several well-established practices to predict, such as capacity planning and the ability to remediate automatically, e.g., auto-scaling in a Kubernetes environment. All of these can be used to tackle these types of known issues.

E: ) Due to its relatively predictable and slowly changing nature, the aggregated metrics approach monitors and alerts perfectly for infrastructure problems. So here, a metric-based system works well. Metrics-based systems and their associated signals help you see when capacity limits or known error conditions of underlying systems are being reached.

F: ) So, metrics-based systems work well for infrastructure problems that don’t change much but fall dramatically short in complex distributed systems. For these systems, you should opt for an observability and controllability platform. 

Summary: Autoscaling Observability

Auto-scaling has revolutionized how we manage cloud resources, allowing us to adjust capacity dynamically based on demand. However, ensuring optimal performance and efficiency in auto-scaling environments requires proper observability. In this blog post, we explored the importance of auto-scaling observability and how it can enhance the overall effectiveness of your infrastructure.

Understanding Auto Scaling Observability

To truly grasp the significance of auto scaling observability, we must first understand what it entails. Auto-scaling observability refers to monitoring and gaining insights into the behavior and performance of auto-scaling groups and their associated resources. It involves collecting and analyzing various metrics, logs, and events to gain a comprehensive view of your infrastructure’s health and performance.

Key Metrics for Auto-Scaling Observability

When it comes to auto scaling observability, specific metrics play a crucial role in assessing the efficiency and performance of your infrastructure. Metrics like average CPU utilization, network throughput, and request latency can provide valuable insights into resource utilization, bottlenecks, and overall system health. Monitoring these metrics enables you to make informed decisions about scaling actions and resource allocation.

Implementing Effective Monitoring and Alerting

To achieve optimal auto-scaling observability, robust monitoring, and alerting mechanisms are essential. This involves setting up monitoring tools that can collect and analyze relevant metrics in real time. Configuring intelligent alerting systems can notify you of any anomalies or issues requiring immediate attention. By proactively monitoring and alerting, you can identify and address potential problems before they impact your system’s performance.

Leveraging Logging and Tracing for Deep Insights

Besides metrics, logging, and tracing provide deeper insights into the behavior and interactions within your auto-scaling environment. By capturing detailed logs and tracing requests across various services, you can gain visibility into the data flow and identify potential bottlenecks or errors. Proper logging and tracing practices can help troubleshoot issues, optimize performance, and enhance the overall reliability of your infrastructure.

Scaling Observability for Greater Efficiency

Auto-scaling observability is not a one-time setup; it requires continuous refinement and scaling alongside your infrastructure. As your system evolves, adapting your observability practices to match the changing demands is crucial. This may involve configuring additional monitoring tools, fine-tuning alert thresholds, or expanding log retention. By scaling observability in parallel with your infrastructure, you can ensure its efficiency and effectiveness in the long run.

Conclusion

In conclusion, auto scaling observability is critical to managing and optimizing cloud resources. Investing in proper monitoring, alerting, logging, and tracing practices can unlock the full potential of auto scaling. Improved observability leads to enhanced efficiency, performance, and reliability of your infrastructure, ultimately enabling you to provide better experiences to your users.

ACI networks

ACI Networks

ACI Networks

ACI networks, short for Application Centric Infrastructure networks, have emerged as a game-changer in the realm of connectivity. With their innovative approach to networking architecture, ACI networks have opened up new possibilities for businesses and organizations of all sizes. In this blog post, we will explore the key features and benefits of ACI networks, delve into the underlying technology, and highlight real-world use cases that showcase their transformative potential.

ACI networks are built on the principle of application-centricity, where the network infrastructure is designed to align with the needs of the applications running on it. By focusing on applications rather than traditional network components, ACI networks offer improved scalability, agility, and automation. They enable organizations to seamlessly manage and optimize their network resources, resulting in enhanced performance and efficiency.

1. Policy-Based Automation: ACI networks allow administrators to define policies that automatically govern network behavior, eliminating the need for manual configurations and reducing human errors. This policy-driven approach simplifies network management and accelerates application deployment.

2. Scalability and Flexibility: ACI networks are designed to scale and adapt to changing business requirements seamlessly. Whether it's expanding your network infrastructure or integrating with cloud environments, ACI networks offer the flexibility to accommodate growth and evolving needs.

3. Enhanced Security: ACI networks incorporate advanced security measures, such as microsegmentation and end-to-end encryption, to protect critical assets and sensitive data. These built-in security features provide organizations with peace of mind in an increasingly interconnected digital landscape.

ACI networks have found application across various industries and sectors. Let's explore a few real-world use cases that highlight their versatility:

1. Data Centers: ACI networks have revolutionized data center management by simplifying network operations, increasing agility, and streamlining service delivery. With ACI networks, data center administrators can provision resources rapidly, automate workflows, and ensure optimal performance for critical applications.

2. Multi-Cloud Environments: In today's multi-cloud era, ACI networks facilitate seamless connectivity and consistent policies across different cloud platforms. They enable organizations to build a unified network fabric, simplifying workload migration, enhancing visibility, and ensuring consistent security policies.

3. Financial Services: The financial services industry demands robust and secure networks. ACI networks provide the necessary foundation for high-frequency trading, real-time analytics, and secure transactions. They offer low-latency connectivity, improved compliance, and simplified network management for financial institutions.

ACI networks have emerged as a transformative force in the world of networking, offering a host of benefits such as policy-based automation, scalability, flexibility, and enhanced security. With their application-centric approach, ACI networks enable organizations to unlock new levels of efficiency, agility, and performance. Whether in data centers, multi-cloud environments, or the financial services industry, ACI networks are paving the way for a more connected and empowered future.

Highlights: ACI Networks

ACI Main Components

-APIC controllers and underlay network infrastructure are the main components of ACI. Due to specialized forwarding chips, hardware-based underlay switching in ACI has a significant advantage over software-only solutions.

-As a result of Cisco’s own ASIC development, ACI has many advanced features, including security policy enforcement, microsegmentation, dynamic policy-based redirection (allowing external L4-L7 service devices to be inserted into the data path), and detailed flow analytics—in addition to performance and flexibility.

-ACI underlays require Nexus 9000 switches exclusively. There are Nexus 9500 modular switches and Nexus 9300 fixed 1U to 2U models available. The spine function in ACI fabric is handled by certain models and line cards, while leaves can be handled by others, and some can even be used for both functions at the same time.The combination of different leaf switches within the same fabric is not limited.

Cisco Data Center Design

**The rise of virtualization**

Virtualization is creating a virtual — rather than actual — version of something, such as an operating system (OS), a server, a storage device, or network resources. Virtualization uses software that simulates hardware functionality to create a virtual system.

It is creating a virtual version of something like computer hardware. It was initially developed during the mainframe era. With virtualization, the virtual machine could exist on any host. As a result, Layer 2 had to be extended to every switch.

This was problematic for Larger networks as the core switch had to learn every MAC address for every flow that traversed it. To overcome this and take advantage of the convergence and stability of layer 3 networks, overlay networks became the choice for data center networking, along with introducing control plane technologies such as EVPM MPLS.

**The Cisco Data Center Design Transition**

The Cisco data center design has gone through several stages. First, we started with the Spanning Tree, moved to the Spanning Tree with vPCs, and then replaced the Spanning Tree with FabricPath. FabricPath is what is known as a MAC-in-MAC Encapsulation.

Today, in the data center, VXLAN is the de facto overlay protocol for data center networking. The Cisco ACI uses an enhanced version of VXLAN to implement both Layer 2 and Layer 3 forwarding with a unified control plane. Replacing SpanningTree with VXLAN, where we have a MAC-in-IP encapsulation, was a welcomed milestone for data center networking.

**Overlay networking with VXLAN**

VXLAN is an encapsulation protocol that provides data center connectivity using tunneling to stretch Layer 2 connections over an underlying Layer 3 network. VXLAN is the most commonly used protocol in data centers to create a virtual overlay solution that sits on top of the physical network, enabling virtual networks. The VXLAN protocol supports the virtualization of the data center network while addressing the needs of multi-tenant data centers by providing the necessary segmentation on a large scale.

Here, we are encapsulating traffic into a VXLAN header and forwarding between VXLAN tunnel endpoints, known as the VTEPs. With overlay networking, we have the overlay and the underlaying concept. By encapsulating the traffic into the overlay VXLAN, we now use the underlay, which in the ACI is provided by IS-IS, to provide the Layer 3 stability and redundant paths using Equal Cost Multipathing (ECMP) along with the fast convergence of routing protocols.

Example: Point to Point GRE

Cisco ACI Overview

Introduction to the ACI Networks

The base of the ACI network is the Cisco Application Centric Infrastructure Fabric (ACI)—the Cisco SDN solution for the data center. Cisco has taken a different approach from the centralized control plane SDN approach with other vendors and has created a scalable data center solution that can be extended to multiple on-premises, public, and private cloud locations.

The ACI networks have many components, including Cisco Nexus 9000 Series switches with the APIC Controller running in the spine leaf architecture ACI fabric mode. These components form the building blocks of the ACI, supporting a dynamic integrated physical and virtual infrastructure.

Enhanced Scalability and Flexibility:

One of the critical advantages of ACI networks is their ability to scale and adapt to changing business needs. Traditional networks often struggle to accommodate rapid growth or dynamic workloads, leading to performance bottlenecks. ACI networks, on the other hand, offer seamless scalability and flexibility, allowing businesses to quickly scale up or down as required without compromising performance or security.

Simplified Network Operations:

Gone are the days of manual network configurations and time-consuming troubleshooting. ACI networks introduce a centralized management approach, where policies and structures can be defined and automated across the entire network infrastructure. This simplifies network operations, reduces human errors, and enables IT teams to focus on strategic initiatives rather than mundane tasks.

Enhanced Security:

Network security is paramount in today’s threat landscape. ACI networks integrate security as a foundational element rather than an afterthought. With ACI’s microsegmentation capabilities, businesses can create granular security policies and isolate workloads, effectively containing potential threats and minimizing the impact of security breaches. This approach ensures that critical data and applications remain protected despite evolving cyber threats.

**Real-World Use Cases of ACI Networks**

  • Data Centers and Cloud Environments:

ACI networks have revolutionized data center and cloud environments, enabling businesses to achieve unprecedented agility and efficiency. By providing a unified management platform, ACI networks simplify data center operations, enhance workload mobility, and optimize resource utilization. Furthermore, ACI’s seamless integration with cloud platforms ensures consistent network policies and security across hybrid and multi-cloud environments.

  • Network Virtualization and Automation:

ACI networks are a game-changer for network virtualization and automation. By abstracting network functionality from physical hardware, ACI enables businesses to create virtual networks, provision services on-demand, and automate network operations. Streamlining network deployments accelerates service delivery, reduces costs, and improves overall performance.

Recap: Traditional Data Center 

Firstly, the Cisco data center design traditionally built our networks based on hierarchical data center topologies. This is often referred to as the traditional data center, which has a three-tier structure with an access layer, an aggregation layer, and a core layer. Historically, this design enabled substantial predictability because aggregation switch blocks simplified the spanning-tree topology. In addition, the need for scalability often pushed this design into modularity with ACI networks and ACI Cisco, which increased predictability.

Recap: The Challenges

However, although we increased predictability, the main challenge inherent in the three-tier models is that they were difficult to scale. As the number of endpoints increases and the need to move between segments increases, we need to span layer 2. This is a significant difference between the traditional and the ACI data centers.

Related: For pre-information, you may find the following post helpful:

  1. Data Center Security 

ACI Networks

The Journey to ACI

Our journey towards ACI started in the early 1990s when we examined the most traditional and well-known two—or three-layer network architecture. This Core/Aggregation/Access design was generally used and recommended for campus enterprise networks.

Layer 2 Connectivity:

At that time and in that environment, it delivered sufficient quality for typical client-server types of applications. The traditional design taken from campus networks was based on Layer 2 connectivity between all network parts, segmentation was implemented using VLANs, and the loop-free topology relied on the Spanning Tree Protocol (STP).

STP Limitations:

Scaling such an architecture implies growing broadcast and failure domains, which could be more beneficial for the resulting performance and stability. For instance, picture each STP Topology Change Notification (TCN) message causing MAC tables to age in the whole datacenter for a particular VLAN, followed by excessive BUM (Broadcast, Unknown Unicast, Multicast) traffic flooding until all MACs are relearned.

Spanning Tree Root Switch stp port states

**Designing around STP**

Before we delve into the Cisco ACI overview, let us first address some basics around STP design. The traditional Cisco data center design often leads to poor network design and human error. You don’t want a layer 2 segment between the data center unless you have the proper controls.

Although modularization is still desired in networks today, the general trend has been to move away from this design type, which evolves around a spanning tree, to a more flexible and scalable solution with VXLAN and other similar Layer 3 overlay technologies. In addition, the Layer 3 overlay technologies bring a lot of network agility, which is vital to business success.

VXLAN overlay

Agility refers to making changes, deploying services, and supporting the business at its desired speed. This means different things to different organizations. For example, a network team can be considered agile if it can deploy network services in a matter of weeks.

In others, it could mean that business units in a company should be able to get applications to production or scale core services on demand through automation with Ansible CLI or Ansible Tower.

Regardless of how you define agility, there is little disagreement with the idea that network agility is vital to business success. The problem is that network agility has traditionally been hard to achieve until now with the ACI data center. Let’s recap some of the leading Cisco data center design transitions to understand fully.

**Challenge: – Layer 2 to the Core**

The traditional SDN data center has gone through several transitions. Firstly, we had Layer 2 to the core. Then, from the access to the core, we had Layer 2 and not Layer 3. A design like this would, for example, trunk all VLANs to the core. For redundancy, you would manually prune VLANs from the different trunk links.

Our challenge with this approach of having Layer 2 to the core relies on the Spanning Tree Protocol. Therefore, redundant links are blocked. As a result, we don’t have the total bandwidth, leading to performance degradation and resource waste. Another challenge is to rely on topology changes to fix the topology.

Data Center Design

Data Center Stability

Layer 2 to the Core layer

STP blocks reduandant links

Manual pruning of VLANs

STP for topology changes

Efficient design

Spanning Tree Protocol does have timers to limit the convergence and can be tuned for better performance. Still, we rely on the convergence from the Spanning Tree Protocol to fix the topology, but the Spanning Tree Protocol was never meant to be a routing protocol.

Compared to other protocols operating higher up in the stack, they are designed to be more optimized to react to changes in the topology. However, STP is not an optimized control plane protocol, significantly hindering the traditional data center. You could relate this to how VLANs have transitioned to become a security feature. However, their purpose was originally for performance reasons.

**Required: – Routing to Access Layer**

To overcome these challenges and build stable data center networks, the Layer 3 boundary is pushed further to the network’s edge. Layer 3 networks can use the advances in routing protocols to handle failures and link redundancy much more efficiently.

It is a lot more efficient than Spanning Tree Protocol, which should never have been there in the first place. Then we had routing at the access. With this design, we can eliminate the Spanning Tree Protocol to the core and then run Equal Cost MultiPath (ECMP) from the access to the core.

We can run ECMP as we are now Layer 3 routing from the access to the core layer instead of running STP that blocks redundant links.  However, equal-cost multipath (ECMP) routes offer a simple way to share the network load by distributing traffic onto other paths.

ECMP is typically applied only to entire flows or sets of flows. Destination address, source address, transport level ports, and payload protocol may characterize a flow in this respect.

Data Center Design

Data Center Stability


Layer 3 to the Core layer

Routing protocol stability 

Automatic routing  convergence

STP for topology changes

Efficient design

**Key Point: – Equal Cost MultiPath (ECMP)**

Equal-cost Multipath (ECMP) has many advantages. First, ECMP gives us total bandwidth with equal-cost links. As we are routing, we no longer have to block redundant links to prevent loops at Layer 2. However, we still have Layer 2 in the network design and Layer 2 on the access layer; therefore, parts of the network will still rely on the Spanning Tree Protocol, which converges when there is a change in the topology.

So we may have Layer 3 from the access to the core, but we still have Layer 2 connections at the edge and rely on STP to block redundant links to prevent loops. Another potential drawback is that having smaller Layer 2 domains can limit where the application can reside in the data center network, which drives more of a need to transition from the traditional data center design.

The Layer 2 domain that the applications may use could be limited to a single server rack connected to one ToR or two ToR for redundancy with a Layer 2 interlink between the two ToR switches to pass the Layer 2 traffic.

These designs are not optimal, as you must specify where your applications are set, which limits agility. As a result, another critical Cisco data center design transition was the introduction of overlay data center designs.

**The Cisco ACI version**

Before Cisco ACI 4.1, the Cisco ACI fabric allowed only a two-tier (spine-and-leaf switch) topology. Each leaf switch is connected to every spine switch in the network, and there is no interconnection between leaf switches or spine switches.

Starting from Cisco ACI 4.1, the Cisco ACI fabric allows a multitier (three-tier) fabric and two tiers of leaf switches, which provides the capability for vertical expansion of the Cisco ACI fabric. This is useful for migrating a traditional three-tier architecture of core aggregation access that has been a standard design model for many enterprise networks and is still required today.

ACI fabric Details
Diagram: Cisco ACI fabric Details

The APIC Controller:

The ACI networks are driven by the Cisco Application Policy Infrastructure Controller ( APIC) database, which works in a cluster from the management perspective. The APIC is the centralized control point; you can configure everything in the APIC.

Consider the APIC to be the brains of the ACI fabric and server as the single source of truth for configuration within the fabric. The APIC controller is a policy engine and holds the defined policy, which tells the other elements in the ACI fabric what to do. This database allows you to manage the network as a single entity. 

In summary, the APIC is the infrastructure controller and is the main architectural component of the Cisco ACI solution. It is the unified point of automation and management for the Cisco ACI fabric, policy enforcement, and health monitoring. The APIC is not involved in data plane forwarding.

data center layout
Diagram: Data center layout: The Cisco APIC controller

The APIC represents the management plane, allowing the system to maintain the control and data plane in the network. The APIC is not the control plane device, nor does it sit in the data traffic path. Remember that the APIC controller can crash, and you still have forwarded in the fabric. The ACI solution is not an SDN centralized control plane approach. The ACI is a distributed fabric with independent control planes on all fabric switches. 

The Leaf and Spine 

Leaf-spine is a two-layer data center network topology for data centers that experience more east-west network traffic than north-south traffic. The topology comprises leaf switches (to which servers and storage connect) and spine switches (to which leaf switches connect).

In this two-tier Clos architecture, every lower-tier switch (leaf layer) is connected to each top-tier switch (Spine layer) in a full-mesh topology. The leaf layer consists of access switches connecting to devices like servers.

The Spine layer is the network’s backbone and interconnects all Leaf switches. Every Leaf switch connects to every spine switch in the fabric. The path is randomly chosen, so the traffic load is evenly distributed among the top-tier switches. Therefore, if one of the top-tier switches fails, it would only slightly degrade performance throughout the data center.

SDN data center
Diagram: Cisco ACI fabric checking.

Unlike the traditional Cisco data center design, the ACI data center operates with a Leaf and Spine architecture. Traffic now comes in through a device sent from an end host, known as a Leaf device.

We also have the Spine devices, which are Layer 3 routers with no unique hardware dependencies. In a primary Leaf and Spine fabric, every Leaf is connected to every Spine. Any endpoint in the fabric always has the same distance regarding hops and latency as every other internal endpoint.

The ACI Spine switches are Clos intermediary switches with many vital functions. Firstly, they exchange routing updates with leaf switches via Intermediate System-to-Intermediate System (IS-IS) and rapidly forward packets between them. They also provide endpoint lookup services to leaf switches through the Council of Oracle Protocol (COOP) and handle route reflection to the leaf switches using Multiprotocol BGP (MP-BGP).

Cisco ACI Overview
Diagram: Cisco ACI Overview.

The Leaf switches are the ingress/egress points for traffic into and out of the ACI fabric. They also provide end-host connectivity and are the connectivity points for the various endpoints that the Cisco ACI supports.

The spines act as a fast, non-blocking Layer 3 forwarding plane that supports Equal Cost Multipathing (ECMP) between any two endpoints in the fabric and uses overlay protocols such as VXLAN under the hood. VXLAN enables any workload to exist anywhere in the fabric, so we can now have workloads anywhere in the fabric without introducing too much complexity.

Required: ACI data center and ACI networks

This is a significant improvement to data center networking. We can now have physical or virtual workloads in the same logical Layer 2 domain, even running Layer 3 down to each ToR switch. The ACI data center is a scalable solution as the underlay is specifically built to be scalable as more links are added to the topology and resilient when links in the fabric are brought down due to, for example, maintenance or failure. 

ACI Networks: The Normalization event

VXLAN is an industry-standard protocol that extends Layer 2 segments over Layer 3 infrastructure to build Layer 2 overlay logical networks. The ACI infrastructure Layer 2 domains reside in the overlay, with isolated broadcast and failure bridge domains. This approach allows the data center network to grow without risking creating too large a failure domain. All traffic in the ACI fabric is normalized as VXLAN packets.

**Encapsulation Process**

ACI encapsulates external VLAN, VXLAN, and NVGRE packets in a VXLAN packet at the ingress. This is known as ACI encapsulation normalization. As a result, the forwarding in the ACI data center fabric is not limited to or constrained by the encapsulation type or overlay network. If necessary, the ACI bridge domain forwarding policy can be defined to provide standard VLAN behavior where required.

**Making traffic ACI-compatible**

As a final note in this Cisco ACI overview, let us address the normalization process. When traffic hits the Leaf, there is a normalization event. The normalization takes traffic from the servers to the ACI, making it ACI-compatible. Essentially, we are giving traffic sent from the servers a VXLAN ID to be sent across the ACI fabric.

Traffic is normalized, encapsulated with a VXLAN header, and routed across the ACI fabric to the destination Leaf, where the destination endpoint is. This is, in a nutshell, how the ACI Leaf and Spine work. We have a set of leaf switches that connect to the workloads and the spines that connect to the Leaf.

**VXLAN: Overlay Protocol** 

VXLAN is the overlay protocol that carries data traffic across the ACI data center fabric. A key point to this type of architecture is that the Layer 3 boundary is moved to the Leaf. This brings a lot of value and benefits to data center design. This boundary makes more sense as we must route and encapsulate this layer without going to the core layer.

ACI networks are revolutionizing how businesses connect and operate in the digital age. Focusing on application-centric infrastructure, they offer enhanced scalability, simplified network operations, and top-notch security. By leveraging ACI networks, businesses can unleash the full potential of their network infrastructure, ensuring seamless connectivity and staying ahead in today’s competitive landscape.

Summary: ACI Networks

Application Centric Infrastructure (ACI) has emerged as a game changer in the ever-evolving networking landscape. It revolutionizes the way networks are designed, deployed, and managed. This blog post will delve into ACI networking, exploring its key features, benefits, and considerations.

Understanding ACI Networking

ACI networking is a holistic approach that combines software-defined networking (SDN) and policy-driven automation. It provides a centralized platform where physical and virtual networks seamlessly coexist. By decoupling network control from the underlying infrastructure, ACI brings unprecedented flexibility and agility to network administrators.

Key Features of ACI Networking

ACI networking offers rich features that empower organizations to build scalable, secure, and intelligent networks. Some of the key features include:

– Application Policy Infrastructure Controller (APIC): The brains behind ACI, APIC allows administrators to define network policies and automate network provisioning.

Application Network Profiles: These profiles capture applications’ unique requirements, enabling granular control and policy enforcement at the application level.

– Fabric Extenders: These devices extend the fabric and connect endpoints to the ACI infrastructure.

Microsegmentation: ACI enables microsegmentation, which provides enhanced security by isolating workloads and preventing lateral movement within the network.

Benefits of ACI Networking

The adoption of ACI networking brings numerous benefits to organizations of all sizes. Some key advantages include:

– Simplified Management: ACI’s centralized management platform streamlines network operations, reducing complexity and improving efficiency.

– Enhanced Security: With microsegmentation and policy-based enforcement, ACI strengthens network security, protecting against threats and unauthorized access.

– Scalability and Flexibility: ACI’s scalable architecture and programmable nature allow organizations to adapt to evolving business needs and scale their networks effortlessly.

Considerations for Implementing ACI Networking

While ACI networking offers compelling advantages, a successful implementation requires careful planning and consideration. Some important factors to consider include:

– Infrastructure readiness: Ensure your network infrastructure is compatible with ACI and meets the requirements.

– Training and expertise: Invest in training your IT team to understand and leverage the full potential of ACI networking.

– Migration strategy: If transitioning from traditional networking to ACI, develop a well-defined migration strategy to minimize disruptions.

Conclusion:

ACI networking represents a paradigm shift in network architecture, enabling organizations to achieve greater agility, security, and scalability. By embracing ACI, businesses can unlock the power of automation, simplify network management, and future-proof their infrastructure. As networking continues to evolve, ACI stands at the forefront, paving the way for a more efficient and intelligent network ecosystem.

Cisco ACI

Service Level Objectives (SLOs): Customer-centric view

Service Level Objectives (SLOs)

In today's rapidly evolving digital landscape, businesses are increasingly reliant on technology to deliver their products and services. To ensure a seamless user experience, organizations set specific performance targets known as Service Level Objectives (SLOs). This blog post will delve into the importance of SLOs, their key components, and best practices for optimizing performance.

SLOs are measurable goals that define the level of service a system should provide. They encompass various metrics such as availability, latency, and error rates. By setting SLOs, businesses establish performance benchmarks that align with user expectations and business objectives.

Metrics: Identify the specific metrics that align with your service and user expectations. For instance, a video streaming platform might consider buffer time and playback quality as critical metrics.

Targets: Set realistic and attainable targets for each metric. These targets should strike a balance between meeting user expectations and resource utilization.

Time Windows: Define the time windows over which SLOs are measured. This ensures that performance is evaluated consistently and provides a meaningful representation of the user experience.

Monitoring and Alerting: Implement robust monitoring and alerting systems to track performance in real-time. This allows proactive identification and resolution of potential issues before they impact users.

Capacity Planning: Conduct thorough capacity planning to ensure that resources are adequately provisioned to meet SLO targets. This involves assessing current usage patterns, forecasting future demand, and scaling infrastructure accordingly.

Continuous Improvement: Regularly evaluate and refine SLOs based on user feedback and evolving business needs. Embrace a culture of continuous improvement to drive ongoing enhancements in performance.

In today's competitive landscape, meeting user expectations for performance is paramount. Service Level Objectives (SLOs) serve as a crucial tool for organizations to define, measure, and optimize performance. By understanding the key components of SLOs and implementing best practices, businesses can deliver exceptional user experiences, enhance customer satisfaction, and achieve their performance goals.


Highlights: Service Level Objectives (SLOs)

Understanding Service Level Objectives

A ) Service Level Objectives, or SLOs, refer to predefined goals that outline the level of service a company aims to provide to its customers. They act as measurable targets that help organizations assess and improve service quality. To ensure consistent and reliable service delivery, SLOs define various metrics and performance indicators, such as response time, uptime, error rates, etc.

B ) Implementing SLOs offers numerous benefits for both businesses and customers. First, they provide a clear framework for service expectations, giving customers a transparent understanding of what to anticipate. Second, SLOs enable companies to align their internal goals with customer needs, fostering a customer-centric approach. Additionally, SLOs are crucial in driving accountability and continuous improvement within organizations, leading to enhanced operational efficiency and customer satisfaction.

C ) Setting Effective SLOs: To set effective SLOs, organizations need to consider several key factors. First, they must identify the critical metrics that directly impact customer experience. These could include response time, resolution time, or system availability. Second, SLOs should be realistic and achievable, taking into account the organization’s capabilities and resources. Moreover, SLOs should be regularly reviewed and adjusted to align with evolving customer expectations and business objectives.

D ) The Significance of SLOs in Business Success: Service Level Objectives are a theoretical and vital tool for achieving business success. Organizations can enhance customer satisfaction, loyalty, and retention by defining clear service goals.  This, in turn, leads to positive word-of-mouth, increased customer acquisition, and, ultimately, revenue growth. SLOs also enable businesses to identify and address service gaps proactively, mitigating potential issues before they escalate and impacting the overall customer experience.

**Service Level Objectives**

Key Note: We need to start thinking differently about things than we have in the past to make sure our services are reliable with complex systems. It might be your responsibility to maintain a globally distributed service with thousands of moving parts, or it might just be to keep a few virtual machines. No matter how far removed humans are from those things, they almost certainly rely on them at some point. It is also necessary to consider things from the perspective of human users once you have considered their needs.

SLOs are percentages you use to help drive decision-making, while SLAs are promises to customers that include compensation in case you do not meet your targets. Violations of your SLO generate data you use to evaluate the reliability of your service. Whenever you violate an SLO, you can choose to take action.

I reiterate that SLOs are not contracts; they are objectives. You are free to update or change your targets at any time. In the world, things will change, and your service’s operations may also change.

**The Mechanics of Managed Instance Groups**

At its core, a managed instance group is a collection of VM instances that are treated as a single entity. This allows for seamless scaling, load balancing, and automated updates. By utilizing Google Cloud’s MIGs, you can configure these groups to dynamically adjust the number of instances in response to changes in demand, ensuring optimal performance and cost-efficiency. This flexibility is crucial for businesses that experience fluctuating workloads, as it allows for resources to be used judiciously, meeting service level objectives without unnecessary expenditure.

**Achieving Service Level Objectives with MIGs**

Service level objectives (SLOs) are critical benchmarks that dictate the expected performance and availability of your applications. Managed instance groups help you meet these SLOs by offering auto-healing capabilities, ensuring that any unhealthy instances are automatically replaced. This minimizes downtime and maintains the reliability of your services. Additionally, Google Cloud’s integration with load balancing allows for efficient distribution of traffic, further enhancing your application’s performance and adherence to SLOs.

**Advanced Features for Enhanced Management**

Google Cloud’s managed instance groups come equipped with advanced features that elevate their utility. With instance templates, you can standardize the configuration of your VM instances, ensuring consistency across your deployments. The use of regional managed instance groups provides additional resilience, as instances can be spread across multiple zones, safeguarding against potential outages in a single zone. These features collectively empower you to build robust, fault-tolerant applications tailored to your specific needs.

Managed Instance Group

Why are SLOs Important?

SLOs play a vital role in ensuring customer satisfaction and meeting business objectives. Here are a few reasons why SLOs are essential:

1. Accountability: SLOs provide a framework for holding service providers accountable for meeting the promised service levels. They establish a baseline for evaluating the performance and quality of the service.

2. Customer Experience: By setting SLOs, businesses can align their service offerings with customer expectations. This helps deliver a superior customer experience, foster customer loyalty, and gain a competitive edge in the market.

3. Performance Monitoring and Improvement: SLOs enable businesses to monitor their services’ performance and continuously identify improvement areas. Regularly tracking SLO metrics allows for proactive measures and optimizations to enhance service reliability and availability.

Critical Elements of SLOs:

To effectively implement SLOs, it is essential to consider the following key elements:

1. Metrics: SLOs should be based on relevant, measurable metrics that accurately reflect the desired service performance. Standard metrics include response time, uptime percentage, error rate, and throughput.

2. Targets: SLOs must define specific targets for each metric, considering customer expectations, industry standards, and business requirements. Targets should be achievable yet challenging enough to drive continuous improvement.

3. Monitoring and Alerting: Establishing robust monitoring and alerting mechanisms allows businesses to track the performance of their services in real time. This enables timely intervention and remediation in case of deviations from the defined SLOs.

4. Communication: Effective communication with customers is crucial to ensure transparency and manage expectations. Businesses should communicate SLOs, including the metrics, targets, and potential limitations, to foster trust and maintain a healthy customer-provider relationship.

**The Value of SRE Teams**

Site Reliability Engineering (SRE) teams have tools such as Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets that can guide them on the road to building a reliable system with the customer viewpoint as the metric. These new tools or technologies form the basis for reliability in distributed system and are the core building blocks of a reliable stack that assist with baseline engineering. The first thing you need to understand is the service’s expectations. This introduces the areas of service-level management and its components.

**The Role of Service-Level Management**

The core concepts of service level management are Service Level Agreement (SLA), Service Level Objectives (SLO), and Service Level Indicators (SLIs). The common indicators used are Availability, latency, duration, and efficiency. Monitoring these indicators to catch problems before your SLO is violated is critical. These are the cornerstone of developing a good SRE practice.

    • SLI: Service level Indicator: A well-defined measure of “successful enough.” It is a quantifiable measurement of whether a given user interaction was good enough. Did it meet the users’ expectations? Does a web page load within a specific time? This allows you to categorize whether a given interaction is good or bad.
    • SLO: Service level objective: A top-line target for a fraction of successful interactions.
    • SLA: Service level agreement: consequences. It’s more of a legal construct. 

Related: For pre-information, you may find the following helpful:

  1. Starting Observability
  2. Distributed Firewalls
  3. Network Traffic Engineering
  4. Brownfield Network Automation 

Service Level Objectives (SLOs)

Site Reliability Engineering (SRE)

Google pioneered SRE to create more scalable and reliable large-scale systems. SRE has become one of today’s most valuable software innovation opportunities. It is a concrete, opinionated implementation of the DevOps philosophy. The main goal is to create scalable and highly reliable software systems.

According to Benjamin Treynor Sloss, the founder of Google’s Site Reliability Team, “SRE is what happens when a software engineer is tasked with what used to be called operations.”

So, reliability is not so much a feature as a practice that must be prioritized and considered from the very beginning. It should not be added later, for example, when a system or service is in production. Reliability is the essential feature of any system, and it’s not a feature that a vendor can sell you.

Personal Note:

So, if someone tries to sell you an add-on solution called Reliability, don’t buy it, especially if they offer 100% reliability. Nothing can be 100% reliable all the time. If you strive for 100% reliability, you will miss out on opportunities to perform innovative tasks and the need to experiment and take risks that can help you build better products and services. 

Nothing can be 100% reliable all the time

Components of a Reliable System

### Distributed systems

At its core, a distributed system is a network of independent computers that work together to achieve a common goal. These systems can be spread across multiple locations and connected through communication networks. The architecture of distributed systems can vary widely, ranging from client-server models to peer-to-peer networks. One of the key features of distributed systems is their ability to provide redundancy and fault tolerance, ensuring that if one component fails, the system as a whole continues to function.

### Building Reliable Systems

To build reliable systems that can tolerate various failures, the system needs to be distributed so that a problem in one location doesn’t mean your entire service stops operating. So you need to build a system that can handle, for example, a node dying or perform adequately with a particular load.

To create a reliable system, you need to understand it fully and what happens when the different components that make up the system reach certain thresholds. This is where practices such as Chaos engineering kubernetes can help you.

### Chaos Engineering 

We can have practices like Chaos Engineering that can confirm your expectations, give you confidence in your system at different levels, and prove you can have certain tolerance levels to Reliability. Chaos Engineering allows you to find weaknesses and vulnerabilities in complex systems. It is an important task that can be automated into your CI/CD pipelines.

You can have various Chaos Engineering verifications before you reach production. These tests, such as load and Latency tests, can all be automated with little or no human interaction. Site Reliability Engineering (SRE) teams often use Chaos Engineering to improve resilience, which must be part of your software development/deployment process.  

### Integrating Chaos Engineering with Service Mesh

Integrating chaos engineering with a service mesh brings numerous benefits to organizations striving for resilience and reliability. Firstly, it enhances fault tolerance by exposing vulnerabilities before they become critical issues. Secondly, it provides a deeper understanding of system behavior under duress, enabling teams to optimize service performance and reliability. Lastly, it fosters a culture of experimentation and learning, encouraging teams to continuously improve and innovate.

 Perception: Customer-Centric View

Reliability is all about perception. Suppose the user considers your service unreliable. In that case, you will lose consumer trust because of poor service perception, so it’s important to provide consistency with your services as much as possible. For example, it’s OK to have some outages. Outages are expected, but you can’t have them all the time and for long durations.

Users expect to have outages at some point in time, but not for so long. User Perception is everything; if the user thinks you are unreliable, you are. Therefore, you need to have a customer-centric view, and using customer satisfaction is a critical metric to measure.

This is where the critical components of service management, such as Service Level Objectives (SLO) and Service Level Indicators (SLI), come into play. It would be best if you found a balance between Velocity and Stability. You can’t stop innovation, but you can’t take too many risks. An Error Budget will help you with Site Reliability Engineering (SRE) principles. 

Users experience Static thresholds.

User experience means different things to different groups of users. We now have a model where different service users may be routed through the system in other ways, using various components and providing experiences that can vary widely. We also know that the services no longer tend to break in the same few predictable ways over and over.

With complex microservices and many software interactions, we have many unpredictable failures that we have never seen before. These are often referred to as black holes. We should have a few alerts triggered by only focusing on symptoms that directly impact user experience and not because a threshold was reached.

Example: Issues with Static Thresholds

1.If your POD network reaches a certain threshold, this does not tell you anything about user experience. You can’t rely on static thresholds anymore, as they have no relationship to customer satisfaction.

2.If you use static thresholds, they can’t reliably indicate any issues with user experience. Alerts should be set up to detect failures that impact user experience. Traditional monitoring falls short of trying this as it usually has predefined dashboards that look for something that has happened before.

3.This brings us back to the challenges with traditional metrics-based monitoring; we rely on static thresholds to define optimal system conditions, which have nothing to do with user experience. However, modern systems change shape dynamically under different workloads. Static thresholds for monitoring can’t reflect impacts on user experience. They lack context and are too coarse.

How to Approach Reliability 

**New tools and technologies**

– 1: We have new tools, such as distributed tracing. What is the best way to find the bottleneck if the system becomes slow? Here, you can use Distributed Tracing and Open Telemetry. Tracing helps us instrument our system so we figure out where the time has been spent. It can be used across distributed microservice architecture to troubleshoot problems. Open Telemetry provides a standardized way of instrumenting our system and providing those traces.

– 2: have already touched on Service Level Objectives, Indicators, and Error Budget. You want to know why and how something has happened. So we don’t just want to know when something has happened and then react to an event that is not looking from the customer’s perspective.

– 3: need to understand if we are meeting the Service Level Agreement (SLA) by gathering the number and frequency of the outages and any performance issues. Service Level Objectives (SLO) and Service Level Indicators (SLI) can assist you with measurements. 

– 4: Level Objectives (SLO) and Service Level Indicators (SLI) assist you with measurements. They also offer a tool for better system reliability and form the base for the Reliability Stack. SLIs and SLOs help us interact with Reliability differently and provide a path for building a reliable system.

So now we have the tools and a disciple to use the tools within. Can you recall what that disciple is? The discipline is Site Reliability Engineering (SRE)

Example: Distributed Tracing with Cloud Trace

**SLO-based approach to reliability**

If you’re too reliable all the time, you’re also missing out on some of the fundamental features that SLO-based approaches give you. The main area you will miss is the freedom to do what you want, test, and innovate. If you’re too reliable, you’re missing out on opportunities to experiment, perform chaos engineering, ship features quicker than before, or even introduce structured downtime to see how your dependencies react.

To learn a system, you need to break it. So, if you are 100% reliable, you can’t touch your system, so you will never truly learn and understand your system. You want to give your users a good experience, but you’ll run out of resources in various ways if you try to ensure this good experience happens 100% of the time. SLOs let you pick a target that lives between those two worlds.

**Balance velocity and stability**

You can’t just have Reliability; you must also have new features and innovation. Therefore, you need to find a balance between velocity and stability. We need to balance Reliability with other features you have and are proposing to offer. Suppose you have access to a system with a fantastic feature that doesn’t work. The users who have the choice will leave.

Site Reliability Engineering is the framework for balancing velocity and stability. How do you know what level of Reliability you need to provide your customer? This all goes back to the business needs that reflect the customer’s expectations. With SRE, we have a customer-centric approach.

The primary source of outages is making changes even when the changes are planned. This can come in many forms, such as pushing new features, applying security patches, deploying new hardware, and scaling up to meet customer demand, which will significantly impact if you strive for a 100% reliability target. 

There will always be changes

If nothing changes to the physical/logical infrastructure or other components, we will not have bugs. We can freeze our current user base and never have to scale the system. In reality, this will not happen. There will always be changes. So it would be best if you found a balance.

Service Level Objectives (SLOs) are a cornerstone for delivering reliable and high-quality services in today’s technology-driven world. By setting measurable targets, businesses can align their service performance with customer expectations, drive continuous improvement, and ultimately enhance customer satisfaction. 

Implementing and monitoring SLOs allows companies to proactively address issues, optimize service delivery, and stay ahead of the competition. By embracing SLOs, companies can pave the way for successful service delivery and long-term growth.

Summary: Service Level Objectives (SLOs)

Service Level Objectives (SLOs) ensure optimal performance and efficiency in any service-based industry. In this blog post, we explored SLOs’ significance, components, and best practices. Whether you are a business owner, a service provider, or simply curious about SLOs, this guide will provide valuable insights to enhance your understanding.

What are Service Level Objectives?

Service Level Objectives (SLOs) are measurable targets that define the service quality and performance a service provider aims to achieve. They serve as a benchmark for evaluating the effectiveness and efficiency of service delivery. By setting specific SLOs, organizations can ensure they meet their customer’s expectations while maintaining high performance.

Components of Service-Level Objectives

To construct effective SLOs, it is essential to consider several vital components. These include:

a) Metrics: Identify the specific metrics that will be used to measure performance. This could include response time, uptime, error rates, or other relevant indicators.

b) Targets: Set realistic and achievable targets for each metric. It is essential to strike a balance between ambitious goals and practicality.

c) Timeframes: Define the timeframes for meeting the targets. Depending on the nature of the service, this could be measured daily, weekly, monthly, or yearly.

Best Practices for Setting SLOs

Setting effective SLOs requires careful consideration and adherence to best practices. Here are a few essential tips to keep in mind:

a) Collaboration: Involve stakeholders from different departments to comprehensively understand the service requirements and expectations.

b) Data-Driven Approach: Base your SLOs on reliable and accurate data. Analyze historical performance, customer feedback, and industry benchmarks to set realistic targets.

c) Continuous Monitoring: Regularly monitor and assess performance against the defined SLOs. This allows for the timely identification of deviations and facilitates prompt remedial actions.

Implementing SLOs for Success

Implementing SLOs successfully involves a systematic approach and a focus on continuous improvement. Consider the following steps:

a) Clear Communication: Ensure that all relevant teams and individuals are aware of the SLOs and their significance. Transparency and open communication are crucial to fostering a culture of accountability.

b) Regular Reporting: Establish a reporting mechanism to track progress towards the SLOs. This allows for the identification of trends, areas for improvement, and informed decision-making.

c) Iterative Refinement: SLOs should not be set in stone. They should be reviewed and refined regularly based on evolving customer expectations, industry trends, and internal capabilities.

In conclusion, Service Level Objectives (SLOs) are indispensable tools for driving performance and efficiency in service-based industries. By defining clear targets, monitoring performance, and implementing best practices, organizations can optimize their service delivery and enhance customer satisfaction. Embracing SLOs as a strategic framework empowers businesses to excel in an increasingly competitive landscape.

auto scaling observability

Observability vs Monitoring

Observability vs Monitoring

In today's fast-paced digital landscape, where complex systems and applications drive businesses, it's crucial to have a clear understanding of observability and monitoring. These two terms are often used interchangeably, but they represent distinct concepts in the realm of system management and troubleshooting. In this blog post, we will delve into the differences between observability and monitoring, shedding light on their unique features and benefits.

What is Observability? Observability refers to the ability to gain insight into the internal state of a system through its external outputs. It focuses on understanding the behavior and performance of a system from an external perspective, without requiring deep knowledge of its internal workings. Observability provides a holistic view of the system, enabling comprehensive analysis and troubleshooting.

The Essence of Monitoring: Monitoring, on the other hand, involves the systematic collection and analysis of various metrics and data points within a system. It primarily focuses on tracking predefined performance indicators, such as CPU usage, memory utilization, and network latency. Monitoring provides real-time data and alerts to ensure that system health is maintained and potential issues are promptly identified.

Data Collection and Analysis: Observability emphasizes comprehensive data collection and analysis, aiming to capture the entire system's behavior, including its interactions, dependencies, and emergent properties. Monitoring, however, focuses on specific metrics and predefined thresholds, often using predefined agents, plugins, or monitoring tools.

Contextual Understanding: Observability aims to provide a contextual understanding of the system's behavior, allowing engineers to trace the flow of data and understand the cause and effect of different components. Monitoring, while offering real-time insights, lacks the contextual understanding provided by observability.

Reactive vs Proactive: Monitoring is primarily reactive, alerting engineers when predefined thresholds are exceeded or when specific events occur. Observability, on the other hand, enables a proactive approach, empowering engineers to explore and investigate the system's behavior even before issues arise.

Observability and monitoring are both crucial elements in system management, but they have distinct focuses and approaches. Observability provides a holistic and contextual understanding of the system's behavior, allowing for comprehensive analysis and proactive troubleshooting. Monitoring, on the other hand, offers real-time data and alerts based on predefined metrics, ensuring system health is maintained. Understanding the differences between these two concepts is vital for effectively managing and optimizing complex systems.

Highlights: Observability vs Monitoring

Understanding Monitoring & Observability

A: Understanding Monitoring: Monitoring collects data and metrics from a system to track its health and performance. It involves setting up various tools and agents that continuously observe and report on predefined parameters. These parameters include resource utilization, response times, and error rates. Monitoring provides real-time insights into the system’s behavior and helps identify potential issues or bottlenecks.

B: Unveiling Observability: Observability goes beyond traditional monitoring by understanding the system’s internal state and cause-effect relationships. It aims to provide a holistic view of the system’s behavior, even in unexpected scenarios. Observability encompasses three main pillars: logs, metrics, and traces. Logs capture detailed events and activities, metrics quantify system behavior over time, and traces provide end-to-end transaction monitoring. By combining these pillars, observability enables deep system introspection and efficient troubleshooting.

C: The Power of Contextual Insights: One of the key advantages of observability is its ability to provide contextual insights. Traditional monitoring may alert you when a specific metric exceeds a threshold, but it often lacks the necessary context to debug complex issues. With its comprehensive data collection and correlation capabilities, Observability allows engineers to understand the context surrounding a problem. Contextual insights help in root cause analysis, reducing mean time to resolution and improving overall system reliability.

D: The Role of Automation: Automation plays a crucial role in monitoring and observability. In monitoring, automation can help set up alerts, generate reports, and scale the monitoring infrastructure. On the other hand, observability requires automated instrumentation and data collection to handle the vast amount of information generated by modern systems. Automation enables engineers to focus on analyzing insights rather than spending excessive time on data collection and processing.

**Observability: The First Steps**

The first step towards achieving modern observability is to gather metrics, traces, and logs. From the collected data points, observability aims to generate valuable outcomes for decision-making. The decision-making process goes beyond resolving problems as they arise. Next-generation observability goes beyond application remediation, focusing on creating business value to help companies achieve their operational goals. This decision-making process can be enhanced by incorporating user experience, topology, and security data.

**Observability Platform**

A full-stack observability platform monitors every monitored host in your environment. Depending on the technologies used, an average of 500 metrics are generated per computational node. AWS, Azure, Kubernetes, and VMware Tanzu are some platforms that use observability to collect important key performance metrics for services and real-user monitored applications. 

Within a microservices environment, dozens, if not hundreds, of microservices call one another. Distributed tracing can help you understand how the different services connect and how your requests flow. 

**Pillars of Observability**

The three pillars of observability form a strong foundation for making data-driven decisions, but there are opportunities to extend observability. User experience and security details must be considered to gain a deeper understanding. A holistic, context-driven approach to advanced observability enables proactively addressing potential problems before they arise.

**The Role of Monitoring**

To understand the difference between observability and monitoring, we need first to discuss the role of monitoring. Monitoring is the evaluation that helps identify the most practical and efficient use of resources. So, the big question I put to you is what to monitor. This is the first step to preparing a monitoring strategy.

To fully understand if monitoring is enough or if you need to move to an observability platform, ask yourself a couple of questions. Firstly, consider what you should be monitoring, why you should be monitoring it, and how you should be monitoring it. 

Observability & Service Mesh

What is a Cloud Service Mesh?

A Cloud Service Mesh is a design pattern that helps manage and secure microservices interactions. Essentially, it acts as a dedicated layer for controlling the network traffic between microservices. By introducing a service mesh, developers can offload much of the responsibility for service communication from the application code itself, making the entire system more resilient and easier to manage.

### Key Benefits of Implementing a Cloud Service Mesh

#### Enhanced Security

One of the primary advantages of a Cloud Service Mesh is the enhanced security it offers. With features like mutual TLS (mTLS) for encrypting communications between services, a service mesh ensures that data is protected as it travels through the network. This is particularly important in a multi-cloud or hybrid cloud environment where services might be spread across different platforms.

#### Improved Observability

Observability is another critical benefit. A Cloud Service Mesh provides granular insights into service performance, helping developers identify and troubleshoot issues quickly. Metrics, logs, and traces are collected systematically, offering a comprehensive view of the entire microservices ecosystem.

#### Traffic Management

Managing traffic between services becomes significantly easier with a Cloud Service Mesh. Features like load balancing, traffic splitting, and failover mechanisms are built-in, ensuring that service-to-service communication remains efficient and reliable. This is particularly beneficial for applications requiring high availability and low latency.

### Popular Cloud Service Mesh Solutions

Several solutions have emerged as leaders in the Cloud Service Mesh space. Istio, Linkerd, and Consul are among the most popular options, each offering unique features and benefits. Istio, for example, is known for its robust policy enforcement and telemetry capabilities, while Linkerd is praised for its simplicity and performance. Consul, on the other hand, excels in multi-cloud environments, providing seamless service discovery and configuration.

### Challenges and Considerations

While the benefits are compelling, implementing a Cloud Service Mesh is not without its challenges. Complexity can be a significant hurdle, particularly for organizations new to microservices architecture. The additional layer of infrastructure requires careful planning and management. Moreover, there is a learning curve associated with configuring and maintaining a service mesh, which can impact development timelines.

Example Product: Cisco AppDynamics

### Real-Time Monitoring: Keeping an Eye on Your Applications

One of the standout features of Cisco AppDynamics is its real-time monitoring capabilities. By continuously tracking the performance of your applications, AppDynamics provides instant insights into any issues that may arise. This allows businesses to quickly identify and address performance bottlenecks, ensuring that their applications remain responsive and reliable. Whether it’s tracking transaction times, monitoring server health, or keeping an eye on user interactions, Cisco AppDynamics provides a comprehensive view of your application’s performance.

### Advanced Analytics: Turning Data into Actionable Insights

Data is the lifeblood of modern businesses, and Cisco AppDynamics excels at turning raw data into actionable insights. With its advanced analytics engine, AppDynamics can identify patterns, trends, and anomalies in your application’s performance data. This empowers businesses to make informed decisions, optimize their applications, and proactively address potential issues before they impact users. From root cause analysis to predictive analytics, Cisco AppDynamics provides the tools you need to stay ahead of the curve.

### Comprehensive Diagnostics: Troubleshooting Made Easy

When performance issues do arise, Cisco AppDynamics makes troubleshooting a breeze. Its comprehensive diagnostics capabilities allow you to drill down into every aspect of your application’s performance. Whether it’s identifying slow database queries, pinpointing code-level issues, or tracking down problematic user interactions, AppDynamics provides the detailed information you need to resolve issues quickly and efficiently. This not only minimizes downtime but also ensures a seamless user experience.

### Enhancing User Experiences: The Ultimate Goal

At the end of the day, the ultimate goal of any application is to provide a positive user experience. Cisco AppDynamics helps businesses achieve this by ensuring that their applications are always performing at their best. By providing real-time monitoring, advanced analytics, and comprehensive diagnostics, AppDynamics enables businesses to deliver fast, reliable, and engaging applications that keep users coming back for more. In a competitive digital landscape, this can be the difference between success and failure.

Google Cloud Monitoring

Example: What is Ops Agent?

Ops Agent is a lightweight, flexible monitoring agent explicitly designed for Compute Engine instances. It allows you to collect and analyze essential metrics and logs from your virtual machines, providing valuable insights into your infrastructure’s health, performance, and security.

To start monitoring your Compute Engine instance with Ops Agent, you must install and configure it properly. The installation process is straightforward and can be done through the Google Cloud Console or the command line. Once installed, you can configure Ops Agent to collect specific metrics and logs based on your requirements.

Ops Agent offers a wide range of metrics and logs that can be collected and monitored. These include system-level metrics like CPU and memory usage, network traffic, disk I/O, and more. Additionally, Ops Agent allows you to gather application-specific metrics and logs, providing deep insights into the performance and behavior of your applications running on the Compute Engine instance.

Options: Open source or commercial

Knowing this lets you move into the different tools and platforms available. Some of these tools will be open source, and others commercial. When evaluating these tools, one word of caution: does each tool work in a silo, or can it be used across technical domains? Silos are breaking agility in every form of technology.

Related: For pre-information, you may find the following posts helpful:

  1. Microservices Observability
  2. Auto Scaling Observability
  3. Network Visibility
  4. WAN Monitoring
  5. Distributed Systems Observability
  6. Prometheus Monitoring
  7. Correlate Disparate Data Points
  8. Segment Routing

Observability vs Monitoring

Monitoring and Distributed Systems

By utilizing distributed architectures, the cloud native ecosystem allows organizations to build scalable, resilient, and novel software architectures. However, the ever-changing nature of distributed systems means that previous approaches to monitoring can no longer keep up. The introduction of containers made the cloud flexible and empowered distributed systems.

Nevertheless, the ever-changing nature of these systems can cause them to fail in many ways. Distributed systems are inherently complex, and, as systems theorist Richard Cook notes, “Complex systems are intrinsically hazardous systems.”

Cloud-native systems require a new approach to monitoring, one that is open-source compatible, scalable, reliable, and able to control massive data growth. However, cloud-native monitoring can’t exist in a vacuum; it must be part of a broader observability strategy.

**Gaining Observability**

Key Features of Observability:

1. High-dimensional data collection: Observability involves collecting a wide variety of data from different system layers, including metrics, logs, traces, and events. This comprehensive data collection provides a holistic view of the system’s behavior.

2. Distributed tracing: Observability allows tracing requests as they flow through a distributed system, enabling engineers to understand the path and identify performance bottlenecks or errors.

3. Contextual understanding: Observability emphasizes capturing contextual information alongside the data, enabling teams to correlate events and understand the impact of changes or incidents.

Benefits of Observability:

1. Faster troubleshooting: By providing detailed insights into system behavior, observability helps teams quickly identify and resolve issues, minimizing downtime and improving system reliability.

2. Proactive monitoring: Observability allows teams to detect potential problems before they become critical, enabling proactive measures to prevent service disruptions.

3. Improved collaboration: With observability, different teams, such as developers, operations, and support, can have a shared understanding of the system’s behavior, leading to improved collaboration and faster incident response.

**Gaining Monitoring**

On the other hand, monitoring focuses on collecting and analyzing metrics to assess a system’s health and performance. It involves setting up predefined thresholds or rules and generating alerts based on specific conditions.

Key Features of Monitoring:

1. Metric-driven analysis: Monitoring relies on predefined metrics collected and analyzed to measure system performance, such as CPU usage, memory consumption, response time, or error rates.

2. Alerting and notifications: Monitoring systems generate alerts and notifications when predefined thresholds or rules are violated, enabling teams to take immediate action.

3. Historical analysis: Monitoring systems provide historical data, allowing teams to analyze trends, identify patterns, and make informed decisions based on past performance.

Benefits of Monitoring:

1. Performance optimization: Monitoring helps identify performance bottlenecks and inefficiencies within a system, enabling teams to optimize resources and improve overall system performance.

2. Capacity planning: By monitoring resource utilization and workload patterns, teams can accurately plan for future growth and ensure sufficient resources are available to meet demand.

3. Compliance and SLA enforcement: Monitoring systems help organizations meet compliance requirements and enforce service level agreements (SLAs) by tracking and reporting on key metrics.

Observability & Monitoring: A Unified Approach

While observability and monitoring differ in their approaches and focus, they are not mutually exclusive. When used together, they complement each other and provide a more comprehensive understanding of system behavior.

Observability enables teams to gain deep insights into system behavior, understand complex interactions, and troubleshoot issues effectively. Conversely, monitoring provides a systematic approach to tracking predefined metrics, generating alerts, and ensuring the system meets performance requirements.

Combining observability and monitoring can help organizations create a robust system monitoring and management strategy. This integrated approach empowers teams to quickly detect, diagnose, and resolve issues, improving system reliability, performance, and customer satisfaction.

Application Latency & Cloud Trace

A: – Latency, in simple terms, refers to the delay between sending a request and receiving a response. It can be caused by various factors, such as network congestion, server processing time, or inefficient code execution. Understanding the different components contributing to latency is essential for optimizing application performance.

B: – Google Cloud Trace is a powerful diagnostic tool provided by Google Cloud Platform. It allows developers to visualize and analyze latency data for their applications. By instrumenting code and capturing trace data, developers gain valuable insights into the performance bottlenecks and can take proactive measures to improve latency.

C: – To start capturing traces in your application, you need to integrate the Cloud Trace API into your codebase. Once integrated, Cloud Trace collects detailed latency data, including information about the various services and resources used to process a request. This data can then be visualized and analyzed through the user-friendly Cloud Trace interface.

The Starting Point: Observability vs Monitoring

You need to measure and gather the correct event information in your environment, which will be done with several tools. This will let you know what is affecting your application performance and infrastructure. As a good starting point, there are four golden signals for Latency, saturation, traffic, and errors. These are Google’s Four Golden Signals. The four most important metrics to keep track of are: 

      1. Latency: How long it takes to serve a request
      2. Traffic: The number of requests being made.
      3. Errors: The rate of failing requests. 
      4. Saturation: How utilized the service is.

So now we have some guidance on what to monitor and let us apply this to Kubernetes to, for example, let’s say, a frontend web service that is part of a tiered application, we would be looking at the following:

      1. How many requests is the front end processing at a particular point in time,
      2. How many 500 errors are users of the service received, and 
      3. Does the request overutilize the service?

We already know that monitoring is a form of evaluation that helps identify the most practical and efficient use of resources. With monitoring, we observe and check the progress or quality of something over time. Within this, we have metrics, logs, and alerts. Each has a different role and purpose.

**Monitoring: The role of metrics**

Metrics are related to some entity and allow you to view how many resources you consume. Metric data consists of numeric values instead of unstructured text, such as documents and web pages. Metric data is typically also a time series, where values or measures are recorded over some time. 

Available bandwidth and latency are examples of such metrics. Understanding baseline values is essential. Without a baseline, you will not know if something is happening outside the norm.

Note: Average Baselines

What are the average baseline values for bandwidth and latency metrics? Are there any fluctuations in these metrics? How do these values rise and fall during normal operations and peak usage? This may change over different days, weeks, and months.

If you notice a rise in these values during normal operations, this would be deemed abnormal and should act as a trigger that something could be wrong and needs to be investigated. Remember that these values should not be gathered as a once-off but can be gathered over time to understand your application and its underlying infrastructure better.

**Monitoring: The role of logs**

Logging is an essential part of troubleshooting application and infrastructure performance. Logs give you additional information about events, which is important for troubleshooting or discovering the root cause of the events. Logs will have much more detail than metrics, so you will need some way to parse the logs or use a log shipper.

A typical log shipper will take these logs from the standard out in a Docker container and ship them to a backend for processing.

Note: Example Log Shipper

FluentD or Logstash has pros and cons. The group can use it here and send it to a backend database, which could be the ELK stack ( Elastic Search). Using this approach, you can add different things to logs before sending them to the backend. For example, you can add GEO IP information. This will add richer information to the logs that can help you troubleshoot.

Understanding VPC Flow Logs

VPC Flow Logs is a feature provided by Google Cloud that captures network traffic metadata within a Virtual Private Cloud (VPC) network. This metadata includes source and destination IP addresses, protocol, port, and more. By enabling VPC Flow Logs, administrators can gain visibility into the network traffic patterns and better understand the communication flow within their infrastructure.

We can leverage data visualization tools to make the analysis more visually appealing and easier to comprehend. Google Cloud provides various options for creating interactive and informative dashboards, such as Data Studio and Cloud Datalab. These dashboards can display network traffic trends, highlight critical metrics, and aid in identifying patterns or anomalies that might require further investigation.

**Monitoring: The role of alerting**

Then we have the alerting, and it would be best to balance how you monitor and what you alert on. So, we know that alerting is not always perfect, and getting the right alerting strategy in place will take time. It’s not a simple day-one installation and requires much effort and cross-team collaboration.

You know that alerting on too much can cause alert fatigue. We are all too familiar with the problems alert fatigue can bring and the tensions it can create in departments.

To minimize this, consider Service Level Objectives (SLOs) for alerts. SLOs are measurable characteristics such as availability, throughput, frequency, and response times. They are the foundation for a reliability stack. Also, it would be best if you considered alert thresholds. If these are too short, you will get a lot of false positives on your alerts. 

Monitoring is not enough.

Even with all of these in place, monitoring is not enough. Due to the sheer complexity of today’s landscape, you need to consider and think differently about the tools you use and how you use the intelligence and data you receive from them to resolve issues before they become incidents.  That monitoring by itself is not enough.

The tool used to monitor is just a tool that probably does not cross technical domains, and different groups of users will administer each tool without a holistic view. The tools alone can take you only half the way through the journey.  Also, what needs to be addressed is the culture and the traditional way of working in silos. A siloed environment can affect the monitoring strategy you want to implement. Here, you can look at an observability platform.

The Foundation of GKE-Native Monitoring

GKE-Native Monitoring builds upon the robust foundation of Prometheus and Stackdriver, providing a seamless integration that simplifies observability within GKE clusters. By harnessing the strengths of these industry-leading monitoring solutions, GKE-Native Monitoring offers a robust and comprehensive monitoring experience.

Under the umbrella of GKE-Native Monitoring, users gain access to a rich set of features designed to enable fine-grained visibility and control. These include customizable dashboards, real-time metrics, alerts, and horizontal pod autoscaling. With these tools, developers and operators can easily monitor the health, performance, and resource utilization of their GKE clusters.

Observability vs Monitoring

When it comes to observability vs. monitoring, we know that monitoring can detect problems and tell you if a system is down, and when your system is UP, Monitoring doesn’t care. Monitoring only cares when there is a problem. The problem has to happen before monitoring takes action. It’s very reactive. So, if everything is working, monitoring doesn’t care.

On the other hand, we have an observability platform, which is a more proactive practice. It’s about what and how your system and services are doing. Observability lets you improve your insight into how complex systems work and quickly get to the root cause of any problem, known or unknown.

Observability is best suited for interrogating systems to explicitly discover the source of any problem, along any dimension or combination of dimensions, without first predicting. This is a proactive approach.

## Pillars of Observability

This is achieved by combining logs, metrics, and traces. So, we need data collection, storage, and analysis across these domains while also being able to perform alerting on what matters most. Let’s say you want to draw correlations between units like TCP/IP packets and HTTP errors experienced by your app.

The Observability platform pulls context from different sources of information, such as logs, metrics, events, and traces, into one central context. Distributed tracing adds a lot of value here.

Also, when everything is placed into one context, you can quickly switch between the necessary views to troubleshoot the root cause. Viewing these telemetry sources with one single pane of glass is an excellent key component of any observability system. 

## Known and Unknown vs Unknown and Unknown 

Monitoring automatically reports whether known failure conditions are occurring or are about to occur. In other words, it is optimized for reporting on unknown conditions about known failure modes, which are referred to as known unknowns. In contrast, Observability is centered around discovering if and why previously unknown failure modes may be occurring, in other words, to find unknown unknowns.

The monitoring-based approach of metrics and dashboards is an investigative practice that relies on humans’ experience and intuition to detect and understand system issues. This is okay for a simple legacy system that fails in predictable ways, but the instinctual technique falls short for modern systems that fail in unpredictable ways.

With modern applications, the complexity and scale of their underlying systems quickly make that approach unattainable, and we can’t rely on hunches. Observability tools differ from traditional monitoring tools because they enable engineers to investigate any system, no matter how complex. You don’t need to react to a hunch or have intimate system knowledge to generate a hunch.

Monitoring vs Observability: Working together?

Monitoring helps engineers understand infrastructure concerns, while observability helps engineers understand software concerns. So, Observability and Monitoring can work together. First, the infrastructure does not change too often, and when it fails, it will fail more predictably. So, we can use monitoring here.

This is compared to software system states that change daily and are unpredictable. Observability fits this purpose. The conditions that affect infrastructure health change infrequently and are relatively more straightforward to predict. 

We have several well-established practices to expect, such as capacity planning and the ability to remediate automatically (e.g., auto-scaling in a Kubernetes environment). All of these can be used to tackle these types of known issues. 

Monitoring and infrastructure problems

Due to its relatively predictable and slowly changing nature, the aggregated metrics approach monitors and alerts perfectly for infrastructure problems. So here, a metric-based system works well. Metrics-based systems and their associated alerts help you see when capacity limits or known error conditions of underlying systems are being reached.

Now, we need to look at monitoring the Software and have access to high-cardinality fields. These may include the user ID or a shopping cart ID. Code that is well-instrumented for Observability allows you to answer complex questions that are easy to miss when examining aggregate performance.

Observability and monitoring are essential practices in modern software development and operations. While observability focuses on understanding system behaviour through comprehensive data collection and analysis, monitoring uses predefined metrics to assess performance and generate alerts.

By leveraging both approaches, organizations can gain a holistic view of their systems, enabling proactive measures, faster troubleshooting, and optimal performance. Embracing observability and monitoring as complementary practices can pave the way for more reliable, scalable, and efficient systems in the digital era.

Summary: Observability vs Monitoring

As technology advances rapidly, understanding and managing complex systems becomes increasingly important. Two terms that often arise in this context are observability and monitoring. While they may seem interchangeable, they represent distinct approaches to gaining insights into system performance. In this blog post, we delved into observability and monitoring, exploring their differences, benefits, and how they can work together to provide a comprehensive understanding of system behavior.

Understanding Monitoring

Monitoring is a well-established practice in the world of technology. It involves collecting and analyzing data from various sources to ensure the smooth functioning of a system. Monitoring typically focuses on key performance indicators (KPIs) such as response time, error rates, and resource utilization. Organizations can proactively identify and resolve issues by tracking these metrics, ensuring optimal system performance.

Unveiling Observability

Observability takes a more holistic approach compared to monitoring. It emphasizes understanding the internal state of a system by leveraging real-time data and contextual information. Unlike monitoring, which focuses on predefined metrics, observability aims to provide a clear picture of how a system behaves under different conditions. It achieves this by capturing fine-grained telemetry data, including logs, traces, and metrics, which can be analyzed to uncover patterns, anomalies, and root causes of issues.

The Benefits of Observability

One of the key advantages of observability is its ability to handle unexpected scenarios and unknown unknowns. Capturing detailed data about system behavior enables teams to investigate issues retroactively, even those that were not anticipated during the design phase. Additionally, observability allows for better collaboration between different teams, as the shared visibility into system internals facilitates more effective troubleshooting and faster incident resolution.

Synergy between Observability and Monitoring

While observability and monitoring are distinct concepts, they are not mutually exclusive. They can complement each other to provide a comprehensive understanding of system performance. Monitoring can provide high-level insights into system health and performance trends, while observability can dive deeper into specific issues and offer a more granular view. By combining these approaches, organizations can achieve a proactive and reactive system management approach, ensuring stability and resilience.

Conclusion:

Observability and monitoring are two powerful tools in the arsenal of system management. While monitoring focuses on predefined metrics, observability takes a broader and more dynamic approach, capturing fine-grained data to gain deeper insights into system behavior. By embracing observability and monitoring, organizations can unlock a comprehensive understanding of their systems, enabling them to proactively address issues, optimize performance, and deliver exceptional user experiences.

OpenShift Security Context Constraints

OpenShift Security Best Practices

OpenShift Security Best Practices

In today's digital landscape, security is of utmost importance. This is particularly true for organizations utilizing OpenShift, a powerful container platform. In this blog post, we will explore the best practices for OpenShift security, ensuring that your deployments are protected from potential threats.

Container Security: Containerization has revolutionized application deployment, but it also introduces unique security considerations. By implementing container security best practices, you can mitigate risks and safeguard your OpenShift environment. We will delve into topics such as image security, vulnerability scanning, and secure container configurations.

Access Control: Controlling access to your OpenShift cluster is vital for maintaining a secure environment. We will discuss the importance of strong authentication mechanisms, implementing role-based access control (RBAC), and regularly reviewing and updating user permissions. These measures will help prevent unauthorized access and potential data breaches.

Network Security: Securing the network infrastructure is crucial to protect your OpenShift deployments. We will explore topics such as network segmentation, implementing firewall rules, and utilizing secure network protocols. By following these practices, you can create a robust network security framework for your OpenShift environment.

Monitoring and Logging: Effective monitoring and logging are essential for detecting and responding to security incidents promptly. We will discuss the importance of implementing comprehensive logging mechanisms, utilizing monitoring tools, and establishing alerting systems. These practices will enable you to proactively identify potential security threats and take necessary actions to mitigate them.

Regular Updates and Patching: Keeping your OpenShift environment up to date with the latest patches and updates is vital for maintaining security. We will emphasize the significance of regular patching and provide tips for streamlining the update process. By staying current with security patches, you can address vulnerabilities and protect your OpenShift deployments.

Securing your OpenShift environment requires a multi-faceted approach that encompasses container security, access control, network security, monitoring, and regular updates. By implementing the best practices discussed in this blog post, you can fortify your OpenShift deployments and ensure a robust security posture. Protecting your applications and data is a continuous effort, and staying vigilant is key in the ever-evolving landscape of cybersecurity.

Highlights: OpenShift Security Best Practices

OpenShift Security 

**Understanding the Security Landscape**

Before diving into specific security measures, it’s essential to understand the overall security landscape of OpenShift. OpenShift is built on top of Kubernetes, inheriting its security features while also providing additional layers of protection. These include role-based access control (RBAC), network policies, and security context constraints (SCCs). Understanding how these components interact is crucial for building a secure environment.

**Implementing Role-Based Access Control**

One of the foundational elements of OpenShift security is Role-Based Access Control (RBAC). RBAC allows administrators to define what actions users and service accounts can perform within the cluster. By assigning roles and permissions carefully, you can ensure that each user only has access to the resources and operations necessary for their role, minimizing the risk of unauthorized actions or data breaches.

**Network Policies for Enhanced Security**

Network policies are another vital aspect of securing an OpenShift environment. These policies determine how pods within a cluster can communicate with each other and with external resources. By creating strict network policies, you can isolate sensitive workloads, control traffic flow, and prevent unauthorized access, thus reducing the attack surface of your applications.

**Security Context Constraints and Pod Security**

Security Context Constraints (SCCs) are used to define the permissions and access controls for pods. By setting up SCCs, you can restrict the capabilities of pods, such as preventing them from running as a root user. This minimizes the risk of privilege escalation attacks and helps maintain the integrity of your applications. Regularly reviewing and updating SCCs is a best practice for maintaining a secure OpenShift environment.

OpenShift Security Best Practices

1- Implementing Strong Authentication: One of the fundamental aspects of OpenShift security is ensuring robust authentication mechanisms are in place. Utilize features like OpenShift’s built-in OAuth server or integrate with external authentication providers such as LDAP or Active Directory. Enforce multi-factor authentication for added security and regularly review and update access controls to restrict unauthorized access.

2- Container Image Security: Container images play a vital role in OpenShift deployments, and their security should not be overlooked. Follow these best practices: regularly update base images to patch security vulnerabilities, scan images for known vulnerabilities using tools like Clair or Anchor, utilize trusted registries for image storage, and implement image signing and verification to ensure image integrity.

3- Network Segmentation and Policies: Proper network segmentation is crucial to isolate workloads and minimize the impact of potential security breaches. Leverage OpenShift’s network policies to define and enforce communication rules between pods and projects. Implement ingress and egress filtering to control traffic flow and restrict access to sensitive resources. Regularly review and update network policies to align with your evolving security requirements.

4- Logging and Monitoring: Comprehensive logging and monitoring are essential for effectively detecting and responding to security incidents. Enable centralized logging by leveraging OpenShift’s logging infrastructure or integrating with external log management solutions. Implement robust monitoring tools to track resource usage, detect abnormal behavior, and set up alerts for security-related events. Regularly review and analyze logs to identify potential security threats and take proactive measures.

Understanding Cluster Access

To begin our journey, let’s establish a solid understanding of cluster access. Cluster access refers to authenticating and authorizing users or entities to interact with an Openshift cluster. It involves managing user identities, permissions, and secure communication channels.

  • Implementing Multi-Factor Authentication (MFA)

Multi-factor authentication (MFA) requires users to provide multiple forms of identification, adding an extra layer of security. Enabling MFA within your Openshift cluster can significantly reduce the risk of unauthorized access. This section will outline the steps to configure and enforce MFA for enhanced cluster access security.

  • Role-Based Access Control (RBAC)

RBAC is a crucial component of Openshift security, allowing administrators to define and manage user permissions at a granular level. We will explore the concept of RBAC and its practical implementation within an Openshift cluster. Discover how to define roles, assign permissions, and effectively control access to various resources.

  • Secure Communication Channels

Establishing secure communication channels is vital to protect data transmitted between cluster components. In this section, we will discuss the utilization of Transport Layer Security (TLS) certificates to encrypt communication and prevent eavesdropping or tampering. Learn how to generate and manage TLS certificates within your Openshift environment.

  • Continuous Monitoring and Auditing

Maintaining a robust security posture involves constantly monitoring and auditing cluster access activities. Through integrating monitoring tools and auditing mechanisms, administrators can detect and respond to potential security breaches promptly. Uncover the best practices for implementing a comprehensive monitoring and auditing strategy within your Openshift cluster.

  • Threat modelling

A threat model maps out the likelihood and impact of potential threats to your system. Your security team is busy evaluating the risks to your platform, so it is essential to think about and evaluate them.

OpenShift clusters are no different, so when hardening your cluster, keep that in mind. Using the Kubeadmin user is probably fine if you use CodeReadContainers on your laptop to learn OpenShift. Access control and RBAC rules are probably a good idea for your company’s production clusters exposed to the internet.

If you model threats beforehand, you can explain what you did to protect your infrastructure and why you may not have taken specific other actions.

**Stricter Security than Kubernetes**

OpenShift has stricter security policies than Kubernetes. For instance, running a container as root is forbidden. To enhance security, it also offers a secure-by-default option. Kubernetes doesn’t have built-in authentication or authorization capabilities, so developers must manually create bearer tokens and other authentication procedures.

OpenShift provides a range of security features, including role-based access control (RBAC), image scanning, and container isolation, that help ensure the safety of containerized applications.

Related: For useful pre-information on OpenShift basics, you may visit the following posts helpful:

  1. OpenShift Networking
  2. Kubernetes Security Best Practice
  3. Container Networking
  4. Identity Security
  5. Docker Container Security
  6. Load Balancing

OpenShift Security Best Practices

Securing containerized environments

Securing containerized environments is considerably different from securing the traditional monolithic application because of the inherent nature of the microservices architecture. We went from one to many, and there is a clear difference in attack surface and entry points. So, there is much to consider for OpenShift network security and OpenShift security best practices, including many Docker security options.

The application stack previously had very few components, maybe just a cache, web server, and database separated and protected by a context firewall. The most common network service allows a source to reach an application, and the sole purpose of the network is to provide endpoint reachability.

As a result, the monolithic application has few entry points, such as ports 80 and 443. Not every monolithic component is exposed to external access and must accept requests directly, so we designed our networks around these facts. The following diagram provides information on the threats you must consider for container security.

container security
Diagram: Container security. Source Neuvector

Container Security Best Practices

1. Secure Authentication and Authorization: One of the fundamental aspects of OpenShift security is ensuring that only authorized users have access to the platform. Implementing robust authentication mechanisms, such as multifactor authentication (MFA) or integrating with existing identity management systems, is crucial to prevent unauthorized access. Additionally, defining fine-grained access controls and role-based access control (RBAC) policies will help enforce the principle of least privilege.

2. Container Image Security: OpenShift leverages containerization technology, which brings its security considerations. It is essential to use trusted container images from reputable sources and regularly update them to include the latest security patches. Implementing image scanning tools to detect vulnerabilities and malware within container images is also recommended. Furthermore, restricting privileged containers and enforcing resource limits will help mitigate potential security risks.

3. Network Security: OpenShift supports network isolation through software-defined networking (SDN). It is crucial to configure network policies to restrict communication between different components and namespaces, thus preventing lateral movement and unauthorized access. Implementing secure communication protocols, such as Transport Layer Security (TLS), between services and enforcing encryption for data in transit will further enhance network security.

4. Monitoring and Logging: A robust monitoring and logging strategy is essential for promptly detecting and responding to security incidents. OpenShift provides built-in monitoring capabilities, such as Prometheus and Grafana, which can be leveraged to monitor system health, resource usage, and potential security threats. Additionally, enabling centralized logging and auditing of OpenShift components will help identify and investigate security events.

5. Regular Vulnerability Assessments and Penetration Testing: To ensure the ongoing security of your OpenShift environment, it is crucial to conduct regular vulnerability assessments and penetration testing. These activities will help identify any weaknesses or vulnerabilities within the platform and its associated applications. Addressing these vulnerabilities promptly will minimize the risk of potential attacks and data breaches.

**OpenShift Security**

OpenShift delivers all the tools you need to run software on top of it with SRE paradigms, from a monitoring platform to an integrated CI/CD system that you can use to monitor and run both the software deployed to the OpenShift cluster and the cluster itself. So, the cluster and the workload that runs in it need to be secured. 

From a security standpoint, OpenShift provides robust encryption controls to protect sensitive data, including platform secrets and application configuration data. In addition, OpenShift optionally utilizes FIPS 140-2 Level 1 compliant encryption modules to meet security standards for U.S. federal departments.

This post highlights OpenShift security and provides security best practices and considerations when planning and operating your OpenShift cluster. These will give you a starting point. However, as clusters and bad actors are ever-evolving, it is important to revise the steps you took.

**Central security architecture**

Therefore, we often see security enforcement in a fixed central place in the network infrastructure. This could be, for example, a significant security stack consisting of several security appliances. We are often referred to as a kludge of devices. As a result, the individual components within the application need not worry about carrying out any security checks as they occur centrally for them.

On the other hand, with the common microservices architecture, those internal components are specifically designed to operate independently and accept requests alone, which brings considerable benefits to scaling and deploying pipelines.

However, each component may now have entry points and accept external connections. Therefore, they need to be concerned with security individually and not rely on a central security stack to do this for them.

**The different container attack vectors** 

These changes have considerable consequences for security and how you approach your OpenShift security best practices. The security principles still apply, and we still are concerned with reducing the blast radius, least privileges, etc. Still, they must be used from a different perspective and to multiple new components in a layered approach. Security is never done in isolation.

So, as the number of entry points to the system increases, the attack surface broadens, leading us to several docker container security attack vectors not seen with the monolithic. We have, for example, attacks on the Host, images, supply chain, and container runtime. There is also a considerable increase in the rate of change for these types of environments; an old joke says that a secure application is an application stack with no changes.

Open The Door To Bad Actors

So when you change, you can open the door to a bad actor. Today’s application varies considerably a few times daily for an agile stack. We have unit and security tests and other safety tests that can reduce mistakes, but no matter how much preparation you do, there is a chance of a breach whenever there is a change.

So, environmental changes affect security and some alarming technical challenges to how containers run as default, such as running as root by default and with a disturbing amount of capabilities and privileges. The following image displays attack vectors that are linked explicitly to containers.

container attack vectors
Diagram: Container attack vectors. Source Adriancitu

Challenges with Securing Containers

  • Containers running as root

As you know, containers run as root by default and share the Kernel of the Host OS. The container process is visible from the Host, which is a considerable security risk when a container compromise occurs. When a security vulnerability in the container runtime arose, and a container escape was performed, as the application ran as root, it could become root on the underlying Host.

Therefore, if a bad actor gets access to the Host and has the correct privileges, it can compromise all the hosts’ containers.

  • Risky Configuration

Containers often run with excessive privileges and capabilities—much more than they need to do their job efficiently. As a result, we need to consider what privileges the container has and whether it runs with any unnecessary capabilities it does not need.

Some of a container’s capabilities may be defaults that fall under risky configurations and should be avoided. You should keep an eye on the CAP_SYS_ADMIN flag, which grants access to an extensive range of privileged activities.

  • Excessive container isolation

The container has isolation boundaries by default with namespace and control groups ( when configured correctly). However, granting excessive container capabilities will weaken the isolation between the container, this Host, and other containers on the same Host. This is essentially removing or dissolving the container’s ring-fence capabilities.

Starting OpenShift Security Best Practices

Then, we have security with OpenShift, which overcomes many of the default security risks you have with running containers. And OpenShift does much of this out of the box. If you want further information on securing an OpenShift cluster, kindly check out my course for Pluralsight on OpenShift Security and OpenShift Network Security.

OpenShift Container Platform (formerly known as OpenShift Enterprise) or OCP is Red Hat’s offering for the on-premises private platform (PaaS). OpenShift is based on the Origin open-source project and is a Kubernetes distribution.

The foundation of the OpenShift Container Platform and OpenShift Network Security is based on Kubernetes and, therefore, shares some of the same networking technology and some enhancements. However, as you know, Kubernetes is a complex beast and can be utilized by itself when trying to secure clusters.  OpenShift does an excellent job of wrapping Kubernetes in a layer of security, such as using Security Context Constraints (SCCs) that give your cluster a good security base.

**Security Context Constraints**

By default, OpenShift prevents the cluster container from accessing protected functions. These functions—Linux features such as shared file systems, root access, and some core capabilities such as the KILL command—can affect other containers running in the same Linux kernel, so the cluster limits access.

Most cloud-native applications work fine with these limitations, but some (especially stateful workloads) need greater access. Applications that require these functions can still use them but need the cluster’s permission.

The application’s security context specifies the permissions that the application needs, while the cluster’s security context constraints specify the permissions that the cluster allows. An SC with an SCC enables an application to request access while limiting the access that the cluster will grant.

What are security contexts and security context constraints?

A pod configures a container’s access with permissions requested in the pod’s security context and approved by the cluster’s security context constraints:

-A security context (SC), defined in a pod, enables a deployer to specify a container’s permissions to access protected functions. When the pod creates the container, it configures it to allow these permissions and block all others. The cluster will only deploy the pod if the permissions it requests are permitted by a corresponding SCC.

-A security context constraint (SCC), defined in a cluster, enables an administrator to control pod permissions, which manage containers’ access to protected Linux functions. Similarly to how role-based access control (RBAC) manages users’ access to a cluster’s resources, an SCC manages pods’ access to Linux functions.

By default, a pod is assigned an SCC named restricted that blocks access to protected functions in OpenShift v4.10 or earlier. Instead, in OpenShift v4.11 and later, the restricted-v2 SCC is used by default. For an application to access protected functions, the cluster must make an SCC that allows it to be available to the pod.

SCC grants access to protection functions

While an SCC grants access to protected functions, each pod needing access must request it. To request access to the functions its application needs, a pod specifies those permissions in the security context field of the pod manifest. The manifest also specifies the service account that should be able to grant this access.

When the manifest is deployed, the cluster associates the pod with the service account associated with the SCC. For the cluster to deploy the pod, the SCC must grant the permissions that the pod requests.

One way to envision this relationship is to think of the SCC as a lock that protects Linux functions, while the manifest is the key. The pod is allowed to deploy only if the key fits.

Security Context Constraints
Diagram: Security Context Constraints. Source is IBM

A final note: Security context constraint

When your application is deployed to OpenShift in a virtual data center design, the default security model will enforce that it is run using an assigned Unix user ID unique to the project for which you are deploying it. Now, we can prevent images from being run as the Unix root user. When hosting an application using OpenShift, the user ID that a container runs as will be assigned based on which project it is running in.

Containers cannot run as the root user by default—a big win for security. SCC also allows you to set different restrictions and security configurations for PODs.

So, instead of allowing your image to run as the root, which is a considerable security risk, you should run as an arbitrary user by specifying an unprivileged USER, setting the appropriate permissions on files and directories, and configuring your application to listen on unprivileged ports.

OpenShift Network Security

SCC defaults access:

Security context constraints let you drop privileges by default, which is essential and still the best practice. Red Hat OpenShift security context constraints (SCCs) ensure that no privileged containers run on OpenShift worker nodes by default—another big win for security. Access to the host network and host process IDs are denied by default. Users with the required permissions can adjust the default SCC policies to be more permissive.

So, when considering SCC, consider SCC admission controllers as restricting POD access, similar to how RBAC restricts user access. To control the behavior of pods, we have security context constraints (SCCs). These cluster-level resources define what resources pods can access and provide additional control. 

Security context constraints let you drop privileges by default, which is a critical best practice. With Red Hat OpenShift SCCs, no privileged containers run on OpenShift worker nodes. Access to the host network and host process IDs is denied by default, a big win for OpenShift security.

Restricted security context constraints (SCCs):

A few SCCs are available by default, and you may have the head of the restricted SCC. By default, all pods, except those for builds and deployments, use a default service account assigned by the restricted SCC, which doesn’t allow privileged containers – that is, those running under the root user and listening on privileged ports are ports under <1024. SCC can be used to manage the following:

    1. Privilege Mode: This setting allows or denies a container from running in privilege mode. As you know, privilege mode bypasses any restriction such as control groups, Linux capabilities, secure computing profiles, 
    2. Privilege Escalation: This setting enables or disables privilege escalation inside the container ( all privilege escalation flags)
    3. Linux Capabilities: This setting allows the addition or removal of specific Linux capabilities
    4. Seccomp profile – this setting shows which secure computing profiles are used in a pod.
    5. Root-only file system: this makes the root file system read-only 

The goal is to assign the fewest possible capabilities for a pod to function fully. This least-privileged model ensures that pods can’t perform tasks on the system that aren’t related to their application’s proper function. The default value for the privileged option is False; setting the privileged option to True is the same as giving the pod the capabilities of the root user on the system. Although doing so shouldn’t be common practice, privileged pods can be helpful under certain circumstances. 

OpenShift Network Security: Authentication

Authentication refers to the process of validating one’s identity. Usually, users aren’t created in OpenShift but are provided by an external entity, such as the LDAP server or GitHub. The only part where OpenShift steps in is authorization—determining roles and permissions for a user.

OpenShift supports integration with various identity management solutions in corporate environments, such as FreeIPA/Identity Management, Active Directory, GitHub, Gitlab, OpenStack Keystone, and OpenID.

OpenShift Network Security: Users and identities

A user is any human actor who can request the OpenShift API to access resources and perform actions. Users are typically created in an external identity provider, usually a corporate identity management solution such as Lightweight Directory Access Protocol (LDAP) or Active Directory.

To support multiple identity providers, OpenShift relies on the concept of identities as a bridge between users and identity providers. A new user and identity are created upon the first login by default. There are four ways to map users to identities:

OpenShift Network Security: Service accounts

Service accounts allow us to control API access without sharing users’ credentials. Pods and other non-human actors use them to perform various actions and are a central vehicle by which their access to resources is managed. By default, three service accounts are created in each project:

OpenShift Network Security:  Authorization and role-based access control

Authorization in OpenShift is built around the following concepts:

Rules: Sets of actions allowed to be performed on specific resources.
Roles are collections of rules that allow them to be applied to a user according to a specific user profile. They can be used either at the cluster or project level.
Role bindings are associations between users or groups and roles. A given user or group can be associated with multiple roles.

If pre-defined roles aren’t sufficient, you can always create custom roles with just the specific rules you need.

Summary: OpenShift Security Best Practices

In the ever-evolving technological landscape, ensuring the security of your applications is of utmost importance. OpenShift, a powerful containerization platform, offers robust security features to protect your applications and data. This blog post explored some essential OpenShift security best practices to help you fortify your applications and safeguard sensitive information.

Understand the OpenShift Security Model

OpenShift follows a layered security model that provides multiple levels of protection. Understanding this model is crucial to implementing adequate security measures. From authentication and authorization mechanisms to network policies and secure container configurations, OpenShift offers a comprehensive security framework.

Implement Strong Authentication Mechanisms

Authentication is the first line of defense against unauthorized access. OpenShift supports various authentication methods, including username/password, token-based, and integration with external authentication providers like LDAP or Active Directory. Implementing robust authentication mechanisms ensures that only trusted users can access your applications and resources.

Apply Fine-Grained Authorization Policies

Authorization is vital in controlling users’ actions within the OpenShift environment. By defining fine-grained access control policies, you can limit privileges to specific users or groups. OpenShift’s Role-Based Access Control (RBAC) allows you to assign roles with different levels of permissions, ensuring that each user has appropriate access rights.

Secure Container Configurations

Containers are at the heart of OpenShift deployments; securing them is crucial for protecting your applications. Employing best practices such as using trusted container images, regularly updating base images, and restricting container capabilities can significantly reduce the risk of vulnerabilities. OpenShift’s security context constraints enable you to define and enforce security policies for containers, ensuring they run with the minimum required privileges.

Enforce Network Policies

OpenShift provides network policies that enable you to define traffic flow rules between your application’s different components. By implementing network policies, you can control inbound and outbound traffic, restrict access to specific ports or IP ranges, and isolate sensitive components. This helps prevent unauthorized communication and protects your applications from potential attacks.

Conclusion:

Securing your applications on OpenShift requires a multi-faceted approach, encompassing various layers of protection. By understanding the OpenShift security model, implementing strong authentication and authorization mechanisms, securing container configurations, and enforcing network policies, you can enhance the overall security posture of your applications. Stay vigilant, keep up with the latest security updates, and regularly assess your security measures to mitigate potential risks effectively.

System Observability

Distributed Systems Observability

Distributed Systems Observability

In the realm of modern technology, distributed systems have become the backbone of numerous applications and services. However, the increasing complexity of such systems poses significant challenges when it comes to monitoring and understanding their behavior. This is where observability steps in, offering a comprehensive solution to gain insights into the intricate workings of distributed systems. In this blog post, we will embark on a captivating journey into the realm of distributed systems observability, exploring its key concepts, tools, and benefits.

Observability, as a concept, enables us to gain deep insights into the internal state of a system based on its external outputs. When it comes to distributed systems, observability takes on a whole new level of complexity. It encompasses the ability to effectively monitor, debug, and analyze the behavior of interconnected components across a distributed architecture. By employing various techniques and tools, observability allows us to gain a holistic understanding of the system's performance, bottlenecks, and potential issues.

To achieve observability in distributed systems, it is crucial to focus on three interconnected components: logs, metrics, and traces.

Logs: Logs provide a chronological record of events and activities within the system, offering valuable insights into what has occurred. By analyzing logs, engineers can identify anomalies, track down errors, and troubleshoot issues effectively.

Metrics: Metrics, on the other hand, provide quantitative measurements of the system's performance and behavior. They offer a rich source of data that can be analyzed to gain a deeper understanding of resource utilization, response times, and overall system health.

Traces: Traces enable the visualization and analysis of transactions as they traverse through the distributed system. By capturing the flow of requests and their associated metadata, traces allow engineers to identify bottlenecks, latency issues, and performance optimizations.

In the ever-evolving landscape of distributed systems observability, a plethora of tools and frameworks have emerged to simplify the process. Prominent examples include:

1. Prometheus: A powerful open-source monitoring and alerting system that excels in collecting and storing metrics from distributed environments.

2. Jaeger: An end-to-end distributed tracing system that enables the visualization and analysis of transaction flows across complex systems.

3. ELK Stack: A comprehensive combination of Elasticsearch, Logstash, and Kibana, which collectively offer powerful log management, analysis, and visualization capabilities.

4. Grafana: A widely-used open-source platform for creating rich and interactive dashboards, allowing engineers to visualize metrics and logs in real-time.

The adoption of observability in distributed systems brings forth a multitude of benefits. It empowers engineers and DevOps teams to proactively detect and diagnose issues, leading to faster troubleshooting and reduced downtime. Observability also aids in capacity planning, resource optimization, and identifying performance bottlenecks. Moreover, it facilitates collaboration between teams by providing a shared understanding of the system's behavior and enabling effective communication.

In the ever-evolving landscape of distributed systems, observability plays a pivotal role in unraveling the complexity and gaining insights into system behavior. By leveraging the power of logs, metrics, and traces, along with robust tools and frameworks, engineers can navigate the intricate world of distributed systems with confidence. Embracing observability empowers organizations to build resilient, high-performing systems that can withstand the challenges of today's digital landscape.

Highlights: Distributed Systems Observability

The Role of Distributed Systems

A – Several decades ago, only a handful of mission-critical services worldwide were required to meet the availability and reliability requirements of today’s always-on applications and APIs. In response to user demand, every application must be built to scale nearly instantly to accommodate the potential for rapid, viral growth. Almost every app built today—whether a mobile app for consumers or a backend payment system—must meet these constraints and requirements.

B – Inherently, distributed systems are more reliable due to their distributed nature. When appropriately designed software engineers build these systems, they can benefit from more scalable organizational models. There is, however, a price to pay for these advantages.

C – Designing, building, and debugging these distributed systems can be challenging. A reliable distributed system requires significantly more engineering skills than a single-machine application, such as a mobile app or a web frontend. Regardless, distributed systems are becoming increasingly important. There is a corresponding need for tools, patterns, and practices to build them.

D – As digital transformation accelerates, organizations adopt multicloud environments to drive secure innovation and achieve speed, scale, and agility. As a result, technology stacks are becoming increasingly complex and scalable. Today, even the most straightforward digital transaction is supported by an array of cloud-native services and platforms delivered by various providers. To improve user experience and resilience, IT and security teams must monitor and manage their applications.

**Key Components of Observability**

Observability in distributed systems typically relies on three pillars: logs, metrics, and traces. Logs provide detailed records of events within the system, offering context for debugging issues. Metrics offer quantitative data, such as CPU usage and request rates, allowing teams to monitor system health and performance over time. Traces enable the tracking of requests as they move through the system, helping to pinpoint where latency or failures occur. Together, these components create a comprehensive picture of the system’s state and behavior.

**Challenges in Achieving Observability**

While observability is essential, achieving it in distributed systems is not without its challenges. The sheer volume of data generated by these systems can be overwhelming. Additionally, correlating data from disparate sources to form a cohesive narrative requires sophisticated tools and techniques. Moreover, ensuring that observability doesn’t introduce too much overhead or affect system performance is a delicate balancing act. Organizations must invest in the right infrastructure and expertise to tackle these challenges effectively.

**Best Practices for Enhancing Observability**

To maximize observability in distributed systems, organizations should adopt several best practices. Firstly, they should implement centralized logging and monitoring solutions that can aggregate data from all system components. Secondly, leveraging open standards like OpenTelemetry can facilitate consistent data collection and integration with various tools. Thirdly, incorporating automated alerting and anomaly detection can help teams proactively address issues before they impact users. Lastly, fostering a culture of collaboration between development and operations teams can ensure that observability is an ongoing, shared responsibility.

Cloud Service Mesh

### What is a Cloud Service Mesh?

A Cloud Service Mesh is a dedicated infrastructure layer that facilitates service-to-service communication in a microservices architecture. It abstracts the complex communication patterns between services into a manageable, secure, and observable framework. By deploying a service mesh, organizations can effectively manage the interactions of their microservices, ensuring seamless connectivity, security, and resilience.

### Key Benefits of Implementing a Cloud Service Mesh

1. **Enhanced Security**: A service mesh provides robust security features such as mutual TLS authentication, which encrypts communications between services. This ensures that data remains secure and tamper-proof as it travels across the network.

2. **Traffic Management**: With a service mesh, you can implement sophisticated traffic management policies, including load balancing, circuit breaking, and retries. This leads to improved performance and reliability of your distributed systems.

3. **Observability**: One of the standout features of a service mesh is its ability to provide deep observability into the interactions between services. Metrics, logs, and traces are collected and analyzed, offering invaluable insights into system health and performance.

### Enhancing Observability in Distributed Systems

Observability is a key concern in managing distributed systems. With the proliferation of microservices, tracking and understanding service interactions can become overwhelmingly complex. A Cloud Service Mesh addresses this challenge by offering comprehensive observability features:

– **Metrics Collection**: Collects real-time metrics on service performance, latency, error rates, and more.

– **Distributed Tracing**: Enables tracing of requests as they propagate through multiple services, helping identify bottlenecks and performance issues.

– **Centralized Logging**: Aggregates logs from various services, providing a unified view for easier troubleshooting and analysis.

These capabilities empower teams to detect issues early, optimize performance, and ensure the reliability of their applications.

### Real-World Applications and Use Cases

Several organizations have successfully implemented Cloud Service Meshes to transform their operations. For instance, financial institutions use service meshes to secure sensitive transactions, while e-commerce platforms leverage them to manage high traffic volumes during peak shopping seasons. By providing a robust framework for service communication, a service mesh enhances scalability, reliability, and security across industries.

Googles Ops Agent

Ops Agent is a lightweight agent that runs on your Compute Engine instances, collecting and forwarding metrics and logs to Google Cloud Monitoring and Logging. By installing Ops Agent on your instances, you gain real-time visibility into your Compute Engine’s performance and behavior.

To start monitoring your Compute Engine, you must install Ops Agent on your instances. The installation process is straightforward and can be done manually or through automation tools like Cloud Deployment Manager or Terraform. Once installed, the Ops Agent will automatically begin collecting metrics and logs from your Compute Engine.

Ops Agent allows you to customize the metrics and logs you want to monitor for your Compute Engine. Various options are available, allowing you to choose specific metrics and logs relevant to your application or system. By configuring metrics and logs, you can gain deeper insights and track the performance of critical components.

**Challenge: Fragmented Monitoring Tools**

Fragmented monitoring tools and manual analytics strategies challenge IT and security teams. The lack of a single source of truth and real-time insight makes it increasingly difficult for these teams to access the answers they need to accelerate innovation and optimize digital services. To gain insight, they must manually query data from various monitoring tools and piece together different sources of information.

This complex and time-consuming process distracts Team members from driving innovation and creating new value for the business and customers. In addition, many teams monitor only their mission-critical applications due to the effort involved in managing all these tools, platforms, and dashboards. The result is a multitude of blind spots across the technology stack, which makes it harder for teams to gain insights.

**Challenge: Kubernetes is Complex**

Understanding how Kubernetes adds to the complexity of technology stacks is imperative. In the drive toward modern technology stacks, it is the platform of choice for organizations refactoring their applications for the cloud-native world. Through dynamic resource provisioning, Kubernetes architectures can quickly scale services to new users and increase efficiency.

However, the constant changes in cloud environments make it difficult for IT and security teams to maintain visibility into them. To provide observability in their Kubernetes environments, these teams cannot manually configure various traditional monitoring tools. The result is that they are often unable to gain real-time insights to improve user experience, optimize costs, and strengthen security. Due to this visibility challenge, many organizations are delaying moving more mission-critical services to Kubernetes.

GKE-Native Monitoring

The Basics of GKE-Native Monitoring

GKE-Native Monitoring is a comprehensive monitoring solution provided by Google Cloud Platform (GCP) designed explicitly for GKE clusters. It offers deep insights into your applications’ performance and behavior, allowing you to proactively detect and resolve issues. With GKE-Native Monitoring, you can easily collect and analyze metrics, monitor logs, and set up alerts to ensure the reliability and availability of your applications.

One of the critical features of GKE-Native Monitoring is its ability to collect and analyze metrics from your GKE clusters. It provides preconfigured dashboards that display essential metrics such as CPU usage, memory utilization, and network traffic. Additionally, you can create custom dashboards tailored to your specific requirements, allowing you better to understand your application’s performance and resource consumption.

The Role of Megatrends

We have had a considerable drive with innovation that has spawned several megatrends that have affected how we manage and view our network infrastructure and the need for distributed systems observability. We have seen the decomposition of everything from one to many.

Many services and dependencies in multiple locations, aka microservices observability, must be managed and operated instead of the monolithic where everything is generally housed internally. The megatrends have resulted in a dynamic infrastructure with new failure modes not seen in the monolithic, forcing us to look at different systems observability tools and network visibility practices. 

Shift in Control

There has also been a shift in the point of control. As we move towards new technologies, many of these loosely coupled services or infrastructures your services depend on are not under your control. The edge of control has been pushed, creating different network and security perimeters. These parameters are now closer to the workload than a central security stack. Therefore, the workloads themselves are concerned with security.

Example Product: Cisco AppDynamics

### What is Cisco AppDynamics?

Cisco AppDynamics is an application performance management (APM) solution designed to provide real-time visibility into the performance of your applications. It helps IT professionals identify bottlenecks, diagnose issues, and optimize performance, ensuring a seamless user experience. With its powerful analytics, you can gain deep insights into your application stack, from the user interface to the backend infrastructure.

### Key Features and Capabilities

#### Real-Time Monitoring

One of the standout features of Cisco AppDynamics is its ability to monitor applications in real-time. This allows IT teams to detect and resolve issues as they occur, minimizing downtime and ensuring a smooth user experience. Real-time monitoring covers everything from user interactions to server performance, providing a comprehensive view of your application’s health.

#### End-User Experience Monitoring

Understanding how users interact with your application is crucial for delivering a high-quality experience. Cisco AppDynamics offers end-user experience monitoring, which tracks user sessions and interactions. This data helps you identify any pain points or performance issues that may be affecting user satisfaction.

#### Business Transaction Monitoring

Cisco AppDynamics takes a unique approach to monitoring by focusing on business transactions. By tracking the performance of individual transactions, you can gain a clearer understanding of how different parts of your application are performing. This level of granularity allows for more targeted optimizations and quicker issue resolution.

### Benefits of Using Cisco AppDynamics

#### Improved Application Performance

With its comprehensive monitoring and diagnostic capabilities, Cisco AppDynamics helps you identify and resolve performance issues quickly. This leads to faster load times, fewer errors, and an overall improved user experience.

#### Enhanced Operational Efficiency

By automating many of the monitoring and diagnostic processes, Cisco AppDynamics reduces the workload on your IT team. This allows them to focus on more strategic initiatives, driving greater value for your business.

#### Better Decision Making

The insights provided by Cisco AppDynamics enable better decision-making at all levels of your organization. Whether you’re looking to optimize resource allocation or plan for future growth, the data and analytics provided can inform your strategies and drive better outcomes.

### Integrations and Flexibility

Cisco AppDynamics offers seamless integrations with a wide range of third-party tools and platforms. Whether you’re using cloud services like AWS and Azure or CI/CD tools like Jenkins and GitHub, AppDynamics can integrate into your existing workflows, providing a unified view of your application’s performance.

For pre-information, you may find the following posts helpful:

  1. Observability vs Monitoring 
  2. Prometheus Monitoring
  3. Network Functions

Distributed Systems Observability

Distributed Systems

Today’s world of always-on applications and APIs has availability and reliability requirements that would have been needed of solely a handful of mission-critical services around the globe only a few decades ago. Likewise, the potential for rapid, viral service growth means that every application has to be built to scale nearly instantly in response to user demand.

Finally, these constraints and requirements mean that almost every application made—whether a consumer mobile app or a backend payments application—needs to be a distributed system. A distributed system is an environment where different components are spread across multiple computers on a network. These devices split up the work, harmonizing their efforts to complete the job more efficiently than if a single device had been responsible.

**The Key Components of Observability**

Observability in distributed systems is achieved through three main components: monitoring, logging, and tracing.

1. Monitoring: Monitoring involves continuously collecting and analyzing system metrics and performance indicators. It provides real-time visibility into the health and performance of the distributed system. By monitoring various metrics such as CPU usage, memory consumption, network traffic, and response times, engineers can proactively identify anomalies and make informed decisions to optimize system performance.

2. Logging: Logging involves recording events, activities, and errors within the distributed system. Log data provides a historical record that can be analyzed to understand system behavior and debug issues. Distributed systems generate vast amounts of log data, and effective log management practices, such as centralized log storage and log aggregation, are crucial for efficient troubleshooting.

3. Tracing: Tracing involves capturing the flow of requests and interactions between different distributed system components. It allows engineers to trace the journey of a specific request and identify potential bottlenecks or performance issues. Tracing is particularly useful in complex distributed architectures where multiple services interact.

**Benefits of Observability in Distributed Systems**

Adopting observability practices in distributed systems offers several benefits:

1. Enhanced Troubleshooting: Observability enables engineers to quickly identify and resolve issues by providing detailed insights into system behavior. With real-time monitoring, log analysis, and tracing capabilities, engineers can pinpoint the root cause of problems and take appropriate actions, minimizing downtime and improving system reliability.

2. Performance Optimization: By closely monitoring system metrics, engineers can identify performance bottlenecks and optimize system resources. Observability allows for proactive capacity planning and efficient resource allocation, ensuring optimal performance even under high loads.

3. Efficient Change Management: Observability facilitates monitoring system changes and their impact on overall performance. Engineers can track changes in metrics and easily identify any deviations or anomalies caused by updates or configuration changes. This helps maintain system stability and avoid unexpected issues.

What are VPC Flow Logs?

VPC Flow Logs is a feature offered by Google Cloud that captures and records network traffic information within Virtual Private Cloud (VPC) networks. Each network flow is logged, providing a comprehensive view of the traffic traversing. These logs include valuable information such as source and destination IP addresses, ports, protocol, and packet counts.

Once the VPC Flow Logs are enabled and data is being recorded, we can start leveraging the power of analysis. Google Cloud provides several tools and services for analyzing VPC Flow Logs. One such tool is BigQuery, a scalable and flexible data warehouse. By exporting VPC Flow Logs to BigQuery, we can perform complex queries, visualize traffic patterns, and detect anomalies using industry-standard SQL queries.

**How This Affects Failures**

The primary issue I have seen with my clients is that application failures are no longer predictable, and dynamic systems can fail creatively, challenging existing monitoring solutions. But, more importantly, the practices that support them. We have a lot of partial failures that are not just unexpected but not known or have never been seen before. For example, if you recall, we have the network hero. 

**The network Hero**

It is someone who knows every part of the network and has seen every failure at least once. These people are no longer helpful in today’s world and need proper Observation. When I was working as an Engineer, we would have plenty of failures, but more than likely, we would have seen them before. And there was a system in place to fix the error. Today’s environment is much different.

We can no longer rely on simply seeing a UP or Down, setting static thresholds, and then alerting based on those thresholds. A key point to note at this stage is that none of these thresholds considers the customer’s perspective.  If your POD runs at 80% CPU, does that mean the customer is unhappy?

When monitoring, you should look from your customer’s perspectives and what matters to them. Content Delivery Network (CDN) was one of the first to realize this game and measure what matters most to the customer.

Distributed Systems Observability

The different demands

So, the new, modern, and complex distributed systems place very different demands on your infrastructure and the people who manage it. For example, in microservices, there can be several problems with a particular microservice:

    • The microservices could be running under high resource utilization and, therefore, slow to respond, causing a timeout
    • The microservices could have crashed or been stopped and is, therefore, unavailable
    • The microservices could be fine, but there could be slow-running database queries.
    • So we have a lot of partial failures. 

Consequently, We can no longer predict

The significant shift we see with software platforms is that they evolve much quicker than the products and paradigms we use to monitor them. As a result, we need to consider new practices and technologies with dedicated platform teams and sound system observability. We can’t predict anything anymore, which puts the brakes on some traditional monitoring approaches, especially the metrics-based approach to monitoring.

I’m not saying that these monitoring tools are not doing what you want them to do. But, they work in a siloed environment, and there is a lack of connectivity. So we have monitoring tools working in silos in different parts of the organization and more than likely managed by other people trying to monitor a very dispersed application with multiple components and services in various places. 

Relying On Known Failures

Metric-Based Approach

A metrics-based monitoring approach relies on having previously encountered known failure modes. The metric-based approach relies on known failures and predictable failure modes. So, we have predictable thresholds that someone is considered to experience abnormal.

Monitoring can detect when these systems are either over or under the predictable thresholds that were previously set. Then, we can set alerts, and we hope that these alerts are actionable. This is only useful for variants of predictable failure modes.

Traditional metrics and monitoring tools can tell you any performance spikes or notice that a problem occurs. But they don’t let you dig into the source of the issues and let us slice and dice or see correlations between errors. If the system is complex, this approach is more challenging in getting to the root cause in a reasonable timeframe.

Google Cloud Trace

Example: Application Latency & Cloud Trace

Before we discuss Cloud Trace’s specifics, let’s establish a clear understanding of application latency. Latency refers to the time delay between a user’s action or request and the corresponding response from the application. It includes network latency, server processing time, and database query execution time. By comprehending the different factors contributing to latency, developers can proactively optimize their applications for improved performance.

Google Cloud Trace is a powerful diagnostic tool offered by Google Cloud Platform (GCP) that enables developers to identify and analyze application latency bottlenecks. It provides detailed insights into the flow of requests and events within an application, allowing developers to pinpoint areas of concern and optimize accordingly. Cloud Trace integrates seamlessly with other GCP services and provides a comprehensive view of latency across various components of an application stack.

Traditional style metrics systems

With traditional metrics systems, you had to define custom metrics, which were always defined upfront. This approach prevents us from starting to ask new questions about problems. So, it would be best to determine the questions to ask upfront.

Then, we set performance thresholds, pronounce them “good” or “bad, ” and check and re-check those thresholds. We would tweak the thresholds over time, but that was about it. This monitoring style has been the de facto approach, but we don’t now want to predict how a system can fail. Always observe instead of waiting for problems, such as reaching a certain threshold before acting.

**Metrics: Lack of connective event**

The metrics did not retain the connective event, so you cannot ask new questions in the existing dataset. These traditional system metrics could miss unexpected failure modes in complex distributed systems. Also, the condition detected via system metrics might be unrelated to what is happening.

An example of this could be an odd number of running threads on one component, which might indicate garbage collection is in progress or that slow response times are imminent in an upstream service.

**Users experience static thresholds**

User experience means different things to different sets of users. We now have a model where different service users may be routed through the system in other ways, using various components and providing experiences that can vary widely. We also know now that the services no longer tend to break in the same few predictable ways over and over.  

We should have a few alerts triggered by only focusing on symptoms that directly impact user experience and not because a threshold was reached.

The Challenge: Can’t reliably indicate any issues with user experience

If you use static thresholds, they can’t reliably indicate any issues with user experience. Alerts should be set up to detect failures that impact user experience. Traditional monitoring falls short in this regard. With traditional metrics-based monitoring, we rely on static thresholds to define optimal system conditions, which have nothing to do with user experience.

However, modern systems change shape dynamically under different workloads. Static monitoring thresholds can’t reflect impacts on user experience. They lack context and are too coarse.

Required: Distributed Systems Observability

Systems observability and reliability in distributed systems are practices. Rather than just focusing on a tool that logs, metrics, or alters, Observability is all about how you approach problems, and for this, you need to look at your culture. So you could say that Observability is a cultural practice that allows you to be proactive about findings instead of relying on a reactive approach that we are used to in the past.

Nowadays, we need a different viewpoint and want to see everything from one place. You want to know how the application works and how it interacts with the other infrastructure components, such as the underlying servers, physical or server, the network, and how data transfer looks in a transfer and stale state. 

Levels of Abstraction

What level of observation is needed to ensure everything performs as it should? What should you look at to obtain this level of detail?

Monitoring is knowing the data points and the entities from which we gather information. On the other hand, Observability is like putting all the data together. So monitoring is collecting data, and Observability is putting it together in one single pane of glass. Observability is observing the different patterns and deviations from the baseline; monitoring is getting the data and putting it into the systems. A vital part of an Observability toolkit is service level objectives (slos).

**Preference: Distributed Tracing**

We have three pillars of Systems Observability. There are Metrics, Traces, and Logging. So, defining or viewing Observability as having these pillars is an oversimplification. But for Observability, you need these in place. Observability is all about connecting the dots from each of these pillars.

If someone asked me which one I prefer, it would be distributed tracing. Distributed tracing allows you to visualize each step in service request executions. As a result, it doesn’t matter if services have complex dependencies. You could say that the complexity of the Dynamic systems is abstracted with distributed tracing.

**Use Case: Challenges without tracing**

For example, latency can stack up if a downstream database service experiences performance bottlenecks, resulting in high end-to-end latency. When latency is detected three or four layers upstream, it can be complicated to identify which component of the system is the root of the problem because now that same latency is being seen in dozens of other services.

**Distributed tracing: A winning formula**

Modern distributed systems tend to scale into a tangled knot of dependencies. Therefore, distributed tracing shows the relationships between various services and components in a distributed system. Traces help you understand system interdependencies. Unfortunately, those inter-dependencies can obscure problems and make them challenging to debug unless their relationships are clearly understood.

In distributed systems, observability is vital in ensuring complex architectures’ stability, performance, and reliability. Monitoring, logging, and tracing provide engineers with the tools to understand system behavior, troubleshoot issues, and optimize performance. By adopting observability practices, organizations can effectively manage their distributed systems and provide seamless and reliable services to their users.

Summary: Distributed Systems Observability

In the vast landscape of distributed systems, observability is crucial in ensuring their reliable and efficient functioning. This blogpost aims to delve into the critical components of distributed systems observability and shed light on their significance.

Telemetry

Telemetry forms the foundation of observability in distributed systems. It involves collecting, processing, and analyzing various metrics, logs, and traces. By monitoring and measuring these data points, developers gain valuable insights into the performance and behavior of their distributed systems.

Logging

Logging is an essential component of observability, providing a detailed record of events and activities within a distributed system. It captures important information such as errors, warnings, and informational messages, which aids in troubleshooting and debugging. Properly implemented logging mechanisms enable developers to identify and resolve issues promptly.

Metrics

Metrics are quantifiable measurements that provide a high-level view of the health and performance of a distributed system. They offer valuable insights into resource utilization, throughput, latency, error rates, and other critical indicators. By monitoring and analyzing metrics, developers can proactively identify bottlenecks, optimize performance, and ensure the smooth operation of their systems.

Tracing

Tracing allows developers to understand the flow and behavior of requests as they traverse through a distributed system. It provides detailed information about the path a request takes, including the various services and components it interacts with. Tracing is instrumental in diagnosing and resolving performance issues, as it highlights potential latency hotspots and bottlenecks.

Alerting and Visualization

Alerting mechanisms and visualization tools are vital for effective observability in distributed systems. Alerts notify developers when certain predefined thresholds or conditions are met, enabling them to take timely action. Visualization tools provide intuitive and comprehensive representations of system metrics, logs, and traces, making identifying patterns, trends, and anomalies easier.

Conclusion

In conclusion, the key components of distributed systems observability, namely telemetry, logging, metrics, tracing, alerting, and visualization, form a comprehensive toolkit for monitoring and understanding the intricacies of such systems. By leveraging these components effectively, developers can ensure their distributed systems’ reliability, performance, and scalability.

Reliability in Distributed Systems

Reliability In Distributed System

Reliability In Distributed System

Distributed systems have become an integral part of our modern technological landscape. Whether it's cloud computing, internet banking, or online shopping, these systems play a crucial role in providing seamless services to users worldwide. However, as distributed systems grow in complexity, ensuring their reliability becomes increasingly challenging.

In this blog post, we will explore the concept of reliability in distributed systems and discuss various techniques to achieve fault-tolerant operations.

Reliability in distributed systems refers to the ability of the system to consistently function as intended, even in the presence of hardware failures, network partitions, and other unforeseen events. To achieve reliability, system designers employ various techniques, such as redundancy, replication, and fault tolerance, to minimize the impact of failures and ensure continuous service availability.

Highlights: Reliability In Distributed System

Understanding Distributed Systems

At their core, distributed systems consist of multiple interconnected nodes working together to achieve a common goal. These nodes can be geographically dispersed and communicate through various protocols. Understanding the structure and behavior of distributed systems is crucial before exploring reliability measures.

To grasp the inner workings of distributed systems, it’s essential to familiarize ourselves with their key components. These include communication protocols, consensus algorithms, fault tolerance mechanisms, and distributed data storage. Each component plays a crucial role in ensuring the reliability and efficiency of distributed systems.

– Challenges and Risks: Reliability in distributed systems faces several challenges and risks due to their inherent nature. Network failures, node crashes, message delays, and data inconsistency are common issues compromising system reliability. Furthermore, the complexity of these systems amplifies the difficulty of diagnosing and resolving failures promptly.

– Replication and Redundancy: To mitigate the risks associated with distributed systems, replication and redundancy techniques are employed. Replicating data and functionalities across multiple nodes ensures fault tolerance and enhances reliability. The system can continue to operate with redundant components even if specific nodes fail.

– Consistency and Coordination: Maintaining data consistency is crucial in distributed systems. Distributed consensus protocols, such as the Paxos or Raft consensus algorithms, ensure that all nodes agree on the same state despite failures or network partitions. Coordinating actions among distributed nodes is essential to prevent conflicts and ensure reliable system behavior.

– Monitoring and Failure Detection: Continuous monitoring and failure detection mechanisms are essential for identifying and resolving issues promptly. Various monitoring tools and techniques, such as heartbeat protocols and health checks, can help detect failures and initiate recovery processes. Proactive monitoring and regular maintenance significantly contribute to the overall reliability of distributed systems.

**Shift in Landscape**

When considering reliability in a distributed system, considerable shifts in our environmental landscape have caused us to examine how we operate and run our systems and networks. We have had a mega change with the introduction of various cloud platforms and their services and containers.

In addition, with the complexity of managing distributed systems observability and microservices observability that unveil significant gaps in current practices in our technologies. Not to mention the flaws with the operational practices around these technologies.

Managed Instance Groups (MIGs)

 

**Understanding the Basics of Managed Instance Groups**

Managed Instance Groups are collections of virtual machine (VM) instances that are treated as a single entity. They are designed to simplify the management of multiple instances by automating tasks like scaling, updating, and load balancing. With MIGs, you can ensure that your application has the right number of instances running at any given time, responding dynamically to changes in demand.

Google Cloud’s MIGs make it easy to deploy applications with high availability and reliability. By using templates, you can define the configuration for all instances in the group, ensuring consistency and reducing the potential for human error.

**Ensuring Reliability in Distributed Systems**

Reliability is a critical component of any distributed system, and Managed Instance Groups play a significant role in achieving it. By distributing workloads across multiple instances and regions, MIGs help prevent single points of failure. If an instance fails, the group automatically replaces it with a new one, minimizing downtime and ensuring continuous service availability.

Moreover, Google Cloud’s infrastructure ensures that your instances are backed by a robust network, providing low-latency access and high-speed connectivity. This further enhances the reliability of your applications and services, giving you peace of mind as you scale.

**Scaling with Ease and Flexibility**

One of the standout features of Managed Instance Groups is their ability to scale quickly and efficiently. Whether you’re dealing with sudden spikes in traffic or planning for steady growth, MIGs offer flexible scaling policies to meet your needs. You can scale based on CPU utilization, load balancing capacity, or even custom metrics, allowing for precise control over your application’s performance.

Google Cloud’s autoscaling capabilities mean you only pay for the resources you use, optimizing cost-efficiency while maintaining high performance. This flexibility makes MIGs an ideal choice for businesses looking to grow their cloud infrastructure without unnecessary expenditure.

**Integrating with Google Cloud Ecosystem**

Managed Instance Groups seamlessly integrate with other Google Cloud services, providing a cohesive ecosystem for your applications. They work in harmony with Cloud Load Balancing to distribute traffic efficiently and with Stackdriver for monitoring and logging, giving you comprehensive insights into your application’s performance.

By leveraging Google Cloud’s extensive suite of tools, you can build, deploy, and manage applications with greater agility and confidence. This integration streamlines operations and simplifies the complexities of managing a distributed system.

Managed Instance Group

Example Product: Cisco AppDynamics

### What is Cisco AppDynamics?

Cisco AppDynamics is an application performance management (APM) solution that offers deep insights into your application’s performance, user experience, and business impact. It helps IT teams detect, diagnose, and resolve issues quickly, ensuring a seamless digital experience for end-users. By leveraging machine learning and artificial intelligence, AppDynamics provides actionable insights to optimize your applications.

### Key Features of Cisco AppDynamics

#### Real-Time Performance Monitoring

With Cisco AppDynamics, you can monitor the performance of your applications in real-time. This feature allows you to detect anomalies and performance issues as they happen, ensuring that you can address them before they impact your users.

#### End-User Monitoring

Understanding how your users interact with your applications is crucial. AppDynamics offers end-user monitoring, which provides visibility into the user journey, from the front-end user interface to the back-end services. This helps you identify and resolve issues that directly affect user experience.

#### Business Transaction Monitoring

AppDynamics breaks down your application into business transactions, which are critical user interactions within your application. By monitoring these transactions, you can gain insights into how your application supports key business processes and identify areas for improvement.

#### AI-Powered Analytics

The platform’s AI-powered analytics enable you to predict and prevent performance issues before they occur. By analyzing historical data and identifying patterns, AppDynamics helps you proactively manage your application’s performance.

### Benefits of Using Cisco AppDynamics

#### Improved Application Performance

By continuously monitoring your application’s performance, AppDynamics helps you identify and resolve issues quickly, ensuring that your application runs smoothly and efficiently.

#### Enhanced User Experience

With end-user monitoring, you can gain insights into how users interact with your application and address any issues that may affect their experience. This leads to increased user satisfaction and retention.

#### Better Business Insights

Business transaction monitoring provides a clear understanding of how your application supports critical business processes. This helps you make data-driven decisions to optimize your application and drive business growth.

Monitoring GKE Environment

The Significance of Monitoring in GKE

Monitoring in GKE goes beyond simply monitoring resource utilization. It provides valuable insights into the health and performance of your Kubernetes clusters, nodes, and applications running within them. You gain a comprehensive understanding of your system’s behavior by closely monitoring key metrics such as CPU usage, memory utilization, and network traffic. You can proactively address issues before they escalate.

GKE-Native Monitoring has many powerful features that simplify the monitoring process. One notable feature is integrating with Stackdriver, Google Cloud’s monitoring and observability platform.

With this integration, you can access a rich set of monitoring tools, including customizable dashboards, alerts, and logging capabilities, all designed explicitly for GKE deployments. Additionally, GKE-Native Monitoring seamlessly integrates with other Google Cloud services, enabling you to leverage advanced analytics and machine learning capabilities.

Challenges to Gaining Reliability

**Existing Static Tools**

This has caused a knee-jerk reaction to a welcomed drive-in innovation to system reliability. Yet, some technologies and tools used to manage these innovations do not align with the innovative events. Many of these tools have stayed relatively static in our dynamic environment. So, we have static tools used in a dynamic environment, which causes friction to reliability in distributed systems and the rise for more efficient network visibility.

**Understanding the Complexity**

Distributed systems are inherently complex, with multiple components across different machines or networks. This complexity introduces challenges like network latency, hardware failures, and communication bottlenecks. Understanding the intricate nature of distributed systems is crucial to devising reliable solutions.

Gaining Reliability

**Redundancy and Replication**

One critical approach to enhancing reliability in distributed systems is redundancy and replication. By duplicating critical components or data across multiple nodes, the system becomes more fault-tolerant. This ensures the system can function seamlessly even if one component fails, minimizing the risk of complete failure.

**Consistency and Consensus Algorithms**

Maintaining consistency in distributed systems is a significant challenge due to the possibility of concurrent updates and network delays. Consensus algorithms, such as the Paxos or Raft algorithms, are vital in achieving consistency by ensuring agreement among distributed nodes. These algorithms enable reliable decision-making and guarantee that all nodes reach a consensus state.

**Monitoring and Failure Detection**

To ensure reliability, robust monitoring mechanisms are essential. Monitoring tools can track system performance, resource utilization, and network health. Additionally, implementing efficient failure detection mechanisms allows for prompt identification of faulty components, enabling proactive measures to mitigate their impact on the overall system.

**Load Balancing and Scalability**

Load balancing is crucial in distributing the workload evenly across nodes in a distributed system. It ensures that no single node is overwhelmed, reducing the risk of system instability. Furthermore, designing systems with scalability in mind allows for seamless expansion as the workload grows, ensuring that reliability is maintained even during periods of high demand.

Required: Distributed Tracing

Using distributed tracing, you can profile or monitor the results of requests across a distributed system. Distributed systems can be challenging to monitor since each node generates its logs and metrics. To get a complete view of a distributed system, it is necessary to aggregate these separate node metrics holistically. 

A distributed system generally doesn’t access its entire set of nodes but rather a path through those nodes. With distributed tracing, teams can analyze and monitor commonly accessed paths through a distributed system. The distributed tracing is installed on each system node, allowing teams to query the system for information on node health and performance.

Benefits: Distributed Tracing

Despite the challenges, distributed systems offer a wide array of benefits. One notable advantage is enhanced fault tolerance. Distributing tasks and data across multiple nodes improves system reliability, as a single point of failure does not bring down the entire system.

Additionally, distributed systems enable improved scalability, accommodating growing demands by adding more nodes to the network. The applications of distributed systems are vast, ranging from cloud computing and large-scale data processing to peer-to-peer networks and distributed databases.

Google Cloud Trace

Understanding Cloud Trace

Cloud Trace, an integral part of Google Cloud’s observability offerings, provides developers with a detailed view of their application’s performance. It enables tracing and analysis of requests as they flow through different components of a distributed system. By visualizing the latency, bottlenecks, and dependencies, Cloud Trace empowers developers to optimize their applications for better performance and user experience.

Cloud Trace offers a range of features to simplify the monitoring and troubleshooting process. With its distributed tracing capabilities, developers can gain insights into how requests traverse various services, identify latency issues, and pinpoint the root causes of performance bottlenecks. The integration with Google Cloud’s ecosystem allows seamless correlation between traces and other monitoring data, enabling a comprehensive view of application health.

Improve Performance & Reliability 

By leveraging Cloud Trace, developers can significantly improve the performance and reliability of their applications. The ability to pinpoint and resolve performance issues quickly translates into enhanced user satisfaction and higher productivity. Moreover, Cloud Trace enables proactive monitoring, ensuring that potential bottlenecks and inefficiencies are identified before they impact end-users. For organizations, this translates into cost savings, improved scalability, and better resource utilization.

Adopting Cloud Trace

Cloud Trace has been adopted by numerous organizations across various industries, with remarkable outcomes. From optimizing the response time of e-commerce platforms to enhancing the efficiency of complex microservices architectures, Cloud Trace has proven its worth in diagnosing performance issues and driving continuous improvement. The real-time visibility provided by Cloud Trace empowers organizations to make data-driven decisions and deliver exceptional user experiences.

Related: Before you proceed, you may find the following post helpful:

  1. Distributed Firewalls
  2. SD WAN Static Network Based

Reliability In Distributed System

Adopting Distributed Systems

Distributed systems refer to a network of interconnected computers that communicate and coordinate their actions to achieve a common goal. Unlike traditional centralized systems, where a single entity controls all components, distributed systems distribute tasks and data across multiple nodes. This decentralized approach enables enhanced scalability, fault tolerance, and resource utilization.

**Key Components of Distributed Systems**

To comprehend the inner workings of distributed systems, we must familiarize ourselves with their key components. These components include nodes, communication channels, protocols, and distributed file systems. Nodes represent individual machines or devices within the network; communication channels facilitate data transmission, protocols ensure reliable communication and distributed file systems enable data storage across multiple nodes.

Distribued vs centralized

**Distributed Systems Use Cases**

Many modern applications use distributed systems, including mobile and web applications with high traffic. Web browsers or mobile applications serve as clients in a client-server environment, and the server becomes its own distributed system. The modern web server follows a multi-tier system pattern. Requests are delegated to several server logic nodes via a load balancer.

Kubernetes is popular among distributed systems since it enables containers to be combined into a distributed system. Kubernetes orchestrates network communication between the distributed system nodes and handles dynamic horizontal and vertical scaling of the nodes. 

Cryptocurrencies like Bitcoin and Ethereum are also peer-to-peer distributed systems. The currency ledger is replicated at every node in a cryptocurrency network. To bootstrap, a currency node connects to other nodes and downloads its full ledger copy. Additionally, cryptocurrency wallets use JSON RPC to communicate with the ledger nodes.

Challenges in Distributed Systems

While distributed systems offer numerous advantages, they also pose various challenges. One significant challenge is achieving consensus among distributed nodes. Ensuring that all nodes agree on a particular value or decision can be complex, especially in the presence of failures or network partitions. Additionally, maintaining data consistency across distributed nodes and mitigating issues related to concurrency control requires careful design and implementation.

**Example: Distributed System of Microservices**

Microservices are one type of distributed system since they decompose an application into individual components. A microservice architecture, for example, may have services corresponding to business features (payments, users, products, etc.), with each element handling the corresponding business logic. Multiple redundant copies of the services will then be available, so there is no single point of failure.

**Distributed Systems: The Challenge**

Distributed systems are required to implement the reliability, agility, and scale expected of modern computer programs. Distributed systems are applications of many different components running on many other machines. Containers are the foundational building block, and groups of containers co-located on a single device comprise the atomic elements of distributed system patterns.

The significant shift we see with software platforms is that they evolve much quicker than the products and paradigms we use to monitor them. We need to consider new practices and technologies with dedicated platform teams to enable a new era of system reliability in a distributed system. Along with the practices of Observability that are a step up to the traditional monitoring of static infrastructure: Observability vs monitoring.

Knowledge Check: Distributed Systems Architecture

  • Client-Server Architecture

A client-server architecture has two primary responsibilities. The client presents user interfaces and is connected to the server via a network. The server handles business logic and state management. Unless the server is redundant, a client-server architecture can quickly degrade into a centralized architecture. A truly distributed client-server setup will consist of multiple server nodes that distribute client connections. In modern client-server architectures, clients connect to encapsulated distributed systems on the server.

  • Multi-tier Architecture

Multi-tier architectures are extensions of client-server architectures. Multi-tier architectures decompose servers into further granular nodes, decoupling additional backend server responsibilities like data processing and management. By processing long-running jobs asynchronously, these additional nodes free up the remaining backend nodes to focus on responding to client requests and interacting with the data store.

  • Peer-to-Peer Architecture

Peer-to-peer distributed systems contain complete instances of applications on each node. There is no separation between presentation and data processing at the node level. A node consists of a presentation layer and a data handling layer. Peer nodes may contain the entire system’s state data. 

Peer-to-peer systems have a great deal of redundancy. When initiated and brought online, peer-to-peer nodes discover and connect to other peers, thereby synchronizing their local state with the system’s. As a result of this feature, nodes on a peer-to-peer network won’t be disrupted by the failure of one. Additionally, peer-to-peer systems will persist. 

  • Service-orientated Architecture

A service-oriented architecture (SOA) is a precursor to microservices. Microservices differ from SOA primarily in their node scope, which is at the feature level. Each microservice node encapsulates a specific set of business logic, such as payment processing—multiple nodes of business logic interface with independent databases in a microservice architecture. In contrast, SOA nodes encapsulate an entire application or enterprise division. Database systems are typically included within the service boundary of SOA nodes.

Because of their benefits, microservices have become more popular than SOA. The small service nodes provide functionality that teams can reuse through microservices. The advantages of microservices include greater robustness and a more extraordinary ability for vertical and horizontal scaling to be dynamic.

Reliability in Distributed Systems: Components

A- Redundancy and Replication:

Redundancy and replication are two fundamental concepts distributed systems use to enhance reliability. Redundancy involves duplicating critical system components, such as servers, storage devices, or network links, so the redundant component can seamlessly take over if one fails. Replication, on the other hand, involves creating multiple copies of data across different nodes in a system, enabling efficient data access and fault tolerance. By incorporating redundancy and replication, distributed systems can continue to operate even when individual components fail.

B – Fault Tolerance:

Fault tolerance is a crucial aspect of achieving reliability in distributed systems. It involves designing systems to operate correctly even when one or more components encounter failures. Several techniques, such as error detection, recovery, and prevention mechanisms, are employed to achieve fault tolerance.

C – Error Detection:

Error detection techniques, such as checksums, hashing, and cyclic redundancy checks (CRC), identify errors or data corruption during transmission or storage. By verifying data integrity, these techniques help identify and mitigate potential failures in distributed systems.

D – Error Recovery:

Error recovery mechanisms, such as checkpointing and rollback recovery, aim to restore the system to a consistent state after a failure. Checkpointing involves periodically saving the system’s state and data, allowing recovery to a previously known good state in case of failures. On the other hand, rollback recovery involves undoing the effects of failed operations and returning the system to a consistent state.

E – Error Prevention:

Distributed systems employ error prevention techniques, such as redundancy elimination, consensus algorithms, and load balancing to enhance reliability. Redundancy elimination reduces unnecessary duplication of data or computation, thereby reducing the chances of errors. Consensus algorithms ensure that all nodes in a distributed system agree on a shared state despite failures or message delays. Load balancing techniques distribute computational tasks evenly across multiple nodes to prevent overloading and potential shortcomings.

Challenges: Traditional Monitoring

**Lack of Connective Event**

If you examine traditional monitoring systems, they look to capture and investigate signals in isolation. They work in a siloed environment, similar to that of developers and operators before the rise of DevOps. Existing monitoring systems cannot detect the “Unknowns Unknowns” that are familiar with modern distributed systems. This often leads to service disruptions. So, you may be asking what an “Unknown Unknown” is.

I’ll put it to you this way: the distributed systems we see today lack predictability—certainly not enough predictability to rely on static thresholds, alerts, and old monitoring tools. If something is fixed, it can be automated, and we have static events, such as in Kubernetes, a POD reaching a limit.

Then, a replica set introduces another pod on a different node if specific parameters are met, such as Kubernetes Labels and Node Selectors. However, this is only a tiny piece of the failure puzzle in a distributed environment.  Today, we have what’s known as partial failures and systems that fail in very creative ways.

Reliability In Distributed System: Creative ways to fail

So, we know that some of these failures are quickly predicted, and actions are taken. For example, if this Kubernetes POD node reaches a specific utilization, we can automatically reschedule PODs on a different node to stay within our known scale limits.

Predictable failures can be automated in Kubernetes and with any infrastructure. An Ansible script is useful when these events occur. However, we have much more to deal with than POD scaling; we have many partial and complicated failures known as black holes.

**In today’s world of partial failures**

Microservices applications are distributed and susceptible to many external factors. On the other hand, if you examine the traditional monolithic application style, all the functions reside in the same process. It was either switched ON or OFF!! Not much happened in between. So, if there is a failure in the procedure, the application as a whole will fail. The results are binary, usually either a UP or Down.

This was easy to detect with some essential monitoring, and failures were predictable. There was no such thing as a partial failure. In a monolith application, all application functions are within the same process. A significant benefit of these monoliths is that you don’t have partial failures.

However, in a cloud-native world, where we have broken the old monolith into a microservices-based application, a client request can go through multiple hops of microservices, and we can have several problems to deal with.

There is a lack of connectivity between the different domains. Many monitoring tools and knowledge will be tied to each domain, and alerts are often tied to thresholds or rate-of-change violations that have nothing to do with user satisfaction, which is a critical metric to care about.

**System reliability: Today, you have no way to predict**

So, the new, modern, and complex distributed systems place very different demands on your infrastructure—considerably different from the simple three-tier application, where everything is generally housed in one location.  We can’t predict anything anymore, which breaks traditional monitoring approaches.

When you can no longer predict what will happen, you can no longer rely on a reactive approach to monitoring and management. The move towards a proactive approach to system reliability is a welcomed strategy.

**Blackholes: Strange failure modes**

When considering a distributed system, many things can happen. A service or region can disappear or disappear for a few seconds or ms and reappear. We believe this is going into a black hole when we have strange failure modes. So when anything goes into it will disappear. Peculiar failure modes are unexpected and surprising.

Strange failure modes are undoubtedly unpredictable. So, what happens when your banking transactions are in a black hole? What if your banking balance is displayed incorrectly or if you make a transfer to an external account and it does not show up? 

Site Reliability Engineering (SRE) and Observability

Site reliability engineering (SRE) and observational practices are needed to manage these types of unpredictability and unknown failures. SRE is about making systems more reliable. And everyone has a different way of implementing SRE practices. Usually, about 20% of your issues cause 80% of your problems.

You need to be proactive and fix these issues upfront. You need to be able to get ahead of the curve and do these things to prevent incidents from occurring. This usually happens in the wake of a massive incident. This usually acts as a teachable moment. It gives the power to be the reason to listen to a Chaos Engineering project. 

New tools and technologies:

1 – Distributed tracing

We have new tools, such as distributed tracing. So, what is the best way to find the bottleneck if the system becomes slow? Here, you can use Distributed Tracing and Open Telemetry. The tracing helps us instrument our system, figuring out where the time has been spent and where it can be used across distributed microservice architecture to troubleshoot problems. Open Telemetry provides a standardized way of instrumenting our system and providing those traces.

2 – SLA, SLI, SLO, and Error Budgets

So we don’t just want to know when something has happened and then react to an event that is not looking from the customer’s perspective. We need to understand if we are meeting SLA by gathering the number and frequency of the outages and any performance issues.

Service Level Objectives (SLO) and Service Level Indicators (SLI) can assist you with measurements. Service Level Objectives (SLOs) and Service Level Indicators (SLI) not only help you with measurements but also offer a tool for having better reliability and forming the base for the reliability stack.

Summary: Reliability In Distributed System

In modern technology, distributed systems have become the backbone of numerous applications and services. These systems, consisting of interconnected nodes, provide scalability, fault tolerance, and improved performance. However, maintaining reliability in such distributed environments is a challenging endeavor. This blog post explored the key aspects and strategies for ensuring reliability in distributed systems.

Understanding the Challenges

Distributed systems face a myriad of challenges that can impact their reliability. These challenges include network failures, node failures, message delays, and data inconsistencies. These aspects can introduce vulnerabilities that may disrupt system operations and compromise reliability.

Replication for Resilience

One fundamental technique for enhancing reliability in distributed systems is data replication. By replicating data across multiple nodes, system resilience is improved. Replication increases fault tolerance and enables load balancing and localized data access. However, maintaining reliability is crucial to managing consistency and synchronization among replicated copies.

Consensus Protocols

Consensus protocols play a vital role in achieving reliability in distributed systems. These protocols enable nodes to agree on a shared state despite failures or network partitions. Popular consensus algorithms such as Paxos and Raft ensure that distributed nodes reach a consensus, making them resilient against failures and maintaining system reliability.

Fault Detection and Recovery

Detecting faults in a distributed system is crucial for maintaining reliability. Techniques like heartbeat monitoring, failure detectors, and health checks aid in identifying faulty nodes or network failures. Once a fault is detected, recovery mechanisms such as automatic restarts, replica synchronization, or reconfigurations can be employed to restore system reliability.

Load Balancing and Scalability

Load balancing and scalability can also enhance reliability in distributed systems. By evenly distributing the workload among nodes and dynamically scaling resources, the system can handle varying demands and prevent bottlenecks. Load-balancing algorithms and auto-scaling mechanisms contribute to overall system reliability.

Conclusion:

In the world of distributed systems, reliability is a paramount concern. By understanding the challenges, employing replication techniques, utilizing consensus protocols, implementing fault detection and recovery mechanisms, and focusing on load balancing and scalability, we can embark on a journey of resilience. Reliability in distributed systems requires careful planning, robust architectures, and continuous monitoring. By addressing these aspects, we can build distributed systems that are truly reliable, empowering businesses and users alike.

Chaos Engineering

Baseline Engineering

Baseline Engineering

In today's fast-paced digital landscape, network performance plays a vital role in ensuring seamless connectivity and efficient operations. Network baseline engineering is a powerful technique that allows organizations to establish a solid foundation for optimizing network performance, identifying anomalies, and planning for future scalability. In this blog post, we will explore the ins and outs of network baseline engineering and its significant benefits.

Network baseline engineering is the process of establishing a benchmark or reference point for network performance metrics. By monitoring and analyzing network traffic patterns, bandwidth utilization, latency, and other key parameters over a specific period, organizations can create a baseline that represents the normal behavior of their network. This baseline becomes a crucial reference for detecting deviations, troubleshooting issues, and capacity planning.

Proactive Issue Detection: One of the primary advantages of network baseline engineering is the ability to proactively detect and address network issues. By comparing real-time network performance against the established baseline, anomalies and deviations can be quickly identified. This allows network administrators to take immediate action to resolve potential problems before they escalate and impact user experience.

Improved Performance Optimization: With a solid network baseline in place, organizations can gain valuable insights into network performance patterns. This information can be leveraged to fine-tune configurations, optimize resource allocation, and enhance overall network efficiency. By understanding the normal behavior of the network, administrators can make informed decisions to improve performance and provide a seamless user experience.

Data Collection: The first step in network baseline engineering is collecting relevant data, including network traffic statistics, bandwidth usage, application performance, and other performance metrics. This data can be obtained from network monitoring tools, SNMP agents, flow analyzers, and other network monitoring solutions.

Data Analysis and Baseline Creation: Once the data is collected, it needs to be analyzed to identify patterns, trends, and normal behavior. Statistical analysis techniques, such as mean, median, and standard deviation, can be applied to determine the baseline values for various performance parameters. This process may involve using specialized software or network monitoring platforms.

Maintaining and Updating the Network Baseline: Networks are dynamic environments, and their behavior can change over time due to various factors such as increased user demands, infrastructure upgrades, or new applications. It is essential to regularly review and update the network baseline to reflect these changes accurately. By periodically reevaluating the baseline, organizations can ensure its relevance and effectiveness in capturing the network's current behavior.

Network baseline engineering is a fundamental practice that empowers organizations to better understand, optimize, and maintain their network infrastructure. By establishing a reliable baseline, organizations can proactively detect issues, enhance performance, and make informed decisions for future network expansion. Embracing network baseline engineering sets the stage for a robust and resilient network that supports the ever-growing demands of the digital age.

Highlights: Baseline Engineering

1. Understanding Baseline Engineering

Baseline engineering serves as the bedrock for any engineering project. It involves creating a reference point or baseline from which all measurements, evaluations, and improvements are made. By establishing this starting point, engineers gain insights into the project’s progress, performance, and potential deviations from the original plan.

Baseline engineering follows a systematic and structured approach. It starts with defining project objectives and requirements, then data collection and analysis. This data provides a snapshot of the initial conditions and helps engineers set realistic targets and benchmarks. Through careful monitoring and periodic assessments, deviations from the baseline can be detected early, enabling timely corrective actions.

2. Traditional Network Infrastructure

Baseline Engineering was easy in the past; applications ran in single private data centers, potentially two data centers for high availability. There may have been some satellite PoPs, but generally, everything was housed in a few locations. These data centers were on-premises, and all components were housed internally. As a result, troubleshooting, monitoring, and baselining any issues was relatively easy. The network and infrastructure were pretty static, the network and security perimeters were known, and there weren’t many changes to the stack, for example, daily.

3. Distributed Applications

However, nowadays, we are in a completely different environment. We have distributed applications with components/services located in many other places and types of places, on-premises and in the cloud, with dependencies on both local and remote services. We span multiple sites and accommodate multiple workload types.

In comparison to the monolith, today’s applications have many different types of entry points to the external world. All of this calls for the practice of Baseline Engineering and Chaos engineering kubernetes so you can fully understand your infrastructure and scaling issues. 

Managed Instance Groups

**Introduction to Managed Instance Groups**

In the fast-evolving world of cloud computing, maintaining scalability, reliability, and efficiency is essential. Managed instance groups (MIGs) on Google Cloud offer an innovative solution to achieve these goals. Whether you’re a seasoned cloud engineer or new to the Google Cloud ecosystem, understanding MIGs can significantly enhance your infrastructure management capabilities.

**The Role of Managed Instance Groups in Baseline Engineering**

Baseline engineering focuses on establishing a stable foundation for software development and operations. Managed instance groups play a crucial role in this process by automating the deployment and scaling of virtual machines (VMs). By setting up MIGs, baseline engineering can achieve consistent performance, reduce manual intervention, and enhance the system’s resilience to changes in demand. This automation allows engineers to focus on optimizing and innovating rather than maintaining infrastructure.

**Automating Scalability and Load Balancing**

One of the standout features of managed instance groups is their ability to automate scalability and load balancing. As your application experiences varying levels of traffic, MIGs automatically adjust the number of instances to meet the demand. This capability ensures that your application remains responsive and cost-efficient, as you only use the resources you need. Additionally, Google Cloud’s load balancing solutions work seamlessly with MIGs, distributing traffic evenly across instances to maintain optimal performance.

**Achieving Reliability with Health Checks and Autohealing**

Reliability is a cornerstone of any cloud-based application, and managed instance groups provide robust mechanisms to ensure it. Through health checks, MIGs continuously monitor the status of your VMs. If an instance fails or becomes unhealthy, the autohealing feature kicks in, replacing it with a new, healthy instance. This proactive approach minimizes downtime and maintains service continuity, contributing to a more reliable application experience for end-users.

**Optimizing Costs with Managed Instance Groups**

Cost management is a critical consideration in cloud computing, and managed instance groups help optimize expenses. By automatically scaling the number of instances based on demand, MIGs eliminate the need for overprovisioning resources. This dynamic resource allocation ensures that you are only paying for the compute capacity you need. Moreover, when combined with Google Cloud’s pricing models, such as sustained use discounts and committed use contracts, MIGs allow for significant cost savings.

Managed Instance Group

Google Data Centers – Service Mesh

**What is Cloud Service Mesh?**

A cloud service mesh is a dedicated infrastructure layer designed to manage service-to-service communication within a microservices architecture. It provides a way to control how different parts of an application share data with one another. Essentially, it acts as a network of microservices that make up cloud applications and ensures that communication between services is secure, fast, and reliable.

**Benefits of Cloud Service Mesh**

1. **Enhanced Security**: One of the primary advantages of a cloud service mesh is the improved security it offers. By managing communication between services, it can enforce security policies, authenticate service requests, and encrypt data in transit. This reduces the risk of data breaches and unauthorized access.

2. **Increased Reliability**: Cloud service meshes enhance the reliability of service interactions. They provide load balancing, traffic routing, and failure recovery, ensuring that services remain available even in the face of failures. This is particularly crucial for applications that require high availability and resilience.

3. **Improved Observability**: With a cloud service mesh, engineering teams can gain greater visibility into the interactions between services. This includes monitoring performance metrics, logging, and tracing requests. Such observability helps in identifying and troubleshooting issues more efficiently, leading to faster resolution times.

**Impact on Baseline Engineering**

Baseline engineering involves establishing a standard level of performance, security, and reliability for an organization’s infrastructure. The introduction of a cloud service mesh has significantly impacted this field by providing a more robust foundation for managing microservices. Here’s how:

1. **Standardization**: Cloud service meshes help in standardizing the way services communicate, making it easier to maintain consistent performance across the board. This is especially important in complex systems with numerous interdependent services.

2. **Automation**: Many cloud service meshes come with built-in automation capabilities, such as automatic retries, circuit breaking, and service discovery. This reduces the manual effort required to manage service interactions, allowing engineers to focus on more strategic tasks.

3. **Scalability**: By managing service communication more effectively, cloud service meshes enable organizations to scale their applications more easily. This is crucial for businesses that experience varying levels of demand and need to adjust their resources accordingly.

Example Product: Cisco AppDynamics

### Key Features of Cisco AppDynamics

**1. Real-Time Monitoring and Analytics**

One of the standout features of Cisco AppDynamics is its ability to provide real-time monitoring and analytics. This allows businesses to gain instant visibility into application performance and user interactions. By leveraging real-time data, organizations can quickly identify performance bottlenecks, understand user behavior, and make informed decisions to enhance application efficiency.

**2. End-to-End Transaction Visibility**

With Cisco AppDynamics, you get a comprehensive view of end-to-end transactions, from user interactions to backend processes. This visibility helps in pinpointing the exact location of issues within the application stack, whether it’s in the code, database, or infrastructure. This holistic approach ensures that no problem goes unnoticed, enabling swift resolution and minimizing downtime.

**3. AI-Powered Anomaly Detection**

Cisco AppDynamics employs advanced AI and machine learning algorithms to detect anomalies in application performance. These intelligent insights help predict and prevent potential issues before they impact end users. By learning the normal behavior of your applications, the system can alert you to deviations that might signify underlying problems, allowing proactive intervention.

### Benefits of Implementing Cisco AppDynamics

**1. Enhanced User Experience**

By continuously monitoring application performance and user interactions, Cisco AppDynamics helps ensure a smooth and uninterrupted user experience. Instant alerts and detailed reports enable IT teams to address issues swiftly, reducing the likelihood of user dissatisfaction and churn.

**2. Improved Operational Efficiency**

Cisco AppDynamics automates many aspects of performance monitoring, freeing up valuable time for IT teams. This automation reduces the need for manual checks and troubleshooting, allowing teams to focus on strategic initiatives and innovation. The platform’s ability to integrate with various IT tools further streamlines operations and enhances overall efficiency.

**3. Data-Driven Decision Making**

The rich data and analytics provided by Cisco AppDynamics empower businesses to make data-driven decisions. Whether it’s optimizing application performance, planning for capacity, or enhancing security measures, the insights gained from AppDynamics drive informed strategies that align with business goals.

### Getting Started with Cisco AppDynamics

**1. Easy Deployment**

Cisco AppDynamics offers flexible deployment options, including on-premises, cloud, and hybrid environments. The straightforward installation process and intuitive user interface make it accessible even for teams with limited APM experience. Comprehensive documentation and support further ease the onboarding process.

**2. Customizable Dashboards**

Users can create customizable dashboards to monitor the metrics that matter most to their organization. These dashboards provide at-a-glance views of key performance indicators (KPIs), making it easy to track progress and identify areas for improvement. Custom alerts and reports ensure that critical information is always at your fingertips.

**3. Continuous Learning and Improvement**

Cisco AppDynamics encourages continuous learning and improvement through its robust training resources and community support. Regular updates and enhancements keep the platform aligned with the latest technological advancements, ensuring that your APM strategy evolves alongside your business needs.

What is GKE-Native Monitoring?

GKE-Native Monitoring is a comprehensive monitoring solution provided by Google Cloud specifically designed for Kubernetes workloads running on GKE. It offers deep visibility into the performance and health of your clusters, allowing you to identify and address issues before they impact your applications proactively.

a) Automatic Cluster Monitoring: GKE-Native Monitoring automatically collects and visualizes key metrics from your GKE clusters, making monitoring your workloads’ overall health and resource utilization effortlessly.

b) Customizable Dashboards: With GKE-Native Monitoring, you can create personalized dashboards tailored to your specific monitoring needs. Visualize metrics that matter most to you and gain actionable insights at a glance.

c) Alerting and Notifications: Use GKE-Native Monitoring’s robust alerting capabilities to stay informed about critical events and anomalies in your GKE clusters. Configure alerts based on thresholds and receive notifications through various channels to ensure prompt response and issue resolution.

The Role of Network Baselining

Network baselining involves capturing and analyzing network traffic data to establish a benchmark or baseline for normal network behavior. This baseline represents the typical performance metrics of the network under regular conditions. It encompasses various parameters such as bandwidth utilization, latency, packet loss, and throughput. By monitoring these metrics over time, administrators can identify patterns, trends, and anomalies, enabling them to make informed decisions about network optimization and troubleshooting.

Understanding TCP Performance Parameters

TCP, or Transmission Control Protocol, is a vital protocol that governs reliable data transmission over networks. Behind its seemingly simple operation lies a complex web of performance parameters that can significantly impact network efficiency, latency, and throughput. In this blog post, we will dive deep into TCP performance parameters, understanding their importance and how they influence network performance.

TCP performance parameters determine various aspects of the TCP protocol’s behavior. These parameters include window size, congestion control algorithms, maximum segment size (MSS), retransmission timeout (RTO), and many more. Each parameter plays a crucial role in shaping TCP’s performance characteristics, such as reliability, congestion avoidance, and flow control.

The Impact of Window Size: Window size, also known as the receive window, represents the amount of data a receiving host can accept before requiring acknowledgment from the sender. A larger window size allows for more extensive data transfer without waiting for acknowledgments, thereby improving throughput. However, a vast window size can lead to network congestion and increased latency. Finding the optimal window size requires careful consideration and tuning.

Congestion Control Algorithms: Congestion control algorithms, such as TCP Reno, Cubic, and New Reno, regulate the flow of data in TCP connections to avoid network congestion. These algorithms dynamically adjust parameters like the congestion window and the slow-start threshold based on various congestion indicators. Understanding the different congestion control algorithms and selecting the appropriate one for specific network conditions is crucial for achieving optimal performance.

Maximum Segment Size (MSS): The Maximum Segment Size (MSS) refers to the most significant amount of data TCP can encapsulate within a single IP packet. It is determined by the underlying network’s Maximum Transmission Unit (MTU). A higher MSS can enhance throughput by reducing the overhead associated with packet headers, but it should not exceed the network’s MTU to avoid fragmentation and subsequent performance degradation.

Retransmission Timeout (RTO): Retransmission Timeout (RTO) is the duration at which TCP waits for an acknowledgment before retransmitting a packet. Setting an appropriate RTO value is crucial to balance reliability and responsiveness. A too-short RTO may result in unnecessary retransmissions and increased network load, while a too-long RTO can lead to higher latency and decreased throughput. Factors like network latency, jitter, and packet loss rate influence the optimal RTO configuration.

 

Before you proceed, you may find the following post helpful:

  1. Network Traffic Engineering
  2. Low Latency Network Design
  3. Transport SDN
  4. Load Balancing
  5. What is OpenFlow
  6. Observability vs Monitoring
  7. Kubernetes Security Best Practice

 

Baseline Engineering

Chaos Engineering

Chaos engineering is a methodology for experimenting with a software system to build confidence in its capability to withstand turbulent environments in production. It is an essential part of the DevOps philosophy, allowing teams to experiment with their system’s behavior in a safe and controlled manner.

This type of baseline engineering allows teams to identify weaknesses in their software architecture, such as potential bottlenecks or single points of failure, and take proactive measures to address them. By injecting faults into the system and measuring the effects, teams gain insights into system behavior that can be used to improve system resilience.

Finally, chaos Engineering teaches you to develop and execute controlled experiments that uncover hidden problems. For instance, you may need to inject system-shaking failures that disrupt system calls, networking, APIs, and Kubernetes-based microservices infrastructures.

Chaos engineering is “the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production.” In other words, it’s a software testing method that concentrates on finding evidence of problems before users experience them.

Network Baselining

Network baseline involves measuring the network’s performance at different times. This includes measuring throughput, latency, and other performance metrics and the network’s configuration. It is important to note that performance metrics can vary greatly depending on the type of network being used. This is why it is essential to establish a baseline for the network to be used as a reference point for comparison.

Network baselining is integral to network management. It allows organizations to identify and address potential issues before they become more serious. Organizations can be alerted to potential problems by analyzing the network’s performance. This can help organizations avoid costly downtime and ensure their networks run at peak performance.

network baselining
Diagram: Network Baselining. Source is DNSstuff

**The Importance of Network Baselining**

Network baselining provides several benefits for network administrators and organizations:

1. Performance Optimization: Baselining helps identify bottlenecks, inefficiencies, and abnormal behavior within the network infrastructure. By understanding the baseline, administrators can optimize network resources, improve performance, and ensure a smoother user experience.

2. Security Enhancement: Baselining also plays a crucial role in detecting and mitigating security threats. Administrators can identify unusual or malicious activities by comparing current network behavior against the established baseline, such as abnormal traffic patterns or unauthorized access attempts.

3. Capacity Planning: Understanding network baselines enables administrators to forecast future capacity requirements accurately. By analyzing historical data, they can determine when and where network upgrades or expansions may be necessary, ensuring consistent performance as the network grows.

**Establishing a Network Baseline**

To establish an accurate network baseline, administrators follow a systematic approach:

1. Data Collection: Network traffic data is collected using specialized monitoring tools like network analyzers or packet sniffers. These tools capture and analyze network packets, providing detailed insights into performance metrics.

2. Duration: Baseline data should ideally be collected over an extended period, typically from a few days to a few weeks. This ensures the baseline accounts for variations due to different network usage patterns.

3. Normalizing Factors: Administrators consider various factors impacting network performance, such as peak usage hours, seasonal variations, and specific application requirements. Normalizing the data can establish a more accurate baseline that reflects typical network behavior.

4. Analysis and Documentation: Once the baseline data is collected, administrators analyze the metrics to identify patterns and trends. This analysis helps establish thresholds for acceptable performance and highlights any deviations that may require attention. Documentation of the baseline and related analysis is crucial for future reference and comparison.

Network Baselining: A Lot Can Go Wrong

Infrastructure is becoming increasingly complex, and let’s face it, a lot can go wrong. It’s imperative to have a global view of all the infrastructure components and a good understanding of the application’s performance and health. In a large-scale container-based application design, there are many moving pieces and parts, and it is hard to validate the health of each piece manually.  

Therefore, monitoring and troubleshooting are much more complex, especially as everything is interconnected, making it difficult for a single person in one team to understand what is happening entirely. Nothing is static anymore; things are moving around all the time. This is why it is even more important to focus on the patterns and to be able to see the path of the issue efficiently.

Some modern applications could simultaneously be in multiple clouds and different location types, resulting in numerous data points to consider. If any of these segments are slightly overloaded, the sum of each overloaded segment results in poor performance on the application level. 

What does this mean to latency?

Distributed computing has many components and services, with far-apart components. This contrasts with a monolith, with all parts in one location. Because modern applications are distributed, latency can add up. So, we have both network latency and application latency. The network latency is several orders of magnitude more significant.

As a result, you need to minimize the number of Round-Trip Times and reduce any unneeded communication to an absolute minimum. When communication is required across the network, it’s better to gather as much data together as possible to get bigger packets that are more efficient to transfer. Also, consider using different types of buffers, both small and large, which will have varying effects on the dropped packet test.

With the monolith, the application is simply running in a single process, and it is relatively easy to debug. Many traditional tooling and code instrumentation technologies have been built, assuming you have the idea of a single process. The core challenge is trying to debug microservices applications. So much of the tooling we have today has been built for traditional monolithic applications. So, there are new monitoring tools for these new applications, but there is a steep learning curve and a high barrier to entry.

A new approach: Network baselining and Baseline engineering

For this, you need to understand practices like Chaos Engineering, along with service level objectives (SLOs), and how they can improve the reliability of the overall system. Chaos Engineering is a baseline engineering practice that allows tests to be performed in a controlled way. Essentially, we intentionally break things to learn how to build more resilient systems.

So, we are injecting faults in a controlled way to make the overall application more resilient by injecting various issues and faults. Implementing practices like Chaos Engineering will help you understand and manage unexpected failures and performance degradation. The purpose of Chaos Engineering is to build more robust and resilient systems.

**A final note on baselines: Don’t forget them**

Creating a good baseline is a critical factor. You need to understand how things work under normal circumstances. A baseline is a fixed point of reference used for comparison purposes. You usually need to know how long it takes to start the application to the actual login and how long it takes to do the essential services before there are any issues or heavy load. Baselines are critical to monitoring.

It’s like security; if you can’t see what, you can’t protect. The same assumptions apply here. Go for a good baseline and if you can have this fully automated. Tests need to be carried out against the baseline on an ongoing basis. You need to test constantly to see how long it takes users to use your services. Without baseline data, estimating any changes or demonstrating progress is difficult.

Network baselining is a critical practice for maintaining optimal network performance and security. By establishing a baseline, administrators can proactively monitor, analyze, and optimize their networks. This approach enables them to promptly identify and address performance issues, enhance security measures, and plan for future capacity requirements. Organizations can ensure a reliable and efficient network infrastructure that supports their business objectives by investing time and effort in network baselining.

Summary: Baseline Engineering

Maintaining stability and performance is crucial in the fast-paced world of technology, where networks are the backbone of modern communication. This blog post will delve into the art of Network Baseline Engineering, uncovering its significance, methods, and benefits—strap in as we embark on a journey to understand and master this essential aspect of network management.

Section 1: What is Network Baseline Engineering?

Network Baseline Engineering is a process that involves establishing a benchmark or baseline for network performance, allowing for effective monitoring, troubleshooting, and optimization. Administrators can identify patterns, trends, and anomalies by capturing and analyzing network data over a certain period.

Section 2: The Importance of Network Baseline Engineering

A stable network is vital for seamless operations, preventing downtime, and ensuring user satisfaction. Network Baseline Engineering helps understand normal network behavior, crucial for detecting deviations, security threats, and performance issues. It enables proactive measures, reducing the impact of potential disruptions.

Section 3: Establishing a Baseline

Administrators need to consider various factors to create an accurate network baseline. These include defining key performance indicators (KPIs), selecting appropriate tools for data collection, and determining the time frame for capturing network data. Proper planning and execution are essential to ensure data accuracy and reliability.

Section 4: Analyzing and Interpreting Network Data

Once network data is collected, the real work begins. Skilled analysts leverage specialized tools to analyze the data, identify patterns, and establish baseline performance metrics. This step requires expertise in statistical analysis and a deep understanding of network protocols and traffic patterns.

Section 5: Benefits of Network Baseline Engineering

Network Baseline Engineering offers numerous benefits. It enables administrators to promptly detect and resolve performance issues, optimize network resources, and enhance overall network security. Organizations can make informed decisions, plan capacity upgrades, and ensure a smooth user experience by having a clear picture of normal network behavior.

Conclusion:

Network Baseline Engineering is the foundation for maintaining network stability and performance. By establishing a benchmark and continuously monitoring network behavior, organizations can proactively address issues, optimize resources, and enhance overall network security. Embrace the power of Network Baseline Engineering and unlock the full potential of your network infrastructure.

Docker network security

Docker Security Options

Docker Security Options

In the ever-evolving world of containerization, Docker has emerged as a leading platform for deploying and managing applications. As the popularity of Docker continues to grow, so does the importance of securing your containers and protecting your valuable data. In this blog post, we will delve into various Docker security options and strategies to help you fortify your container environment.

Docker brings numerous benefits, but it also introduces unique security challenges. We will explore common Docker security risks such as container breakout, unauthorized access, and image vulnerabilities. By understanding these risks, you can better grasp the significance of implementing robust security measures.

To mitigate potential vulnerabilities, it is crucial to follow Docker security best practices. We will share essential recommendations, including the importance of regularly updating Docker, utilizing strong access controls, and implementing image scanning tools. By adopting these practices, you can significantly enhance the security posture of your Docker environment.

Fortunately, the Docker ecosystem offers a range of security tools to assist in safeguarding your containers. We will delve into popular tools like Docker Security Scanning, Notary, and AppArmor. Each tool serves a specific purpose, whether it's vulnerability detection, image signing, or enforcing container isolation. By leveraging these tools effectively, you can bolster your Docker security framework.

Network security is a critical aspect of any container environment. We will explore Docker networking concepts, including bridge networks, overlay networks, and network segmentation. Additionally, we will discuss the importance of implementing firewalls, network policies, and encryption to protect your containerized applications

The container runtime plays a crucial role in ensuring the security of your containers. We will examine container runtimes like Docker Engine and containerd, highlighting their security features and best practices for configuration. Understanding these runtime security aspects will empower you to make informed decisions to protect your containers.

Securing your Docker environment is not a one-time task, but an ongoing effort. By understanding the risks, implementing best practices, leveraging security tools, and focusing on network and runtime security, you can mitigate potential vulnerabilities and safeguard your containers effectively. Remember, a proactive approach to Docker security is key in today's ever-evolving threat landscap e

Highlights: Docker Security Options

Docker Security Options

Container Security

The fact that containers share the kernel of the Linux server boosts their performance and makes them lightweight. Because of this, Linux containers pose the most significant security risk. Namespaces are not everywhere in the kernel, which is the main reason for this concern.

Because cgroups and standard namespaces provide some necessary isolation from the host’s core resources, containerized applications are more secure than noncontainerized applications. However, containers should not be used as a replacement for good security practices. It would be best if you run all your containers as you would run an application on a production system. The same should apply if your application runs as a nonprivileged user on a server.

Docker Attack Surface

So you are currently in the Virtual Machine world and considering transitioning to a containerized environment. You want to smoothen your application pipeline and gain the benefits of a Docker containerized environment. But you have heard from many that the containers are insecure and are concerned about Docker network security. There is a Docker attack surface to be concerned about.

Example: Containers run as root

For example, containers run by root by default and have many capabilities that scare you. Yes, we have a lot of benefits to the containerized environment, and containers are the only way to do it for some application stacks. However, we have a new attack surface with some benefits of deploying containers and forcing you to examine Docker security options. The following post will discuss security issues, a container security video to help you get started, and an example of Docker escape techniques.

Understanding SELinux

SELinux, which stands for Security-Enhanced Linux, is a security framework built into the Linux kernel. It provides a powerful set of security policies and access controls to enforce fine-grained restrictions on processes and resources. By leveraging SELinux, administrators can define and enforce access rules, reducing the attack surface of Docker containers.

It is important to understand how SELinux and Docker can be integrated to utilize SELinux in a Docker environment. Docker provides SELinux support through SELinux labels applied to Docker objects such as containers and volumes. These labels enforce SELinux policies and restrict container actions based on defined policies.

Advantages: SELinux

SELinux offers several benefits for Docker security. Firstly, it provides mandatory access controls, enabling administrators to define precisely what actions a container can perform. This prevents malicious containers from accessing sensitive resources or executing unauthorized commands. Secondly, SELinux helps mitigate container breakout attacks by isolating containers and limiting their interactions with the host system. Lastly, SELinux can help detect and prevent privilege escalation attempts within Docker containers.

**New Attacks and New Components**

Containers are secure by themselves, and the kernel is pretty much battle-tested. A container escape is hard to orchestrate unless misconfiguration could result in excessive privileges. So, even though the bad actors’ intent may stay the same, we must mitigate a range of new attacks and protect new components.

To combat these, you need to be aware of the most common Docker network security options and follow the recommended practices for Docker container security. A platform approach is also recommended, and OpenShift is a robust platform for securing and operating your containerized environment.

Understanding Docker Bench Security

Docker Bench Security, developed by the Center for Internet Security (CIS), is a script that automates the process of running security checks against Docker installations. It follows industry-standard best practices and provides a comprehensive report on potential vulnerabilities and misconfigurations.

Section 1: Installation and Configuration

To begin using Docker Bench Security, you first need to install it on your host system. The installation process is straightforward and well-documented. Once installed, you can configure the script to perform specific checks based on your needs and environment.

Section 2: Running Docker Bench

Running Docker Bench Security is as simple as executing a single command. The script will systematically analyze your Docker setup, checking for various security aspects such as host configuration, Docker daemon configuration, container runtime, networking, and more. It will generate a detailed report highlighting any security issues found.

The generated report from Docker Bench Security provides valuable insights into your Docker environment’s security posture. It categorizes the findings into different levels of severity, helping you prioritize and address the most critical vulnerabilities first. By understanding the report and taking necessary actions, you can significantly enhance the security of your Docker deployments.

 

For pre-information, you may find the following posts helpful: 

  1. OpenShift Security Best Practices
  2. Docker Default Networking 101
  3. What Is BGP Protocol in Networking
  4. Container Based Virtualization
  5. Hands On Kubernetes

 

Docker Security Options

Docker Security

To use Docker safely in production and development, you must know potential security issues and the primary tools and techniques for securing container-based systems. Your system’s defenses should also consist of multiple layers.

For example, your containers will most likely run in VMs so that if a container breakout occurs, another level of defense can prevent the attacker from getting to the host or other containers. Monitoring systems should be in place to alert admins in the case of unusual behavior. Finally, firewalls should restrict network access to containers, limiting the external attack surface.

Container Isolation:

One of Docker’s key security features is container isolation, which ensures that each container runs in its own isolated environment. By utilizing Linux kernel features such as namespaces and cgroups, Docker effectively isolates containers from each other and the host system, mitigating the risk of unauthorized access or interference between containers.

Image Vulnerability Scanning:

It is crucial to scan Docker images for vulnerabilities regularly to ensure their security. Docker Security Scanning is an automated service that helps identify known security issues in your containers’ base images and dependencies. By leveraging this feature, you can proactively address vulnerabilities and apply necessary patches, reducing the risk of potential exploits.

Docker Content Trust:

Docker Content Trust is a security feature that allows you to verify the authenticity and integrity of images you pull from Docker registries. By enabling this feature, Docker ensures that only signed and verified images are used, preventing the execution of untrusted or tampered images. This provides an additional layer of protection against malicious or compromised containers.

Role-Based Access Control (RBAC):

Controlling access to Docker resources is critical to maintaining a secure environment. Docker Enterprise Edition (EE) offers Role-Based Access Control (RBAC), which allows you to define granular access controls for users and teams. By assigning appropriate roles and permissions, you can restrict access to sensitive operations and ensure that only authorized individuals can manage Docker resources.

Network Segmentation:

Docker provides various networking options to facilitate communication between containers and the outside world. Implementing network segmentation techniques, such as bridge or overlay networks, helps isolate containers and restrict unnecessary network access. By carefully configuring the network settings, you can minimize the attack surface and protect your containers from potential network-based threats.

Container Runtime Security:

In addition to securing the container environment, it is equally important to focus on the security of the container runtime. Docker supports different container runtimes, such as Docker Engine and containerd. Regularly updating these runtimes to the latest stable versions ensures that you benefit from the latest security patches and bug fixes, reducing the risk of potential vulnerabilities.

**Docker Attack Surface**

Often, the tools and appliances in place are entirely blind to containers. The tools look at a running process and think, if the process is secure, then I’m safe. One of my clients ran a container with the DockerFile and pulled an insecure image. The onsite tools did not know what an image was and could not scan it.

As a result, we had malware right in the network’s core, a little bit too close to the database server for my liking.  Yes, we call containers a fancy process, and I’m to blame here, too, but we need to consider what is around the container to secure it fully. For a container to function, it needs the support of the infrastructure around it, such as the CI/CD pipeline and supply chain.

To improve your security posture, you must consider all the infrastructures. If you are looking for quick security tips on Docker network security, this course I created for Pluralsight may help you with Docker security options.

Ineffective Traditional Tools

Containers are not like traditional workloads. With a single command, we can run an entire application with all its dependencies. Legacy security tools and processes often assume largely static operations and must be adjusted to adapt to the rate of change in containerized environments. With non-cloud-native data centers, Layer 4 is coupled with the network topology at fixed network points and lacks the flexibility to support containerized applications.

There is often only inter-zone filtering and east-to-west traffic may go unchecked. A container changes the perimeter and moves right to the workload. Just look at a microservices architecture. It has many entry points compared to monolithic applications.

Docker container networking

When considering container networking, we are a world apart from the monolithic. Containers are short-lived and constantly spun down, and assets such as servers, IP addresses, firewalls, drives, and overlay networks are recycled to optimize utilization and enhance agility. Traditional perimeters designed with I.P. address-based security controls lag in a containerized environment.

Rapidly changing container infrastructure rules and signature-based controls can’t keep up with a containerized environment. Securing hyper-dynamic container infrastructure using traditional networks ​​and endpoint controls won’t work. For this reason, you should adopt purpose-built tools and techniques for a containerized environment.

**The Need for Observability**

Not only do you need to implement good Docker security options, but you also need to be concerned about the recent observability tools. So, we need proper observability of the state of security and the practices used in the containerization environment, and we need to automate this as much as possible—not just the development but also the security testing, container scanning, and monitoring.

You are only as secure as the containers you have running. You need to be observable in systems and applications and proactive in these findings. It is not something you can buy; it is a cultural change. You want to know how the application works with the server, how the network is with the application, and what data transfer looks like in transfer and a stable state.  

What level of observation do you need so you know that everything is performing as it should? There are several challenges to securing a containerized environment. Containerized technologies are dynamic and complex and require a new approach that can handle the agility and scale of today’s landscape. There are initial security concerns that you must understand before you get started with container security. This will help you explore a better starting strategy.

Docker attack surface: Container attack vectors 

We must consider a different threat model and understand how security principles such as least privilege and in-depth defense apply to Docker security options. With Docker containers, we have a completely different way of running applications and, as a result, a different set of risks to deal with.

Instructions are built into Dockerfiles, which run applications differently from a normal workload. With the correct rights, a bad actor could put anything in the Dockerfile without the necessary guard rails that understand containers; there will be a threat.

Therefore, we must examine new network and security models, as old tools and methods won’t meet these demands.  A new network and security model requires you to mitigate against a new attack vector. Bad actors’ intent stays the same. They are not going away anytime soon. But they now have a different and potentially easier attack surface if misconfigured.

I would consider the container attack surface pretty significant; bad actors will have many default tools if not locked down. For example, we have image vulnerabilities, access control exploits, container escapes, privilege escalation, application code exploits, attacks on the docker host, and all the docker components.

Docker security options: A final security note

Containers by themselves are secure, and the kernel is pretty much battle-tested. You will not often encounter kernel compromises, but they happen occasionally. A container escape is brutal to orchestrate unless misconfiguration could result in excessive privileges. From a security standpoint, it would be best to stay clear of setting container capabilities that provide excessive privileges.

Minimise container capabilities: Reduce the attack surface.

If you minimize the container’s capabilities, you are stripping down its functionality to a bare minimum—we mentioned this in the container security video. Therefore, the attack surface is limited, and the attack vector available to the attacker is minimized. 

You also want to keep an eye on CAP_SYS_ADMIN. This flag grants access to an extensive range of privileged activities. Containers run many other capacities by default that can cause havoc.

As Docker continues to gain popularity, understanding and implementing proper security measures is essential to safeguarding your containers and infrastructure. By leveraging the security options discussed in this blog post, you can mitigate risks, protect against potential threats, and ensure the integrity and confidentiality of your applications. Stay vigilant, stay secure, and embrace the power of Docker while keeping your containers safe.

Summary: Docker Security Options

With the growing popularity of containerization, Docker has become a leading platform for deploying and managing applications. However, as with any technology, security should be a top priority. In this blog post, we delved into various Docker security options that can help you safeguard your containers and ensure the integrity of your applications.

Understanding Docker Security

Before we discuss the specific security options, let’s establish a foundational understanding of Docker security. We’ll explore the concept of container isolation, Docker vulnerabilities, and potential risks associated with containerized environments.

Docker Security Best Practices

To mitigate security risks, it’s crucial to follow Docker security best practices. This section will outline critical recommendations, including limiting container privileges, using secure base images, and implementing container scanning and vulnerability assessment tools.

Docker Content Trust

Docker Content Trust, also known as Docker Notary, is a security feature that ensures the authenticity and integrity of Docker images. We’ll explore how It works, how to enable it, and the benefits it provides in preventing image tampering and unauthorized modifications.

Docker Network Security

Securing Docker networks is essential to protect against unauthorized access and potential attacks. In this section, we’ll discuss network segmentation, Docker network security models, and techniques such as network policies and firewalls to enhance the security of your containerized applications.

Container Runtime Security

The container runtime plays a critical role in Docker security. We’ll examine different container runtimes, such as Docker Engine and containerd, and explore features like seccomp, AppArmor, and SELinux that can help enforce fine-grained security policies and restrict container capabilities.

Conclusion:

In this blog post, we have explored various Docker security options that can empower you to protect your containers and fortify your applications against potential threats. By understanding Docker security fundamentals, following best practices, leveraging Docker Content Trust, securing Docker networks, and utilizing container runtime security features, you can enhance the overall security posture of your containerized environment. As you continue your journey with Docker, remember to prioritize security and stay vigilant in adopting the latest security measures to safeguard your valuable assets.