zero trust network design

Zero Trust SASE

Zero Trust SASE

In today's digital age, where remote work and cloud-based applications are becoming the norm, traditional network security measures are no longer sufficient to protect sensitive data. Enter Zero Trust Secure Access Service Edge (SASE), a revolutionary approach that combines the principles of Zero Trust security with the flexibility and scalability of cloud-based architectures.

In this blog post, we will delve into the concept of Zero Trust SASE and explore its benefits and implications for the future of network security.

Zero Trust is a security model that operates on "never trust, always verify." It assumes that no user or device should be granted automatic trust within a network, whether inside or outside the perimeter. Instead, every user, device, and application must be continuously authenticated and authorized based on various contextual factors, such as user behavior, device health, and location.

SASE is a comprehensive security framework that combines networking and security capabilities into a single cloud-based service. It aims to simplify and unify network security by providing secure access to applications and data, regardless of the user's location or device.

SASE integrates various security functions, such as secure web gateways, cloud access security brokers, and data loss prevention, into a single service, reducing complexity and improving overall security posture.

Highlights: Zero Trust SASE

The Lag in Security 

Today’s digital transformation and strategy initiatives require speed and agility in I.T. However, there is a lag, and that lag is with security. Security can either hold them back or not align with the fluidity needed for agility. As a result, we have decreased an organization’s security posture, which poses a risk that needs to be managed. We have a lot to deal with, such as the rise in phishing attacks, mobile malware, fake public Wi-Fi networks, malicious apps, and data leaks.

The Role of New Security Requirements

These are some of the challenges that new security requirements have propelled. One is the critical capability to continuously discover, assess, and adapt to ever-changing risk and trust levels. These are bundled into a Secure Access Service Edge: SASE definition solution and Zero Trust network design capabilities combined into one SASE architecture.

Understanding Zero Trust

Zero Trust is a security model that operates on the principle of never trusting any network or user by default. It emphasizes continuous verification and strict access control to mitigate potential threats. With Zero Trust, organizations adopt a granular approach to security, ensuring that every user, device, and application is authenticated and authorized before accessing any resources.

authorization

Introducing SASE

Secure Access Service Edge (SASE) is a cloud-based architecture that converges network and security services into a unified platform. SASE offers a holistic approach by integrating wide area networking (WAN) capabilities with security functions, providing organizations with a scalable and flexible solution. This convergence enables seamless connectivity and robust security across distributed networks, regardless of the user’s location or device.

The Powerful Features of Zero Trust SASE

    • Scalability and Flexibility:

Zero Trust SASE is designed to scale effortlessly, accommodating businesses’ evolving needs. The architecture can adapt without compromising security, whether expanding network infrastructure or adding new users. The flexibility of Zero Trust SASE allows organizations to seamlessly integrate new applications and services into their network while maintaining a solid security posture.

    • Unified Security and Networking:

One of Zero Trust SASE’s standout features is the convergence of security and networking services into a single platform. This integration eliminates the complexities associated with managing separate security and networking solutions. By consolidating these functions, organizations can achieve streamlined operations, reduced costs, and enhanced visibility across their network infrastructure.

    • Enhanced Threat Prevention:

Zero Trust SASE incorporates advanced threat prevention mechanisms to combat the ever-evolving threat landscape. With features like real-time monitoring, behavior analytics, and threat intelligence, organizations can proactively identify and mitigate potential risks. By leveraging Zero Trust principles alongside SASE capabilities, businesses can significantly enhance their security posture and protect against emerging threats.

Related: For pre-information, you may find the following helpful:

  1. SD-WAN SASE
  2. SASE Model
  3. SASE Solution
  4. Cisco Secure Firewall
  5. SASE Definition



SASE Architecture

Key Zero Trust SASE Discussion Points:


  • The rise of SASE.

  • Challenges to existing networking.

  • The misconception of Trust.

  • SASE definition and SASE architecture.

  • SASE requirements.

Back to Basics: Zero Trust SASE

The SASE Concept

Gartner coined the SASE concept after seeing a pattern emerge in cloud and SD-WAN projects where full security integration was needed. We now refer to SASE as a framework and a security best practice. SASE leverages multiple security services into a framework approach.

The idea of SASE was not far from what we already did, which was to integrate multiple security solutions into a stack that ensured a comprehensive, layered, secure access solution. By calling it a SASE framework, the approach to a complete solution somehow felt more focused than what the industry recognized as a best security practice.

SASE Meaning

Main SASE Definition Components

SASE – Secure Access Service Edge

  • Network as a Service (NaaS)

  • Security as a Service (SECaaS)

  • Zero-Trust Architecture

  • Cloud-Native Architecture

The Benefits of Zero Trust SASE:

1. Enhanced Security: Zero Trust SASE ensures that only authorized users and devices can access sensitive resources, minimizing the risk of data breaches and insider threats. Organizations can mitigate the impact of compromised credentials and unauthorized access attempts by continuously verifying user identities and device health.

2. Scalability and Flexibility: With Zero Trust SASE, organizations can scale their security infrastructure dynamically based on their needs. SASE solutions can adapt to changing network demands as cloud-based services, providing secure access to applications and data from anywhere, anytime, and on any device.

3. Simplified Management: By consolidating multiple security functions into a single service, Zero Trust SASE simplifies security management and reduces operational overhead. Organizations can centrally manage and enforce security policies across their entire network, eliminating the need for multiple-point solutions and reducing complexity.

4. Improved User Experience: Zero Trust SASE eliminates the need for traditional VPNs and complex access control mechanisms. Users can securely access applications and data directly from the cloud without backhauling traffic to a central location. This improves performance and user experience, especially for remote and mobile users.

The Rise of SASE

The rise of SASE and Zero Trust security strategy. The security infrastructure and decisions must become continuous and adaptive, not static, that formed the basis of traditional security methods. Consequently, we must enable real-time decisions that balance risk, trust, and opportunity. As a result, security has beyond a simple access control list (ACL) and zone-based segmentation based on VLANs. In reality, there is no network point that acts as an anchor for security.

Zero trust SASE
Diagram: Zero Trust SASE: Digital transformation and strategy.

Zero Trust SASE: SASE Architecture

Many current network security designs and technologies were not designed to handle all the traffic and security threats we face today. This has forced many to adopt multiple-point products to address the different requirements. Remember that for every point product, there is an architecture to deploy, a set of policies to configure, and a bunch of logs to analyze.

I find correlating logs across multiple-point product solutions used in different domains hard. For example, a diverse team may operate the secure web gateways (SWG) to that of the virtual private network (VPN) appliances. It could be the case that these teams work in silos and are in different locations.

Challenges to existing networks

Many challenges to existing networks and infrastructure create big security holes and decrease security posture. In reality, several I.T. components give the entity more access than required. We have considerable security flaws with using I.P. addresses as a security anchor and static locations; the virtual private networks (VPN) and demilitarized zone (DMZ) architectures used to establish access are often configured to allow excessive implicit trust.  

The issue with a DMZ

The DMZ is the neutral network between the Internet and your organization’s private network. It’s protected by a front-end firewall that limits Internet traffic to specific systems within its zone. The DMZ can have a significant impact on security if not appropriately protected. Remote access technologies such as VPN or RDP, often located in the DMZ, have become common targets of cyberattacks. One of the main issues I see with the DMZ is that the bad actors know it’s there. It may be secured, but it’s visible.

The issue with the VPN

In basic terms, a VPN provides an encrypted server and hides your IP address. However, the VPN does not secure users when they land on a network segment and is based on coarse-grained access control where the user has access to entire network segments and subnets. Traditionally, once you are on a segment, there will be no intra-filtering on that segment. That means all users in that segment need the same security level and access to the same systems, but that is not always the case. 

Site to Site VPN

Overly permissive network access

VPNs generally provide broad, overly permissive network access with only fundamental access control limits based on subnet ranges. So, the traditional VPN provides overly permissive access and security based on I.P. subnets.

Security infrastructure
Diagram: Security infrastructure: The issues.

SASE Architecture and Misconception of Trust 

Much of the non-zero trust security architecture is based on trust. Bad actors abuse this trust. On the other hand, examining an SASE overview includes zero trust networking and remote access as one of its components, which can adaptively offer the appropriate trust required at the time and nothing more.

It is like providing a narrow segmentation based on many contextual parameters continuously assessed for risk to ensure the users are who they are and that the entities, either internal or external to the network, are doing what they are supposed to do.

Removes excessive trust

A core feature of SASE and Zero Trust is that it removes the excessive trust once required to allow entities to connect and collaborate. Within a zero-trust environment, our implicit trust in traditional networks is replaced with explicit identity-based trust with a default denial. With an identity-based trust solution, we are not just looking at IP addresses to determine trust levels. After all, they are just binary, deemed a secure private or a less trustworthy public. This assumption is where all of our problems started. They are just ones and zeros.

Zero Trust concept: Proxy for trust

To improve your security posture, it would be best to stop relying primarily on IP addresses and network locations as a proxy for trust. We have been doing this for decades. There is minimal context in placing a policy with legacy constructs. To determine the trust of a requesting party, we need to examine multiple contextual aspects, not just IP addresses.

And the contextual aspects are continuously assessed for security posture. This is a much better way to manage risk and allows you to look at the entire picture before deciding to enter the network or access a resource.

zero trust requirements
Diagram: Zero Trust requirements. Lockdown of trust and access

Challenging Environments

More outside than inside

The current environmental challenge is that more users, devices, applications, services, and data are located outside an enterprise than inside. As a result, there has been a rapid rise in remote working, especially in recent times. Also, there has been an increase in the adoption of cloud-based services, particularly SaaS. These environmental changes have turned the enterprise network “inside out.”. So, the traditional perimeter that we had was useless.

Multi-cloud

Also, many organizations are adopting multi-cloud. There are challenges in deploying and managing native security offerings from multiple cloud service providers. The different service providers will have other management consoles and security capabilities that do not share or integrate the policies. Although we have technologies that help with this, cloud providers are different entities. So, to combat these, let’s say, environmental evolutions, we have attempted other attempts to secure our infrastructure.

SASE: First attempt to 

Organizations have been adopting different security technologies to combat and include these changes in their security stack. Many of the security technologies are cloud-based services. Some of these services include the cloud-based secure web gateway (SWG), content delivery network [CDN], and web application firewall [WAF]. A secure web gateway (SWG) protects users from web-based threats and applies and enforces acceptable corporate use policies. 

A content delivery network (CDN) refers to a geographically distributed group of servers working together to deliver Internet content quickly. A WAF or web application firewall helps protect web applications by filtering and monitoring HTTP traffic between a web application and the Internet.

The data center is the center of the universe.

However, even with these welcomed additions to security, the general trend was that the data center is still the center of most enterprise networks and network security architectures. Let’s face it: These designs are becoming ineffective and cumbersome with the rise of cloud and mobile technology. Traffic patterns have changed considerably, and so has the application logic.

SASE: Second attempt to

The next attempt was for a converged cloud-delivered secure access service edge (SASE) to accomplish this shift in the landscape. And that is what SASE architecture does. As you know, the SASE architecture relies on multiple contextual aspects to establish and adapt trust for application-level access.

It does not concern itself with large VLAN and broad-level access or believes that the data center is the center of the universe. Instead, the SASE architecture is often based on PoP, where each PoP acts as the center of the universe.

The SASE definition and its components are a transformational architecture that can combat many of these discussed challenges. A SASE solution converges networking and security services into one unified, cloud-delivered solution that includes the following core capabilities of sase.

From the network side of things: SASE in networking

    1. Software-defined wide area network (SD-WAN)
    2. Virtual private network (VPN)
    3. Zero Trust Network ZTN
    4. Quality of service (QoS)
    5. Software-defined perimeter (SDP)

From the security side of things: SASE capabilities in security

    1. Firewall as a service (FWaaS)
    2. Domain Name System (DNS) security
    3. Threat prevention
    4. Secure web gateways
    5. Data loss prevention (DLP)
    6. Cloud access security broker (CASB)

Zero Trust SASE: What the SASE architecture changes

SASE changes the focal point to the identity of the user and device. With traditional network design, we have the on-premises data center that is considered the center of the universe. With SASE, that architecture changes this to match today’s environment and moves the perimeter to the actual user, devices, or PoP with some SASE designs.  In contrast to the traditional enterprise network and security architectures, the internal data center is the focal point for access. 

SASE features
Diagram: SASE features

VPN Security Scenario 

The limitations of traditional remote access VPNs

Remote access VPNs are primarily built to allow users outside the perimeter firewall to access resources inside the perimeter firewall. As a result, they often follow a hub-and-spoke architecture, with users connected by tunnels of various lengths depending on their distance from the data center. Traditional VPNs introduce a lot of complexity. For example, what do you do if you have multiple sites where users need to access applications? In this scenario, the cost of management would be high. 

Tunnel based on I.P

What’s happening here is that the tunnel creates an extension between the client device and the application location. The tunnel is based on IP addresses on the client device and the remote application. Now that there is I.P. connectivity between the client and the application, the network where the application is located is extended to the client.

However, the client might not sit in an insecure hotel room or from home. These may not be sufficiently protected, and such locations should be considered insecure. The traditional VPN has many issues to deal with. They are user-initiated, and policy often permits split-tunnel VPN where there can be no Internet or cloud traffic inspection.

SASE and VPN: A zero-trust VPN solution

A SASE solution encompasses VPN services and enhances the capabilities of operating in cloud-based infrastructure to route traffic. On the other hand, with SASE, the client connects to the SASE PoP, which carries out security checks and forwards the request to the application. A SASE design still allows clients to access the application, but they can only access that specific application and nothing more, like a stripped-down VLAN known as a micro-segmentation.

Clients must pass security controls, and no broad-level access is susceptible to lateral movements. Access control is based on an allowlist rather than the traditional blocklist rule. Also, other variables present in the request context are used instead of using I.P. addresses as the client identifier. As a result, the application is now the access path, not the network.

ZTNA remote access

So, no matter what type of VPN services you use, the SASE provides a unified cloud to connect to instead of backhauling to a VPN gateway—simplifying management and policy control. Well-established technologies such as VPN, secure web gateway, and firewall are being reviewed and reassessed in Zero Trust remote access solutions as organizations revisit approaches that have been in place for over a decade. 

A quick recommendation: SASE and SD-WAN

The value of SD-WAN is high. However, it also brings many challenges, including new security risks. In some of my consultancies, I have seen unreliable performance and increased complexity due to the need for multiple overlays. Also, these overlays need to terminate somewhere, and this will be at a hub site.

However, when combined with SASE, the SD-WAN edge devices can be connected to a cloud-based infrastructure rather than the physical SD-WAN hubs. This brings the value of interconnectivity between branch sites without the complexity of deploying or managing physical Hub sites.

sase in networking
Diagram: SASE in networking.

Zero Trust SASE: Vendor considerations

SASE features converge various individual components into one connected, cloud-delivered service, making it easy to control policies and behaviors. The SASE architecture is often based on a PoP design. When examining the SASE vendor, the vendor’s PoP layout should be geographically diverse, with worldwide entry and exit points.

Also, considerations should be made regarding the vendor’s edge/physical infrastructure providers or colocation facilities. We can change your security posture, but we can’t change the speed of light and the laws of physics.

SASE capabilities and route optimizations

Consider how the SASE vendor routes traffic in their PoP fabric. Route optimization should be performed at each PoP. Some route optimizations are for high availability, while others are for performance. Does the vendor offer cold-potato or hot-potato routing? The cold-potato routing means bringing the end-user device into the provider’s network as soon as possible. On the other hand, “hot-potato routing” means the end user’s traffic traverses more of the public Internet.

The Main Zero Trust SASE Architecture Requirements List

The following is a list of considerations to review when discussing SASE with your preferred cybersecurity vendor.

zero trust environment
Diagram: Zero trust environment

Zero Trust SASE requirements: Information hiding

Secure access service requires clients to be authenticated and authorized before accessing protected assets, regardless of whether the connection is inside or outside the network perimeter. Then, real-time encrypted connections are created between the requesting client and the protected asset. As a result, all SASE-protected servers and services are hidden from all unauthorized network queries and scan attempts.

You can’t attack what you can’t see.

The base for network security started by limiting visibility – you cannot attack what you cannot see. Public and private IP addresses range from separate networks. This was the biggest mistake we ever made as I.P. addresses are just binary, whether they are deemed public or private. If a host were assigned a public address and wanted to communicate with a host with a private address, it would need to go through a network address translation (NAT) device and have a permit policy set.

Security based on the visibility

Network address translation is mapping an IP address space into another by modifying network address information in the I.P. header of packets while they are in transit across a traffic routing device. Limiting visibility this way works to a degree, but we cannot get away from the fact that a) if you have the I.P. address of someone, you can reach them, and b) if a port is open, you can potentially connect to it. Therefore, the traditional security method can open your network wide for compromise, especially when bad actors have all the tools. However, finding, downloading, and running a port scanning tool is not hard.

“Nmap,” for Network Mapper, is the most widely used port scanning tool. Nmap works by checking a network for hosts and services. Once found, the software platform sends information to those hosts and services, responding. Nmap reads and interprets the response and uses the data to create a network map.

Example: Single Packet Authorization

Zero Trust network security is used for information and infrastructure hiding through lightweight protocols such as a single packet authorization (SPA). No internal IP addresses or DNS information is shown, creating an invisible network.

As a result, we have zero visibility and connectivity, only establishing connectivity after clients prove they can be trusted to allow legitimate traffic. Now, we can have various protected assets hidden regardless of location: on-premise, public or private clouds, a DMZ, or a server on the internal LAN, in keeping with today’s hybrid environment.

This approach mitigates denial-of-service attacks. Anything internet-facing is reachable on the public Internet and, therefore, susceptible to bandwidth and server denial-of-service attacks. The default-drop firewall is deployed, with no visible presence to unauthorized users. Only good packets are allowed.

Zero Trust SASE tools: Single packet authorization (SPA)

Single packet authorization (SPA) also allows for attack detection. If a host receives anything other than a valid SPA packet or similar construct, it views that packet as part of a threat. The first packet to a service must be a valid SPA packet or similar security construct.

If it receives another packet type, it views this as an attack, which is helpful for bad packet detection. Therefore, SPA can determine an attack based on a single malicious packet, a highly effective way to detect network-based attacks. Thus, external network and cross-domain attacks are detected.

single packet authorization
Diagram: Single packet authorization (spa)

Zero Trust SASE architecture requirements: Mutually encrypted connections

Transport Layer Security ( TLS ) is an encryption protocol that protects data when it moves between computers. When two computers send data, they agree to encrypt the information in a way they both understand. Transport layer security (TLS) was designed to provide mutual device authentication before enabling confidential communication over the public Internet.

However, the standard TLS configuration is the validation that ensures that the client is connected to a trusted entity. So, the typical TLS adoptions authenticate servers to clients, not clients to servers. 

Mutually encrypted connections

SASE uses the full TLS standard to provide mutual, two-way cryptographic authentication. Mutual TLS provides this and goes one step further to authenticate the client. Mutual TLS connections are set up between all components in the SASE architecture.

Mutual Transport Layer Security (mTLS) is a process that establishes an encrypted TLS connection in which both parties use X. 509 digital certificates to authenticate each other.  MTLS can help mitigate the risk of moving services to the cloud and can help prevent malicious third parties from imitating genuine apps.

This offers robust device and user authentication, as connections from unauthorized users and devices are mitigated. Secondly, forged certificates, which are attacks aimed at credential theft, are disallowed. This will reduce impersonation attacks, where a bad actor can forge a certificate from a compromised certificate authority.

Zero Trust SASE architecture requirements: Need to know the access model

Thirdly, SASE employs a need-to-know access model. As a result, SASE permits the requesting client to view only the resources that are allowed to be appropriate to the assigned policy. Users are associated with their devices that are validated based on policy. Only connections to the specifically requested service are enabled, and no other connection is allowed to any other service. 

SASE provides additional information, such as who made the connection, from what device, and to what service. All these give you full visibility into all the established connections, which is pretty hard to do if you have an IP-based solution. So now we have a contextual aspect of determining the level of risk. As a result, it makes forensics easier. The SASE architecture only accepts good packets; bad packets can be analyzed and tracked for forensic activities.

A key point: Device validation

Secondly, it enforces device validation, which helps against threats from unauthorized devices. Not only can we examine the requesting user, we can also perform device validation. Device validation ensures that the machine is running on trusted hardware and is used by the appropriate user.

Finally, suppose a device does become compromised. In that case, there is a complete lockdown on lateral movements as a user is only allowed access to the resource it is authorized to. Or they could be placed into a sandbox zone where human approval must intervene and assess the situation.

Zero Trust SASE architecture requirements: Dynamic access control

This traditional type of firewall is limited in scope as it cannot express or enforce rules based on identity information, which you can with zero trust identity. Attempting to model identity-centric control with the limitations of the 5-tuple, SASE can be used alongside traditional firewalls and take over the network access control enforcement that we try to do with conventional firewalls.

SASE deploys a dynamic firewall that starts with one rule – deny all. Then, requested communication is dynamically inserted into the firewall, providing an active firewall security policy instead of static configurations. For example, every packet hitting the firewall is inspected with a single packet authentication (SPA) and then quickly verified for a connection request. 

sase and zero trust
Diagram: Zero trust capabilities

A key point: Dynamic firewall

Once established, the firewall is closed again. Therefore, the firewall is dynamically opened only for a specific period. The connections made are not seen by rogues outside the network or the user domain within the network.

Allows dynamic, membership-based enclaves that prevent network-based attacks. The SASE dynamically binds users to devices, enabling those users to access protected resources by dynamically creating and removing firewall rules.

Access to protected resources is facilitated by dynamically creating and removing inbound and outbound access rules. Therefore, we now have more precise access control mechanisms and considerably reduced firewall rules.

Zero Trust SASE architecture requirement: Micro perimeter

Traditional applications were grouped into VLANs whether they offered similar services or not. Everything on that VLAN was reachable. The VLAN was a performance construct to break up broadcast domains, but it was pushed into the security world and never meant to be there. 

Its prime use was to increase performance. However, it was used for security in what we know as traditional zone-based networking. The segments in zone-based networks are too large and often have different devices with different security levels and requirements.

Logical-access boundary

SASE enables this by creating a logical access boundary encompassing a user and an application or set of applications. And that is it—nothing more and nothing less. Therefore, we have many virtual micro perimeters specific to the business instead of the traditional main inside/outside perimeter. Virtual perimeters allow you to grant access to the particular application, not the underlying network or subnet.

sase and zero trust
Diagram: SASE and micro perimeters

Reduce the attack surface.

The smaller microperimeters reduce the attack surface and limit the need for excessive access to all ports and protocols or all applications. These individualized “virtual perimeters” encompass only the user, the device, and the application. They are created and are specific to the session and then closed again when the session is over or if there is a change in the risk level and the device or user needs to perform setup authentication.

Software-defined perimeter (SDP)

Also, SASE only grants access to the specific application at an application layer. The SDP part of SASE now controls which devices and applications can access distinctive services at an application level. Permitted by a policy granted by the SDP part of SASE, machines can only access particular hosts and services and cannot access network segments and subnets.

Broad network access is eliminated, reducing the attack surface to an absolute minimum.  SDP provides a fully encrypted application communication path. However, the binding application permits only authorized applications, so they can only communicate through the established encrypted tunnels, thus blocking all other applications from using them.

This creates a dynamic perimeter around the application, including connected users and devices. Furthermore, it offers a narrow access path—reducing the attack surface to an absolute minimum.

Zero Trust SASE architecture requirement: Identity-driven access control

Traditional network solutions provide coarse-grained network segmentation based on someone’s IP address. However, someone’s IP address is not a good security hook and does not provide much information about user identity. SASE enables the creation of microsegmentation based on user-defined controls, allowing a 1-to-1 mapping, unlike with a VLAN, where there is the potential to see everything within that VLAN.

Identity-aware access

SASE provides adaptive, identity-aware, precision access for those seeking more precise access and session control to applications on-premises and in the cloud. Access policies are primarily based on user, device, and application identities.

The procedure is applied independent of the user’s physical location or the device’s I.P. address, except where it prohibits it. This brings a lot more context to policy application. Therefore, if a bad actor gains access to one segment in the zone, they are prevented from compromising any other network resource.

Implications for the Future:

Zero Trust SASE represents the future of network security as organizations increasingly adopt cloud-based applications and embrace remote workforces. With the proliferation of IoT devices, edge computing, and hybrid cloud environments, traditional security models are no longer sufficient to protect critical assets.

Zero Trust SASE provides a holistic and adaptive approach to security, ensuring that organizations can defend against evolving threats and maintain a strong security posture in the digital era.

Summary: Zero Trust SASE

In today’s rapidly evolving digital landscape, where remote work and cloud-based applications have become the norm, traditional security measures are no longer sufficient. Enter Zero Trust Secure Access Service Edge (SASE), a revolutionary approach that combines network security and wide-area networking into a unified framework. In this blog post, we explored the concept of Zero Trust SASE and its implications for the future of cybersecurity.

Section 1: Understanding Zero Trust

Zero Trust is a security framework that operates under the principle of “never trust, always verify.” It assumes no user or device should be inherently trusted, regardless of location or network. Instead, Zero Trust focuses on continuously verifying and validating identity, access, and security parameters before granting any level of access.

Section 2: The Evolution of SASE

Secure Access Service Edge (SASE) represents a convergence of network security and wide-area networking capabilities. It combines security services, such as secure web gateways, firewall-as-a-service, and data loss prevention, with networking functionalities like software-defined wide-area networking (SD-WAN) and cloud-native architecture. SASE aims to provide comprehensive security and networking services in a unified, cloud-delivered model.

Section 3: The Benefits of Zero Trust SASE

a) Enhanced Security: Zero Trust SASE brings a holistic approach to security, ensuring that every user and device is continuously authenticated and authorized. This reduces the risk of unauthorized access and mitigates potential threats.

b) Improved Performance: By leveraging cloud-native architecture and SD-WAN capabilities, Zero Trust SASE optimizes network traffic, reduces latency, and enhances overall performance.

c) Simplified Management: With a unified security and networking framework, organizations can streamline their management processes, reduce complexity, and achieve better visibility and control over their entire network infrastructure.

Section 4: Implementing Zero Trust SASE

a) Comprehensive Assessment: Before adopting Zero Trust SASE, organizations should conduct a thorough assessment of their existing security and networking infrastructure, identify vulnerabilities, and define their security requirements.

b) Architecture Design: Organizations need to design a robust architecture that aligns with their specific needs and integrates Zero Trust principles into their existing systems. This may involve deploying virtualized security functions, adopting SD-WAN technologies, and leveraging cloud services.

c) Continuous Monitoring and Adaptation: Zero Trust SASE is an ongoing process that requires continuous monitoring, analysis, and adaptation to address emerging threats and evolving business needs. Regular security audits and updates are crucial to maintaining a solid security posture.

Conclusion:

Zero Trust SASE represents a paradigm shift in cybersecurity, providing a comprehensive and unified approach to secure access and network management. By embracing the principles of Zero Trust and leveraging the capabilities of SASE, organizations can enhance their security, improve performance, and simplify their network infrastructure. As the digital landscape continues to evolve, adopting Zero Trust SASE is not just an option—it’s a necessity to safeguard the future of our interconnected world.

microservices development

Microservices Observability

Monitoring Microservices

In today's rapidly evolving software development landscape, microservices architecture is famous for building scalable and resilient applications. However, as the complexity of these systems increases, so does the need for effective observability. In this blog post, we will explore the concept of microservices observability and why it is crucial in ensuring the stability and performance of modern software systems.

Microservices observability refers to the ability to gain insights into the behavior and performance of individual microservices, as well as the entire system as a whole. It involves collecting, analyzing, and visualizing data from various sources, such as logs, metrics, traces, and events, to comprehensively understand the system's health and performance.

Table of Contents

Highlights: Monitoring Microservices

 

The Role of Microservices Monitoring

Microservices monitoring is suitable for known patterns that can be automated, while microservices observability is suitable for detecting unknown and creative failures. Microservices monitoring is a critical part of successfully managing a microservices architecture. It involves tracking each microservice’s performance to ensure there are no bottlenecks in the system and that the microservices are running optimally.

Components of Microservices Monitoring

Additionally, microservices monitoring can detect anomalies and provide insights into the microservices architecture. There are several critical components of microservices monitoring, including:

Metrics: This includes tracking metrics such as response time, throughput, and error rate. This information can be used to identify performance issues or bottlenecks.

Logging allows administrators to track requests, errors, and exceptions. This can provide deeper insight into the performance of the microservices architecture.

Tracing: Tracing provides a timeline of events within the system. This can be used to identify the source of issues or to track down errors.

Alerts: Alerts notify administrators when certain conditions are met. For example, administrators can be alerted if a service is down or performance is degrading.

Finally, it is essential to note that microservices monitoring is not just limited to tracking performance. It can also detect security vulnerabilities and provide insights into the architecture.

By leveraging microservices monitoring, organizations can ensure that their microservices architecture runs smoothly and that any issues are quickly identified and resolved. This can help ensure the organization’s applications remain reliable and secure.microservices

Related: For pre-information, you will find the following posts helpful:

  1. Observability vs Monitoring
  2. Chaos Engineering Kubernetes
  3. Distributed System Observability
  4. ICMPv6

 



Microservices Monitoring

Key Microservices Observability Discussion Points:


  • The challenges with traditional monitoring.

  • Tools of the past, logs and metrics.

  • Why we need Observability.

  • The use of Disributed Tracing.

  • Observability pillars.

 

Back to Basics: Containers and Microservices

The challenges

Teams increasingly adopt new technologies as companies transform and modernize applications to leverage container- and microservices development. IT infrastructure monitoring has always been complex but is even more challenging with the changing software architecture and the new technology needed to support it. In addition, many of your existing monitoring tools may not fully support modern applications and frameworks, especially when you throw in serverless and hybrid IT. All of these create a considerable gap in the management of application health and performance.

Containers

Containers can wrap up an application into its isolated package—everything the application needs to run successfully as a process is executed within the container. Kubernetes is an open-source container management tool that delivers an abstraction layer over the container to manage the container fleets, leveraging REST APIs.

Container-based technologies affect infrastructure management services, like backup, patching, security, high availability, disaster recovery, etc. Therefore, we must establish other monitoring and management technologies for containerization and microservices architecture. Prometheus is an example of a container monitoring tool that comes up as a go-to open-source monitoring and alerting solution.

 

Docker Container Diagram
Diagram: Docker Container. Source Docker.

Microservices

Microservices are an architectural approach to software development that enables teams to create, deploy, and manage applications quickly. Microservices allow greater flexibility, scalability, and maintainability than traditional monolithic applications.

The microservices approach is based on building independent services that communicate with each other over an API. Each service is responsible for a specific business capability, so a single application can comprise many different services. This makes it easy to scale individual components and replace them with newer versions without affecting the rest of the application.

Diagram: Microservices. The source is AVI networks

 

The Benefits of Microservices Observability:

Implementing a robust observability strategy brings several benefits to a microservices architecture:

1. Enhanced Debugging and Troubleshooting:

Microservices observability gives developers the tools and insights to identify and resolve issues quickly. By analyzing logs, metrics, and traces, teams can pinpoint the root causes of failures, reducing mean time to resolution (MTTR) and minimizing the impact on end-users.

2. Improved Performance and Scalability:

Observability enables teams to monitor the performance of individual microservices and identify areas for optimization. By analyzing metrics and tracing requests, developers can fine-tune service configurations, scale services appropriately, and ensure efficient resource utilization.

3. Proactive Issue Detection:

With comprehensive observability, teams can detect potential issues before they escalate into critical problems. By setting up alerts and monitoring key metrics, teams can proactively identify anomalies, performance degradation, or security threats, allowing for timely intervention and prevention of system-wide failures.

 

Video: Microservices vs. Observability

We will start by discussing how our approach to monitoring needs to adapt to the current megatrends, such as the rise of microservices. Failures are unknown and unpredictable. Therefore, a pre-defined monitoring dashboard will have difficulty keeping up with the rate of change and unknown failure modes. For this, we should look to have the practice of observability for software and monitoring for infrastructure.

Observability vs Monitoring
Prev 1 of 1 Next
Prev 1 of 1 Next

 

Microservices Monitoring and Observability

Containers, cloud platforms, scalable microservices, and the complexity of monitoring distributed systems have highlighted significant gaps in the microservices monitoring space that have been static for some time. As a result, you must fully understand performance across the entire distributed and complex stack, including distributed traces across all microservices. So, to do this, you need a solution that can collect, process, and store data used for monitoring. And the data needs to cover several domains and then be combined and centralized for analysts.

This can be an all-in-one solution that represents or bundles different components for application observability. The bundled solutions would be, for example, an Application Performance Monitoring (APM) that consists of application performance monitoring tools or a single platform, which could be Prometheus, which lives in a world of metrics only.  

Application Performance Monitoring

Application performance monitoring typically involves tracking the response time of an application, the number of requests it can handle, and the amount of memory or other system resources it uses. This data can be used to identify any issues with application performance or scalability. Organizations can take corrective action by monitoring application performance to improve the user experience and ensure their applications run as efficiently as possible.

Application performance monitoring also helps organizations better understand their users by providing insight into how applications are used and how well they are performing. In addition, this data can be used to identify trends and patterns in user behavior, helping organizations decide how to optimize their applications for better user engagement and experience.

microservices development
Diagram: Observability: Microservices development.

 

The Need for Microservices Observability

Today’s challenges

1. Obfuscation

When creating microservices, your application becomes more distributed, the coherence of failures decreases, and we live in a world of unpredictable failure mode—also, the distance between cause and effect increases. For example, an outage at your cloud provider’s blob storage could cause huge cascading latency for everyone. In today’s environment, we have new cascading problems.

2. Inconsistency and highly independent

Distributed applications might be reliable, but the state of individual components can be much less consistent than in monolithic or non-distributed applications, which have elementary and well-known failure modes. In addition, each element of a distributed application is designed to be highly independent, and each component can be affected by different upstream and downstream components.

3. Decentralization

How do you look for service failures when a thousand copies of that service may run on hundreds of hosts? How do you correlate those failures? So you can make sense of what’s going on.

 

Tools of the past: Logs and metrics

Traditionally, microservices monitoring has boiled down to two types of telemetry data: log data and time series statistics. The time series data—is also known as metrics, as to make sense of a metric, you need to view a period.

However, as we broke the software into tiny, independently operated services and distributed those fragmented services, the logs and metrics we captured told you very little of what was happening to the critical path.

Understanding the critical path is the most important, as this is what the customer is experiencing. Looking at a single stack trace or watching CPU and memory utilization on predefined graphs and dashboards is insufficient. As software scales in-depth but breadth—telemetry data like logs and metrics alone don’t provide clarity; you must quickly identify production problems.

Monitoring observability
Diagram: Monitoring Observability. Source is Bravengeek

 

Introduction to Microservices Monitoring Categories

We have several different categories to consider. For microservices monitoring and Observability, you must first address your infrastructures, such as your network devices, hypervisors, servers, and storage. Then, you should manage your application performance and health.

Then, you need to monitor how to manage network quality and optimize when possible. For each category, you must consider white box and black box monitoring and potentially introduce new tools such as Artificial Intelligence (AI) for IT operations (AIOps).

Prevented approach to Microservice monitoring: AI and ML.

When choosing microservices observability software, consider a more preventive approach than a reactive one better suited for traditional environments. Prevented approaches to monitoring can use historical health and performance telemetry as an early warning with the use of Artificial Intelligence (AI) and Machine Learning (ML) techniques.

White box monitoring offers more details than a black box telling you something is broken without knowing why. White box monitoring details the why, but you must ensure the data is easily consumable.

 

white box monitoring
Diagram: White box monitoring and black box monitoring.

 

With predictable failures and known failure modes, black box microservices monitoring can help. Still, with the creative ways that applications and systems fail today, we need to examine the details of white-box microservices monitoring. Complex applications fail in unpredictable ways, often termed black holes.

Distributing your software presents new types of failure, and these systems can fail in creative ways and become more challenging to pin down. The service you’re responsible for may be receiving malformed or unexpected data from a source you don’t control because a team manages that service halfway across the globe.

White box monitoring: Exploring failures

White box monitoring relies on a different approach to black box monitoring. It uses a technique called Instrumentation that exposes details about the system’s internals to help you explore these black holes and better understand the creative mode in which applications fail today.

 

Microservices Observability: Techniques

Collection, storage, and analytics

Regardless of what you are monitoring, the infrastructure or the application service, monitoring requires 3 three inputs, more than likely across three domains. We require:

    1. Data collection
    2. Storage, and 
    3. Analysis.

We need to look at metrics, traces, and logs for these three domains or, let’s say, components. Out of these three domains, trace data is the most beneficial and excellent way to isolate performance anomalies for distributed applications. Trace data fall into the brackets of distributed tracing that enable flexible consumption of capture traces. 

What you need to do: The four golden signals 

First, you must establish a baseline comprising the four golden signals – latency, traffic, errors, and saturation. The golden signals are good indicators of health and performance and apply to most components of your environment, such as the infrastructure, applications, microservices, and orchestration systems.

 

application performance monitoring tools
Diagram: Application performance monitoring tools.

 

A quick recommendation: Alerts and SLIs

I recommend driving this baseline automated along with the automation alerts from deviations from baselines. The problem is that you may alert on too much if you collect too much. Service Level Indicators (SLI) can help you find what is better to alert and what matters to the user experience. 

A key point: Distributed tracing

Navigate real-time alerts

Leveraging distributed tracing for directed troubleshooting provides users with distributed tracing capabilities to dig deep when a performance-impacting event occurs. No matter where an issue arises in your environment, you can navigate from real-time alerts directly to application traces and correlate performance trends between infrastructure, Kubernetes, and your microservices. Distributed tracing is essential to monitoring, debugging, and optimizing distributed software architecture, such as microservices–especially in dynamic microservices architectures.

 

The Effect on Microservices: Microservices Monitoring

When considering a microservice application, many consider this independent microservice as independent, but this is nothing more than an illusion. These microservices are highly interdependent, and a failure or slowdown in one service propagates across the stack of microservices.

A typical architecture may include a backend service, a front-end service, or maybe even a docker-compose file. So, at a minimum, several containers must communicate to carry out operations. 

For a simple microservice architecture, we would have a simple front end minimizing a distributed application setup, where microservices serving static contents are at the front end. At the same time, the heavy lifting is done with the other service.   

Monolith and microservices monitoring.

We have more components to monitor than we had in the monolithic world. With their traditional monolithic, there are only two components to monitor. Then, we had the applications and the hosts.

Compared to the cloud-native world, we have containerized applications orchestrated by Kubernetes with multiple components requiring monitoring. We have, for example, the hosts, the Kubernetes platform itself, the Docker containers, and the containerized microservices.

Distributed systems have different demands.

Today, distributed systems are the norm, placing different demands on your infrastructure than the classic, three-tier application. Pinpointing issues in a microservices environment is more challenging than with a monolithic one, as requests traverse both between different layers of the stack and across multiple services. 

The Challenges: Microservices

The things we love about microservices are independence and idempotence, which make them difficult to understand, especially when things go wrong. As a result, these systems are often referred to as deep systems, not due to their width but their complexity.

We can no longer monitor their application by using a script to access the application over the network every few seconds, report any failures, or use a custom script to check the operating system to understand when a disk is running out of space.

Understanding saturation is an implemented signal, but it’s just one of them. It quickly becomes unrealistic for a single human, or even a group, to understand enough of the services in the critical path of even a single request and continue maintaining it. 

Node Affinity or Taints

Microservices-based applications are typically deployed on containers that are dynamic and transient. This leaves an unpredictable environment where the pods get deployed and run unless specific intent is expressed using affinity or taints. However, there can still be unpredictability with pod placement. The unpredictable nature of pod deployment and depth of configuration can lead to complex troubleshooting.

 

The Beginnings of Distributed Tracing

Open Tracing

So, when you are ready to get started with distributed tracing, you will come across OpenTracing. OpenTracing is a set of standards that are exposed as frameworks. So, it’s a vendor-neutral API and Instrumentation for distributed tracing. 

It is not that open tracing gives you the library but more of a set of rules and extensions that another library can adopt so you can use and swap around different libraries and expect the same things. 

Diagram: Distributed Tracing Example. Source is Simform

 

Microservices architecture example

Let’s examine an example of the request library for Python. So we have Requests, an elegant and simple HTTP library for Python. The request library talks to HTTP and will rely on specific standards; the standard here will be HTTP. So in Python, when making a “requests.get”.

The underlying library implementation will do a formal HTTP request using the GET method. So, the HTTP standards and the HTTP specs lay the ground rules of what is expected from the client and the server.

OpenTracing

So, the OpenTracing projects do the same thing. It sets out the ground rules for distributed tracing, regardless of the implementation and the language used. It has several liabilities that are available in 9 languages. Go, JavaScript, Java, Python, Ruby, PHP, Objective-C, C++.

For example, OpenTracing API for Python gives you implementation for open tracing to be used by Python. This is the set of standards for tracing with Python, and it provides examples of what the Instrumentation should look like and common ways to start a trace. 

 

Video: Distributed Tracing

We generally have two types of telemetry data. We have log data and time-series statistics. The time-series data is also known as metrics in a microservices environment. The metrics, for example, will allow you to get an aggregate understanding of what’s happening to all instances of a given service.

Then, we have logs, on the other hand, that provide highly fine-grained detail on a given service. But have no built-in way to provide that detail in the context of a request. Due to how distributed systems fail, you can’t use metrics and logs to discover and address all of your problems. We need a third piece to the puzzle: distributed tracing.

 

Distributed Tracing Explained
Prev 1 of 1 Next
Prev 1 of 1 Next

 

Connect the dots with distributed tracing

And this is a big difference in why you would use tracing and logging. Tracing allows you to connect the dots from one end of the application to the other end of the application. So, if you are starting a request on the front end and want to see how that works on the backend, that works. A trace and child traces connected will have a representation. 

Visual Representation with Jaeger

You may need to use Jaeger for the visual representation. Jaeger is an open-source end-to-end visual representation of tracing that allows you to monitor and troubleshoot transactions in complex distributed systems.

So, we have a dashboard where we can interact and search for traces. Jaeger addresses problems such as monitoring distributed tracing, performance and latency optimizations, root cause analysis, service dependency analysis, and distributed content propagation. Jaeger has different clients for different types of languages.

So, for example, if you are using Python, there will be client library features for Python. 

OpenTelementry

We also have OpenTelementry, and this is similar. It is described as an observability framework for cloud-native software and is in beta across several languages. It is geared towards traces, metrics, and logs, so it does more than OpenTracing. 

 

distributed tracing
Diagram: Distributed tracing and scalable microservices.

 

Introduction to Microservices Observability

We know that Observability means that the internal states of a system can be inferred from its external outputs. Therefore, the tools used to complete an Observability system help understand the relationships between causes and effects in distributed systems.

The term Observability is borrowed from the control theory. It suggests a holistic, data-centric view of microservices monitoring that enables exploration capabilities and identifying unknown failures with the more traditional anomaly detection and notification mechanisms.

Goal: The ultimate goal of Observability is to :

  • Improving baseline performance
  • Restoring baseline performance (after a regression)

By improving the baseline, you improve the user experience. This could be, for user-facing applications, performance often means request latency. Then, we have regressions in performance, including application outages, which can result in a loss of revenue and negatively impact the brand. The regressions time accepted comes down to user expectation. What is accessible, and what is in the SLA?

Chaos engineering

You understand your limits and new places that your system and applications can be made with Chaos Engineering tests. Chaos Engineering helps you know your system by introducing controlled experiments when debugging microservices. 

 

 Video: Chaos Engineering

This educational tutorial will begin with guidance on how the application has changed from the monolithic style to the microservices-based approach and how this has affected failures. I will introduce to you how this can be solved by knowing exactly how your application and infrastructure perform under stress and what are their breaking points.

Chaos Engineering: How to Start A Project
Prev 1 of 1 Next
Prev 1 of 1 Next

 

Microservices Observability Pillars

So, to fully understand a system’s internal state, we need tools. Some of which are old and others which are new. These tools are known as the pillars of Observability, a combination of logs, metrics, and distributed tracing. These tools must be combined to understand internal behavior and fulfill the observability definition fully.

Data must be collected continuously across all Observability domains to fully understand the symptoms and causes.

A key point: Massive amount of data

Remember that instrumenting potentially generates massive amounts of data, which can cause challenges in storing and analyzing. You must collect, store, and analyze data across the metrics, traces, and logs domains. And then, you need to alert me on these domains and what matters most. Not just when an arbitrary threshold is met.

The role of metrics

A metric is known to most, comprising a value, timestamp, and metadata. Metrics are collections of statistics that need to be analyzed over time. A single instance of a metric is of limited value. Examples include request rate, average duration, and queue size. These values are usually captured as time series so that operators can see and understand changes to metrics over time. 

Add labels to metric.

To better understand metrics, we can add labels as key-value pairs. The labels add additional context to this data point. So, the label is a key-value pair indexed with the metrics as part of the injection process. In addition, metrics can now be broken down into sub-metrics.

As we enter the world of labels and tags for metrics, we need to understand the effects this may have on Cardinality. While each indexed label value adds time series, this will come at storage and processing costs. Therefore, we use Cardinality to understand the impact of labels on a metric store.

observability
Diagram: Observability: The issue with metrics.

 

Aggregated metrics

The issue I continue to see is that metrics are typically aggregated every minute or even six to twelve times per minute. However, metrics must be aggregated and visualized within at most one minute but ideally even more quickly. Key questions are: What is the window across which values are aggregated? How are the windows from different sources aligned?

A key point: The issues of Cardinality

Aggregated Metrics allow you to get an aggregate understanding of what’s happening to all instances of a given service and even narrow your query to specific groups of services but fail to account for infinite Cardinality. Due to issues with “high-cardinality” within a time series storage engine, it is recommended to use labels rather than hierarchical naming for metrics.

 

Prometheus Monitoring and Prometheus Metric Types

Examples: Push and Pull

So, to get metric, you need to have a push or pull approach. A push agent transmits data upstream and, more than likely, on a scheduled basis. A pull agent expects to be polled. Then, we have Prometheus and several Prometheus metric types. We have a Prometheus server with a pull approach that fits better into larger environments.

Prometheus does not use the term agent and has what is known as exporters. They allow the Prometheus server to pull metrics back from software that cannot be instrumented using the Prometheus client libraries.

Prometheus Kuberentes

Prometheus Kubernetes is an open-source monitoring platform that originated at SoundCloud and was released in 2012. Its capabilities include metric collection, storage, data analyses, and visualizations. So, we can use Prometheus and Grafana for the visualizations.

Storing Metrics

You can sort metrics that are time-series data in a general-purpose relational database. However, they should be stored in an optimized repository for storing and retrieving time-series data. We have several time-series storage options, such as Altas, InfluxDB, and Prometheus. Prometheus is the one that stands out, but keep in mind that, as far as I’m aware, there is no commercial support and limited professional services to Prometheus.

prometheus and grafana
Diagram: Prometheus and Grafana.

The Role of Logs

Then, we have logs that can be highly detailed. Logs can be anything, unlike metrics that have a daily uniform format. However, logs do provide you with why something is broken. Logs capture activity that can be printed to the screen or sent to a backend to be centrally stored and viewed.

There is very little standard structure to logs apart from a timestamp indicating when the event occurred. There is minimal log schema, and log structure will depend on how the application uses it and how developers create logs.

Emitting Logs

Logs are emitted by almost every entity, such as the basic infrastructure, network and storage, servers and computer notes, operating system nodes, and application software. So, there are a variety of log sources and also several tools involved in the transport and interpretation to make log collection a complex task. However, remember that you may assume a large amount of log data must be stored.

Search engines such as Google have developed several techniques for searching extensive datasets using arbitrary queries, which has proved very efficient. All of which can be applied to log data.  

Logstash, Beats, and FluentD

Logstash is a cloud-scale ingestion tool and is part of the Elasticsearch suit. However, there have been concerns with the performance and scalability of Logstash, which brings us to the lightweight version of Beats. So, if you don’t need the sophisticated data manipulation and filtering of Logstash, you can use Beasts. FluentD provides a unified logging layer or a way to aggregate logs from many different sources and distribute them to many destinations with the ability to transform data.

Storing Logs

Structure data such as logs and events are made of key-value pairs, any of which may be searched. This leads us to repositories called nonrelational or no SQL databases. So, storing logs represents a different storage problem from that of metrics. Examples of KV databases include Memcache and Redis.

However, they are not a good choice for log storage due to the inefficiency of indexing and searching. The ELK stack has an indexing and searching engine, a collector, a Logstash, a visualization tool, and the dominant storage mechanism for soft log and event data.

A key point: Analyze logs with AI

So, once you store the logs, they need to be analyzed and viewed. Here, you could, for example, use Splunk. Its data analysis capabilities range from security to AI for IT operations (AIOps). Kibana, which is part of the Elastic Stack, can also be used.

 

Introducing Distributed Tracing

Distributed tracing is used in microservices and other distributed applications because a single operation touches many services. Distributed tracing is a type of correlated logging that helps you gain visibility into the process of a distributed software system—distributed tracing consists of collecting request data from the application and then analyzing and visualizing this data as traces.

Tracing data, in the form of spans, must be collected from the application, transmitted, and stored to reconstruct complete requests. This can be useful for performance profiling, debugging in production, and root causes analysis of failures or other incidents. 

A key point: The value of distributed tracing

Distributed tracing allows you to understand what a particular service is doing as part of the whole. Thus providing visibility into the operation of your microservice architecture. The trace data you generate can display the overall shape of your distributed system and view individual service performance inside a single request.

distributed tracing
Diagram: Distributed tracing.

 

Distributed tracing components 

  1. What is a trace?

Consider your software in terms of requests. Each component of your software stack works in response to a request or a remote procedure call from another service. So, we have a trace encapsulating a single operation within the application, end to end, and represented as a series of spans. 

Each traceable unit of work within the operations generates a span. There are two ways you can get trace data. Trace data can be generated through the Instrumentation of your service processes or by transforming existing telemetry data into trace data. 

  1. Introducing a SPAN

We call each service’s work a span, as in the period it takes for the work to occur. These spans can be annotated with additional information, such as attributes, tags, or logs. So, we can have a combination of metadata and events that can be added to spans—creating effective spans that unlock insights into the behavior of your service. The span data produced by each service is then forwarded to some external process, where it can be aggregated into a trace, analyzed, and stored for further insights.

Summary: Monitoring Microservices

Monitoring microservices has become a critical aspect of maintaining the performance and reliability of modern applications. With the increasing adoption of microservices architecture, understanding how to monitor and manage these distributed systems effectively has become indispensable. In this blog post, we explored the key considerations and best practices for monitoring microservices.

Section 1: The Need for Comprehensive Monitoring

Microservices are highly distributed and decentralized, which poses unique challenges regarding monitoring. Traditional monolithic applications are more accessible to monitor, but microservices require a different approach. Understanding the need for comprehensive monitoring is the first step toward ensuring the reliability and performance of your microservices-based applications.

Section 2: Choosing the Right Monitoring Tools

This section will delve into the various monitoring tools available for monitoring microservices. From open-source solutions to commercial platforms, there is a wide range of options. We will discuss the critical criteria for selecting a monitoring tool: scalability, real-time visibility, alerting capabilities, and integration with existing systems.

Section 3: Defining Relevant Metrics

To effectively monitor microservices, it is essential to define relevant metrics that provide insights into the health and performance of individual services as well as the overall system. In this section, we will explore the key metrics to monitor, including response time, error rates, throughput, resource utilization, and latency. We will also discuss the importance of setting appropriate thresholds for these metrics to trigger timely alerts.

Section 4: Implementing Distributed Tracing

Distributed tracing plays a crucial role in understanding the flow of requests across microservices. By instrumenting your services with distributed tracing, you can gain visibility into the entire request journey and identify bottlenecks or performance issues. We will explore the benefits of distributed tracing and discuss popular tracing frameworks like Jaeger and Zipkin.

Section 5: Automating Monitoring and Alerting

Keeping up with the dynamic nature of microservices requires automation. This section will discuss the importance of automated monitoring and alerting processes. From automatically discovering new services to scaling monitoring infrastructure, automation plays a vital role in ensuring the effectiveness of your monitoring strategy.

Conclusion:

Monitoring microservices is a complex task, but with the right tools, metrics, and automation in place, it becomes manageable. By understanding the unique challenges of monitoring distributed systems, choosing appropriate monitoring tools, defining relevant metrics, implementing distributed tracing, and automating monitoring processes, you can stay ahead of potential issues and ensure optimal performance and reliability for your microservices-based applications.

Cisco ACI

Cisco ACI | ACI Infrastructure

Cisco ACI | ACI Infrastructure

The ACI Cisco stands for Cisco Application Centric Infrastructure and is based on a spine leaf architecture. It is a software-defined networking solution that provides a holistic approach to network management. ACI offers a centralized policy-driven framework for managing and automating network infrastructure.

One of the critical features of ACI Cisco is its ability to create a virtualized network environment using the Application Network Profiles (ANPs) concept. ANPs allow administrators to define and manage network policies based on the requirements of specific applications.

This simplifies the deployment and management of applications, as network policies can be easily applied across the entire infrastructure.

Table of Contents

Highlights: ACI Cisco

Example: ACI Networks

ACI Networks also introduces the concept of the Application Policy Infrastructure Controller (APIC), which acts as the central point of control for the network. The APIC allows administrators to define and enforce network policies, monitor performance, and troubleshoot issues.

In addition to network virtualization and policy management, ACI Cisco offers a range of other features. These include integrated security, intelligent workload placement, and seamless integration with other Cisco products and technologies.

COOP Protocol in ACI

The spine proxy receives mapping information (location and identity) via the Council of Oracle Protocol (COOP). Using Zero Message Queue (ZMQ), leaf switches forward endpoint address information to spine switches. As part of COOP, the spine nodes maintain a consistent copy of the endpoint address and location information and maintain the distributed hash table (DHT) database for mapping endpoint identity to location.

Micro-segmentation

Integrated security is achieved through micro-segmentation, which allows administrators to define fine-grained security policies at the application level. This helps to prevent the lateral movement of threats within the network and provides better protection against attacks.

Intelligent workload placement ensures that applications are placed in the most appropriate locations within the network based on their specific requirements. This improves application performance and resource utilization.

Related: For pre-information, you may find the following helpful:

  1. Data Center Security
  2. Data Center Topologies
  3. Dropped Packet Test
  4. DMVPN
  5. Stateful Inspection Firewall
  6. Cisco ACI Components



ACI Network



Key Cisco ACI Blog Discussion Points:


  • Operates over a Leaf and Spine design.

  • New ACI network components e.g Bridge Domain and Contracts.

  • Intelligence at the edge.

  • Overcomes many DC challenges.

  • VXLAN transport network.

  • Extend with Mutli Pod and Multi Site.

 

ACI Components

Key components that make up the ACI Cisco architecture. By understanding these components, network administrators and IT professionals can harness the power of ACI to optimize their data center operations.

Cisco ACI Components

Main ACI Components

Cisco Application Centric Infrastructure (ACI) 

  • Application Policy Infrastructure Controller

  • Spine Switches

  • Leaf Switches

  • Application Network Profiles

  • Endpoint Groups 

1. Application Policy Infrastructure Controller (APIC):

The cornerstone of the Cisco ACI architecture is the Application Policy Infrastructure Controller (APIC). APIC is the central management and policy engine for the entire ACI fabric. It provides a single point of control, enabling administrators to define and enforce policies that govern the behavior of applications and services within the data center. APIC offers a user-friendly interface for policy configuration, monitoring, and troubleshooting, making it an essential component for managing the ACI fabric.

2. Spine Switches:

Spine switches form the backbone of the ACI fabric. These high-performance switches provide connectivity between leaf switches and facilitate east-west traffic within the fabric. Spine switches operate at Layer 3 and use routing protocols to distribute traffic across the fabric efficiently. With the ability to handle massive amounts of data, spine switches ensure high-speed connectivity and optimal performance in the ACI environment.

3. Leaf Switches:

Leaf switches act as the access layer switches in the ACI fabric. They connect directly to the endpoints, such as servers, storage devices, and other network devices, and serve as the entry and exit points for traffic entering and leaving the fabric. Leaf switches provide Layer 2 connectivity for endpoint devices and Layer 3 connectivity for communication between endpoints within the fabric. They also play a crucial role in implementing policy enforcement and forwarding traffic based on predefined policies.

 

Lab Guide: IS-IS

Example: IS-IS

Cisco ACI under the covers runs ISIS. The ISIS routing protocol is an Interior Gateway Protocol (IGP) that enables routers within a network to exchange routing information and make informed decisions on the best path to forward packets. It operates at the OSI model’s Layer 2 (Data Link Layer) and Layer 3 (Network Layer).

ISIS organizes routers into logical groups called areas, simplifying network management and improving scalability. It allows for hierarchical routing, reducing the overhead of exchanging routing information across large networks.

Note:

Below, we have four routers. R1 and R2 are in area 12, and R3 and R4 are in area 34. R1 and R3 are intra-area routers, so that they will be configured as level 1 routers. R2 and R4 form the backbone, so that these routers will be configured as levels 1-2.

Network administrators need to configure ISIS parameters on each participating router to implement ISIS. These parameters include the router’s ISIS system ID, area assignments, and interface settings. ISIS uses the reliable transport protocol (RTP) to exchange routing information between routers.

Routing Protocol
Diagram: Routing Protocol. ISIS.

4. Application Network Profiles (ANPs):

Application Network Profiles (ANPs) are a key component of the Cisco ACI policy model. ANPs define the policies and configurations required for specific applications or application groups. ANPs encapsulate all the necessary information, including network connectivity, quality of service (QoS) requirements, security policies, and service chaining.

By associating endpoints with ANPs, administrators can easily manage and enforce consistent policies across the ACI fabric, simplifying application deployment and ensuring compliance.

5. Endpoint Groups (EPGs):

Endpoint Groups (EPGs) are logical containers that group endpoints with similar network requirements. EPGs provide a way to define and enforce policies at a granular level—endpoints within an EPG share standard policies, such as security, QoS, and network connectivity.

This grouping allows administrators to apply policies consistently to specific endpoints, regardless of their physical location within the fabric. EPGs enable seamless application mobility and simplify policy enforcement within the ACI environment.

 

Specific ACI Cisco architecture.

In some of the lab guides we have in this blog post. We are using the following hardware from a rack rental from Cloudmylabs. Remember that the ACI Fabric is built on the Nexus 9000 Product Family.

The Cisco Nexus 9000 Series Switches are designed to meet the increasing demands of modern networks. With high-performance capabilities, these switches deliver exceptional speeds and low latency, ensuring smooth and uninterrupted data flow. They support high-density 10/25/40/100 Gigabit Ethernet interfaces, allowing businesses to scale and adapt to growing network requirements.

Enhanced Security

The Cisco Nexus 9000 Series Switches offer comprehensive security features to protect networks from evolving threats. They leverage Cisco TrustSec technology, which provides secure access control, segmentation, and policy enforcement. With integrated security features, businesses can mitigate risks and safeguard critical data, ensuring peace of mind.

Application Performance Optimization:

To meet the demands of modern applications, the Cisco Nexus 9000 Series Switches are equipped with advanced features that optimize application performance. These switches support Cisco Tetration Analytics, which provides deep insights into application behavior, enabling businesses to enhance performance, troubleshoot issues, and improve efficiency.

Diagram: The source is Cloudmylabs.

Cisco ACI Simulator

Below is a screenshot from Cisco ACI similar. At the start, you will be asked for fabric details. Remember that once you set the out-of-band management address for the API, you need to change the port group settings on the ESXi VM network. If you don’t change “Promiscuous mode, MAC address changes, and Forged Transmits,” you cannot access the UI from your desktop.

ACI fabric Details
Diagram: Cisco ACI fabric Details

Back to basics: Leaf and spine design

Leaf and Spine

Leaf and spine architecture is a network design methodology commonly used in data centers. It provides a scalable and resilient infrastructure that can handle the increasing demands of modern applications and services. The term “leaf and spine” refers to the physical and logical structure of the network.

In leaf and spine architecture, the network is divided into two main layers: the leaf and spine layers. The leaf layer consists of leaf switches connected to the servers or endpoints in the data center. These leaf switches act as the access points for the servers, providing high-bandwidth connectivity and low-latency communication.

The spine layer, on the other hand, consists of spine switches that connect the leaf switches. The spine switches provide high-speed and non-blocking interconnectivity between the leaf switches, forming a fully connected fabric. This allows for efficient and predictable traffic patterns, as any leaf switch can communicate directly with any other leaf switch through the spine layer.

 

 Lab Guide: ACI Cisco with leaf and spine.

The following lab guide has a leaf and spine ACI design that includes 2 leaf switches acting as the leaf layer where the workloads connect. Then, we have a spine connected to the leaf. When the ACI hardware installation is done, all Spines and Leafs are linked and powered up. Once the basic configuration of APIC is completed, the Fabric discovery process starts working.

Note: IFM process

In the discovery process, ACI uses the Intra-Fabric Messaging (IFM) process in which APIC and nodes exchange heartbeat messages.

The process used by the APIC to push policy to the fabric leaf nodes is called the IFM Process. ACI Fabric discovery is completed in three stages. The leaf node directly connected to the APIC is discovered in the first stage. The second discovery stage brings in the spines connected to that initial leaf where APIC was connected. The third stage involves discovering the cluster’s other leaf nodes and APICs.

The fabric membership diagram below shows the inventory, including serial number, Pod, Node ID, Model, Role, Fabric IP, and Status. Cisco ACI consists of the following hardware components: APIC Controller Spine Switches and Leaf Switches.

ACI fabric discovery
Diagram: ACI fabric discovery

Analysis:

Cisco ACI uses an overlay based on VXLAN to virtualize physical infrastructure. Like most overlays, this overlay requires the data path at the network’s edge to map from the tenant end-point address in the packet, otherwise referred to as its identifier, to the endpoint’s location, also known as its locator. This mapping occurs in a tunnel endpoint (TEP) function called VXLAN (VTEP).

The VTEP addresses are displayed in the INFRASTRUCTURE IP column. The TEP address pool 10.0.0.0/16 has been configured on the Cisco APIC using the initial setup dialog. The APIC assigns the TEP addresses to the fabric switches via DHCP, so the infrastructure IP addresses in your fabric will differ from the figure.

This configuration is perfectly valid for a Lab but not good for a production environment. The minimum physical fabric hardware for a production environment includes two spines, two leaves, and three APICs.In addition to discovering and configuring the Fabric and applying the Tenant design, the following functionality can be configured:

  • Routing at Layer 3

  • Connecting a legacy network at layer 2

  • Virtual Port Channels at Layer 2

A note about Border Leafs: ACI fabrics often use this designation along with “Compute Leafs” and “Storage Leafs.” Border Leaf is merely a convention for identifying the leaf pair that hosts all external connectivity external to the fabric (Border Leaf) or the leaf pair that hosts host connectivity (Compute Leaf).

Note: The Link Layer Discovery Protocol (LLDP) is responsible for discovering directly adjacent neighbors. When run between the Cisco APIC and a leaf switch, it precedes three other processes: Tunnel endpoint (TEP) IP address assignment, node software upgrade (if necessary), and the intra-fabric messaging (IFM) process, which the Cisco APIC uses to push policy to the leaves.

aci Cisco LLDP

Leaf and Spine: Traffic flows

The leaf and spine network topology is suitable for east-to-west network traffic and comprises leaf switches to which the workloads connect and spine switches to which the leaf switches connect. The spines have a simple role to play and are geared around performance, while all the intelligence is distributed to the edge of the network where the leaf layers sit.

This allows engineers to move away from managing individual devices and manage the data center architecture more efficiently with policy. In this model, the Application Policy Infrastructure Controller (APIC) controllers can correlate information from the entire fabric.

Understanding Leaf and Spine Traffic Flow

In a leaf and spine architecture, traffic flow follows a structured path. When a device connected to a leaf switch wants to communicate with another device, the traffic is routed through the spine switch to the destination leaf switch. This approach minimizes the hops required for data transmission and reduces latency. Additionally, traffic can be evenly distributed since every leaf switch is connected to every spine switch, preventing congestion and bottlenecks.

Lab guide on ACI Cisco with leaf and spine.

In the following lab guide, we continue to verify the ACI leaf and spine.  We can run the command Acidiag fnvread, a diagnostics tool to check the ACI fabric. It would also be recommended to check the LLDP and ISIS adjacencies. With a leaf and spine design, the leaf layer does not connect, and we can see this with the LLDP and ISIS adjacency information below.

ACI leaf and spine
Diagram: ACI leaf and spine
 

Advantages of Leaf and Spine Traffic Flow:

  • Improved Performance: Leaf and spine architecture ensures optimal performance by evenly distributing traffic and minimizing latency. This results in faster data transmission and improved response times for end-users.
  • Scalability: The leaf and spine architecture allows for easy scalability as additional leaf switches can be added without disrupting the existing network. This flexibility enables networks to adapt to changing requirements and handle increasing traffic loads.
  • High Availability: Providing multiple paths for traffic, leaf, and spine architecture ensures redundancy and fault tolerance. If one link fails, traffic can be rerouted through alternative paths, minimizing downtime and ensuring uninterrupted connectivity.

leaf and spine

 

Leaf and Spine Switch Functions

Based on a two-tier (spine and leaf switches) or three-tier (spine switch, tier-1 leaf switch, and tier-2 leaf switch) architecture, Cisco ACI switches provide the following functions:

Leaf switches: 

What are Leaf Switches?

Leaf switches connect between end devices, servers, and the network fabric. They are typically deployed in leaf-spine network architecture, connecting directly to the spine switches. Leaf switches provide high-speed, low-latency connectivity to end devices within a data center network.

Functionalities of Leaf Switches:

1. Aggregation: Leaf switches aggregate traffic from multiple servers and sends it to the spine switches for further distribution. This aggregation helps reduce the network’s complexity and enables efficient traffic flow.

2. High-density Port Connectivity: Leaf switches are designed to provide a high-density port connectivity environment, allowing multiple devices to connect simultaneously. This is crucial in data centers where numerous servers and devices must be interconnected.

These devices have ports connected to classic Ethernet devices, such as servers, firewalls, and routers. In addition, these leaf switches provide the VXLAN Tunnel Endpoint (VTEP) function at the edge of the fabric. In Cisco ACI terminology, IP addresses representing leaf switch VTEPs are called Physical Tunnel Endpoints (PTEPs). The leaf switches route or bridge tenant packets and applies network policies.

Spine switches:

What are Spine Switches?

Spine switches, also known as spine or core switches, are high-performance switches that form the backbone of a network. They play a vital role in data centers and large enterprise networks and facilitate the seamless data flow between various leaf switches.

These devices interconnect leaf switches. To build a Cisco ACI Multi-Pod fabric, they can also connect Cisco ACI pods to IP networks or WAN devices. In addition to the mapping entries between endpoints and VTEPs, spine switches also store proxy entries between endpoints and VTEPs. Leaf switches are connected to spine switches within a pod, and spine switches are connected to leaf switches.

No direct connection between tier-1 leaf switches, tier-2 leaf switches, or spine switches is allowed. If you incorrectly cable spine switches to each other or leaf switches in the same tier to each other, the interfaces will be disabled.

Cisco ACI Fabric
Diagram: Cisco ACI Fabric. Source Cisco Live.

 

Video 2: Demonstration on a leaf and spine data center design

The following tutorial will examine the leaf and spine data center architecture. We know this design is a considerable step from traditional DC design. As a use case, we will focus on how Cisco has adopted the leaf and spine design with its Cisco ACI product. We will address the components and how they form the Cisco ACI fabric.

Spine and Leaf Design: Cisco ACI
Prev 1 of 1 Next
Prev 1 of 1 Next

BGP Route Reflection

Under the cover, Cisco ACI works with BGP Route-Reflection. BGP Route Reflection creates a hierarchy of routers within the ACI fabric. At the top of the hierarchy is a Route-Reflector (RR), a central point for collecting routing information from other routers within the fabric. The RR then reflects this information to other routers, ensuring that every router in the network has a complete view of the routing table.

The ACI uses MP-BGP protocol to distribute external Network subnets or prefixes inside the ACI fabric. To create an MP-BGP route reflector, we must select two Spines acting as Route Reflectors and make an iBGP Neighbourship to all other Leafs.

BGP Route Reflection
Diagram: BGP Route Reflection

The ACI Cisco Architecture

The ACI Cisco operates with several standard ACI building blocks. These include Endpoint Groups (EPGs) that are used to classify and group similar workloads; then, we have the Bridge Domains (BD), VRFs, Contract constructs, COOP protocol in ACI, and micro-segmentation. With micro-segmentation in the ACI, you can get granular policy enforcement right the workload anywhere in the network.

Unlike in the traditional network design, you don’t need to place certain workloads in specific VLANs or, in some cases, physical locations. The ACI can incorporate devices separate from the ACI, such as a firewall, load balancer, or an IPS/IDS, for additional security mechanisms. This enables the dynamic service insertion of Layer 4 to Layer 7 services. Here, we have a lot of flexibility with the redirect option and service graphs.

Cisco ACI 

ACI network


Automation and consitency


Multi-cloud acceleration


Zero-trust security protectomn

Centralised management

Multi-site capabilities 

The ACI Infrastructure

The Cisco ACI architecture is optimized to learn endpoints dynamically with its dynamic endpoint learning functionality. So, we have endpoint learning in the data plane. Therefore, the other devices learn of the endpoints connected to that local leaf switch; the spines have a mapping database that saves many resources on the spine and can optimize the data traffic forwarding. So you don’t need to flood traffic anymore. If you want, you can turn off flooding in the ACI fabric. Then, we have an overlay network.

As you know, the ACI network has both an overlay and a physical underlay; this would be a virtual underlay in the case of Cisco Cloud ACI. The ACI uses VXLAN, the overlay protocol that rides on top of a simple leaf and spine topology, with standards-based protocols such as IS-IS and BGP for route propagation. 

Video: BGP in the Data Center

In this whiteboard session, we will address the basics of BGP. A network exists specifically to serve the connectivity requirements of applications, and these applications are to serve business needs. So these applications must run on stable networks, and stable networks are built from stable routing protocols.

Routing protocols are predefined rules used by the routers that interconnect your network to maintain communication between the source and the destination. These protocols help to find routes between two nodes on the computer network.

BGP in the Data Center
Prev 1 of 1 Next
Prev 1 of 1 Next

ACI Cisco and endpoints

In a traditional network, three tables are used to maintain the network addresses of external devices: a MAC address table for Layer 2 forwarding, a Routing Information Base (RIB) for Layer 3 forwarding, and an ARP table for the combination of IP addresses and MAC addresses. Cisco ACI, however, maintains this information differently, as shown below.

ACI Endpoint learning
Diagram: Endpoint Learning. Source Cisco.com

What is ACI Endpoint Learning?

ACI endpoint learning refers to discovering and monitoring the network endpoints within an ACI fabric. Endpoints include devices, virtual machines, physical servers, users, and applications. Network administrators can make informed decisions regarding network policies, security, and traffic optimization by gaining insights into these endpoints’ location, characteristics, and behavior.

How Does ACI Endpoint Learning Work?

ACI fabric leverages a distributed, controller-based architecture to facilitate endpoint learning. When an endpoint is connected to the fabric, ACI utilizes a variety of mechanisms to gather information about it. These mechanisms include Address Resolution Protocol (ARP) snooping, Link Layer Discovery Protocol (LLDP), and even integration with hypervisor-based systems.

Once an endpoint is detected, ACI Fabric builds a comprehensive endpoint database called the Endpoint Group (EPG). This database contains vital information such as MAC addresses, IP addresses, VLANs, and associated policies. By continuously monitoring and updating this database, ACI ensures real-time visibility and control over the network endpoints.

Benefits of ACI Endpoint Learning:

1. Enhanced Security: With ACI endpoint learning, network administrators can enforce security policies by controlling traffic flow based on endpoint characteristics. Unauthorized or suspicious endpoints can be automatically detected and isolated, reducing the risk of data breaches and unauthorized access.

2. Simplified Network Operations: ACI’s endpoint learning eliminates the need for manual configuration of network policies and access control lists (ACLs). By dynamically learning the endpoints and their associated attributes, ACI enables automated policy enforcement, reducing human error and simplifying network management.

3. Efficient Traffic Optimization: ACI’s endpoint learning enables intelligent traffic steering by understanding the location and behavior of endpoints. This information allows for intelligent load balancing and traffic optimization, ensuring optimal performance and reducing congestion within the infrastructure.

Implementation Endpoint Learning Considerations:

To leverage the benefits of ACI endpoint learning, organizations need to consider a few key aspects:

1. Infrastructure Design: A well-designed ACI fabric with appropriate leaf and spine switches is crucial for efficient endpoint learning. Proper VLAN and subnet design should be implemented to ensure accurate endpoint identification and classification.

2. Endpoint Group (EPG) Definition: Defining and associating EPGs with appropriate policies is essential. EPGs help categorize endpoints based on their characteristics, allowing for granular policy enforcement and simplified management.

Diagram: ACI Endpoint Learning. The source is Cisco.

 

Forwarding Behavior. The COOP database

Local and remote endpoints are learned from the data plane, but remote endpoints are local caches. Cisco ACI’s fabric relies heavily on local endpoints for endpoint information. A leaf is responsible for reporting its local endpoints to the Council Of Oracle Protocol (COOP) database located on each spine switch, which implies that all endpoint information in the Cisco ACI fabric is stored there.

Each leaf does not need to know about all the remote endpoints to forward packets to the remote endpoints because this database is accessible. When a leaf does not know about a remote endpoint, it can still forward packets to spine switches. This forwarding behavior is called spine proxy.

Diagram: Endpoint Learning. The source is Cisco.

In a traditional network environment, switches rely on the Address Resolution Protocol (ARP) to map IP addresses to MAC addresses. However, this approach becomes inefficient as the network scales, resulting in increased network traffic and complexity. Cisco ACI addresses this challenge by utilizing local endpoint learning, a more intelligent and efficient method of mapping MAC addresses to IP addresses.

Diagram: Local and Remote endpoint learning. The source is Cisco.

ACI Cisco: The Main Features

We have a lot of changes right now that are impacting almost every aspect of IT. Applications are changing immensely, and we see their life cycles broken into smaller windows as the applications become less structured. In addition, containers and microservices are putting new requirements on the underlying infrastructure, such as the data centers they live in. This is one of the main reasons why a distributed system, including a data center, is better suited for this environment.

Distributed system/Intelligence at the edge

Like all networks, the Cisco ACI network still has a control and data plane. From the control and data plane perspective, the Cisco ACI architecture is still a distributed system. Each switch has intelligence and knows what it needs to do—one of the differences between ACI and traditional SDN approaches that try to centralize the control plane. If you try to centralize the control plan, you may hit scalability limits, not to mention a single point of failure and an avenue for bad actors to penetrate.

Cisco ACI Design
Diagram: Cisco ACI Design. Source Cisco Live.

MPLS overlay

In the following guide, we have an example of an MPLS overlay. Similar to that of Cisco ACI, an MPLS overlay pushes intelligence to the edge of the networks. MPLS overlay is a technique that enables the creation of virtual private networks (VPNs) over a shared IP infrastructure.

It involves encapsulating data packets with MPLS labels, allowing routers to forward traffic based on these labels rather than the traditional IP routing. This process enhances network efficiency, reduces complexity, and creates secure and isolated network segments.

Two PE nodes are running BGP, while the P nodes representing the core only run an IGP plus LDP. In the core, we have label switch paths that bring a lot of scalability.

MPLS overlay
Diagram: MPLS Overlay

Two large core devices

If we examine the traditional data center architecture, intelligence is often in two central devices. You could have two large core devices. What the network used to control and secure has changed dramatically with virtualization via hypervisors. We’re seeing faster change with containers and microservices being deployed more readily.

As a result, an overlay networking model is better suited. However, in a VXLAN overlay network, the intelligence is distributed across the leaf switch layer.

Therefore, distributed systems are better than centralized systems for more scale, resilience, and security. By distributing the Intelligence to the leaf layer, the scalability is not determined by the scalability of each leaf and is determined at a fabric level. However, there are scale limits on each device. Therefore, scalability as a whole is determined by the network design.

A key point: Overlay networking

The Cisco ACI architecture provides an integrated Layer 2 and 3 VXLAN-based overlay networking capability to offload network encapsulation processing from the compute nodes onto the top-of-rack or ACI leaf switches. This architecture provides the flexibility of software overlay networking in conjunction with the performance and operational benefits of hardware-based networking. We will have a lab guide on overlay networking in just a moment.

ACI infrastructure
Diagram: ACI infrastructure.

ACI Cisco New Concepts

Networking in the Cisco ACI architecture differs from what you may use in traditional network designs. It’s not different because we use an entirely new set of protocols. ACI uses standards-based protocols such as BGP, VXLAN, and IS-IS. However, the new networking constructs inside the ACI fabric exist only to support policy.

ACI has been referred to as stateless architecture. As a result, the network devices have no application-specific configuration until a policy is defined stating how that application or traffic should be treated on the network.

This is a new and essential concept to grasp. So, now, with the ACI, the network devices in the fabric have no application-specific configuration until there is a defined policy. No configuration is tied to a device. With a traditional configuration model, we have many designs on a device, even if it’s not being used. For example, we had ACL and QoS parameters configured, but nothing was using them.

  • Cisco ACI: Stateless Architecture.

  • ACI Network: Standards-based protocols such as BGP.

  • ACI Network: New ACI network constructs.

  • ACI Fabric Contructs: EPGs and Contracts.

  • Cisco ACI Architecture: VXLAN distributed architecture.

  • Cisco ACI Fabric: No policy tied to devices.

The APIC controller

The APICs, the management plan that defined the policy, do not need to push resources when nothing connected utilizes that. The APIC controller can see the entire fabric and has a holistic viewpoint.

Therefore, it can correlate configurations and integrate them with devices to help manage and maintain the security policy you define. We see every device on the fabric, physical or virtual, and can maintain policy consistency and, more importantly, recognize when policy needs to be enforced. 

APIC Controller
Diagram: APIC Controller. Source Cisco Live.

Endpoint groups (EPG)

We touched on this a moment ago. Groups or endpoint groups (EPGs) and contracts are core to the ACI. Because this is a zero-trust network by default, communication is blocked in hardware until a policy consisting of groups and contracts is defined. With Endpoint Groups, we can decouple and separate the physical or virtual workloads from the constraints of IP addresses and VLANs. 

So, we are grouping similar workloads into groups known as Endpoint Groups. Then, we can control group behavior by applying policy to the groups and not the endpoints in the group. As a security best practice, it is essential to group similar workloads with similar security sensitivity levels and then apply the policy to the endpoint group.

For example, a traditional data center network could have database and application servers in the same segment controlled by a VLAN with no intra-VLAN filtering. The EPG approach removes the barriers we have had with traditional networks with the limitation of the IP address being used as the identifier and locator and the VLANs restrictions.

This is a new way of thinking and allows devices to communicate with each other without having to change the IP address, VLAN, or subnet.

ACI Endpoint Groups
Diagram: ACI Endpoint Groups. Source Cisco Live.

EPG Communication

The EPG provides a better way to provide segmentation than the VLAN, which was never meant to live in a world of security. Anything in the group, by default, can communicate freely, and Inter-EPG communication needs a policy. This policy construct that ACI uses is called a contract. So, having similar workloads of similar security levels in the same EPG makes sense. All devices inside the same endpoint group can talk to each other freely.

This behavior can be modified with intra-EPG isolation, similar to a private VLAN where communication between group members is not allowed. Or, intra-EPG contracts can be used only to allow specific communications between devices in an EPG.

Endpoint groups
Diagram: Cisco Endpoint Groups (EPG).

Data Center Network Challenges

Let us examine well-known data center challenges and how the Cisco ACI network solves them.

Cisco Data Center

Cisco ACI 

Challenges

  • Complicated Topologies

  • Oversubscription

  • Varying Bandwidths

  • Management Challenges

Cisco Data Center

Cisco ACI 

Challenges

  • Lack of Portability

  • Issues with ACL

  • Issues with Spanning Tree

  • Core-Distribution Designs

Complicated topologies

Usually, a traditional data center network design uses core distribution access layers. When you add more devices, this topology can be complicated to manage. Cisco ACI uses a simple spine-leaf topology wherein all the connections within the Cisco ACI fabric are from leaf-to-spine switches, and a mesh topology is between them. There is no leaf-to-leaf and no spine-to-spine connectivity.

How ACI Cisco overcomes this

The Cisco ACI architecture uses the leaf-spine, consisting of a two-tier “fat tree” topology with equidistant bandwidths. The leaf layer connects to the physical and virtual workloads and network services. The spine layer is the transport layer that interconnects the leaves.

Oversubscription

Oversubscription generally means potentially requiring more resources from a device, link, or component than are available. Therefore, the oversubscription ratio must be examined at multiple aggregation points in the design, including the line card to switch fabric bandwidth and the switch fabric input to uplink bandwidth.

Oversubscription Example

Let’s look at a typical 2-layer network topology with access switches and a central core switch. The access switches have 24 user ports and one uplink port. The uplink port is connected to the core switch. Each access switch has 24 1Gb user ports and a 10Gb uplink port. So, in theory, if all the user ports are transmitted to a server simultaneously, they would require 24 GB of bandwidth (24 x 1 GB).

However, the uplink port is only 10, limiting the maximum bandwidth to all the user ports. The uplink port is oversubscribed because the theoretical required bandwidth (24Gb) exceeds the available bandwidth (10Gb). Oversubscription is expressed as a ratio of bandwidth needed to available bandwidth. In this case, it’s 24Gb/10Gb or 2.

Varying bandwidths

We have layers of oversubscription with the traditional core, distribution, and access designs. We have oversubscription at the access, distribution, and core layers. The cause of this will give varying bandwidth to endpoints if they want to communicate with an endpoint that is near or an endpoint that is far away. With this approach, endpoints on the same switch will have more bandwidth than two endpoints communicating across the core layer.

Users and application owners don’t care about networks; they want to place their workload wherever the computer is and want the same BW regardless of where you place it. However, with traditional designs, the bandwidth available depends on where the endpoints are located.

How ACI Cisco overcomes this

The ACI leaf and spine have equidistant endpoints between any two endpoints. So if any two servers have the same bandwidths, which is a big plus for data center performance, then it doesn’t matter where you place the workload, which is a big plus for virtualized workloads. This gives you unlimited workload placement.

data center challenges
Diagram: Data center challenges.

Lack of portability

Applications are built on top of many building blocks. We use contracts such as VLANs, IP addresses, and ACLs to create connectivity. We use these constructs to create and translate the application requirements to the network infrastructure. These constructs are hardened into the network with configurations applied before connectivity.

These configurations are not very portable. It’s not that they were severely designed; they were never meant to be portable. Location Independent Separation Protocol (LISP) did an excellent job making them portable. However, they are hard-coded for a particular requirement at that time. Therefore, if we have the exact condition in a different data center location, we must reconfigure the IP address, VLANs, and ACLs. 

How ACI Cisco overcomes this

An application refers to a set of networking components that provides connectivity for a given set of workloads. These workloads’ relationship is what ACI calls an “application,” and the connection is expressed by what ACI calls an application network profile. With a Cisco ACI design, we can create what is known as Application Network Profiles (ANPs).

The ANP expresses the relationship between the application and its communications. It is a configuration template used to express the relationship between segments. The ACI then translates those relationships into networking constructs such as VLANs, VXLAN, VRF, and IP addresses that the devices in the network can then implement.

Issues with ACL

The traditional ACL is very tightly coupled with the network topology, and anything that is tightly coupled will kill agility. It is configured on a specific ingress and egress interface and pre-set to expect a particular traffic flow. These interfaces are usually at demarcation points in the network. However, many other points in the network could do so with security filtering.

How ACI Cisco overcomes this

The fundamental security architecture of the Cisco ACI design follows an allow-list model where we explicitly define what traffic should be permitted. A contract is a policy construct used to define communication between EPGs.  Without a contract between EPGs, no unicast communication is possible between those EPGs unless the VRF is configured in “unenforced” mode or those EPGs are in a preferred group.

A contract is not required to communicate between endpoints in the same EPG (although transmission can be prevented with intra-EPG isolation or intra-EPG contract). We have a different construct to apply the policy in ACI. We use the contract construct, and within the contract construct, we have subjects and filters that specify how endpoints are allowed to communicate.

These managed objects are not tied to the network’s topology because they are not applied to a specific interface. Instead, the contracts are used in the intersection between EPGs. They represent rules the network must enforce irrespective of where these endpoints are connected.   

Issues with Spanning Tree Protocol (STP)

A significant shortcoming of STP is that it is a brittle failure mode that can bring down entire data centers or campus networks when something goes wrong. Though modifications and enhancements have addressed some of these risks, this has happened at the cost of technical debt in design and maintenance.

When you think about how this works, we have a BPDU that acts as a HELLO mechanism, and when we stop receiving the BPDUs and the link stays up, we decide to forward all the links. So, spanning Tree Protocol causes outages.

How ACI Cisco overcomes this

The Cisco ACI does not run Spanning Tree Protocol natively, meaning the ACI control plane does not run STP. Inside the fabric, we are running IS-IS as the interior routing protocol. If we stop receiving, we don’t go into an all-forwarding state with IS-IS. As we have IP reachability between Leaf and Spine, we don’t have to block ports and see actual traffic flows that are not the same as the physical topology.

So, within the ACI fabric, we have all the advantages of layer three networks, which are more robust and predictable than we have with an STP design. With ACI, we don’t rely on SPT for the topology design. Instead, the ACI uses ECMP for layer 2 and Layer 3 forwarding. We can use ECMP because we have routed links between the leaves and spines in the ACI fabric. So, the ACI has ECMP for Layer 2 and Layer 3 forwarding.

leaf and spine design
Diagram: Leaf and spine design.

Core-distribution design

The traditional design uses VLANs to logically segment Layer 2 boundaries and broadcast domains. VLANs use network links inefficiently, resulting in rigid device placement. We also have a cap on the number of VLANs we can create. Some applications require that you need Layer 2 adjacencies.

For example, clustering software requires Layer 2 adjacency between source and destination servers. However, if we are routing at the access layer, only servers connected to the same access switch with the same VLANs trunked down would be Layer 2-adjacent. 

How ACI Cisco overcomes this

VXLAN solves this dilemma in ACI by decoupling Layer 2 domains from the underlying Layer 3 network infrastructure. With ACI, we are using the concepts of overlays to provide this abstract. Isolated Layer 2 domains can be connected over a Layer 3 network using VXLAN. Packets are transported across the fabric using Layer 3 routing.

This paradigm fully supports layer 2 networks. Large layer-2 domains will always be needed, for example, for VM mobility, clusters that don’t or can’t use dynamic DNS and non-IP traffic, and broadcast-based intra-subnet communication.

Cisco ACI Architecture: Leaf and Spine

The fabric is symmetric with a leaf and spine design, and we have central bandwidth. Therefore, regardless of where a device is connected to the fabric, it has the same bandwidth as every other device connected to the same fabric. This removes the placement restrictions that we have with traditional data center designs. A spine-leaf architecture is a data center network topology that consists of two switching layers—a spine and a leaf.

The leaf layer comprises access switches that aggregate server traffic and connect directly to the spine or network core. Spine switches interconnect all leaf switches in a full-mesh topology.

With low latency east-west traffic, optimized traffic flows are imperative for performance, especially for time-sensitive or data-intensive applications. A spine-leaf architecture aids this by ensuring traffic is always the same number of hops from its next destination, so latency is lower and predictable.

Displaying a VXLAN tunnel 

We have expanded the original design and added VXLAN. We are creating a Layer 2 network, or, more specifically, a Layer 2 overlay over a Layer 3 routed core. The Layer 2 extension allows the two hosts, desktop 0 and desktop 1, to communicate over the Layer 2 overlay that VXLAN creates.

The IP addresses of the hosts are 10.0.0.1 and 10.0.0.2 and are not reachable via the Leaf switches. The leaf switches cannot ping these. Consider the Leaf and the Spine switches a standard Layer 3 WAN or network for this lab. So we have unicast connectivity over the WAN.

The only IP routing addition I have added is the new loopback addresses on Leaf 1 and 2, of 1.1.1.1/32 and 2.2.2.2/32, used for ingress replication for VXLAN. Remember that the ACI is one of many products that use Layer 2 overlays. VXLAN can be used as a Layer 2 DCI. For a lab guide displaying Multicast VXLAN, go to this blogWhat is VXLAN

VXLAN overlay
Diagram: VXLAN Overlay

Notice below I am running a ping from desktop 0 to the corresponding desktop. These hosts are in the 10.0.0.0/8 range, and the core does not know these subnets. I’m also running a packet capture on the link Gi1 connected to Leaf A.

Notice the source and destination are 1.1.1.1 and 2.2.2.2.2, which are the VTEPs, and the IMCP traffic is encapsulated into UDP port 1024. The UDP port 1024 is explicitly set in the confirmation as the VXLAN port to use.

VXLAN unicast mode

ACI Network: VXLAN transport network

In a leaf-spine ACI fabric, We have a native Layer 3 IP fabric that supports equal-cost multi-path (ECMP) routing between any two endpoints in the network—using VXLAN as the overlay protocol allows any workload to exist anywhere in the network.

We can have physical and virtual machines in the same logical layer 2 domain while running layer 3 routing to the top of each rack. So we can have several endpoints connected to each leaf, and for one endpoint to communicate with another endpoint, we use VXLAN.

So, the transport of the ACI fabric is carried out with VXLAN. The ACI encapsulates traffic with VXLAN and forwards the data traffic across the fabric. Any policy that needs to be implemented gets applied at the leaf layer. All traffic on the fabric is encapsulated with VXLAN. This allows us to support standard bridging and routing semantics without the standard location constraints.

Diagram: VXLAN operations. The source is Cisco.
  • A key point – Video 3: Demonstration on overlay networking with VXLAN

The following video gives a deep dive into the operations of VXLAN—the VLAN tag field defined in 1. IEEE 802.1Q has 12 bits for host identification, supporting a maximum of only 4094 VLANs. It’s common these days to have a multi-tiered application deployment where every tier requires its segment, and with literally thousands of multi-tier application segments, this will run out.

Then came along the Virtual extensible local area network (VXLAN). VXLAN uses a 24-bit network segment ID, called a VXLAN network identifier (VNI), for identification. This is much larger than the 12 bits used for traditional VLAN identification.

Technology Brief : VXLAN – Introducing VXLAN
Prev 1 of 1 Next
Prev 1 of 1 Next

Council of Oracle Protocol

COOP protocol in ACI and the ACI fabric

The fabric appears to the outside as one switch capable of forwarding Layers 2 and 3. In addition, the fabric is a Layer 3 network routed network and enables all links to be active, providing ECMP forwarding in the fabric for both Layer 2 and Layer 3. Inside the fabric, we have routing protocols such as BGP; we also use Intermediate System-to-Intermediate System Protocol (IS-IS) and Council of Oracle Protocol (COOP) for all forwarding endpoint-to-endpoint communications.

The COOP protocol in ACI communicates the mapping information (location and identity) to the spine proxy. A leaf switch forwards endpoint address information to the spine switch ‘Oracle’ using Zero Message Queue (ZMQ). The COOP protocol in ACI is something new to data centers. The Leaf switches use COOP to report local station information to the Spine (Oracle) switches.

COOP protocol in ACI

Let’s look at an example of how the COOP protocol in ACI works. We have a Leaf that learns of a host. The Leaf reports this information; let’s say it knows Host B and sends this to one of the Spine switches chosen randomly using the Council Of Oracle Protocol.

The Spine switch then relays this information to all the other Spines in the ACI fabric so that every Spine has a complete record of every single endpoint. The Spines switches record the information learned via the COOP in the Global Proxy Table, which resolves unknown destination MAC/IP addresses when traffic is sent to the Proxy address.

Lab guide on the COOP database.

So, we know that the Spine has a COOP database of all endpoints in the fabric. Council of Oracle Protocol (COOP) is used to communicate the mapping information (location and identity) to the spine proxy. A leaf switch forwards endpoint address information to the spine switch ‘Oracle’ using Zero Message Queue (ZMQ).

The command: Show coop internal info repo key allows us to verify that the endpoint is in the COOP database using the BD VNID of 16154554 mapped to the MAC address of 0050.5690.3eeb. With this command, you can also see the tunnel next hop and IPv4 and IPv6 addresses tied to this MAC address.

coop protocol in ACI
Diagram: COOP protocol in ACI

The fabric constructs

The ACI Fabric contains several new network constructs specific to ACI that enable us to abstract much of the complexity we had with traditional data center designs. These new concepts are ACI’s Endpoint Groups, Contracts, Bridge Domains, and COOP protocol.

In addition, we have a distributed Layer 3 Anycast gateway function that ensures optimal Layer 3 and Layer 2 forwarding. We also have original constructs you may have used, such as VRFs. The layer 3 anycast feature is popular and allows flexible placement of the default gateway suited for designs that need to be agile.

Extending the ACI Fabric

Developing the Cisco ACI architecture

I have always found extending data risky when undergoing data center network design projects. However, the Cisco ACI architecture can be extended without the traditional Layer 2 and 3 Data Center Interconnect (DCI) mechanisms. Here, we can use Multi-Pod and Multi-Site and better control large environments that need to span multiple locations and for applications to share those multiple locations in active-active application deployments.

Diagram: Extending the ACI fabric. Source is Cisco

When considering data center designs, terms such as active-active and active-passive are often discussed. In addition, enterprises are generally looking for data center solutions that provide geographical redundancy for their applications.

Enterprises also need to be able to place workloads in any data center where computing capacity exists—and they often need to distribute members of the same cluster across multiple data center locations to provide continuous availability in the event of a data center failure. The ACI gives us options for extending the fabric to multiple locations and location types.

For example, there are stretched fabric, multi-pod, multi-site designs, and, more recently, Cisco Cloud ACI.

Cisco ACI Design
Diagram: Cisco ACI design: Extending the network.

ACI design: Multi pod

The ACI Multi-Pod is the next evolution of the original stretch fabric design we discussed. The architecture consists of multiple ACI Pods connected by an IP Inter-Pod Layer 3 network. With the stretched fabric, we have one Pod across several locations. Cisco ACI MultiPod is part of the “single APIC cluster/single domain” family of solutions; a single APIC cluster is deployed to manage all the interconnected ACI networks.

These ACI networks are called “pods,” Each looks like a regular two-tier spine-leaf topology. The same APIC cluster can manage several pods.  All of the nodes deployed across the individual pods are under the control of the same APIC cluster. The separate pods are managed as if they were logically a single entity. This gives you operational simplicity. We also have a fault-tolerant fabric since each Pod has isolated control plane protocols.

Diagram: Multi-pod. Source is Cisco

ACI design: Cisco cloud ACI

Cisco Cloud APIC is an essential new solution component introduced in the architecture of Cisco Cloud ACI. It plays the equivalent of APIC for a cloud site. Like the APIC for on-premises Cisco ACI sites, Cloud APIC manages network policies for the cloud site it runs on by using the Cisco ACI network policy model to describe the policy intent.

ACI design: Multisite

ACI Multi-Site enables you to interconnect separate APIC cluster domains or fabric, each representing a separate availability zone. As a result, we have separate and independent APIC domains and fabrics. This way, we can manage multiple fabrics as regions or availability zones. ACI Multi-Site is the easiest DCI solution in the industry. Communication between endpoints in separate sites (Layers 2 and 3) is enabled simply by creating and pushing a contract between the endpoints’ EPGs.

Cisco ACI Architecture

ACI Network

Cisco ACI 

  • Leaf and Spine

  • Equidistant endpoints

  • ACI APIC Controller

  • Multi-Pod and Multi-Site

  • VXLAN Overlay

  • Endpoint Groups

  • Bridge Domains

  • VRFs

  • Automation and Consitency

  • Multi-cloud support

  • Zero Trust Security 

  • Central Management

 

rsz_secure_access_service_edge1

SASE Definition

SASE Definition

In today's digital landscape, organizations constantly seek ways to enhance network security, simplify infrastructure, and optimize performance. One emerging concept that has gained significant attention is Secure Access Service Edge (SASE). In this blog post, we will delve into the definition of SASE, its key components, and how it can revolutionize how businesses approach network and security architecture.

Secure Access Service Edge (SASE) is a transformative network architecture model that combines network and security services into a unified cloud-native solution. It offers a holistic approach to networking, allowing organizations to connect securely to cloud resources, applications, and data centers, regardless of their location or the devices being used.

Table of Contents

Highlights: SASE Definition

Multiple network and security functions 

SASE definition, or Secure Access Service Edge, is a modern networking solution that combines multiple security functions into a single platform. This solution is designed to provide secure access to cloud-based applications, data, and services. SASE architecture is built on top of a cloud-native platform that integrates software-defined wide-area networking (SD-WAN) and security functions like secure web gateway (SWG), cloud access security broker (CASB), firewall as a service (FWaaS), zero-trust network access (ZTNA).

Traditional complex methods

SASE meaning is becoming increasingly popular among organizations because it provides a more flexible and cost-effective approach to networking and security. The traditional approach to networking and security involves deploying multiple devices or appliances, each with its own set of functions. This approach can be complex, time-consuming, and expensive to manage. On the other hand, SASE simplifies this process by integrating all the necessary functions into a single platform.

SASE: A scalable approach

SASE also provides a more scalable and adaptable solution for organizations adopting cloud-based applications and services. With SASE, organizations can connect to cloud-based platforms like AWS, Azure, or Google Cloud, ensuring secure data and application access. Additionally, SASE provides better visibility and control over network traffic, allowing organizations to monitor and manage their network more effectively.

SASE definition

Related: For additional pre-information, you may find the following helpful for pre-information:

  1. SD-WAN SASE
  2. SASE Solution
  3. Security Automation
  4. SASE Model
  5. Cisco Secure Firewall
  6. eBOOK on SASE



SASE Definition

Key SASE Meaning Discussion Points:


  • New phase of WAN transformation.

  • WAN challenges and how SASE solves them.

  • Challenge: Managing the network.

  • Challenge: Site connectivity.

  • Challenge: Site performance.

  • Challenge: Cloud agility.

  • Challenge: Security.

Vendor Example: Cisco Umbrella

The Power of Secure Access Service Edge (SASE)

One of the key concepts associated with Cisco Umbrella is Secure Access Service Edge (SASE). SASE combines network security and wide-area networking (WAN) capabilities into a single cloud-native service. By converging multiple security functions such as secure web gateways, firewall-as-a-service, and data loss prevention, SASE provides a unified and simplified approach to network security. Cisco Umbrella is crucial in the SASE framework by seamlessly integrating cloud security services with the network.

Key Features and Benefits

Cisco Umbrella offers a range of powerful features that enhance network security. These include threat intelligence, advanced malware protection, secure internet gateway, and DNS-layer security. By leveraging the power of machine learning and data analytics, Umbrella continuously analyzes global internet activity to identify and block threats in real-time. Moreover, its intuitive dashboard gives administrators granular visibility and control over network traffic, enabling them to make informed decisions and respond swiftly to potential threats.

Cisco Umbrella
Diagram: Cisco Umbrella. Source is Cisco

SASE: A Cloud-Centric Approach

Firstly, the SASE meaning is down to the environment that we are in. In a cloud-centric world, users and devices require access to services everywhere. The focal point has changed. Now, the identity of the user and device, as opposed to the traditional model, focuses solely on the data center with many network security components. These environmental changes have created a new landscape we must protect and connect.

Many common problems challenge the new landscape. Due to deployed appliances for different technology stacks, enterprises are loaded with complexity and overhead. The legacy network and security designs increase latency. In addition, the world is encrypted when considering Zero Trust SASE. This needs to be inspected without degrading application performance.

These are reasons to leverage a cloud-delivered secure access service edge (SASE). SASE means a tailored network fabric optimized where it makes the most sense for the user, device, and application – at geographically dispersed PoPs enabling technologies that secure your environment with technologies such as single packet authorization.

SASE explained
Diagram: SASE explained. Source Fortinet.

SASE Meaning

Main SASE Definition Components

SASE – Secure Access Service Edge

  • Network as a Service (NaaS)

  • Security as a Service (SECaaS)

  • Zero-Trust Architecture

  • Cloud-Native Architecture

Components of SASE:

1. Network as a Service (NaaS): SASE integrates network services such as SD-WAN (Software-Defined Wide Area Network) and cloud connectivity to provide organizations with a flexible and scalable network infrastructure. With NaaS, businesses can optimize network performance, reduce latency, and ensure reliable connectivity across different environments.

2. Security as a Service (SECaaS): SASE incorporates various security services, including secure web gateways, firewall-as-a-service, data loss prevention, and zero-trust network access. By embedding security into the network infrastructure, SASE enables organizations to enforce consistent security policies, protect against threats, and simplify the management of security measures.

3. Zero-Trust Architecture: SASE adopts a zero-trust approach, which assumes that no user or device should be trusted by default, even within the network perimeter. By implementing continuous authentication, access controls, and micro-segmentation, SASE ensures that every user and device is verified before accessing network resources, reducing the risk of unauthorized access and data breaches.

4. Cloud-Native Architecture: SASE leverages cloud-native technologies to provide a scalable, agile, and elastic network and security infrastructure. By transitioning from legacy hardware appliances to software-defined solutions, SASE enables organizations to respond more to changing business requirements, reduce costs, and improve overall efficiency.

Benefits of SASE:

1. Enhanced Security: By integrating security into the network infrastructure, SASE provides a unified and consistent security approach across all network edges, reducing potential vulnerabilities and simplifying security management.

2. Increased Agility: SASE enables organizations to adapt quickly to changing business requirements by providing on-demand network and security services that can be rapidly provisioned and scaled.

3. Improved User Experience: With SASE, users can securely access applications and resources from anywhere, on any device, without compromising performance or experiencing network congestion.

4. Cost Savings: By consolidating network and security services into a single cloud-native solution, organizations can reduce hardware and maintenance costs, streamline their infrastructure, and optimize resource utilization.

Secure Access Service Edge

SASE Advantages

Cloud Delivered: Network and Security

  • Unified and consitent security to all edges. 

  • Increased agility with on-demand network and security services

  • Improved user experience. Same access from all locations.

  • Cost savings with a single cloud-native solution.

Secure Access Service Edge

SASE Technologies

Cloud Delivered: Network and Security

  • SD-WAN

  • Cloud Access Security Broker (CASB)

  • NGFW and Firewall as a service

  • Zero Trust Network Access (ZTNA)

  • Secure Web Gateway (SWG)

 

Lab Guide: Phishing Attacks

The Social-Engineer Toolkit (SET)

In this lab, we have a fake Google login page that we can use to capture the username and password. This process is known as phishing, and here, I will use the Social-Engineer Toolkit (SET), specifically designed to perform advanced attacks against the human element. 

Note:

SET was designed to be released with the http://www.social-engineer.org launch and has quickly become a standard tool in a penetration tester arsenal. The attacks built into the toolkit are intended to be focused attacks against a person or organization used during a penetration test. There are a couple of steps to perform, and I’m using Kali Linux.

  1. Once the Social Engineering Toolkit loads, select 1) Social-Engineering Attacks from the menu. 
  2. Select 2) Website Attack Vectors from the following menu. 
  3. Select 3) Credential Harvester Attack Method from the following menu. 
  4. Select 1) Web Templates method from next to the menu. 
  5. The following prompt will ask for your IP address for the POST request. The default IP [xx.xx.xx.xx] is correct, so hit Enter here.
  6. Next, select the 2—Google template. 

The credential harvester attack is a phishing attack where attackers create deceptive websites or emails to trick unsuspecting victims into providing their login credentials. These malicious actors often mimic legitimate websites or services, luring users into entering their usernames, passwords, or other sensitive information.

Techniques Employed by Credential Harvesters

Credential harvesters employ various techniques to make their attacks more convincing. They may use URL manipulation, where the website’s URL appears genuine, but in reality, it redirects to a fraudulent page designed to capture user credentials. Another method involves creating spoofed emails with links that lead to imitation login pages.

Consequences of Falling Victim to Credential Harvesters

The consequences of falling victim to credential harvesters can be severe. Once attackers obtain login credentials, they can gain unauthorized access to personal accounts, financial information, or corporate networks. This can result in identity theft, financial loss, reputational damage, and compromised privacy.

Analysis: 

    • This is an effortless way for attackers to use malicious links inside emails, texts, or social media messages. If those links are clicked, it directs the user to a fake login page to capture their credentials! 
    • Fortunately, there are several preventive measures individuals and organizations can take to safeguard against credential harvester attacks. Implementing robust and unique passwords, enabling two-factor authentication, and regularly updating software and security patches are effective ways to enhance security.
    • Additionally, being cautious of unsolicited emails, scrutinizing URLs before entering credentials, and educating oneself about phishing techniques can significantly reduce the risk of falling victim to such attacks.

In conclusion, the credential harvester attack method poses a significant threat to individuals and organizations. By understanding the techniques employed by attackers, being aware of the consequences, and implementing preventive measures, we can fortify our defenses against these malicious activities. Remember, staying vigilant and practicing good cybersecurity hygiene is the key to staying one step ahead of cybercriminals.

Back to Basics: SASE Definition

Generally, SASE services include SD-WAN, Zero-Trust Network Access (ZTNA), Cloud Access Security Broker (CASB), NGFW, Secure Web Gateway (SWG), unified management, and orchestration. Just what constitutes a real SASE solution varies significantly by source.

Several organizations, such as the Metro Ethernet Forum (MEF), are trying to establish neutral industry standards for SASE. These standards will pave the way for a universal understanding, the ability to integrate multiple manufacturers into a solution, and a method for teaching SASE.

the rise of sase
Diagram: Cloud-native application security. The rise of SASE.

SASE Meaning

The rise of SASE and digital transformation

There has been a loss of confidence in the network. As a result, organizations uncover weaknesses in their networks when they roll out digital initiatives. This seems to be true for MPLS backbones and in some SD-WAN designs, where there is a lag in security, cloud connectivity, mobility, and site connectivity.

Confidence in SD-WAN and MPLS has significantly decreased when confronted with the digital structure of network transformation. Intrinsically, SD-WAN is not an all-in-one-encompassing solution, whereas MPLS is rigid and fixed.

MPLS forwarding
Diagram: MPLS Overlay

It is expected to find that they were more confident in their networks before adopting digital transformation than post-digital transformations. Therefore, it is difficult to predict the impact of digital change on networks. Enterprises must ensure they have the proper infrastructure performance and security levels. Digital transformation is not just about replacing MPLS. Networking professionals must broaden their focus to encompass security, cloud, and mobility.

sase definition
Diagram: SASE definition. They are driving digital transformation.

WAN Transformation

SASE Meaning

All these problems can be avoided by switching to SASE, a new enterprise networking technology category introduced by Gartner in 2019. SASE meaning is the convergence of security, cloud connectivity, mobility, and site connectivity, enabling the architecture to correlate disparate data points.

It is an all-in-one encompassing solution that provides a ready-made solution for the WAN transformation journey. Gartner expects at least 40% of enterprises to have explicit strategies to adopt SASE by 2024.

Today, customers are looking for a WAN transformation solution that connects and secures all edges – sites, cloud resources, mobile users, and anything else that might emerge tomorrow. MPLS is not the right approach, and some SD-WAN deployments are causing question marks. So, a SASE definition, on the other hand, significantly assists post-digital transformation.

So, let us shine the torch on some of the digital transformation challenges likely to surface. These challenges include complexity with management and operations, site connectivity, performance between locations, inefficient security, and cloud agility.

SASE definition
Diagram: SASE: Combining network and security.

SASE Definition: Secure Access Service Edge (SASE)

The SASE definition combines network security functions (such as SWG, CASB, FWaaS, and Zero Trust Network Access (ZTNA) with SD-WAN to support organizations’ dynamic, secure access needs. These capabilities are primarily delivered by XaaS and are based on the entity’s identity, real-time context, and security/compliance policies.

SASE changes the focal point to the identity of the user and device. With traditional network design, this was the on-premises data center. The conventional enterprise network and network security architectures place the internal data center as the focal point for access.

These designs are proving ineffective and cumbersome with the rise of cloud and mobile. Traffic patterns have changed considerably, and so has the application logic.

  • A key point: “Software-defined” secure access

SASE consolidates networking and security-as-a-service capabilities into a cloud-delivered secure access service edge. The cloud-delivered service provides you with policy-based “software-defined” secure access. The “software-defined” secure access comprises a worldwide fabric of points of presence (POPs) and peering relationships. With the PoP design, the general architecture is to move inspection engines to the sessions, not reroute the engines’ sessions as traditional techniques do. This design is more aligned with today’s traffic patterns and application logic.

        • SASE offers a tailorable network fabric comprising the SASE PoPs geographically dispersed.

The architecture allows you to accurately specify every network session’s performance, reliability, security, and cost. This is based on identity and context. For practical, secure access, decisions must be centered on the entity’s identity at the source of the connection. And not a traditional construct such as the IP address or mere network location. The requesting entity can be the user, device, branch office, IoT device, edge computing location, and policy based on these parameters.

 

Lab Guide: Identity-Aware-Proxy

Identity Security with Google Cloud

Next, we will have a look at Identity security and Google Cloud. Here, I have a  minimal web application with Google App Engine.  Then, an Identity-Aware Proxy (IAP) restricts access based on parameters that I can configure.

Note:

  1. An identity-aware proxy (IAP) is a Google Cloud service allowing fine-grained access control to applications and resources based on user identity. By integrating with Google Cloud Identity and Access Management (IAM), IAP enables organizations to define and enforce access policies easily.
  2. IAP provides a robust solution, whether protecting sensitive data or mitigating the risk of unauthorized access.

See below; I have enabled IAP for a simple application. For access, I now need to tell the IAP services who can access the application. I do this by adding Principles.

Once an app is protected with IAP, it can use the identity information that IAP provides in the web request headers it passes through. So, for additional identity information, the application will get the logged-in user’s email address and a persistent unique user ID assigned by the Google Identity Service to that user—notice below the additional lines in the application code that get the IAP-provided identity data. Additionally, the X-Goog-Authenticated-User- the IAP service provides headers.

Note:

If there is a risk of IAP being turned off or bypassed, your app can check to make sure the identity information it receives is valid. This uses a third web request header added by IAP called X-Goog-IAP-JWT-Assertion. The header’s value is a cryptographically signed object containing user identity data. Your application can verify the digital signature and use the data provided in this object to ensure that IAP provided it without alteration.

Digital signature verification requires several extra steps, such as retrieving the latest set of Google public keys. You can decide whether your application needs these additional steps based on the risk that someone can turn off or bypass IAP and the application’s sensitivity.

IAP Key Features and Benefits

a) Secure Access Control: IAP offers granular control over who can access specific resources, ensuring that only authorized individuals can gain entry. By leveraging context-aware access policies, organizations can define rules based on user attributes, device security status, and more.

b) Multi-Factor Authentication (MFA): IAP supports using MFA, adding an extra layer of security to the authentication process. The risk of unauthorized access is further reduced by requiring users to provide additional verification factors such as SMS codes or security keys.

c) Centralized Logging and Auditing: IAP provides detailed logs and audit trails, allowing organizations to monitor and track access attempts. This enhances visibility and enables swift action against potential security threats.

Implementing Identity-Aware Proxy

Implementing IAP within your Google Cloud environment is a straightforward process. By following these steps, you can ensure a seamless integration:

a) Enabling IAP: Start by enabling IAP in the Google Cloud Console for the desired project. This will activate the necessary APIs and services.

b) Configuring Access Policies: Define access policies based on user identity, resource paths, and other criteria using the Cloud Console or the IAP API.

c) Fine-tuning authentication Methods: Customize the authentication methods according to your organization’s security requirements. This includes enabling MFA and deciding whether to allow or deny unauthenticated users.

Conclusion: Identity-Aware Proxy (IAP) is a robust security solution offered by Google Cloud. With its granular access control, multi-factor authentication, and centralized logging capabilities, IAP provides organizations with the means to ensure secure access to their cloud resources. By implementing IAP, businesses can enhance their security posture and protect against potential threats.

Security and Identity

With a SASE platform, when we create an object, such as a policy in the networking domain, it is then available in other domains, such as security. So, any policies assigned to users are tied to that user regardless of network location. This removes the complexity of managing network and security policies across multiple areas, users, and devices. Again, all of this can be done from one platform.

Also, when examining security solutions, many buy individual appliances that focus on one job. To troubleshoot, you need to gather information, such as the logs from each device. A SIEM is valuable but can only be used in some organizations as a resource-heavy. For those who don’t have ample resources, the manual process is backbreaking, and there will be false positives.

sase security
Diagram: SASE security. The PoP architecture.

SASE Definition with Challenge 1: Managing the Network

Looking across the entire networking and security industry, everyone sells individual point solutions that are not a holistic joined-up offering. Thinking only about MPLS replacement leads to incremental, point solution acquisitions when confronted by digital initiatives, making their networks more complex and costly.

Principally, distributed appliances for network and security at every location require additional tasks such as installation, ongoing management, regular updates, and refreshes. This results in far too many security and network configuration points. We see this all the time with NOC and SOC integration efforts.

Numerous integration points

The point-solution approach addresses one issue and requires a considerable amount of integration. Therefore, you must constantly add solutions to the stack, likely resulting in management overhead and increased complexity. Let’s say you are searching for a new car. Would you prefer to build the car with all the different parts or buy the already-built one?

In the same way, if we examine the network and security industry, the way it has been geared up presently is provided in parts. It’s your job to support, manage, and build the stack over time and scale it when needed. Fundamentally, it would help if you were an expert in all the different parts.

However, if you abstract the complexity into one platform, you don’t need to be an expert in everything. SASE is one of the effective ways to abstract management and operational complexity.

SASE Meaning: How SASE solves this

Converging network and security into a single platform does not require multiple integration points. This will eliminate the need to deploy these point solutions and the complexities of managing each. Essentially, with SASE, we can bring each point solution functionalities together and place them under one hood – the SASE cloud. SASE merges all of the networking and security capabilities into a single platform.

This way, you now have a holistic joined-up offering. Customers don’t need to perform upgrades, size, and scale their network. Instead, all this is done for them in the SASE cloud, creating a fully managed and self-healing architecture.

Besides, the convergence is minimal if something goes wrong in one of the SASE Pops. All of this is automatic, and there is no need to set up new tunnels or have administrators step in to perform configurations.

sase definition
Diagram: SASE definition. No more point solutions.

SASE Definition with Challenge 2: Site Connectivity

SD-WAN appliances require other solutions for global connectivity and to connect, secure, and manage mobile users and cloud resources. As a result, many users are turning to Service Providers to handle the integration. The carrier-managed SD-WAN providers integrate a mix of SD-WAN and security devices to form SD-WAN services.

Unfortunately, this often makes the Service Providers inflexible in accommodating new requests. The telco’s lack of agility and high bandwidth costs will remain problematic. Deploying new locations has been the biggest telco-related frustration, especially when connecting offices outside of the telco’s operating region to the company’s MPLS network. For this, they need to integrate with other telcos.

Video: SD-WAN

In the following video, we will address the basics of SD-WAN and the challenges of the existing WAN. We will also go through popular features of SD-WAN and integration points with, for example, SASE.

SD WAN Tutorial
Prev 1 of 1 Next
Prev 1 of 1 Next

SASE Meaning: How SASE solves this

SASE handles all of the complexities of management. As a result, the administrative overhead for managing and operating a global network that supports site-to-site connectivity and enhanced security, cloud, and mobility is kept to an absolute minimum.

SASE Definition with Challenge 3: Performance Between Locations

The throughput is primarily determined by latency and packet loss, not bandwidth. Therefore, for an optimal experience for global applications, we must explore ways to manage the latency and packet loss end-to-end for last-mile and middle-mile segments. Most SD-WAN vendors don’t control these segments, affecting application performance and service agility.

Consequently, there will be constant tweaking at the remote ends to attain the best performance for your application. With SD-WAN, we can bundle transports and perform link bonding to solve the last mile. However, this does not create any benefits for the middle mile bandwidth.

MPLS will help you overcome the middle-mile problems, but you will likely pay a high price.

Define SASE
Diagram: Define SASE. Link Bonding is only suitable for last-mile performance.

SASE Meaning: How SASE solves this

The SASE cloud already has an optimized converged network and security platforms. Therefore, sites need to connect to the nearest SASE PoP. This way, the sites are placed on the global private backbone to take advantage of global route optimization, dynamic path selection, traffic optimization, and end-to-end encryption. The traffic can also be routed over MPLS, directly between sites (not through the SASE PoP), and from IPsec tunnels to third-party devices. The SASE architecture optimizes the last and middle-mile traffic flows.

Optimization techniques

The SASE global backbone has several techniques that improve the network performance, resulting in predictable, consistent latency and packet loss. The SASE cloud has complete control of each PoP and can employ optimizations. It uses proprietary routing algorithms that factor in latency, packet loss, and jitter.

These routing algorithms favor performance over cost and select the optimal route for every network packet. This is compared to Internet routing, where metrics don’t consider what is best for the application or the type.

SASE Definition with Challenge 4: Cloud Agility

Cloud applications are becoming the most critical to organizations, even more severe than those hosted in private data centers. When delivering cloud resources, we must consider more than just providing connectivity. In the past, when we spoke about agility, we were concerned only with the addition of new on-premises sites.

However, now, this conversation needs to encompass the cloud. Primarily, delivering cloud applications is about providing an application experience as responsive as the on-premises applications. However, most SD-WANs have a low response rate for rapidly offering new public cloud infrastructure. MPLS is expensive, rigid, and not built for cloud access.

SASE Meaning: How SASE solves this

Cloud Native Meaning

SASE natively supports cloud data centers (IaaS) and applications (SaaS) without additional configuration, complexity, or point solutions, enabling built-in cloud connectivity. This further enables the rapid delivery of new public cloud infrastructure.

The SASE PoPs are collocated in the data centers, directly connected to the IXP of the leading IaaS providers, such as Amazon AWS, Microsoft Azure, and Google Cloud Platform. In addition, cloud applications are optimized through SASE’s ability to define the egress points. This helps exit the cloud application traffic at the points closest to the customer’s application instance.

The optimal global routing algorithms can determine the best path from anywhere to the customer’s cloud application instance. This provides optimal performance to the cloud applications regardless of the user’s location.

So, when we talk about performance to the cloud with SASE, the latency to the cloud is comparable to the optimized access provided by the cloud providers, such as AWS Direct Connect or Azure Express Route. So, authentically, SASE provides out-of-the-box cloud performance.

SASE Definition with Challenge 5: Security

The security landscape is constantly evolving. Therefore, network security solutions must develop to form a well-founded landscape. Ransomware and Malware will continue to be the primary security concerns from 2020 onward. This is a challenge for the entire organization to combat the various solutions designed with complex integration points scattered through the network domain.

Security must be a part of any WAN transformation initiative protecting the users and resources regardless of the underlying network managed through a single-pane-of-glass.

However, a bundle of non-integrated security products results in appliance sprawl that hinders your security posture instead of strengthening it. The security solution must defend against emerging threats like malware/ransomware. In addition, it must boost the ability to enforce corporate security policies on mobile users.

Finally, the security solution must also address the increasing cost of buying and managing security appliances and software.

sase edge
Diagram: SASE Edge: The issues of service chaining.

Security and encryption

So, we know there is an increase in complexity due to the disparate tools required to address the different threat vectors. So, for example, we have DLP that can be spread across the SWG, CASB, and DLP but with three other teams managing each. What about the impact of encrypted web traffic on the security infrastructure?

The issue is that most internet traffic is now encrypted, and attackers deliver the payloads, deliver command and control instructions, and exfiltrate data over encrypted protocols. Organizations cannot decrypt all network traffic for performance reasons and avoid looking at sensitive employee information.

Also, there are issues with the scalability of encrypted traffic management solutions. This can, too, cause performance issues.

Lab Guide: Security Backdoors

Using Bash

Bash, short for “Bourne Again SHell,” is a widely used command-line interpreter in Unix-based systems. It provides powerful scripting capabilities, making it a favorite among system administrators and developers. However, this versatility also brings the potential for misuse. This section will explain what a Bash backdoor is and how it functions.

Note:

In the following, I created a backdoor on a corporate machine to maintain persistence within the environment. I performed bash script and system configuration using cron jobs. You will then connect to the created backdoor. Here, we demonstrate how to use tools available on standard operating system installations to bypass an organization’s security controls.

Cron jobs, derived from the word “chronos” meaning time in Greek, are scheduled tasks that run automatically in the background of your server. They follow a specific syntax, using fields to specify when and how often a task should be executed. You can create precise and reliable automated processes by grasping the structure and components of cron jobs.


Analysis: First, the file called file is deleted with the rm command if it already exists. Next, a special pipe, a new communications channel, is called a file. Any information passed to the bash terminal, such as typed commands, is transmitted to a specific IP address and port using the pipe. The | indicates the point at which the output from one Linux command passes information to the next command. Using this single line, you can create a network connection to a specific machine, giving remote access to a user.

Analysis: First, errors when running the cron task are ignored and not printed on the screen. Then, the new cronjob is printed to the screen; in this example, the backdoor bash shell will run every minute. The output of the echoed command is then written to the cronfile with crontab. 

Conclusion: 

Backdoor access refers to a hidden method or vulnerability intentionally created within a system or software that allows unauthorized access or control. It is an alternative entry point that bypasses conventional security measures, often undetected.

While backdoor access can be misused for malicious purposes, it is essential to acknowledge that there are legitimate reasons for its existence. Government agencies may utilize backdoor access to monitor criminal activities or ensure national security. Additionally, software developers may implement backdoor access for debugging and maintenance purposes.

Stringent security measures are necessary to counter the threats posed by backdoor access. Regular system audits, vulnerability assessments, and robust encryption protocols can help identify and patch potential vulnerabilities. Fostering a security-conscious culture among users and promoting awareness of potential risks can strengthen overall cybersecurity.

Video: Stateful Inspection Firewall

We know we have a set of well-defined protocols that are used to communicate over our networks. Let’s call these communication rules. You are probably familiar with the low-layer transport protocols, such as TCP and UDP, and higher application layer protocols, such as HTTP and FTP.

Generally, we interact directly with the application layer and have networking and security devices working at the lower layers. So when Host A wants to talk to Host B, it will go through a series of communication layers with devices working at each layer. A device that works at one of these layers is a stateful firewall.

Stateful Inspection Firewall
Prev 1 of 1 Next
Prev 1 of 1 Next

MPLS and SD-WAN

MPLS does not protect the resources and users, certainly not those connected to the Internet. On the other hand, SD-WAN service offerings are not all created equal since many do not include firewall/security features for threat protection to protect all edges – mobile devices, sites, and cloud resources. This lack of integrated security complicates SD-WAN deployments. Also, this often leads to Malware getting past the perimeter unnoticed.

The cost involved

Security solutions are expensive, and there is never a fixed price. Some security vendors may charge on the usage models for which you don’t yet have the quantity. This makes the process of planning extremely problematic and complex. As the costs keep increasing, we often find that security professionals would trade off point-security solutions due to the associated costs. This is not an effective risk-management strategy.

The security controls are also limited to mobile VPN solutions. More often than not, they are very coarse, forcing IT to open access to all the network resources. Protecting mobile users requires additional security tools like next-generation firewalls (NGFWs). So again, we have another point solution. In addition, mobile VPN solutions provide no last- or middle-mile optimization.

SASE Meaning: How SASE solves this

SASE converges a complete security stack into the network, allowing SASE to bring granular control to sites and mobile and cloud resources. This is done by enforcing the zero-trust principles for all edges. SASE provides anti-malware protection for both WAN and Internet traffic. In addition, for malware detection and prevention, SASE can offer signature and machine-based learning protection consisting of several integrated anti-malware engines.

For malware communication, SASE can stop the outbound traffic to C&C servers based on reputation feeds and network behavioral analysis. Mobile user traffic is fully protected by SASE’s advanced security services, including NGFW, secure web gateway (SWG), threat prevention, and managed threat detection and response.

Furthermore, in the case of mobile, SASE mobile users can dynamically connect to the closest SASE PoP regardless of the location. Again, as discussed previously, the SASE cloud’s relevant optimizations are available for mobile users.

Rethink the WAN

The shift to the cloud, edge computing, and mobility offer new opportunities for IT professionals. To support these digital initiatives, the network professionals must rethink their approach to the WAN transformation. WAN transformation is not just about replacing MPLS with SD-WAN. It needs an all-encompassing solution that provides the proper network performance and security level for enhanced site-to-site connectivity, security, mobile, and cloud.

network security solution
Diagram: SASE, a network security solution.

SASE Meaning: SASE wraps up

SASE is a network and security architecture consolidating numerous network and security functions, traditionally delivered as siloed point solutions, into an integrated cloud service. It combines several network and security capabilities along with cloud-native security functions. The functions are produced from the cloud and provided by the SASE vendor.

They are essentially providing a consolidated, platform-based approach to security. We have a cloud-delivered solution consolidating multiple edge network security controls and network services into a unified solution with centralized management and distributed enforcement.

The appliance-based perimeter

Even Though there has been a shift to the cloud, the traditional perimeter network security solution has remained appliance-based. The shift for moving security controls to the cloud is for better protection and performance, plus ease of deployment and maintenance.

The initial performance of the earlier cloud-delivered solutions has been overcome with the introduction of optimized routing and global footprint. However, there is a split in opinion about performance and protection. Many consider protection and performance prime reasons to remain on-premises and keep the network security solutions on-premises.

Key Components of SASE

The key components of SASE include software-defined wide-area networking (SD-WAN), cloud-native secure web gateways (SWG), zero-trust network access (ZTNA), firewall-as-a-service (FWaaS), and data loss prevention (DLP), among others. These components work harmoniously to provide organizations with a holistic and scalable solution for secure network connectivity, regardless of the location or device used by the end-user.

Benefits of SASE

SASE offers numerous benefits for organizations seeking to enhance their network infrastructure and security posture. Firstly, it provides simplified network management by consolidating various functions into a unified platform. Secondly, it offers an improved user experience through optimized connectivity and reduced latency. Additionally, SASE enables organizations to embrace cloud services securely and facilitates seamless scalability to adapt to changing business demands.

Implications for the Future

As businesses embrace digital transformation and remote work becomes more prevalent, the demand for flexible and secure network architectures like SASE is expected to skyrocket. SASE empowers organizations to overcome the limitations of traditional network setups and enables them to thrive in an increasingly dynamic and interconnected world. With its cloud-native approach and emphasis on security, SASE is poised to redefine how networks are designed and managed in the coming years.




Key SASE Definition Summary Points:

Main Checklist Points To Consider

  • The rise of SASE and the causes of digital transformation.

  • Technical details on the issues of MPLS with the lack of agility. 

  • Technical details on the SASE PoP and the converging of networking and security to a SaaS solution.

  • Discuss the numerous challenges of managing the network and how SASE solves this.

  • A final note on the appliance-based perimeter.

 

Summary: SASE Definition

With the ever-evolving landscape of technology and the increasing demand for secure and efficient networks, a new paradigm has emerged in the realm of network security – SASE, which stands for Secure Access Service Edge. In this blog post, we delved into the definition of SASE, its key components, and its transformative impact on network security.

Section 1: Understanding SASE

SASE, pronounced “sassy,” is a comprehensive framework that combines network security and wide area networking (WAN) capabilities into a single cloud-based service model. It aims to provide users with secure access to applications and data, regardless of their location or the devices they use. By converging networking and security functions, SASE simplifies the network architecture and enhances overall performance.

Section 2: The Key Components of SASE

To fully grasp the essence of SASE, it is essential to explore its core components. These include:

1. Secure Web Gateway (SWG): The SWG component of SASE ensures safe web browsing by inspecting and filtering web traffic, protecting users from malicious websites, and enforcing internet usage policies.

2. Cloud Access Security Broker (CASB): CASB provides visibility and control over data as it moves between the organization’s network and multiple cloud platforms. It safeguards against cloud-specific threats and helps enforce data loss prevention policies.

3. Firewall-as-a-Service (FWaaS): FWaaS offers scalable and flexible firewall protection, eliminating the need for traditional hardware-based firewalls. It enforces security policies and controls access to applications and data, regardless of their location.

4. Zero Trust Network Access (ZTNA): ZTNA ensures that users and devices are continuously authenticated and authorized before accessing resources. It replaces traditional VPNs with more granular and context-aware access policies, reducing the risk of unauthorized access.

Section 3: The Benefits of SASE

SASE brings numerous advantages to organizations seeking enhanced network security and performance:

1. Simplified Architecture: By consolidating various network and security functions, SASE eliminates the need for multiple-point solutions, reducing complexity and management overhead.

2. Enhanced Security: With its comprehensive approach, SASE provides robust protection against emerging threats, ensuring data confidentiality and integrity across the network.

3. Improved User Experience: SASE enables secure access to applications and data from any location, offering a seamless user experience without compromising security.

Conclusion:

In conclusion, SASE represents a paradigm shift in network security, revolutionizing how organizations approach their network architecture. By converging security and networking functions, SASE provides a comprehensive and scalable solution that addresses the evolving challenges of today’s digital landscape. Embracing SASE empowers organizations to navigate the complexities of network security and embrace a future-ready approach.

SD WAN Overlay

SD WAN Overlay

SD WAN Overlay

In today's digital age, businesses rely on seamless and secure network connectivity to support their operations. Traditional Wide Area Network (WAN) architectures often struggle to meet the demands of modern companies due to their limited bandwidth, high costs, and lack of flexibility. A revolutionary SD-WAN (Software-Defined Wide Area Network) overlay has emerged to address these challenges, offering businesses a more efficient and agile network solution. This blog post will delve into SD-WAN overlay, exploring its benefits, implementation, and potential to transform how businesses connect.

SD-WAN employs the concepts of overlay networking. Overlay networking is a virtual network architecture that allows for the creation of multiple logical networks on top of an existing physical network infrastructure. It involves the encapsulation of network traffic within packets, enabling data to traverse across different networks regardless of their physical locations. This abstraction layer provides immense flexibility and agility, making overlay networking an attractive option for organizations of all sizes.

Scalability: One of the key advantages of overlay networking is its ability to scale effortlessly. By decoupling the logical network from the underlying physical infrastructure, organizations can rapidly deploy and expand their networks without disruption. This scalability is particularly crucial in cloud environments or scenarios where network requirements change frequently.

Security and Isolation: Overlay networks provide enhanced security by isolating different logical networks from each other. This isolation ensures that data traffic remains segregated and prevents unauthorized access to sensitive information. Additionally, overlay networks can implement advanced security measures such as encryption and access control, further fortifying network security.

Highlights: SD WAN Overlay

The Role of SD-WAN Overlays

SD-WAN overlay is a network architecture that enhances traditional WAN infrastructure by leveraging software-defined networking (SDN) principles. Unlike conventional WAN, where network management is done manually and requires substantial hardware investments, SD-WAN overlay centralizes network control and management through software. This enables businesses to simplify network operations and reduce costs by utilizing commodity internet connections alongside existing MPLS networks. 

SD-WAN, or Software-Defined Wide Area Network, is a technology that simplifies the management and operation of a wide area network. It abstracts the underlying network infrastructure and provides a centralized control plane for configuring and managing network services. SD-WAN overlay takes this concept further by introducing an additional virtualization layer, enabling the creation of multiple logical networks on top of the physical network infrastructure.

SD WAN 

SD WAN Overlay 

Overlay Types

  • Tunnel-Based Overlays

  • Segment-Based Overlays

  • Policy-Based Overlays

  • Internet-Based SD-WAN Overlay

SD WAN 

SD WAN Overlay 

Overlay Types

  • Hybrid Overlays

  • Cloud-Enabled Overlays

  • MPLS-Based SD-WAN Overlay

  • Hybrid SD-WAN Overlay

So, what exactly is an SD-WAN overlay?

In simple terms, it is a virtual layer added to the existing network infrastructure. These network overlays connect different locations, such as branch offices, data centers, and the cloud, by creating a secure and reliable network.

1. Tunnel-Based Overlays:

One of the most common types of SD-WAN overlays is tunnel-based overlays. This approach encapsulates network traffic within a virtual tunnel, allowing it to traverse multiple networks securely. Tunnel-based overlays are typically implemented using IPsec or GRE (Generic Routing Encapsulation) protocols. They offer enhanced security through encryption and provide a reliable connection between the SD-WAN edge devices.

GRE over IPsec

2. Segment-Based Overlays:

Segment-based overlays are designed to segment the network traffic based on specific criteria such as application type, user group, or location. This allows organizations to prioritize critical applications and allocate network resources accordingly. By segmenting the traffic, SD-WAN can optimize the performance of each application and ensure a consistent user experience. Segment-based overlays are particularly beneficial for businesses with diverse network requirements.

3. Policy-Based Overlays:

Policy-based overlays enable organizations to define rules and policies that govern the behavior of the SD-WAN network. These overlays use intelligent routing algorithms to dynamically select the most optimal path for network traffic based on predefined policies. By leveraging policy-based overlays, businesses can ensure efficient utilization of network resources, minimize latency, and improve overall network performance.

4. Hybrid Overlays:

Hybrid overlays combine the benefits of both public and private networks. This overlay allows organizations to utilize multiple network connections, including MPLS, broadband, and LTE, to create a robust and resilient network infrastructure. Hybrid overlays intelligently route traffic through the most suitable connection based on application requirements, network availability, and cost. By leveraging mixed overlays, businesses can achieve high availability, cost-effectiveness, and improved application performance.

5. Cloud-Enabled Overlays:

As more businesses adopt cloud-based applications and services, seamless connectivity to cloud environments becomes crucial. Cloud-enabled overlays provide direct and secure connectivity between the SD-WAN network and cloud service providers. These overlays ensure optimized performance for cloud applications by minimizing latency and providing efficient data transfer. Cloud-enabled overlays simplify the management and deployment of SD-WAN in multi-cloud environments, making them an ideal choice for businesses embracing cloud technologies.

Related: For additional pre-information, you may find the following helpful:

  1. Transport SDN
  2. SD WAN Diagram 
  3. Overlay Virtual Networking



SD-WAN Overlay

Key SD WAN Overlay Discussion Points:


  • WAN transformation.

  • The issues with traditional networking.

  • Introduction to Virtual WANs.

  • SD-WAN and SDN discussion.

  • SD-WAN overlay core features.

  • Drivers for SD-WAN.

Back to Basics: SD-WAN Overlay

Overlay Networking

Overlay networking is an approach to computer networking that involves building a layer of virtual networks on top of an existing physical network. This approach improves the underlying infrastructure’s scalability, performance, and security. It also allows for creating virtual networks that span multiple physical networks, allowing for greater flexibility in traffic routes.

At the core of overlay networking is the concept of virtualization. This involves separating the physical infrastructure from the virtual networks, allowing greater control over allocating resources. This separation also allows the creation of virtual network segments that span multiple physical networks. This provides an efficient way to route traffic, as well as the ability to provide additional security and privacy measures.

The diagram below displays a VXLAN overlay. So, we are using VLXAN to create the tunnel that allows Layer 2 extensions across a Layer 3 core.

Overlay networking
Diagram: Overlay Networking with VXLAN

Underlay network

A network underlay is a physical infrastructure that provides the foundation for a network overlay, a logical abstraction of the underlying physical network. The network underlay provides the physical transport of data between nodes, while the overlay provides logical connectivity.

The network underlay can comprise various technologies, such as Ethernet, Wi-Fi, cellular, satellite, and fiber optics. It is the foundation of a network overlay and essential for its proper functioning. It provides data transport and physical connections between nodes. It also provides the physical elements that make up the infrastructure, such as routers, switches, and firewalls.

Overlay networking
Diagram: Overlay networking. Source Researchgate.

SD-WAN with SDWAN overlay.

SD-WAN leverages a transport-independent fabric technology that is used to connect remote locations. This is achieved by using overlay technology. The SDWAN overlay works by tunneling traffic over any transport between destinations within the WAN environment.

This gives authentic flexibility to routing applications across any network portion regardless of the circuit or transport type. This is the definition of transport independence. Having a fabric SDWAN overlay network means that every remote site, regardless of physical or logical separation, is always a single hop away from another. DMPVN works based on transport agnostic design.

DMVPN configuration
Diagram: DMVPN Configuration.

SD-WAN overlays offer several advantages over traditional WANs, including improved scalability, reduced complexity, and better control over traffic flows. They also provide better security, as each site is protected by its dedicated security protocols. Additionally, SD-WAN overlays can improve application performance and reliability and reduce latency.

We need more bandwidth.

Modern businesses demand more bandwidth than ever to connect their data, applications, and services. As a result, we have many things to consider with the WAN, such as regulations, security, visibility, branch, data center sites, remote workers, internet access, cloud, and traffic prioritization. They were driving the need for SD-WAN.

The concepts and design principles of creating a wide area network (WAN) to provide resilient and optimal transit between endpoints have continuously evolved. However, the driver behind building a better WAN is to support applications that demand performance and resiliency.

SD WAN Overlay 

Key SD WAN Features

Full stack obervability 


Not all traffic treated equally

Combining all transports

Intelligent traffic steering 

Controller-based policy

Lab Guide: PfR Operations

In the following guide, I will address PfR. PfR is all about optimizing traffic and originated from OER. OER is a good step forward, but it’s not enough; it does “prefix-based route optimization,” but optimization per prefix is not good enough today. Nowadays, it’s all about “application-based optimization”. 

Performance routing (PfR) is similar to OER but can optimize our routing based on application requirements. OER and PfR are technically 95% identical, but Cisco rebranded OER as PfR.

In the diagram below, we have the following:

  • H1 is a traffic generator that sends traffic to the ISP router loopback interfaces.
  • MC, BR1, and BR2 run iBGP.
  • MC is our master controller.
  • BR1 and BR2 are border routers.
  • Between AS 1 and AS 2 we run eBGP.

Performance based routing

Note:

First, we will look at the MC device and the default routing. We see two entries for the 10.0.0.0/8 network; iBGP uses BR1 as the exit point. 

Once PfF is configured, we can check the settings on the MC and the Border routers.

Performance based routing

Analysis:

Cisco PfR, or Cisco Performance Routing, is an advanced technology designed to optimize network traffic flows. Unlike traditional routing protocols, PfR considers various factors such as network conditions, link capacities, and application requirements to select the most efficient path for data packets dynamically. This intelligent routing approach ensures enhanced performance and optimal resource utilization.

Key Features of Cisco PfR

1. Intelligent Path Selection: Cisco PfR analyzes real-time network data to determine the best path for traffic flows, considering factors like latency, delay, and link availability. It dynamically adapts to changing network conditions, ensuring optimal performance.

2. Application-Aware Routing: PfR goes beyond traditional routing protocols by considering the specific requirements of applications running on the network. It can prioritize critical applications, allocate bandwidth resources accordingly, and optimize performance for different types of traffic.

Cisco PfR

Benefits of Cisco PfR

1. Improved Network Performance: PfR can dynamically adapt to network conditions, optimizing traffic flows, reducing latency, and enhancing overall network performance. This results in improved user experience and increased productivity.

2. Efficient Utilization of Network Resources: Cisco PfR intelligently distributes traffic across available network links, optimizing resource utilization. Leveraging multiple paths balances the load and prevents congestion, leading to better bandwidth utilization.

3. Enhanced Application Performance: PfR’s application-aware routing ensures that critical applications receive the required bandwidth and quality of service. This prioritization improves application performance, minimizing delays and ensuring a smooth user experience.

4. Simplified Network Management: PfR provides detailed visibility into network performance, allowing administrators to identify and troubleshoot issues more effectively. Its centralized management interface simplifies configuration and monitoring, making network management less complex.

Implementation Considerations

Certain factors must be considered before implementing Cisco PfR. Evaluate the network infrastructure, identify critical applications, and determine the desired performance goals. Proper planning and configuration are essential to maximizing the benefits of PfR.

Knowledge Check: Application-Aware Routing (AAR) with Cisco SD-WAN

Depending on the OMP best path selection) both connections may be actively used if you have multiple connections, such as an MPLS and an Internet connection. There might be a better solution than this. There is a possibility that your MPLS connection supports QoS, while your Internet connection is the best effort. There may be a business application that requires QoS that should use the MPLS link and web traffic that should only use the Internet connection.

How can MPLS performance be improved if it degrades? Temporarily switching to an Internet connection could improve the end-user experience.

Multi-connections to the Internet are another example. A fiber optic network, cable, DSL, or 4G network might be available. You should be able to select the best connection every time.

With Application-Aware Routing (AAR), we can determine which applications should use which WAN connection, and we can failover based on packet loss, jitter, and delay. AAR tracks network statistics from Cisco SD-WAN data plane tunnels to determine the optimal traffic path.

Knowledge Check: NBAR

NBAR, short for Network-Based Application Recognition, is a technology that allows network devices to identify and classify network protocols and applications traversing the network. Unlike traditional network traffic analysis methods that rely on port numbers alone, NBAR utilizes deep packet inspection to identify applications based on their unique signatures and traffic patterns. This granular level of visibility enables network administrators to gain valuable insights into the type of traffic flowing through their networks.

Application Recognition

NBAR finds extensive use in various scenarios. From a network performance perspective, it assists in traffic shaping and bandwidth management, ensuring optimal resource allocation. Moreover, NBAR plays a vital role in Quality of Service (QoS) implementations, facilitating the prioritization of mission-critical applications. Additionally, NBAR’s application recognition capabilities are essential in network troubleshooting, as they help pinpoint the source of congestion and performance issues.

SD WAN Overlay: Implementation Considerations

Network Assessment: A thorough network assessment is crucial before implementing the SD-WAN overlay. This includes evaluating existing network infrastructure, bandwidth requirements, application performance, and security protocols. A comprehensive assessment helps identify potential bottlenecks and ensures a smooth transition to the new technology.

Vendor Selection: Choosing the right SD-WAN overlay vendor is vital for a successful implementation. Factors to consider include scalability, security features, ease of management, and compatibility with existing network infrastructure. Evaluating multiple vendors and seeking recommendations from industry experts can help make an informed decision.

Key Considerations for Implementation

Before implementing an SD-WAN overlay, assessing your organization’s specific requirements and goals is essential. Consider network architecture, security needs, scalability, and integration with existing systems. Conduct a thorough evaluation to determine your business’s most suitable SD-WAN solution.

Overcoming Implementation Challenges

Implementing an SD-WAN overlay may present challenges. Common hurdles include network compatibility, data migration, and seamless integration with existing infrastructure. Identify potential roadblocks early on and work closely with your SD-WAN provider to develop a comprehensive implementation plan.

Best Practices for Successful Deployment

To ensure a smooth and successful SD-WAN overlay implementation, follow these best practices:

a. Conduct a pilot phase: Test the solution in a controlled environment to identify and address potential issues before full-scale deployment.

b. Prioritize security: Implement robust security measures to protect your network and data. Consider features like encryption, firewalls, and intrusion prevention systems.

c. Optimize for performance: Leverage SD-WAN overlay’s advanced traffic management capabilities to optimize application performance and prioritize critical traffic.

Monitoring and Maintenance

Once the SD-WAN overlay is implemented, continuous monitoring and maintenance are crucial. Regularly assess network performance, address any bottlenecks, and apply updates as necessary. Implement proactive monitoring tools to detect and resolve issues before they impact operations.

WAN Innovation

The WAN is the entry point between inside the perimeter and outside. An outage in the WAN has a large blast radius, affecting many applications and other branch site connectivity. Yet the WAN has had little innovation until now with the advent of both SD-WAN and SASE.  SASE is a combination of both network and security functions.

SASE Network

If you look at the history of WAN, you will see that there have been several stages in WAN virtualization. Most WAN transformation projects went from basic hub-and-spoke topologies based on services such as leased lines to fully meshed MPLS-based WAN servers. Cost was the main driver for this evolution, not agility.  

wide area network
Diagram: Wide Area Network: WAN Technologies.

Issues with the Traditional Network

To understand SD-WAN, we must first discuss some “problems” with traditional WAN connections. There are two types of WAN connections: private and public. Here are two options to compare:

  • Cost: MPLS connections are much more expensive than regular Internet connections.

  • Time to deploy: Private WAN connections take longer than regular Internet connections.

  • Service providers offer service level agreements (SLAs) for private WAN connections but not regular Internet connections. Several Internet providers offer SLAs for “business” class connections, but they are much more expensive than regular (consumer) connections.

  • Packet loss: Private WAN connections like MPLS suffer from lower packet loss than Internet connections.

  • Internet connections do not offer quality of service. Outgoing traffic can be prioritized, but that’s it—the Internet itself is like the Wild West. Private WAN connections often support end-to-end quality of service.

As the world of I.T. becomes dispersed, the network and security perimeters dissolve and become less predictable. Before, it was easy to know what was internal and external, but now we live in a world of micro-perimeters with a considerable change in the focal point.

The perimeter is now the identity of the user and device – not the fixed point at an H.Q. site. As a result, applications require a WAN to support distributed environments, flexible network points, and a change in the perimeter design.

Suboptimal traffic flow

The optimal route will be the fastest or most efficient and, therefore, preferred to transfer data. Sub-optimal routes will be slower and, hence, not the selected route. Centralized-only designs resulted in suboptimal traffic flow and increased latency, which will degrade application performance.

A key point to note is that traditional networks focus on centralized points in the network that all applications, network, and security services must adhere to. These network points are fixed and cannot be changed.

Network point intelligence

However, the network should be evolved to have network points positioned where it makes the most sense for the application and user. Not based on, let’s say, a previously validated design for a different application era. For example, many branch sites do not have local Internet breakouts.

So, for this reason, we backhauled internet-bound traffic to secure, centralized internet portals at the H.Q. site. As a result, we sacrificed the performance of Internet and cloud applications. Designs that place the H.Q. site at the center of connectivity requirements inhibit the dynamic access requirements for digital business.

Hub and spoke drawbacks.

Simple spoke-type networks are sub-optimal because you always have to go to the center point of the hub and then out to the machine you need rather than being able to go directly to whichever node you need. As a result, the hub becomes a bottleneck in the network as all data must go through it. With a more scattered network using multiple hubs and switches, a less congested and more optimal route could be found between machines.

Knowledge Check: DMVPN as an overlay technology

DMVPN, an acronym for Dynamic Multipoint Virtual Private Network, is a Cisco proprietary solution that provides a scalable and flexible approach to creating virtual private networks over the Internet. Unlike traditional VPNs requiring point-to-point connections, DMVPN utilizes a hub-and-spoke architecture, allowing multiple remote sites to communicate securely.

How DMVPN Works

a) Phase 1: Establishing a mGRE (Multipoint GRE) Tunnel: DMVPN begins by creating a multipoint GRE tunnel, allowing spoke routers to connect to the hub router. This phase sets the foundation for secure communication.

b) Phase 2: Dynamic Routing Protocol Integration: Once the mGRE tunnel is established, a dynamic routing protocol, such as EIGRP or OSPF, propagates routing information. This allows spoke routers to learn about other remote networks dynamically.

c) Phase 3: IPSec Encryption: To ensure secure communication over the internet, IPSec encryption is applied to the DMVPN tunnels. This encryption provides confidentiality, integrity, and authentication, safeguarding data transmitted between sites.

DMVPN Phase 3
Diagram: DMVPN Phase 3 configuration

A key point on MPLS agility

Multiprotocol Label Switching, or MPLS, is a networking technology that routes traffic using the shortest path based on “labels,” rather than network addresses, to handle forwarding over private wide area networks. As a protocol-independent solution, MPLS assigns labels to each data packet, controlling the path the packet follows. As a result, MPLS significantly improves traffic speed, but it has some drawbacks.

MPLS VPN
Diagram: MPLS VPN

MPLS topologies, once they are provisioned, are challenging to modify. Community tagging and matching provide some degree of flexibility and are commonly used, meaning the customers set BGP communities on prefixes for specific applications. The SP matches these communities and sets traffic engineering parameters like the MED and Local Preference. However, the network topology essentially remains fixed.

digital transformation
Diagram: Networking: The cause of digital transformation.

Connecting remote sites to cloud offerings, such as SaaS or IaaS, is far more efficient over the public Internet. However, there are many drawbacks to backhauling traffic to a central data center when it is not required, and it is more efficient to go direct. SD-WAN technologies share similar technologies to DMVPN phases, allowing your branch sites to go directly to cloud-based applications without backhauling to the central H.Q.

Introducing the SD-WAN Overlay

A software-defined wide area network is a wide area network that uses software-defined network technology, such as communicating over the Internet using SDWAN overlay tunnels that are encrypted when destined for internal organization locations. SD-WAN is software-defined networking for the wide area network.

SD-WAN decouples (separates) the WAN infrastructure, whether physical or virtual, from its control plane mechanism and allows applications or application groups to be placed into virtual WAN overlays.

Types of SD-WAN and the SD-WAN overlay: The virtual WANs 

The separation allows us to bring many enhancements and improvements to a WAN that has had very little innovation in the past compared to the rest of the infrastructure, such as server and storage modules. With server virtualization, several virtual machines create application isolation on a physical server.

For example, an application placed in a VM operated in isolation from each other, yet the VMs were installed on the same physical hosts.

Consider SD-WAN to operate with similar principles. Each application or group can operate independently when traversing the WAN to endpoints in the cloud or other remote sites. These applications are placed into a virtual SDWAN overlay.

Cisco SD WAN Overlay
Diagram: Cisco SD-WAN overlay. Source Network Academy

SD-WAN overlay and SDN combined

  • The Fabric

The word fabric comes from the fact that there are many paths to move from one server to another to ease balance and traffic distribution. SDN aims to centralize the order that enables the distribution of the flows over all the fabric paths. Then, we have an SDN controller device. The SDN controller can also control several fabrics simultaneously, managing intra and inter-datacenter flows.

  • SD-WAN overlay includes SDN

SD-WAN is used to control and manage a company’s multiple WANs. There are different types of WAN: Internet, MPLS, LTE, DSL, fiber, wired network, circuit link, etc. SD-WAN uses SDN technology to control the entire environment. Like SDN, the data plane and control plane are separated. A centralized controller must be added to manage flows, routing or switch policies, packet priority, network policies, etc. SD-WAN technology is based on overlay, meaning nodes representing underlying networks.

  • Centralized logic

In a traditional network, each device’s transport functions and controller layer are resident. This is why any configuration or change must be done box-by-box. Configuration was carried out manually or, at the most, an Ansible script. SD-WAN brings Software-Defined Networking (SDN) concepts to the enterprise branch WAN.

Software-defined networking (SDN) is an architecture, whereas SD-WAN is a technology that can be purchased and built on SDN’s foundational concepts. SD-WAN’s centralized logic stems from SDN. SDN separates the control from the data plane and uses a central controller to make intelligent decisions, similar to the design that most SD-WAN vendors operate.

  • A holistic view

The controller has a holistic view. Same with the SD-WAN overlay. The controller supports central policy management, enabling network-wide policy definitions and traffic visibility. The SD-WAN edge devices perform the data plane. The data plane is where the simple forwarding occurs, and the control plane, which is separate from the data plane, sets up all the controls for the data plane to forward.

Like SDN, the SD-WAN overlay abstracts network hardware into a control plane with multiple data planes to make up one large WAN fabric. As the control layer is abstracted and decoupled above the physicals and running in software, services can be virtualized and delivered from a central location to any point on the network.

sd-wan technology
Diagram: SD-WAN technology: The old WAN vs the new WAN.

Types of SD WAN and SD-WAN Overlay Features

Enterprises that employ SD-WAN solutions for their network architecture will simplify the complexity of their WAN. Enterprises should look at the SD-WAN options available in various deployment options, ranging from thin devices with most of the functionality in the cloud to thicker devices at the branch location performing most of the work. Whichever SD-WAN vendor you choose will have similar features.

Today’s WAN environment requires us to manage many elements: numerous physical components that include both network and security devices, complex routing protocols and configurations, complex high-availability designs, and various path optimizations and encryption techniques. 

Gaining the SD-WAN benefits

Employing the features discussed below will allow you to gain the benefits of SD-WAN: its higher capacity bandwidth, centralized management, network visibility, and multiple connection types. In addition, SD-WAN technology allows organizations to use connection types that are cheaper than MPLS.

virtual private network
Diagram: SD-WAN features: Virtual Private Network (VPN).

Types of SD WAN: Combining the transports

At its core, SD-WAN shapes and steers application traffic across multiple WAN means of transport. Building on the concept of link bonding to combine numerous means of transport and transport types, the SD-WAN overlay improves the idea by moving the functionality up the stack—first, SD-WAN aggregates last-mile services, representing them as a single pipe to the application.

SD-WAN allows you to combine all transport links into one big pipe. SD-WAN is transport agnostic. As it works by abstraction, it does not care what transport links you have. Maybe you have MPLS, private Internet, or LTE. It can combine all these or use them separately.

Types of SD WAN: Central location

From a central location, SD-WAN pulls all of these WAN resources together, creating one large WAN fabric that allows administrators to slice up the WAN to match the application requirements that sit on top. Different applications traverse the WAN, so we need the WAN to react differently.

For example, if you’re running a call center, you want a low delay, latency, and high availability with Voice traffic. You may wish to use this traffic to use an excellent service-level agreement path.

SD WAN traffic steering
Diagram: SD-WAN traffic steering. Source Cisco.

Types of SD WAN: Traffic steering

Traffic steering may also be required: voice traffic to another path if, for example, the first Path is experiencing high latency. If it’s not possible to steer traffic automatically to a link that is better performing, run a series of path remediation techniques to try to improve performance. File transfer differs from real-time Voice: you can tolerate more delay but need more B/W.

Here, you may want to use a combination of WAN transports ( such as customer broadband and LTE ) to achieve higher aggregate B/W. This also allows you to automatically steer traffic over different WAN transports when there is a deflagration on one link. With the SD-WAN overlay, we must start thinking about paths, not links.

SD-WAN overlay makes intelligent decisions

At its core, SD-WAN enables real-time application traffic steering over any link, such as broadband, LTE, and MPLS, assigning pre-defined policies based on business intent. Steering policies support many application types, making intelligent decisions about how WAN links are utilized and which paths are taken.

computer networking
Diagram: Computer networking: Overlay technology.

Types of SD WAN: Steering traffic

The concept of an underlay and overlay are not new, and SD-WAN borrows these designs. First, the underlay is the physical or virtual world, such as the physical infrastructure. Then, we have the overlay, where all the intelligence can be set. The SDWAN overlay represents the virtual WANs that hold your different applications.

A virtual WAN overlay enables us to steer traffic and combine all bandwidths. Similar to how applications are mapped to V.M. in the server world, with SD-WAN, each application is mapped to its own virtual SD-WAN overlay. Each virtual SDWAN overlay can have its own SD WAN security policies, topologies, and performance requirements.

SD-WAN overlay path monitoring

SD-WAN monitors the paths and the application performance on each link (Internet, MPLS, LTE ) and then chooses the best path based on real-time conditions and the business policy. In summary, the underlay network is the physical or virtual infrastructure above which the overlay network is built. An SDWAN overlay network is a virtual network built on top of an underlying Network infrastructure/Network layer (the underlay).

Types of SD-WAN: Controller-based policy

An additional layer of information is needed to make more intelligent decisions about how and where to forward application traffic. This is the controller-based policy approach that SD-WAN offers, incorporating a holistic view.

A central controller can now make decisions based on global information, not solely on a path-by-path basis with traditional routing protocols.  Getting all the routing information and compiling it into the controller to make a decision is much more efficient than making local decisions that only see a limited part of the network.

The SD-WAN Controller provides physical or virtual device management for all SD-WAN Edges associated with the controller. This includes but is not limited to, configuration and activation, IP address management, and pushing down policies onto SD-WAN Edges located at the branch sites.

SD-WAN Overlay Case Study

I recently consulted for a private enterprise. Like many enterprises, they have many applications, both legacy and new. No one knew about courses and applications running over the WAN. Visibility was at an all-time low. For the network design, the H.Q. has MPLS and Direct Internet access.

There is nothing new here; this design has been in place for the last decade. All traffic is backhauled to the HQ/MPLS headend for security screening. The security stack, which will include firewalls, IDS/IPS, and anti-malware, was located in the H.Q. The remote sites have high latency and limited connectivity options.

 

types of sd wan
Diagram: WAN transformation: Network design.

More importantly, they are transitioning their ERP system to the cloud. As apps move to the cloud, they want to avoid fixed WAN, a big driver for a flexible SD-WAN solution. They also have remote branches. These branches are hindered by high latency and poorly managed IT infrastructure.

But they don’t want an I.T. representative at each site location. They have heard that SD-WAN has a centralized logic and can view the entire network from one central location. These remote sites must receive large files from the H.Q.; the branch sites’ transport links are only single-customer broadband links.

The cost of remote sites

Some remote sites have LTE, and the bills are getting more significant. The company wants to reduce costs with dedicated Internet access or customer/business broadband. They have heard that you can combine different transports with SD-WAN and have several path remediations on degraded transports for better performance. So, they decided to roll out SD-WAN. From this new architecture, they gained several benefits.

SD-WAN Visibility

When your business-critical applications operate over different provider networks, it gets harder to troubleshoot and find the root cause of problems. So, visibility is critical to business. SD-WAN allows you to see network performance data in real-time and is essential for determining where packet loss, latency, and jitter are occurring so you can resolve the problem quickly.

You also need to be able to see who or what is consuming bandwidth so you can spot intermittent problems. For all these reasons, SD-WAN visibility needs to go beyond network performance metrics and provide greater insight into the delivery chains that run from applications to users.

Understand your baselines

Visibility is needed to complete the network baseline before the SD-WAN is deployed. This enables the organization to understand existing capabilities, the norm, what applications are running, the number of sites connected, what service providers used, and whether they’re meeting their SLAs.

Visibility is critical to obtaining a complete picture so teams understand how to optimize the business infrastructure. SD-WAN gives you an intelligent edge, so you can see all the traffic and act on it immediately.

First, look at the visibility of the various flows, the links used, and any issues on those links. Then, if necessary, you can tweak the bonding policy to optimize the traffic flow. Before the rollout of SD-WAN, there was no visibility into the types of traffic, and different apps used what B.W. They had limited knowledge of WAN performance.

SD-WAN offers higher visibility

With SD-WAN, they have the visibility to control and class traffic on layer seven values, such as what URL you are using and what Domain you are trying to hit, along with the standard port and protocol.

All applications are not equal; some run better on different links. If an application is not performing correctly, you can route it to a different circuit. With the SD-WAN orchestrator, you have complete visibility across all locations, all links, and into the other traffic across all circuits. 

SD-WAN High Availability

The goal of any high-availability solution is to ensure that all network services are resilient to failure. Such a solution aims to provide continuous access to network resources by addressing the potential causes of downtime through functionality, design, and best practices.

The previous high-availability design was active and passive with manual failover. It was hard to maintain, and there was a lot of unused bandwidth. Now, they have more efficient use of resources and are no longer tied to the bandwidth of the first circuit.

There is a better granular application failover mechanism. You can also select which apps are prioritized if a link fails or when a certain congestion ratio is hit. For example, you have LTE as a backup, which can be very expensive. So applications marked high priority are steered over the backup link, but guest WIFI traffic isn’t.  

Flexible topology

Before, they had a hub and spoke MPLS design for all applications. They wanted a complete mesh architecture for some applications, kept the existing hub, and spoke for others. However, the service provider couldn’t accommodate the level of granularity that they wanted.

With SD-WAN, they can choose topologies better suited to the application type. As a result, the network design is now more flexible and matches the application than the application matching a network design it doesn’t want.

SD-WAN topology
Diagram: SD-WAN Topologies.

Going Deeper on the SD-WAN Overlay Components

SD-WAN combines transports, SDWAN overlay, and underlay

Look at it this way. With an SD-WAN topology, there are different levels of networking. There is an underlay network, the physical infrastructure, and an SDWAN overlay network. The physical infrastructure is the router, switches, and WAN transports; the overlay network is the virtual WAN overlays.

The SDWAN overlay presents a different network to the application. For example, the voice overlay will see only the voice overlay. The logical virtual pipe the overlay creates, and the application sees differs from the underlay.

An SDWAN overlay network is a virtual or logical network created on top of an existing physical network. The internet, which connects many nodes via circuit switching, is an example of an SDWAN overlay network. An overlay network is any virtual layer on top of physical network infrastructure.

Consider an SDWAN overlay as a flexible tag.

This may be as simple as a virtual local area network (VLAN) but typically refers to more complex virtual layers from an SDN or an SD-WAN). Think of an SDWAN overlay as a tag so that building the overlays is not expensive or time-consuming. In addition, you don’t need to buy physical equipment for each overlay as the overlay is virtualized and in the software.

Similar to software-defined networking (SDN), the critical part is that SD-WAN works by abstraction. All the complexities are abstracted into application overlays. For example, application type A can use this SDWAN overlay, and application type B can use that SDWAN overlay. 

I.P. and port number, orchestrations, and end-to-end

Recent application requirements drive a new type of WAN that more accurately supports today’s environment with an additional layer of policy management. The world has moved away from looking at I.P. addresses and Port numbers used to identify applications and made the correct forwarding decision. 

Types of SD WAN

The market for branch office wide-area network functionality is shifting from dedicated routing, security, and WAN optimization appliances to feature-rich SD-WAN. As a result, WAN edge infrastructure now incorporates a widening set of network functions, including secure routers, firewalls, SD-WAN, WAN path control, and WAN optimization, along with traditional routing functionality. Therefore, consider the following approach to deploying SD-WAN.

SD WAN Overlay Approach

SD WAN Feature

 Application-orientated WAN

Holistic visibility and decisions

Central logic

Independent topologies

Application mapping

1. Application-based approach

With SD-WAN, we are shifting from a network-based approach to an application-based approach. The new WAN no longer looks solely at the network to forward packets. Instead, it looks at the business requirements and decides how to optimize the application with the correct forwarding behavior. This new way of forwarding would be problematic when using traditional WAN architectures.

Making business logic decisions with I.P. and port number information is challenging. Standard routing is the most common way to forward application traffic today, but it only assesses part of the picture when making its forwarding decision. 

These devices have routing tables to perform forwarding. Still, with this model, they operate and decide on their little island, losing the holistic view required for accurate end-to-end decision-making.  

2. SD-WAN: Holistic decision

The WAN must start to make decisions holistically. The WAN should not be viewed as a single module in the network design. Instead, it must incorporate several elements it has not incorporated to capture the correct per-application forwarding behavior. The ideal WAN should be automatable to form a comprehensive end-to-end solution centrally orchestrated from a single pane of glass.

Managed and orchestrated centrally, this new WAN fabric is transport agnostic. It offers application-aware routing, regional-specific routing topologies, encryption on all transports regardless of link type, and high availability with automatic failover. All of these will be discussed shortly and are the essence of SD-WAN.  

3. SD-WAN and central logic        

Besides the virtual SD-WAN overlay, another key SD-WAN concept is centralized logic. Upon examining a standard router, local routing tables are computed from an algorithm to forward a packet to a given destination.

It receives routes from its peers or neighbors but computes paths locally and makes local routing decisions. The critical point to note is that everything is calculated locally. SD-WAN functions on a different paradigm.

Rather than using distributed logic, it utilizes centralized logic. This allows you to view the entire network holistically and with a distributed forwarding plane that makes real-time decisions based on better metrics than before.

This paradigm enables SD-WAN to see how the flows behave along the path. This is because they are taking the fragmented control approach and centralizing it while benefiting from a distributed system. 

The SD-WAN controller, which acts as the brain, can set different applications to run over different paths based on business requirements and performance SLAs, not on a fixed topology. So, for example, if one path does not have acceptable packet loss and latency is high, we can move to another path dynamically.

4. Independent topologies

SD-WAN has different levels of networking and brings the concepts of SDN into the Wide Area Network. Similar to SDN, we have an underlay and an overlay network with SD-WAN. The WAN infrastructure, either physical or virtual, is the underlay, and the SDWAN overlay is in software on top of the underlay where the applications are mapped.

This decoupling or separation of functions allows different application or group overlays. Previously, the application had to work with a fixed and pre-built network infrastructure. With SD-WAN, the application can choose the type of topology it wants, such as a full mesh or hub and spoke. The topologies with SD-WAN are much more flexible.

A key point: SD-WAN abstracts the underlay

With SD-WAN, the virtual WAN overlays are abstracted from the physical device’s underlay. Therefore, the virtual WAN overlays can take on topologies independent of each other without being pinned to the configuration of the underlay network. SD-WAN changes how you map application requirements to the network, allowing for the creation of independent topologies per application.

For example, mission-critical applications may use expensive leased lines, while lower-priority applications can use inexpensive best-effort Internet links. This can all change on the fly if specific performance metrics are unmet.

Previously, the application had to match and “fit” into the network with the legacy WAN, but with an SD-WAN, the application now controls the network topology. Multiple independent topologies per application are a crucial driver for SD-WAN.

types of sd wan
Diagram: SD-WAN Link Bonding.

5. The SD-WAN overlay

SD-WAN optimizes traffic over multiple available connections. It dynamically steers traffic to the best available link. Suppose the available links show any transmission issues. In that case, it will immediately transfer to a better path or apply remediation to a link if, for example, you only have a single link. SD-WAN delivers application flows from a source to a destination based on the configured policy and best available network path. A core concept of SD-WAN is overlaid.

SD-WAN solutions provide the software abstraction to create the SD-WAN overlay and decouple network software services from the underlying physical infrastructure. Multiple virtual overlays may be defined to abstract the underlying physical transport services, each supporting a different quality of service, preferred transport, and high availability characteristics.

6. Application mapping

Application mapping also allows you to steer traffic over different WAN transports. This steering is automatic and can be implemented when specific performance metrics are unmet. For example, if Internet transport has a 15% packet loss, the policy can be set to steer all or some of the application traffic over to better-performing MPLS transport.

Applications are mapped to different overlays based on business intent, not infrastructure details like IP addresses. When you think about overlays, it’s common to have, on average, four overlays. For example, you may have a gold, platinum, and bronze SDWAN overlay, and then you can map the applications to these overlays.

The applications will have different networking requirements, and overlays allow you to slice and dice your network if you have multiple application types. 

SDWAN Overlau
Diagram: Technology design: SDWAN overlay application mapping.

SD-WAN & WAN metrics

SD-WAN captures metrics that go far beyond the standard WAN measurements. For example, the traditional way would measure packet loss, latency, and jitter metrics to determine path quality. These measurements are insufficient for routing protocols that only make the packet flow decision at layer 3 of the OSI model.

As we know, layer 3 of the OSI model lacks intelligence and misses the overall user experience. Rather than relying on bits, bytes jitter, and latency, we must start to look at the application transactions.

SD-WAN incorporates better metrics beyond those a standard WAN edge router considers. These metrics may include application response time, network transfer, and service response time. Some SD-WAN solutions monitor each flow’s RTT, sliding windows, and ACK delays, not just the I.P. or TCP. This creates a more accurate view of the application’s performance.

SD-WAN Features and Benefits

      • Leverage all available connectivity types.

All SD-WAN vendors can balance traffic across all transports regardless of transport type, which can be done per flow or packet. This ensures that existing redundant links sitting idle are not being used. SD-WAN creates an active-active network and eliminates the need to use and maintain traditional routing protocols for active–standby setups.  

      • App-aware routing capabilities 

As we know, application visibility is critical to forward efficiently over either transport. Still, we also need to go one step further and examine deep inside the application and understand what sub-applications exist, such as determining Facebook chat over regular Facebook. This allows you to balance loads across the WAN based on sub-applications. 

      • Regional-specific routing topologies

Several topologies include a hub and spoke full mesh and Internet PoP topologies. Each organization will have different requirements for choosing a topology. For example, Voice should use a full mesh design, while data requires a hub and spoke connecting to a central data center.

As we are moving heavily into cloud applications, local internet access/internet breakout is a better strategic option than backhauling traffic to a central site when it doesn’t need to. SD-WAN abstracts the details of WAN, enabling application-independent topologies. Each application can have its topology, and this can be dynamically changed. All of this is managed by an SD-WAN control plane.

      • Centralized device management & policy administration 

With the controller-based approach that SD-WAN has, you are not embedding the control plane in the network. This allows you to centrally provision and push policies down any instructions to the data plane from a central location. This simplifies management and increases scale. The manual box-by-box approach to policy enforcement is not the way forward.

The ability to tie everything to a template and automate enables rapid branch deployments, security updates, and other policy changes. It’s much better to manage it all in one central place with the ability to dynamically push out what’s needed, such as updates and other configuration changes. 

      • High availability with automatic failovers 

You cannot apply a single viewpoint to high availability. Many components are involved in creating a high availability plan, such as device, link, and site level’s high availability requirements; these should be addressed in an end-to-end solution. In addition, traditional WANs require additional telemetry information to detect failures and brown-out events. 

      • Encryption on all transports, irrespective of link type 

Regardless of link type, MPLS, LTE, or the Internet, we need the capacity to encrypt all those paths without the excess baggage and complications that IPsec brings. Encryption should happen automatically, and the complexity of IPsec should be abstracted.

Summary: SD WAN Overlay

In today’s digital landscape, businesses increasingly rely on cloud-based applications, remote workforces, and data-driven operations. As a result, the demand for a more flexible, scalable, and secure network infrastructure has never been greater. This is where SD-WAN overlay comes into play, revolutionizing how organizations connect and operate.

SD-WAN overlay is a network architecture that allows organizations to abstract and virtualize their wide area networks, decoupling them from the underlying physical infrastructure. It utilizes software-defined networking (SDN) principles to create an overlay network that runs on top of the existing WAN infrastructure, enabling centralized management, control, and optimization of network traffic.

Key benefits of SD-WAN overlay 

1. Enhanced Performance and Reliability:

SD-WAN overlay leverages multiple network paths to distribute traffic intelligently, ensuring optimal performance and reliability. By dynamically routing traffic based on real-time conditions, businesses can overcome network congestion, reduce latency, and maximize application performance. This capability is particularly crucial for organizations with distributed branch offices or remote workers, as it enables seamless connectivity and productivity.

2. Cost Efficiency and Scalability:

Traditional WAN architectures can be expensive to implement and maintain, especially when organizations need to expand their network footprint. SD-WAN overlay offers a cost-effective alternative by utilizing existing infrastructure and incorporating affordable broadband connections. With centralized management and simplified configuration, scaling the network becomes a breeze, allowing businesses to adapt quickly to changing demands without breaking the bank.

3. Improved Security and Compliance:

In an era of increasing cybersecurity threats, protecting sensitive data and ensuring regulatory compliance are paramount. SD-WAN overlay incorporates advanced security features to safeguard network traffic, including encryption, authentication, and threat detection. Businesses can effectively mitigate risks, maintain data integrity, and comply with industry regulations by segmenting network traffic and applying granular security policies.

4. Streamlined Network Management:

Managing a complex network infrastructure can be a daunting task. SD-WAN overlay simplifies network management with centralized control and visibility, enabling administrators to monitor and manage the entire network from a single pane of glass. This level of control allows for faster troubleshooting, policy enforcement, and network optimization, resulting in improved operational efficiency and reduced downtime.

5. Agility and Flexibility:

In today’s fast-paced business environment, agility is critical to staying competitive. SD-WAN overlay empowers organizations to adapt rapidly to changing business needs by providing the flexibility to integrate new technologies and services seamlessly. Whether adding new branch locations, integrating cloud applications, or adopting emerging technologies like IoT, SD-WAN overlay offers businesses the agility to stay ahead of the curve.

Implementation of SD-WAN Overlay:

Implementing SD-WAN overlay requires careful planning and consideration. The following steps outline a typical implementation process:

1. Assess Network Requirements: Evaluate existing network infrastructure, bandwidth requirements, and application performance needs to determine the most suitable SD-WAN overlay solution.

2. Design and Architecture: Create a network design incorporating SD-WAN overlay while considering factors such as branch office connectivity, data center integration, and security requirements.

3. Vendor Selection: Choose a reliable and reputable SD-WAN overlay vendor based on their technology, features, support, and scalability.

4. Deployment and Configuration: Install the required hardware or virtual appliances and configure the SD-WAN overlay solution according to the network design. This includes setting up policies, traffic routing, and security parameters.

5. Testing and Optimization: Thoroughly test the SD-WAN overlay solution, ensuring its compatibility with existing applications and network infrastructure. Optimize the solution based on performance metrics and user feedback.

Conclusion: SD-WAN overlay is a game-changer for businesses seeking to optimize their network infrastructure. By enhancing performance, reducing costs, improving security, streamlining management, and enabling agility, SD-WAN overlay unlocks the true potential of connectivity. Embracing this technology allows organizations to embrace digital transformation, drive innovation, and gain a competitive edge in the digital era. In an ever-evolving business landscape, SD-WAN overlay is the key to unlocking new growth opportunities and future-proofing your network infrastructure.

SD-WAN topology

SD WAN | SD WAN Tutorial

In today's digital age, businesses increasingly rely on technology for seamless communication and efficient operations. One technology that has gained significant traction is Software-Defined Wide Area Networking (SD-WAN). This blog post will provide a comprehensive tutorial on SD-WAN, explaining its key features, benefits, and implementation aspects.

SD-WAN stands for Software-Defined Wide Area Networking. It is a revolutionary approach to network connectivity that enables organizations to simplify their network infrastructure and enhance performance. Unlike traditional Wide Area Networks (WANs), SD-WAN leverages software-defined networking principles to abstract network control from hardware devices.

Table of Contents

Highlights: SD WAN Tutorial

The Role of Abstraction

Firstly, this SD-WAN tutorial will address how SD-WAN incorporates a level of abstraction into WAN, creating virtual WANs: WAN virtualization. Now imagine these virtual WANs individually holding a single application running over the WAN but consider them end-to-end instead of being in one location, i.e., on a server. The individual WAN runs to the cloud or enterprise location, having secure, isolated paths with different policies and topologies. Wide Area Network (WAN) virtualization is an emerging technology revolutionizing how networks are designed and managed.

Decoupling the Infrastructure

It allows for decoupling the physical network infrastructure from the logical network, enabling the same physical infrastructure to be used for multiple logical networks. WAN virtualization enables organizations to utilize a single physical infrastructure to create multiple virtual networks, each with unique characteristics. WAN virtualization is a core requirement enabling SD-WAN.

Highlighting SD-WAN

This SD-WAN tutorial will address the SD-WAN vendor’s approach to an underlay and an overlay, including the SD-WAN requirements. The underlay consists of the physical or virtual infrastructure and the overlay network, the SD WAN overlay to which the applications are mapped. SD-WAN solutions are designed to provide secure, reliable, and high-performance connectivity across multiple locations and networks. Organizations can manage their network configurations, policies, and security infrastructure with SD-WAN.

In addition, SD-WAN solutions can be deployed over any type of existing WAN infrastructure, such as MPLS, Frame Relay, and more. SD-WAN offers enhanced security features like encryption, authentication, and access control. This ensures that data is secure and confidential and that only authorized users can access the network.

Related: Before you proceed, you may find the following posts helpful for pre-information:

  1. SD WAN Security 
  2. WAN Monitoring
  3. Zero Trust SASE
  4. Forwarding Routing Protocols



SD-WAN Tutorial

Key SD WAN Tutorial Discussion Points:


  • WAN transformation.

  • SD WAN requirements.

  • Challenges with the WAN.

  • Old methods of routing protocols.

  • SD-WAN overlay core features.

  • Challenges with BGP.

 

Back to basics: SD-WAN Tutorial

SD-WAN requirements with performance per overlay

As each application is in an isolated WAN overlay, we can assign different mechanisms independent of others to each overlay. Such different performance metrics and topologies can be set to each overlay. More importantly, all these can be given regardless of the underlying transport. The critical point is that each of these virtual WANs is entirely independent.

SD-WAN solutions offer several benefits, such as greater flexibility in routing, improved scalability, and enhanced security. Additionally, SD-WAN solutions can help organizations reduce cyber-attack risks while providing end-to-end visibility into application performance and network traffic.

SD-WAN Tutorial

Key SD-WAN Benifits

Improved performance

Not all traffic treated equally

Zero-trust security protecton

Reduced WAN complexity

Central policy management

Key Features of SD-WAN

Centralized Control and Visibility:

SD-WAN provides a centralized management console, allowing network administrators complete control over their network infrastructure. This enables them to monitor and manage network traffic, prioritize critical applications, and allocate bandwidth resources effectively.

Dynamic Path Selection:

SD-WAN intelligently selects the most optimal path for data transmission based on real-time network conditions. By dynamically routing traffic through the most efficient path, SD-WAN improves network performance, minimizes latency, and ensures a seamless user experience.

Security and Encryption:

SD-WAN solutions incorporate robust security measures to protect data transmission across the network. Encryption protocols, firewalls, and intrusion detection systems are implemented to safeguard sensitive information and mitigate potential security threats.

Benefits of SD-WAN

Enhanced Network Performance:

SD-WAN significantly improves network performance by leveraging multiple connections and routing traffic dynamically. It optimizes bandwidth utilization, reduces latency, and ensures consistent application performance, even in geographically dispersed locations.

Cost Savings:

By leveraging affordable broadband internet connections, SD-WAN eliminates the need for expensive dedicated MPLS connections. This reduces network costs and enables organizations to scale their network infrastructure without breaking the bank.

Simplified Network Management:

SD-WAN simplifies network management through centralized control and automation. This eliminates manual configuration and reduces the complexity of managing a traditional WAN infrastructure. As a result, organizations can streamline their IT operations and allocate resources more efficiently.

 

Implementing SD-WAN

Assessing Network Requirements:

Before implementing SD-WAN, organizations must assess their network requirements, such as bandwidth, application performance, and security requirements. This will help select the right SD-WAN solution that aligns with their business objectives.

Vendor Selection:

Organizations should evaluate different SD-WAN vendors based on their offerings, scalability, security features, and customer support. Choosing a vendor that can meet current requirements and accommodate future growth is crucial.

Deployment and Configuration:

Once the vendor is selected, the implementation involves deploying SD-WAN appliances or virtual instances across the network nodes. Configuration consists of defining policies, prioritizing applications, and establishing security measures.

SD-WAN Tutorial and SD-WAN Requirements:

SD-WAN Is Not New

Before we get into the details of this SD-WAN tutorial, the critical point is that the concepts of SD-WAN are not new and share ideas with the DMVPN phases.  We have had encryption, path control, and overlay networking for some time.

However, the main benefit of SD-WAN is that it acts as an enabler to wrap these technologies together and present them to enterprises as a new integrated offering. We have WAN edge devices that forward traffic to other edge devices across a WAN via centralized control. This enables you to configure application-based policy forwarding and security rules across performance-graded WAN paths.

Policy based routing
Diagram: Policy-based routing. Source Paloalto.

The SD-WAN Control and Data Plane

SD-WAN separates the control from the data plane functions, uses central control plane components to make intelligent decisions, and forwards these decisions to the data plane SD-WAN Edge routers. The control plane components provide the control plane for the SD-WAN network and instruct the data plane devices that consist of the SD-WAN Edge router instructions as to where to steer traffic.

The brains of the SD-WAN network are the SD-WAN control plane components with a fully holistic view that is end-to-end. This is compared to the traditional network where each device’s control plane functions are resident. For example, the data plane is where the simple forwarding occurs, and the control plane, which is separate from the data plane, sets up all the controls for the data plane to forward.

Video: DMVPN Phases

Under the hood, SD-WAN shares some of the technologies used by DMVPN. In this technical demo, we will start with the first network topology, with a Hub and Spoke design, and recap DMVPN Phase 1. This was the starting point of the DMVPN design phases. However, today, you will probably see DMVPN phase 3, which allows for spoke-to-spoke tunnels, which may be better suited if you don’t need a true hub and spoke. In this demo, there will also be a bit of troubleshooting.

DMVPN Phases
Prev 1 of 1 Next
Prev 1 of 1 Next

 

SD WAN tutorial: Removing intensive algorithms

BGP-based networks

SDN is about taking intensive network algorithms out of WAN edge router hardware and placing them into a central controller. Previously, in traditional networks, this was in individual hardware devices using control plane points in the data path. BGP-based networks attempted to use the same concepts with Route-Reflector (RR) designs.

They moved route reflectors (RR) off the data plane, and these RRs were then used to compute the best-path algorithms. Route reflectors can be positioned anywhere in the network and do not have to sit on the data path.

BGP Route Reflection
Diagram: BGP Route Reflection

With the controller-based approach that SD-WAN has, you are not embedding the control plane in the network. This allows you to centrally provision and pushes policy down any instructions to the data plane from a central location. This simplifies management and increases scale.

SD-WAN can centralize control plane security and routing, resulting in data path fluidity. The data plane can flow based on the policy set by the control plane controller that is not in the data plane. The SD-WAN control plane handles routing and security decisions and passes the relevant information between the edge routers.

SD WAN tutorial
Diagram: SD-WAN: SD WAN tutorial.

SD WAN Tutorial: Challenges With the WAN 

The traditional WAN comes with a lot of challenges. It creates a siloed management effect where different WAN links try to connect everything. Traditional WANs require extensive planning for the logistics of calling. In addition, trying to add a branch or remote location can be costly. Additional hardware purchases are required for each site.

wide area network
Diagram: Wide Area Network (WAN): WAN network and the challenges.

Challenge: Visibility

Visibility plays a vital role in day-to-day monitoring, and alerting is crucial to understanding the ongoing operational impact of the WAN. In addition, visibility enables critical performance levels to be monitored as deployments are scaled out. This helps with proactive alerting, troubleshooting, and policy optimization. Unfortunately, the traditional WAN is known for its need for more visibility.

Challenge: Service Level Agreement (SLA)

A service level agreement (SLA) is a legally binding contract between the service provider and one or more clients that lays down the specific terms and agreements governing the duration of the service engagement. For example, a traditional WAN architecture may consist of private MPLS links with Internet or LTE links as backup.

The SLAs within the MPLS service provider environment are usually broken down into bronze, silver, and gold main categories. However, these types of SLA only fit some geographies and should be fine-tuned per location and customer requirements. Therefore, these SLAs are very rigid.

Challenge: Static and lacking agility

The WAN’s capacity, reliability, analytics, and security parts should be available on demand. Yet the WAN infrastructure is very static. New sites and bandwidth upgrades require considerable processing time, and this WAN’s static nature prohibits agility. For today’s type of application and the agility required for business, the WAN is not agile enough, and nothing can be performed on the fly to meet business requirements. When it comes to network topologies, they can be depicted either physically or logically. Common topologies you may have seen include the Star, Mesh, Full, and Ring topologies.

Fixed topologies

In a physical world, these topologies are fixed and cannot be automatically changed. And the logical topologies can also be hindered by physical footprints. The traditional model of operation forces applications to fit into a specific network topology already built and designed. We see this a lot with MPLS/VPNs. The application needs to fit into a predefined topology. This can be changed with configurations such as adding and removing Route Targets, but this requires administrator intervention.

Route Targets (RT)
Diagram: Complications with Route Targets. Source Cisco.

SD WAN tutorial: The old methods of routing protocols

Routing Protocols

With any SD-WAN tutorial, we must address inconsistencies with traditional routing protocols. For example, routing protocols make forwarding decisions based on destination addresses, and these decisions are made on a hop-by-hop basis. As a result, the application can take paths limited to routing loop restrictions, meaning that the routing protocols will not take a path that could potentially result in a forwarding loop. Although this overcomes the routing loop problems, it limits the number of paths the application traffic can take.

The traditional WAN needs help enabling micro-segmentation. Micro-segmentation enhances network security by restricting hackers’ lateral movement in the event of a breach. As a result, it’s become increasingly widely deployed by enterprises over the last few years. It provides firms with improved control over east-west traffic and helps to keep applications running in the cloud or data center-type environments more secure.

Routing support often needs to be more consistent. For example, many traditional WAN vendors support both LAN and WAN side dynamic routing and virtual routing and forwarding (VRF) – some only on the WAN side. Then, some only support static routing, and other vendors don’t have any support for routing.

Video: Routing Convergence

In this video, we will address routing convergence, also known as convergence routing. We know we have Layer 2 switches that create Ethernet. So, all endpoints physically connect to a Layer 2 switch. And if you are on a single LAN with one large VLAN, you are ready with this setup as switches work out of the box, making decisions based on Layer 2 MAC addresses.

So, these Layer 2 MAC addresses are already assigned to the NIC cards on your hosts, so you don’t need to do anything. You can configure the switches to say that this MAC address is available on this port and this MAC is available on this port. Still, it’s better for the switch to dynamically learn this when the two hosts connected to it start communicating and sending traffic. So if you want a switch to learn the MAC address, send a ping, and it will dynamically do all the MAC learning.

Routing Convergence
Prev 1 of 1 Next
Prev 1 of 1 Next

SD-WAN Tutorial: Challenges with BGP

The issue with BGP: Border Gateway Protocol (BGP) attributes

Border Gateway Protocol (BGP) is a gateway protocol that enables the internet to exchange routing information between autonomous systems (AS). As networks interact with each other, they need a way to communicate. This is accomplished through peering. BGP makes peering possible. Without it, networks would not be able to send and receive information from each other. However, it comes with some challenges.

A redundant WAN design requires a routing protocol, either dynamic or static, for practical traffic engineering and failover. This can be done in several ways. For example, for the Border Gateway Protocol (BGP), we can set BGP attributes such as the MED and Local Preference or the administrative distance on static routes. However, routing protocols require complex tuning to load balance between border edge devices.

Although these attributes allow granular policy control, they do not cover aspects relating to path performance, such as Round Trip Time (RTT), delay, and jitter. In addition, there has always been a problem with complex routing for the WAN. As a result, it’s tricky to configure Quality of Service (QOS) policies on a per-link basis and design WAN solutions to incorporate multiple failure scenarios.

Issues with BGP: Lack of performance awareness

Due to the lack of performance awareness, BGP may not choose the best-performing path. Therefore, we must ask ourselves whether BGP can route on the best versus the shortest path

bgp protocol
Diagram: SD WAN tutorial and BGP protocol. BGP protocol example.

Issues with BGP: The shortest path is not always the best path

The shortest path is not necessarily the best path. Initially, we didn’t have real-time voice and video traffic, which is highly sensitive to latency and jitter. We also assumed that all links were equal. This is not the case today, where we have a mix-and-match of connections, such as slow LTE and fast MPLS. Therefore, the shortest path is no longer effective.

However, there are solutions on the market to enhance BGP, offering performance-based solutions for BGP-based networks. These could, for example, send out ICMP requests to monitor the network, then, based on the response, modify the BGP attributes such as AS prepending to influence the traffic flow. All this is done in an attempt to make BGP more performance-based. 

BGP is not performance-aware

However, we still need to avoid the fact that BGP needs to be made aware of capacity and performance. The common BGP attributes used for path selection are AS-Path length and multi-exit discriminators (MED). Unfortunately, these attributes do not correlate with the network or application’s performance.

Video: BGP in the Data Center

In this whiteboard session, we will address the basics of BGP. A network exists specifically to serve the connectivity requirements of applications, and these applications are to serve business needs. So, these applications must run on stable networks, and stable networks are built from stable routing protocols. Routing Protocols are a set of predefined rules used by the routers that interconnect your network to maintain the communication between the source and the destination. These routing protocols help to find the routes between two nodes on the computer network.

BGP in the Data Center
Prev 1 of 1 Next
Prev 1 of 1 Next

Issues with BGP: AS-Path that misses critical performance metrics

When BGP receives multiple paths to the same destination with default configurations, it runs the best path algorithm to decide the best way to install in the IP routing table. Generally, this path selection is based on AS-Path, the number of ASs. However, AS-Path is not an efficient measure of end-to-end transit.

It misses the entire network shape, which can result in long path selection or paths experiencing packet loss. Also, BGP changes paths only in reaction to changes in the policy or the set of available routes.

BGP protocol explained
Diagram: SD WAN tutorial and BGP protocol explained—the issues.

Issues with BGP: BGP and Active-Active deployments

Configuring BGP at the WAN edge requires the applications to fit into a previously defined network topology. We need something else for applications. BGP is hard to configure and manage when you want active-active or bandwidth aggregation. What options do you have when you want to dynamically steer sessions over multiple links?

Blackout detection only

BGP was not designed to address WAN transport brownouts caused by packet loss. Even with blackouts of complete link failure, the application recovery could take tens of seconds and even minutes to fully operational. Nowadays, we have more brownouts than blackouts. However, the original design of BGP was to detect blackouts only.

Brownouts can last anywhere from 10ms to 10 seconds, so it’s crucial to see the failure in a sub-second and re-route to a better path. To provide resiliency, WAN edge protocols must be combined with additional mechanisms, such as IP SLA and even enhanced object tracking. Unfortunately, these add to configuration complexity.

IP SLA Configuration
Diagram: Example IP SLA configuration. Source SlidePlayer.

SD WAN Tutorial: Major Environmental Changes

The hybrid WAN, typically consisting of Internet and MPLS, was introduced to save costs and resilience. However, we have had three emerging factors – new application requirements, increased Internet use, and the adoption of public cloud services that have put traditional designs under pressure.

We also have a lot of complexity at the branch. Many branch sites now include various appliances such as firewalls, intrusion prevention, Internet Protocol (IP) VPN concentrators, WAN path controllers, and WAN optimization controllers.

All these point solutions must be maintained and operated and provide the proper visibility that can be easily digested. Visibility is critical for the WAN. So, how do you obtain visibility into application performance across a hybrid WAN and ensure that applications receive appropriate prioritization and are forwarded over a proper path?

The era of client-server  

The design for the WAN and branch sites was conceived in the client-server era. At that time, the WAN design satisfies the applications’ needs. Then, applications and data resided behind the central firewall in the on-premises data center. Today, we are in a different space with hybrid IT and multi-cloud designs, making applications and data distribution. Data is now omnipresent. The type of WAN and branch originating in the client-server era was not designed with cloud applications. 

Hub and spoke designs.

The “hub and spoke” model was designed for client/server environments where almost all of an organization’s data and applications resided in the data center (i.e., the hub location) and were accessed by workers in branch locations (i.e., the spokes).  Internet traffic would enter the enterprise through a single ingress/egress point, typically into the data center, which would then pass through the hub and to the users in branch offices.

The birth of the cloud resulted in a significant shift in how we consume applications, traffic types, and network topology. There was a big push to the cloud, and almost everything was offered as a SaaS. In addition, the cloud era changed the traffic patterns as the traffic goes directly to the cloud from the branch site and doesn’t need to be backhauled to the on-premise data center.

network design
Diagram: Hub and Spoke: Network design.

Challenges with hub and spoke design.

The hub and spoke model needs to be updated. Because the model is centralized, day-to-day operations may be relatively inflexible, and changes at the hub, even in a single route, may have unexpected consequences throughout the network. It may be difficult or even impossible to handle occasional periods of high demand between two spokes.

The result of the cloud acceleration meant that the best point of access is only sometimes in the central location. Why would branch sites direct all internet-bound traffic to the central HQ, causing traffic tromboning and adding to latency when it can go directly to the cloud? The hub and spoke design could be an efficient topology for cloud-based applications. 

Active/Active and Active/Passive

Historically, WANs are built on “active-passive,” where a branch can be connected using two or more links, but only the primary link is active and passing traffic. In this scenario, the backup connection only becomes active if the primary connection fails. While this might seem sensible, it could be more efficient.

The interest in active-active has always been there, but it was challenging to configure and expensive to implement. In addition, active/active designs with traditional routing protocols are hard to design, inflexible, and a nightmare to troubleshoot.

Convergence and application performance problems can arise from active-active WAN edge designs. For example, active-active packets that reach the other end could be out-of-order packets due to each link propagating at different speeds. Also, the remote end has to reassemble, resulting in additional jitter and delay. Both high jitter and delay are bad for network performance.

The issues arising from active-active are often known as spray and pray. It increases bandwidth but decreases goodput. Spraying packets down both links can result in 20% drops or packet reordering. There will also be firewall issues as they may see asymmetric routes.

TCP out of order packets
Diagram: TCP out-of-order packets. Source F5.

SD-WAN tutorial and SD WAN requirements and active-active paths.

For an active-active design, one must have application session awareness and a design that eliminates asymmetric routing. In addition, it would help if you slice up the WAN so application flows can work efficiently over either link. SD-WAN does this. Also, WAN designs can be active–standby, which requires routing protocol convergence in the event of primary link failure.

Unfortunately, routing protocols are known to converge slowly. The emergence of SD-WAN technologies with multi-path capabilities combined with the ubiquity of broadband has made active-active highly attractive and something any business can deploy and manage quickly and easily.

SD-WAN solution enables the creation of virtual overlays that bond multiple underlay links. Virtual overlays would allow enterprises to classify and categorize applications based on their unique service level requirements and provide fast failover should an underlay link experience congestion or a brownout or outage.

There is traditional routing regardless of the mechanism used to speed up convergence and failure detection. These several convergence steps need to be carried out a ) Detecting the topology change, b ) Notifying the rest of the network about the change, c ) Calculating the new best path, d ) and e) switching to the new best path. Traditional WAN protocols route down one path and, by default, have no awareness of what’s happening at the application level. For this reason, there have been many attempts to enhance the WANs behavior. 

Example Convergence Time with OSPF
Diagram:Example Convergence Time with OSPF. Source INE.
Example Convergence Time with OSPF
Diagram:Example Convergence Time with OSPF. Source INE.

A keynote for this SD WAN tutorial: The issues with MPLS

multiprotocol label switching
Diagram: Multiprotocol label switching (MPLS).

MPLS has some great features but is only suitable for some application profiles. As a result, it can introduce more points of failure than traditional internet transport. Its architecture is predefined and, in some cases, inflexible. For example, some Service Providers (SP) might only offer hub and spoke topologies, and others only offer a full mesh.  Any changes to these predefined architectures will require manual intervention unless you have a very flexible MPLS service provider that allows you to do cool stuff with Route Targets.

MPLS forwarding
Diagram: MPLS forwarding

SD-WAN Tutorial and Scenario: Old and rigid MPLS

I designed a headquarters site for a large enterprise during a recent consultancy. MPLS topologies, once provisioned, are challenging to change. MPLS topologies are similar to the brick foundation of a house. Once the foundation is laid, changing the original structure is easy by starting over. In its simplest form, we have Provider Edge (PE) and P ( Provider ) routers in an MPLS network. The P router configuration does not change based on customer requirements, but the PE router does 

Route Targets

We have several technologies, such as Route Target, to control routers in and out of PE routers. A PE router with matching route targets and configurable variables allows the routes to pass. This created the customer topologies such as a hub and spoke or full mesh. In addition, the Wide Area Network (WAN) I worked on was fully outsourced. As a result, any requests would require service provider intervention with additional design & provisioning activities. 

For example, mapping application subnets to new or existing RT may involve recent high-level design approval with additional configuration templates, which would have to be applied by provisioning teams. It was a lot of work for such a small task. But, unfortunately, it puts the brakes on agility and pushes lead times through the roof. 

BGP community tagging

While there are ways to overcome this with BGP community tagging and matching, which provides some flexibility, we must recognize that it remains a fixed, predefined configuration. As a result, all subsequent design changes may still require service provider intervention.

SD WAN Requirements

sd wan requirements
Diagram: SD-WAN: The drivers for SD-WAN.

In the proceeding sections of this SD WAN tutorial, we will address the SD-WAN driver, which ranges from the need for flexible topologies to bandwidth-intensive applications.

SD-WAN tutorial and SD WAN requirements: Flexible topologies

For example, using DPI, we can have Voice over IP traffic go over MPLS. Here, the SD-WAN will look at real-time protocol and session initiation protocol. We can also have less critical applications that can go to the Internet. MPLS can be used only for a specific app.

As a result, the best-effort traffic is pinned to the Internet, and only critical apps get an SLA and go on the MPLS path. Now we have better utilization of the transports. And circuits never need to be dormant. With SD-WAN, we are using the B/W that you have available and ensure an optimized experience.

The SD-WAN’s value is that the solution tracks the network and path conditions in real time, revealing performance issues as they are happening. Then, dynamically redirect data traffic to the following available path.

Then, when the network recovers to its normal state, the SD-WAN solution can redirect the traffic path of the data to its original location. Therefore the effects of network degradation, which come in the form of brownouts and soft failure, can be minimized.

VPN Segmentation
Diagram: VPN Segmentation. Source Cisco.

SD-WAN tutorial and SD WAN requirements: Encryption key rotation

Data security has never been a more important consideration than it is today. Therefore, businesses and other organizations must take robust measures to keep data and information safely under lock and key. Encryption keys must be rotated regularly (the standard interval is every 90 days) to reduce the risk of compromised data security.

However, regular VPN-based encryption key rotation can be complicated and disruptive, often requiring downtime. SD-WAN can offer automatic key rotation, allowing network administrators to pre-program rotations without manual intervention or system downtime.

SD-WAN tutorial and SD WAN requirements: Push to the cloud 

Another critical feature of SD-WAN technology is cloud breakout. This lets you connect branch office users to cloud-hosted applications directly and securely, eliminating the inefficiencies of backhauling cloud-destined traffic through the data center. Given the ever-growing importance of SaaS and IaaS services, efficient and reliable access to the cloud is crucial for many businesses and other organizations. By simplifying how branch traffic is routed, SD-WAN makes setting up breakouts quicker and easier.

  • The changing perimeter location

Users are no longer positioned in one location with corporate-owned static devices. Instead, they are dispersed; additional latency degrades application performance when connecting to central areas. Optimizations can be made to applications and network devices, but the only solution is to shorten the link by moving to cloud-based applications. There is a huge push and a rapid flux for cloud-based applications. Most are now moving away from on-premise in-house hosting to cloud-based management.

The ready-made global footprint enables the usage of SaaS-based platforms that negate the drawbacks of dispersed users tromboning to a central data center to access applications. Logically positioned cloud platforms are closer to the mobile user. In addition, cloud hosting these applications is far more efficient than making them available over the public Internet.

sd wan tutorial

SD-WAN tutorial and SD WAN requirements: Decentralization of traffic

A lot of traffic is now decentralized from the central data center to remote branch sites. Many branches do not run high bandwidth-intensive applications. These types of branch sites are known as light edges. Despite the traffic change, the traditional branch sites rely on hub sites for most security and network services.

The branch sites should connect to the cloud applications directly over the Internet without tromboning traffic to data centers for Internet access or security services. An option should exist to extend the security perimeter into the branch sites without requiring expensive onsite firewalls and IPS/IDS. SD-WAN builds a dynamic security fabric without the appliance sprawl of multiple security devices and vendors.

  • The ability to service chain traffic 

Also, service chaining. Service chaining through SD-WAN allows organizations to reroute their data traffic through one service or multiple services, including intrusion detection and prevention devices or cloud-based security services. It thereby enables firms to declutter their branch office networks.

They can, after all, automate how particular types of traffic flows are handled and assemble connected network services into a single chain.

SD-WAN tutorial and SD WAN requirements: Bandwidth-intensive applications 

Exponential growth in demand for high-bandwidth applications such as multimedia in cellular networks has triggered the need to develop new technologies capable of providing the required high-bandwidth, reliable links in wireless environments. The biggest user of internet bandwidth is video streaming—more than half of total global traffic. The Cartesian study confirms historical trends reflecting consumer usage that remains highly asymmetric as video streaming remains the most popular.

  • Richer and hungry applications

Richer applications, multimedia traffic, and growth in the cloud application consumption model drive the need for additional bandwidth. Unfortunately, the congestion leads to packet drops, ultimately degrading application performance and user experience.

SD-WAN offers flexible bandwidth allocation so that you don’t have to go through the hassle of manually allocating bandwidth for specific applications. Instead, SD-WAN allows you to classify applications and specify a particular service level requirement. This way, you can ensure your set-up is better equipped to run smoothly, minimizing the risk of glitchy and delayed performance on an audio conference call.

SD-WAN tutorial and SD WAN requirements: Organic growth 

We also have organic business growth, a big driver for additional bandwidth requirements. The challenge is that existing network infrastructures are static and need help to respond adequately to this growth in a reasonable period. The last mile of MPLS puts a lock on you, destroying agility. Circuit lead times impede the organization’s productivity and create an overall lag.

SD-WAN tutorial and SD WAN requirements: Costs 

A WAN solution should be simple. To serve the new era of applications, we need to increase the link capacity by buying more bandwidth. However, life is more complex. The WAN is an expensive part of the network, and employing link oversubscription to reduce the congestion is too costly.

Bandwidth comes at a high cost to cater to new application requirements not met by the existing TDM-based MPLS architectures. At the same time, feature-rich MPLS comes at a high price for relatively low bandwidth. You are going to need more bandwidth to beat latency.

On the more traditional side, MPLS and private ethernet lines (EPLs) can range in cost from $700 to $10,000 per month, depending on bandwidth size and distance of the link itself. Some enterprises must also account for redundancies at each site as uptime for higher-priority sites comes into play. Cost becomes exponential when you have a large number of sites to deploy.

SD-WAN tutorial and SD-WAN requirements: Limitations of protocols 

We already mentioned some problems with routing protocols, but leaving IPsec to default raises challenges. IPSec architecture is point-to-point, not site-to-site. Therefore, it does not natively support redundant uplinks. Complex configurations and potentially additional protocols are required when sites have multiple uplinks to multiple providers. 

Left to its defaults, IPsec is not abstracted, and one session cannot be sent over various uplinks. This will cause challenges with transport failover and path selection. Secure tunnels should be torn up and down immediately, and new sites should be incorporated into a secure overlay without much delay or manual intervention.

SD-WANrequirements: Internet of Things (IoT) 

As millions of IoT devices come online, how do we further segment and secure this traffic without complicating the network design? There will be many dumb IoT devices that will require communication with the IoT platform in a remote location. Therefore, will there be increased signaling traffic over the WAN? 

Security and bandwidth consumption are vital issues concerning the introduction of IP-enabled objects. Although encryption is a great way to prevent hackers from accessing data, it is also one of the leading IoT security challenges.

These drives like the storage and processing capabilities found on a traditional computer. The result is increased attacks where hackers can easily manipulate the algorithms designed for protection. Also, Weak credentials and login details leave nearly all IoT devices vulnerable to password hacking and brute force. Any company that uses factory default credentials on its devices places its business, assets, customers, and valuable information at risk of being susceptible to a brute-force attack.

SD-WAN tutorial and SD WAN requirements: Visibility

Many service provider challenges include a need for more visibility into customer traffic. The lack of granular details of traffic profiles leads to expensive over-provision of bandwidth and link resilience. In addition, upgrades at both a packet and optical layer often need complete traffic visibility and justification.

There are many networks out there that are left at half capacity just in case there is an unexpected spike in traffic. As a result, much money is spent on link underutilization, which should be spent on innovation. This link between underutilization and oversubscription is due to a need for more visibility.

Summary: SD WAN Tutorial

SD-WAN, or Software-Defined Wide Area Networks, has emerged as a game-changing technology in the realm of networking. This tutorial delved into SD-WAN fundamentals, its benefits, and how it revolutionizes traditional WAN infrastructures.

Section 2: Understanding SD-WAN

SD-WAN is an innovative approach to networking that simplifies the management and operation of a wide area network. It utilizes software-defined principles to abstract the underlying network infrastructure and provide centralized control, visibility, and policy-based management.

Section 3: Key Features and Benefits

One of the critical features of SD-WAN is its ability to optimize network performance by intelligently routing traffic over multiple paths, including MPLS, broadband, and LTE. This enables organizations to leverage cost-effective internet connections without compromising performance or reliability. Additionally, SD-WAN offers enhanced security measures, such as encrypted tunneling and integrated firewall capabilities.

Section 4: Deployment and Implementation

Implementing SD-WAN requires careful planning and consideration. This section will explore the different deployment models, including on-premises, cloud-based, and hybrid approaches. We will discuss the necessary steps in deploying SD-WAN, from initial assessment and design to configuration and ongoing management.

Section 5: Use Cases and Real-World Examples

SD-WAN has gained traction across various industries due to its versatility and cost-saving potential. This section will showcase notable use cases, such as retail, healthcare, and remote office connectivity, highlighting the benefits and outcomes of SD-WAN implementation. Real-world examples will provide practical insights into the transformative capabilities of SD-WAN.

Section 6: Future Trends and Considerations

As technology continues to evolve, staying updated on the latest trends and considerations in the SD-WAN landscape is crucial. This section will explore emerging concepts, such as AI-driven SD-WAN and integrating SD-WAN with edge computing and IoT technologies. Understanding these trends will help organizations stay ahead in the ever-evolving networking realm.

Conclusion:

In conclusion, SD-WAN represents a paradigm shift in how wide area networks are designed and managed. Its ability to optimize performance, ensure security, and reduce costs has made it an attractive solution for organizations of all sizes. By understanding the fundamentals, exploring deployment options, and staying informed about the latest trends, businesses can leverage SD-WAN to unlock new possibilities and drive digital transformation.

zero trust network design

Zero Trust Network Design

Zero Trust Network Design

In today's interconnected world, where data breaches and cyber threats have become commonplace, traditional perimeter defenses are no longer enough to protect sensitive information. Enter Zero Trust Network Design is a security approach that prioritizes data protection by assuming that every user and device, inside or outside the network, is a potential threat. In this blog post, we will explore the Zero Trust Network Design concept, its principles, and its benefits in securing the modern digital landscape.

Zero trust network design is a security concept that focuses on reducing the attack surface of an organization’s network. It is based on the assumption that users and systems inside a network are untrusted, and therefore, all traffic is considered untrusted and must be verified before access is granted. This contrasts traditional networks, which often rely on perimeter-based security to protect against external threats.

Key Points:

-Identity and Access Management (IAM): IAM plays a vital role in Zero Trust by ensuring that only authenticated and authorized users gain access to specific resources. Multi-factor authentication (MFA) and strong password policies are integral to this component.

-Network Segmentation: Zero Trust advocates for segmenting the network into smaller, more manageable zones. This helps contain potential breaches and restricts lateral movement within the network.

-Continuous Monitoring and Analytics: Real-time monitoring and analysis of network traffic, user behavior, and system logs are essential for detecting any anomalies or potential security breaches.

-Enhanced Security: By adopting a Zero Trust approach, organizations significantly reduce the risk of unauthorized access and lateral movement within their networks, making it harder for cyber attackers to exploit vulnerabilities.

-Improved Compliance: Zero Trust aligns with various regulatory and compliance requirements, providing organizations with a structured framework to ensure data protection and privacy.

-Greater Flexibility: Zero Trust allows organizations to embrace modern workplace practices, such as remote work and BYOD (Bring Your Own Device), without compromising security. Users can securely access resources from anywhere, anytime.

Implementing Zero Trust requires a well-defined strategy and careful planning. Here are some key steps to consider:

1. Assess Current Security Infrastructure: Conduct a thorough assessment of existing security measures, identify vulnerabilities, and evaluate the readiness for Zero Trust implementation.

2. Define Trust Boundaries: Determine the trust boundaries within the network and establish access policies accordingly. Consider factors like user roles, device types, and resource sensitivity.

3. Choose the Right Technologies: Select security solutions and tools that align with your organization's needs and objectives. These may include next-generation firewalls, secure web gateways, and identity management systems.

Highlights: Zero Trust Network Design

Never Trust, Always Verify

The core concept of zero-trust network design and segmentation is never to trust, always verify. This means that all traffic, regardless of its origin, must be verified before access is granted. This is achieved through layered security controls, including authentication, authorization, encryption, and monitoring.

firewalling device

Authentication is used to verify the identity of users and devices before allowing access to resources. Authorization is used to determine what resources a user or device is allowed to access. Encryption is used to protect data in transit and at rest. Monitoring is used to detect threats and suspicious activity.

Zero Trust Network Segmentation

Zero-trust network design, including zero-trust network segmentation, is becoming increasingly popular as organizations move away from perimeter-based security. By verifying all traffic rather than relying on perimeter-based security, organizations can reduce their attack surface and improve their overall security posture. Segmentation can work at different layers of the OSI Model.

data center network microsegmentation

With a zero-trust network segmentation approach, networks are segmented into smaller islands with specific workloads. In addition, each segment has its own ingress and egress controls to minimize the “blast radius” of unauthorized access to data.

Related: For pre-information, you may find the following helpful:

  1. DNS Security Designs
  2. Zero Trust Access
  3. SD WAN Segmentation



Zero Trust Architecture

Key Zero Trust Network Design Discussion Points:


  • Zero Trust principles.

  • TCP weak connectivitiy model.

  • Develop a Zero Trust architecture.

  • Issues of the traditional perimeter.

  • The use of micro perimeters.

Back to Basics: Zero Trust Network Design

Challenging Landscape

The drive for a zero trust networking and software defined perimeter is again gaining momentum. The bad actors are getting increasingly sophisticated, resulting in a pervasive sense of unease in traditional networking and security methods. So why are our network infrastructure and applications open to such severe security risks? This Zero Trust tutorial will recap some technological weaknesses driving the path to Zero Trust network design and Zero Trust SASE.

We give devices IP addresses to connect to the Internet and signposts three pathways. None of these techniques ensures attacks will not happen. They are like preventive medicine. However, with bad actor sophistication, we need to be more at a total immunization level to ensure that attacks cannot even touch your infrastructure by implementing a zero trust security strategy and software defined perimeter solutions.

Understanding Zero Trust Network Design:

Zero Trust Network Design is a security framework that aims to prevent and mitigate cyber-attacks by continuously verifying and validating every access request. Unlike the traditional perimeter-based security model, Zero Trust Network Design leverages several core principles to achieve a higher level of security:

1. Least Privilege: Users and devices are granted only the minimum level of access required to perform their specific tasks. This principle ensures that the potential damage is limited even if a user’s credentials are compromised.

2. Micro-Segmentation: Networks are divided into smaller, isolated segments, making it more challenging for an attacker to move laterally and gain unauthorized access to critical systems or data.

3. Continuous Authentication: Zero-trust network Design emphasizes multi-factor authentication and continuous verification of user identity and device health rather than relying solely on static credentials like usernames and passwords.

4. Network Visibility: Comprehensive monitoring and logging are crucial components of Zero Trust Network Design. Organizations can detect anomalies and potential security breaches in real-time by closely monitoring network traffic and inspecting all data packets.

Benefits of Zero Trust Network Design:

Implementing Zero Trust Network Design offers numerous benefits for organizations seeking to protect their sensitive data and mitigate cyber risks:

1. Enhanced Security: By assuming that all users and devices are untrusted, Zero Trust Network Design provides a higher security level against internal and external threats. It minimizes the risk of unauthorized access and helps organizations detect and respond to potential breaches more effectively.

2. Improved Compliance: Many industries have strict regulatory requirements for protecting sensitive data. Zero Trust Network Design addresses these compliance challenges by providing granular control over access permissions and ensuring that only authorized individuals can access critical information.

3. Reduced Attack Surface: Zero-trust network Design reduces potential attackers’ attack surface by segmenting networks and implementing strict access controls. This proactive approach makes it significantly harder for cybercriminals to move laterally within the network and gain access to sensitive data.

4. Simplified User Experience: Contrary to common misconceptions, implementing Zero Trust Network Design does not have to sacrifice user experience. With modern identity and access management solutions, users can enjoy a seamless and secure authentication process regardless of location or device.

Highlighting zero trust network segmentation

Zero-trust network segmentation is a process in which a network is divided into smaller, more secure parts. This can be done by using software firewalls, virtual LANs (VLANs), or other network security protocols. The purpose of zero-trust network segmentation, also known as microsegmentation, is to decrease a network’s attack surface and reduce the potential damage caused by a network breach. It also allows for more granular control over user access, which can help prevent unauthorized access to sensitive data.

Microsegmentation also allows for more efficient deployment of applications and more detailed monitoring and logging of network activity. By leveraging the advantages of microsegmentation, organizations can increase their network’s security and efficiency while protecting their data and resources.

Zero Trust: Changing the Approach to Security

Zero Trust is about fundamentally transforming the underlying philosophy and approach to enterprise security—shifting from outdated and demonstrably ineffective perimeter-centric methods to a dynamic, identity-centric, and policy-based system. Policies are at the heart of Zero Trust—after all, its primary architectural components are Policy Decision Points and Policy Enforcement Points. In our Zero Trust world, policies are the structures organizations create to define which identities are permitted access to resources under which circumstances.

zero trust networking
Diagram: Define Zero Trust: The standard three pathways.

Introduction to Zero Trust Network Design

The idea behind the Zero Trust model and software-defined perimeter (SDP) is a connection-based security architecture designed to stop attacks. It doesn’t expose the infrastructure and its applications. Instead, it enables you to know the authorized users by authenticating, authorizing, and validating the devices they are on before connecting to protected resources.

A Zero-Trust architecture allows you to operate while vulnerabilities, patches, and configurations are in progress. Essentially, it cloaks applications or groups of applications so they are invisible to attack.

zero trust network design
Diagram: Zero Trust Network Design. The Principles. Source cimcor.

Zero Trust principles

Zero Trust Network ZTN and SDP are a security philosophy and set of Zero Trust principles, which, taken together, represent a significant shift in how security should be approached. Foundational security elements used before Zero Trust often achieved only coarse-grained separation of users, networks, and applications.

On the other hand, Zero Trust enhances this, effectively requiring that all identities and resources be segmented from one another. Zero Trust enables fine-grained, identity-and-context-sensitive access controls driven by an automated platform. Although Zero Trust started as a narrowly focused approach of not trusting any network identities until authenticated and authorized.

Traditional security boundaries

Traditionally, security boundaries were placed at the edge of the enterprise network in a classic “castle wall and moat” approach. However, a significant issue with this was the design and how we connected. Traditional non-zero Trust security solutions have been unable to bridge the disconnect between network and application-level security. Traditionally, users (and their devices) obtained broad access to networks, and applications relied upon authentication-only access control.

Issue 1 – We Connect First and Then Authenticate

Connect first, authenticate second.

TCP/IP is a fundamentally open network protocol that facilitates easy connectivity and reliable communications between distributed computing nodes. It has served us well in enabling our hyper-connected world but—for various reasons—doesn’t include security as part of its core capabilities.

TCP has a weak security foundation

Transmission Control Protocol (TCP) has been around for decades and has a weak security foundation. When it was created, security was out of scope. TCP can detect and retransmit error packets but leave them to their default; communication packets are not encrypted, which poses security risks. In addition, TCP operates with a Connect First, Authenticate, Second operation model, which is inherently insecure. It leaves the two connecting parties wide open for an attack. When clients want to communicate and access an application, they first set up a connection.

Then only once the connect stage has been carried out successfully can the authentication stage occur. And once the authentication stage has been carried out, we can only begin to pass the data. 

zero trust network design
Diagram: Zero Trust security. The TCP model of connectivity.

From a security perspective, the most important thing to understand is that this connection occurs purely at a network layer with no identity, authentication, or authorization. The beauty of this model is that it enables anyone with a browser to easily connect to any public web server without requiring any upfront registration or permission. This is a perfect approach for a public web server but a lousy approach for a private application.

The potential for malicious activity

With this process of Connect First and Authenticate Second, we are essentially opening up the door of the network and the application without knowing who is on the other side. Unfortunately, with this model, we have no idea who the client is until they have carried out the connect phase, and once they have connected, they are already in the network. Maybe the requesting client is not trustworthy and has bad intentions. If so, once they connect, they can carry out malicious activity and potentially perform data exfiltration. 

Developing a Zero Trust Architecture

A zero-trust architecture requires endpoints to authenticate and be authorized before obtaining network access to protected servers. Then, real-time encrypted connections are created between requesting systems and application infrastructure. With a zero-trust architecture, we must establish trust between the client and the application before the client can set up the connection. Zero Trust is all about trust – never trust, always verify.

The trust is bi-directional between the client and the Zero Trust architecture ( that can take forms ) and the application to the Zero Trust architecture. It’s not a one-time check; it’s a continuous mode of operation. Once sufficient trust has been established, we move into the next stage, authentication. Once authentication has been set, we can connect the user to the application. Zero Trust access events flip the entire security model and make it more robust. 

  • We have gone from connecting first, authenticating second to authenticate first, connect second.
zero trust model
Diagram: The Zero Trust model of connectivity.

Example of a zero-trust network access

Single Pack Authorization ( SPA)

The user cannot see or know where the applications are located. SDP hides the application and creates a “dark” network by using Single Packet Authorization (SPA) for the authorization.

SPAs’ goal, also known as Single Packet Authentication, is to overcome the open and insecure nature of TCP/IP, which follows a “connect then authenticate” model.  SPA is a lightweight security protocol that validates a device or user’s identity before permitting network access to the SDP. The purpose of SPA is to allow a service to be darkened via a default-deny firewall.

SPA Use Case
Diagram: SPA Use Case. Source mrash Github.

The systems use a One-Time-Password (OTP) generated by algorithm 14 and embed the current password in the initial network packet sent from the client to the Server. The SDP specification mentions using the SPA packet after establishing a TCP connection. In contrast, the open-source implementation from the creators of SPA15 uses a UDP packet before the TCP connection.

single packet authorization

Issue 2 – Fixed perimeter approach to networking and security

Traditionally, security boundaries were placed at the edge of the enterprise network in a classic “castle wall and moat” approach. However, as technology evolved, remote workers and workloads became more common. As a result, security boundaries necessarily followed and expanded from just the corporate perimeter.

The traditional world of static domains

The traditional world of networking started with static domains. Networks were initially designed to create internal segments separated from the external world by a fixed perimeter. The classical network model divided clients and users into trusted and untrusted groups. The internal network was deemed trustworthy, whereas the external was considered hostile.

The perimeter approach to network and security has several zones. We have, for example, the Internet, DMZ, Trusted, and then Privileged. In addition, we have public and private address spaces that separate network access from here. Private addresses were deemed more secure than public ones as they were unreachable online. However, this trust assumption that all private addresses are safe is where our problems started. 

zero trust architecture
Diagram: Zero Trust security meaning. The issues with traditional networks and security.

The fixed perimeter 

The digital threat landscape is concerning. We are getting hit by external threats to your applications and networks from all over the world. They also come internally within your network, and we have insider threats within a user group and internally as insider threats across user group boundaries. These types of threats need to be addressed one by one.

One issue with the fixed perimeter approach is that it assumes trusted internal and hostile external networks. However, we must assume that the internal network is as hostile as the external one.

Over 80% of threats are from internal malware or malicious employees. The fixed perimeter approach to networking and security is still the foundation for most network and security professionals, even though a lot has changed since the design’s inception. 

zero trust network
Diagram: Traditional vs zero trust network. Source is thesslstore

We get hacked daily!

We are now at a stage where 45% of US companies have experienced a data breach. The 2022 Thales Data Threat Report found that almost half (45%) of US companies suffered a data breach in the past year. However, this could be higher due to the potential for undetected breaches.

We are getting hacked daily, and major networks with skilled staff are crashing. Unfortunately, the perimeter approach to networking has failed to provide adequate security in today’s digital world. It works to an extent by delaying an attack. However, a bad actor will eventually penetrate your guarded walls with enough patience and skill.

If a large gate and walls guard your house, you would feel safe and think you are fully protected while inside the house. However, the perimeter protecting your home may be as large and thick as possible. There is still a chance that someone can climb the walls, access your front door, and enter your property. However, if a bad actor cannot even see your house, they cannot take the next step and try to breach your security.

Issue 3 – Dissolved perimeter caused by the changing environment

The environment has changed with the introduction of the cloud, advanced BYOD, machine-to-machine connections, the rise in remote access, and phishing attacks. We have many internal devices and a variety of users, such as on-site contractors that need to access network resources.

There is also a trend for corporate devices to move to the cloud, collocated facilities, and off-site to customer and partner locations. In addition, it is becoming more diversified with hybrid architectures.

zero trust network design
Diagram: Zero Trust concept.

These changes are causing major security problems with the fixed perimeter approach to networking and security. For example, with the cloud, the internal perimeter is stretched to the cloud, but traditional security mechanisms are still being used. But it is an entirely new paradigm. Also, some abundant remote workers work from various devices and places.

Again, traditional security mechanisms are still being used. As our environment evolves, security tools and architectures must evolve. Let’s face it: the network perimeter has dissolved as your remote users, things, services, applications, and data are everywhere. In addition, as the world moves to the cloud, mobile, and the IoT, the ability to control and secure everything in the network is no longer available.

Phishing attacks are on the rise.

We have witnessed increased phishing attacks that can result in a bad actor landing on your local area network (LAN). Phishing is a type of social engineering where an attacker sends a fraudulent message designed to trick a person into revealing sensitive information to the attacker or to deploy malicious software on the victim’s infrastructure, like ransomware. The term “phishing” was first used in 1994 when a group of teens worked to obtain credit card numbers from unsuspecting users on AOL manually.

Phishing attacks
Diagram: Phishing attacks. Source is helpnetsecurity

Hackers are inventing new ways.

By 1995, they had created a program called AOHell to automate their work. Since then, hackers have continued to invent new ways to gather details from anyone connected to the internet. These actors have created several programs and types of malicious software that are still used today.

Recently, I was a victim of a phishing email. Clicking and downloading the file is very easy if you are not educated about phishing attacks. In my case, the particular file was a .wav file. It looked safe, but it was not.

Issue 4 – Broad-level access

So, you may have heard of broad-level access and lateral movements. Remember, with traditional network and security mechanisms, when a bad actor lands on a particular segment, i.e., a VLAN, known as zone-based networking, they can see everything on that segment. So this gives them broad-level access. But, generally speaking, when you are on a VLAN, you can see everything in that VLAN, and VLAN-to-VLAN communication is not the hardest thing to do, resulting in lateral movements.

The issue of lateral movements

Lateral movement is the technique attackers use to progress through the organizational network after gaining initial access. Adversaries use lateral movement to identify target assets and sensitive data for their attack. Lateral movement is the tenth step in the MITRE Att&ck framework. It is the set of techniques attackers use to move in the network while gaining access to credentials without being detected.

No intra-VLAN filtering

This is made possible as, traditionally, a security device does not filter this low down on the network, i.e., inside of the VLAN, known as intra-VLAN filtering. A phishing email can easily lead the bad actor to the LAN with broad-level access and the capability to move laterally throughout the network. 

For example, a bad actor can initially access an unpatched central file-sharing server; they move laterally between segments to the web developers’ machines and use a keylogger to get the credentials to access critical information on the all-important database servers.

They can then carry out data exfiltration with DNS or even a social media account like Twitter. However, firewalls generally do not check DNS as a file transfer mechanism, so data exfiltration using DNS will often go unnoticed. 

zero trust network design
Diagram: Zero trust application access. One of the many security threats is lateral movements.

Issue 5 – The challenges with traditional firewalls

The limited world of 5-tuple

Traditional firewalls typically control access to network resources based on source IP addresses. This creates the fundamental challenge of securing admission. Namely, we need to solve the user access problem, but we only have the tools to control access based on IP addresses.

As a result, you have to group users, some of whom may work in different departments and roles, to access the same service and with the same IP addresses. The firewall rules are also static and don’t change dynamically based on levels of trust on a given device. They provide only network information.

Maybe the user moves to a more risky location, such as an Internet cafe, its local Firewall, or antivirus software that has been turned off by malware or even by accident. Unfortunately, a traditional firewall cannot detect this and live in the little world of the 5-tuple.  Traditional firewalls can only express static rule sets and not communicate or enforce rules based on identity information.

TCP 5 Tuple
Diagram: TCP 5 Tuple. Source is packet-foo.

Issue 6 – A Cloud-focused environment

Upon examining the cloud, let’s compare a public parking space. A public cloud is where you can put your car compared to your vehicle in your parking garage. In a public parking space, we have multiple tenants who can take your area, but we don’t know what they can do to your car.

Today, we are very cloud-focused, but when moving applications to the cloud, we need to be very security-focused. However, the cloud environment is less mature in providing the traditional security control we use in our legacy environment. 

So, when putting applications in the cloud, you shouldn’t leave security to its default. Why?? Firstly, we operate in a shared model where the tenant after you can steal your encryption keys or data. There have been a lot of cloud breaches. We have firewalls with static rulesets, authentication, and key management issues in cloud protection.

Control point change

One of the biggest problems is that the perimeter has moved when you move to a cloud-based application. Servers are no longer under your control. Mobile and tablets exacerbate the problem as they can be located everywhere. So, trying to control the perimeter is very difficult. More importantly, firewalls only have access to and control network information and should have more content.

Defining this perimeter is what ZTNA architecture and software-defined perimeter are doing. Cloud users now manage firewalls by moving their applications to the cloud, not the I.T. teams within the cloud providers.

So when moving applications to the cloud, even though cloud providers provide security tools, the cloud consumer has to integrate security to have more visibility than they have today.

zero trust cloud
Diagram: ZTNA. Zero Trust cloud security.

Before, we had clear network demarcation points set by a central physical firewall creating inside and outside trust zones. Anything outside was considered hostile, and anything on the inside was deemed trusted.

1. Connection-centric model

The Zero Trust model flips this around and considers everything untrusted. To do this, there are no longer pre-defined fixed network demarcation points. Instead, the network perimeter initially set in stone is now fluid and software-based.

Zero Trust is connection-centric, not network-centric. Each user on a specific device connected to the network gets an individualized connection to a particular service hidden by the perimeter.

Instead of having one perimeter every user uses, SDP creates many small perimeters purposely built for users and applications. These are known as micro perimeters. Clients are cryptographically signed into these microperimeters.

security micro perimeters
Diagram: Security micro perimeters.

2. Micro perimeters: Zero trust network segmentation

The micro perimeter is based on user and device context and can dynamically adjust to environmental changes. So, as a user moves to different locations or devices, the Zero Trust architecture can detect this and set the appropriate security controls based on the new context.

The data center is no longer the center of the universe. Instead, the user on specific devices, along with their service requests, is the new center of the universe.

Zero Trust does this by decoupling the user and device from the network. The data plane is separated from the network to remove the user from the control plane. The control plane is where the authentication happens first.

Then, the data plane, the client-to-application connection, transfers the data. Therefore, the users don’t need to be on the network to gain application access. As a result, they have the least privilege and no broad-level access.

  • Concept: Zero trust network segmentation

Zero-trust network segmentation is gaining traction in cybersecurity due to its ability to provide increased protection to an organization’s network. This method of securing networks is based on the concept of “never trust, always verify,” meaning that all traffic must be authenticated and authorized before it can access the network.

This is accomplished by segmenting the network into multiple isolated zones accessible only through specific access points, which are carefully monitored and controlled.

Network segmentation is a critical component of a zero-trust network design. By dividing the network into smaller, isolated units, it is easier to monitor and control access to the network. Additionally, segmentation makes it harder for attackers to move laterally across the network, reducing the chance of a successful attack.

Zero-trust network design segmentation is essential to any organization’s cybersecurity strategy. By utilizing segmentation, authentication, and monitoring systems, organizations can ensure their networks are secure and their data is protected.

A final issue 7 – The I.P. address conundrum

Everything today relies on I.P. addresses for trust, but there is a problem: I.P. addresses lack user knowledge to assign and validate the device’s trust. There is no way for an I.P. address to do this. I.P. addresses provide connectivity but do not get involved in validating the trust of the endpoint or the user.

Also, I.P. addresses should not be used as an anchor for network locations as they are today because when a user moves from one place to another, the I.P. address changes. 

security flaws
Diagram: Three main network security flaws.

Can’t have security related to an I.P. address.

But what about the security policy assigned to the old IP addresses? What happens with your change I.P.s? Anything tied to I.P. is ridiculous, as we don’t have a good hook to hang things on for security policy enforcement. There are several facets to policy. For example, the user access policy touches on authorization, the network access policy touches on what to connect to, and user account policies touch on authentication.

With either one, there is no policy visibility with I.P. addresses. This is also a significant problem for traditional firewalling, which displays static configurations; for example, a stationary design may state that this particular source can reach this destination using this port number. 

Security-related issues to I.P.

  1. This has no meaning. There is no indication of why that rule exists and under what conditions a packet should be allowed from one source to another.
  2. No contextual information is taken into consideration. When creating a robust security posture, we must consider more than ports and IP addresses.

For a robust security posture, you need complete visibility into the network to see who, what, when, and how they connect with the device. Unfortunately, today’s Firewall is static and only contains information about the network.

On the other hand, Zero Trust enables a dynamic firewall with the user and device context to open a firewall for a single secure connection. The Firewall remains closed at all other times, creating a ‘black cloud’ stance regardless of whether the connections are made to the cloud or on-premise. 

The rise of the next-generation firewall?

Next-generation firewalls are more advanced than traditional firewalls. They use the information in layers 5 through 7 (session layer, presentation layer, and application layer) to perform additional functions. They can provide advanced features such as intrusion detection, prevention, and virtual private networks.

Today, most enterprise firewalls are “next generation” and typically include IDS/IPS, traffic analysis and malware detection for threat detection, URL filtering, and some degree of application awareness/control.

Like the NAC market segment, vendors in this area began a journey to identity-centric security around the same time Zero Trust ideas began percolating through the industry. Today, many NGFW vendors offer Zero Trust capabilities, but many operate with the perimeter security model.

Still, IP-based security systems

NGFWs are still IP-based systems offering limited identity and application-centric capabilities. In addition, they are static firewalls. Most do not employ zero-trust segmentation, and they often mandate traditional perimeter-centric network architectures with site-to-site connections and don’t offer flexible network segmentation capabilities. Similar to conventional firewalls, their access policy models are typically coarse-grained, providing users with broader network access than what is strictly necessary.

Diagram: Cloud Application Firewall.

Conclusion:

Zero Trust Network Design represents a paradigm shift in network security, recognizing that traditional perimeter defenses are no longer sufficient to protect against the evolving threat landscape. By implementing this approach, organizations can significantly enhance their security posture, minimize the risk of data breaches, and ensure compliance with regulatory requirements. As the digital landscape evolves, Zero Trust Network Design offers a robust framework for safeguarding sensitive information in an increasingly interconnected world.

 

Summary: Zero Trust Network Design

Traditional network security measures are no longer sufficient in today’s digital landscape, where cyber threats are becoming increasingly sophisticated. Enter zero trust network design, a revolutionary approach that challenges the traditional perimeter-based security model. In this blog post, we will delve into the concept of zero-trust network design, its key principles, benefits, and implementation strategies.

Understanding Zero Trust Network Design

Zero-trust network design is a security framework that operates on the principle of “never trust, always verify.” Unlike traditional perimeter-based security, which assumes trust within the network, zero-trust treats every user, device, or application as potentially malicious. This approach is based on the belief that trust should not be automatically granted but continuously verified, regardless of location or network access method.

Key Principles of Zero Trust

Certain key principles must be followed to implement zero trust network design effectively. These principles include:

1. Least Privilege: Users and devices are granted the minimum level of access required to perform their tasks, reducing the risk of unauthorized access or lateral movement within the network.

2. Microsegmentation: The network is divided into smaller segments or zones, allowing granular control over network traffic and limiting the impact of potential breaches or lateral movement.

3. Continuous Authentication: Authentication and authorization are not just one-time events but are verified throughout a user’s session, preventing unauthorized access even after initial login.

Benefits of Zero Trust Network Design

Implementing a zero-trust network design offers several significant benefits for organizations:

1. Enhanced Security: By adopting a zero-trust approach, organizations can significantly reduce the attack surface and mitigate the risk of data breaches or unauthorized access.

2. Improved Compliance: Zero trust network design aligns with many regulatory requirements, helping organizations meet compliance standards more effectively.

3. Greater Flexibility: Zero trust allows organizations to embrace modern workplace trends, such as remote work and cloud-based applications, without compromising security.

Implementing Zero Trust

Implementing a trust network design requires careful planning and a structured approach. Some key steps to consider are:

1. Network Assessment: Conduct a thorough assessment of the existing network infrastructure, identifying potential vulnerabilities or areas that require improvement.

2. Policy Development: Define comprehensive security policies that align with zero trust principles, including access control, authentication mechanisms, and user/device monitoring.

3. Technology Adoption: Implement appropriate technologies and tools that support zero-trust network design, such as network segmentation solutions, multifactor authentication, and continuous monitoring systems.

Conclusion:

Zero trust network design represents a paradigm shift in network security, challenging traditional notions of trust and adopting a more proactive and layered approach. By implementing the fundamental principles of zero trust, organizations can significantly enhance their security posture, reduce the risk of data breaches, and adapt to evolving threat landscapes. Embracing the principles of least privilege, microsegmentation, and continuous authentication, organizations can revolutionize their network security and stay one step ahead of cyber threats.

Ansible Variables

Ansible Variables | Ansible Automation

 

Ansible inventory variable

 

Ansible Variables

In the world of automation, Ansible has emerged as a popular choice for managing and configuring systems. One of the key features that sets Ansible apart is its ability to work with variables. Variables in Ansible enable users to define and store values that can be used throughout the playbook, making it a powerful tool for automation.

Variables in Ansible can be defined in a variety of ways. They can be set globally, at the playbook level, or even at the task level. This flexibility allows users to customize their automation process based on their needs.

One everyday use case for variables is storing configuration values specific to different environments. For example, you have a playbook that deploys a web application. Using variables, you can define the database connection string, the server IP address, and other environment-specific values separately for development, staging, and production environments. This makes reusing the same playbook easy across different environments without modifying the code.

Another powerful feature of variables in Ansible is their ability to be dynamically generated. This means that you can use calculated or fetched variables instead of hardcoding values at runtime. For example, the “lookup” plugin can read values from external sources like files or databases and assign them to variables. This makes your automation process more dynamic and adaptable.

 

Highlights: Ansible Variables

  • The Process of Decoupling

For your automation journey, you want to be as flexible as possible. For this reason, within the Ansible architecture, we have a process known as decoupling. Here, we are separating site-specific code from static code. Anything specific to a server or managed device, such as an IP address, can be replaced with ansible variables. As a best practice, always aim to have playbooks to be flexible, and if you want to share with someone else, all you need to change is the variables.

  • Variable Locations

As you know, variables are defined in several places. Each place you define the variables, such as the inventory with the ansible inventory variable or play header, will have an order of precedence. The most common place is for the Ansible set variables task. When you have Ansible set variables in the task, Ansible also allows setting variables directly in a task using the set_fact module. We will look at Ansible set variables in a task in a moment.

So, for an Ansible architecture with more extensive playbooks, remember the best place to hold your variables and not keep your playbooks site-specific. With Ansible, you can execute tasks and playbooks on multiple systems with a single command.

  • Ansible Tower

With Ansible Tower, you can have very complex automation requirements with the push of a button. Every site will have variations, and Ansible uses variables to manage system differences. To represent the variations among those systems, you can create variables with standard YAML syntax, including lists and dictionaries. 

 

Before you proceed, you may find the following posts helpful:

  1. Ansible Architecture
  2. Security Automation
  3. Network Configuration Automation
  4. Software Defined Perimeter Solutions
  5. Security Automation

 



Ansible Variables Preference

Key Ansible Variables Discussion points:


  • Ansible architecture and decoupling.

  • Defining Ansible variables.

  • Example: Ansible set variables in task.

  • Facts and variables.

  • Inventorty variables.

  • Conditionals and loops.

 

  • A key point: Video on Ansible inventory and its use of Ansible inventory variable.

This video will discuss Ansible automation and the Ansible inventory used to list the target hosts. In addition, the inventory can have a particular Ansible inventory variable known as behavioral variables that can tune how you connect to the managed assets.

 

The Ansible Inventory | Ansible Automation
Prev 1 of 1 Next
Prev 1 of 1 Next

Back to basics with Ansible Variables

Ansible is open-source automation and orchestration software that can automate most of your operations with IT infrastructure components, including servers, storage, networks, and application platforms. Ansible is one of the most popular automation tools in the IT world and has strong community support with more than 5,000 contributors worldwide. With Ansible, we have the use of variables. 

Ansible uses variables to manage differences between systems. With Ansible, you can execute tasks and playbooks on multiple systems with a single command. To represent the variations among those systems, you can create variables with standard YAML syntax, including lists and dictionaries.

 

Defining Ansible Variables
Diagram: Ansible variables.

 

Defining Ansible Variables | Ansible Set Variables in Task

  • A key point: Ansible set variables in task

Ansible is not a full-fledged programming language. However, it does have several programming language components. One of the most significant of these is variable substitution. The most straightforward way to define variables is to put a vars section in your playbook with the names and values of variables.

Here, we can have Ansible set variables in task. The following will list the various ways you can set variables in Ansible. Keep in mind when doing so, we have an order of preference.

There are several places where you can define your variables. You can define these variables in your playbooks, inventory, reusable files or roles, or at the command line. During a playbook run, you can create variables by registering a task’s return value or value as a new variable.

When defining variables in multiple places, those variables have variable precedence. After creating variables, you can use those variables in module arguments, such as conditional “when” clauses, templates, and loops. All of these are potent constructs to have in your automation toolbox.

 

   Ansible Variables 

Definition

Vars: Section

Set_fact module


Vars_files

Vars_promt

Task Variables

At runtime  

 

Highlighting Ansible set variables in task

  • Define variables: Vars: Section.

If you are starting your automation journey, the simplest way to define variables is to put a vars section in your playbook with the names and values of variables. This allows you to define several configuration-related variables. So, to define variables in plays, include a vars: section in the header of the play where the variables are needed.

Variables defined in plays are only valid within that specific play and don’t have playbook scope. So if you need a variable in a different play, you must define it again. It may be inconvenient to you and difficult to manage across extensive playbooks with multiple teams working on playbook development.

 

  • Define variables: Set_fact Module.

We also have the set_fact module. The set_fact is a module used anywhere in a play to set variables. Any variable set this way applies as a fact to the host in which it is set. The set_fact relates the variable to the host used in the play.

Here, you can dynamically set variables based on the result of any task in the playbook. So set_fact is dynamically defining variables. Keep in mind that setting variables this way will have a playbook scope.

 

  • Define variables: Vars_files

You can also put variables into one or more files using the section called vars_files. This lets you put variables in a file instead of directly in the playbook. What I like most about setting variables this way is that it allows you to separate variables that contain sensitive information. When you define variables in reusable variable files, the sensitive variables are separated from playbooks.

This separation enables you to store your playbooks in, for example, source control software and even share the playbooks without the risk of exposing passwords or other sensitive and personal data. So, when you put variables in files, they are referenced in the playbook using vars_files.

Use the var_files: to specify a list of files that include variables you want to include. This is convenient when you want to manage the variables independent of the place using them and is helpful for security purposes.

 

  • Define variables: Vars_Promt

So here, we can use vars_promt in the play header to prompt users for a variable value. This has a playbook scope. By default, the variable is flagged as private, so the user does not see anything while entering the variable. We can if we have set private to no here. 

 

  • Define variables: Defining variables at runtime.  

When you run your playbook, you can define variables by passing variables at the command line using the –extra-vars (or -e) argument. 

 

  • Define variables: Task Variables.

Task variables are made from data discovered while executing tasks or in the fact-gathering phase of a play. These variables are host-specific and are added to the host’s host vars. Variables of this type can be discovered via gather_facts and fact modules, populated from task return data via the register task key, or defined directly by a task using the set_fact or add_host modules. 

 

  • A key point: Video on Ansible Job Templates

In this product demonstration, we will go through the key components of Ansible Tower and its use of Job Templates. We will look at the different Job Template parameters that you can use to form an automation job that you can deploy to your managed assets.

 

Ansible Tower Job Template
Prev 1 of 1 Next
Prev 1 of 1 Next

 

Facts / Variables

  • Ansible Fact: System Variables

Ansible facts are a type of variable. You don’t define Ansible facts; they are discovered. Facts are system variables that contain information about the managed system. Each playbook starts with an implicit task to gather facts. This can also be disabled in the playhead. Gathering facts takes time, so if you are not going to use it, we can disable it.

We can also gather facts manually by using the setup modules. All of the facts are stored in one big variable called ansible_facts. However, within the variable, we have the second-tier variables. All of the facts are categorized into different variables. These facts are variables, and you can use the facts in conditionals and when statements. 

 

    • Speeding up Fact Gathering

Fact collection can be slow as you may work against many hosts you can use and set up a fact cache. If fact-caching is enabled, Ansible will store facts in a cache for the first time; it connects to the host. This is added to your Ansible.cfg with the “fact_caching_timeout” value. So if your playbook does not reference any ansible facts, you can turn off fact-gathering for that play. Here we use the “gather_facts” clause in the play and the Ansible.cfg.

 

Ansible Inventory Variable

Ansible Variables are a key component of Automation, and they allow for dynamic play content and reusable plays across different sets of an inventory. Variable data, such as specific details on how to connect to a particular host in your inventory, can be included, along with an inventory, in various ways.

While Ansible can discover data about a system during the setup phase, not all data can be discovered. We can define data with the inventory that will expand what Ansible has been to discover.

 

Ansible variables: The Ansible inventory variables

    • Host and group variables

In inventory, you can store variable values related to a specific host or group. This allows you to add variables directly to the hosts and groups in your main inventory file. Adding more managed nodes to your Ansible inventory will likely allow you to store variables in separate host and group variable files.

    • [host_var and group_var]

Ansible looks for host variable files in a directory called host_vars and group variable files in a directory called group_vars. Remember that Ansible expects these directories to be in either the directory that contains your playbooks or the directory adjacent to your inventory file. You can break things out even further if you want to go further. Ansible lets us define, for example, group_vars/production as a directory instead of a file. 

 

Behavior inventory variables

Behavioral inventory parameters allow you to describe your machines with additional parameters in your inventory file. Such as the ansible_connection parameter may be helpful. By default, Ansible supports multiple means of transport, which Ansible uses to connect to the managed host. Here is a list of some expected behavior inventory variables and the behaviors they intend to modify:

    1. ansible_host: This is the DNS name or the Docker container name that Ansible will initiate a connection to.
    2. ansible_port: This specifies the port number that Ansible will use to connect to the inventory host if it is not the default value 22.
    3. ansible_user: This specifies the username Ansible will use to connect with the inventory host, regardless of the connection type.
Ansible Automation Best Practices
Diagram: Ansible automation best practices. Ansible inventory variables.

 

  • A key point: Ansible inventory scalability

You can use the inventory to put host and group variables if you don’t have many hosts. However, as your environment gets larger, managing variables in the inventory will become more challenging. In this case, you need to find a more scalable approach to keep track of your host and group variables.

Even though Ansible variables can hold Booleans, strings, lists, and dictionaries, you can specify only Booleans and strings in an inventory file. Therefore, we have a more scalable approach to keeping track of host and group variables: you can create a separate variable file for each host and group. You can create a separate variable file for each host and group. Ansible expects these variable files to be in YAML format. This allows you to break the inventory into multiple files.

 

Summary: Ways to define ansible variables in playbooks

There are different ways that variables can be determined.

 

Defining Ansible Variables

  • You can define variables in the play header using the vars: sections

  • Also, the play header uses include vars_files

  • Using the set_fact modules in a place

  • Also, in the command line with the -e key-value

  • As inventory variables

  • Using vars_promp to request values from the user while running the playbook

  • Also, the Facts are discovered variables. They contain system properties.

 

Highlighting Ansible conditionals

In a playbook, you may want to execute different tasks depending on the value of a fact, a variable, or the result of a previous task. You may wish to the value of some variables to depend on the value of other variables. You can do all of these things with conditionals. Ansible uses Jinja2 tests and filters in conditionals. Basic conditionals are used with the when clause.

The most straightforward conditional statement applies to a single task. Create the Task, then add a when the statement that applies a test. Ansible evaluates the test for all hosts when running the Task or playbook. For example, if you are installing MySQL on multiple machines, some of which have SELinux enabled, you might have a task to configure SELinux to allow MySQL to run.

You would only want that Task to run on machines with SELinux enabled: Sometimes, you want to execute or skip a task based on facts. With conditionals based on facts: You can install a specific package only when the operating system is a particular version. You can skip configuring a firewall on hosts with internal IP addresses. You can perform cleanup tasks only when a filesystem is getting full.

 

  • A key point: Video on Ansible Automation

In this tutorial, we are going to discuss Ansible Automation. In particular, Ansible Engine is run from the CLI. We will discuss the challenging landscape forcing us to move to automation. While introducing Ansible playbooks and the other Ansible main components

 

Ansible Automation Explained
Prev 1 of 1 Next
Prev 1 of 1 Next

 

Ansible conditionals and When Clause

You can also create conditionals based on variables defined in the playbooks or inventory. So, we have playbook variables, registered variables, and facts that can all be used in conditions and ensure that tasks only run if specific conditions are true. 

 

Handlers and When Statement

There are several ways Ansible can be configured for conditional task execution. We have, for example, Handlers for conditional task execution. It is used when a task has changed something. Then we have a very powerful When statement. The When statement allows you to run tasks when specific conditions are true. You can also use the Register in combination with When statements.

Automation Tutorial
Diagram: Automation tutorial.

 

Using Handlers for conditional task execution

A handler is a task only executed when triggered by a task that has changed something. Handler is executed after all tasks in a play, so you need to organize the contents of your playbook accordingly. If any task fails after the Task called the handler, the handlers are not executed. We can use the “force-handler: to change this.

Keep in mind that handlers are operational when something has changed. We have a force handler that allows you to force the handler to be started even if subsequent tasks are failing. Simply put, a handler is a particular type of Task that is called only if something changes. 

 

  • Handlers in Pre and Post tasks

Each task section in a playbook is handled separately; any handler notified in pre_tasks, tasks, or post_tasks is executed at the end of each section. As a result, it is possible to execute one handler several times in one play.

 

Using Blocks

Blocks create logical groups of tasks. Blocks also offer ways to handle task errors, similar to exception handling in many programming languages. Blocks can also be used in error conditional handling. So you have to use a block to define the main Task to run and then rescue to define tasks that run if tasks defined in the block fail. You can use “Always to Define” tasks that will run regardless of the success or failure of the Block and Rescue.

 

  • A keynote: Ansible loops

What happens, however, if you have a single task but need to run it against a list of data, for example, creating several user accounts, directories, or something more complex? Like any programming language, loops in Ansible provide an easier way of executing repetitive tasks using fewer lines of code in a playbook.

Examples of commonly used loops include changing ownership of several files and directories with the file module, creating multiple users with the user module, and repeating a polling step until a certain result is reached.

 

  • A final point: Managing failures with the Fail Module

Ansible looks at the exit status of a task to determine whether it has failed. When any task fails, Ansible aborts the rest of the playbook on that host and continues with the next host. We can change this with a few solutions. For example, we can use ignore_errors in a task to ignore failure—force_handlers to force a handler that has been triggered to run even if another task fails.

But remember, there needs to be a change. We can also use these Failed_When, which allows you to specify what to look for in command output to recognize a failure. You may have a playbook used to clean up resources, and you want the playbook to ignore every error, keep going on till the end, and then fail if there are errors. In this case, when using the fail module, the failing Task must have Ignore Errors set to yes.

 

To summarize, Ansible variables are a powerful tool for automation. They allow users to define and store values that can be used throughout the playbook, making it easy to customize and adapt automation processes. Whether storing configuration values, dynamically generating variables, or working with complex data structures, Ansible variables provide the flexibility and power needed for efficient automation.

 

Ansible inventory variable

Ansible Architecture Diagram

Ansible Architecture | Ansible Automation

 

 

Ansible Architecture

Ansible has emerged as one of the most popular automation tools, revolutionizing how organizations manage and deploy their IT infrastructure. Ansible has gained widespread adoption across diverse industries with its simple yet robust architecture. In this blog post, we will delve into the intricacies of Ansible architecture, exploring how it works and its key components.

At its core, Ansible follows a client-server architecture model. It has three main components: control nodes, managed nodes, and communication channels. Let’s take a closer look at each of these components.

1. Control Node:

The control node acts as the central management point in Ansible architecture. It is the machine from which Ansible is installed and executed. The control node stores the inventory, playbooks, and modules to manage the managed nodes. Ansible uses a declarative language called YAML to define playbook tasks and configurations.

2. Managed Nodes:

Managed nodes are the machines that Ansible is driving. These can be physical servers, virtual machines, or even network devices. Ansible connects to managed nodes over SSH or WinRM protocols, enabling seamless management across different operating systems.

3. Communication Channels:

Ansible utilizes SSH or WinRM protocols to establish secure communication channels between the control node and managed nodes. SSH is used for Linux-based systems, whereas WinRM is for Windows-based systems. This allows Ansible to execute commands, transfer files, and collect information from managed nodes.

 

Highlights: Ansible Architecture

  • Playbooks and Inventory

 As a best practice, you don’t want your Ansible architecture that consists of playbooks and Inventory to be too specific. But on the other hand, you need to have a certain level of abstraction and keep out precise information. Therefore, to develop flexible code, you must separate site-specific information from the code, done with variables in Ansible. 

Remember that when you develop dynamic code along with your static information, you can use this on any site with minor modifications to the variables themselves. However, you can have variables in different places, and where you place variables, such as a play header or Inventory, will take different precedence. So, to provide site-specific code, variables can be used in your Ansible deployment architecture. 

 

Before you proceed, you may find the following posts helpful for pre-information:

  1. Network Configuration Automation
  2. Ansible Variables
  3. Network Traffic Engineering

 

  • A key point: Video on Ansible automation and Ansible architecture.

In this video, we will discuss Ansible automation and Ansible architecture. In particular, Ansible Engine is run from the CLI. This is compared to a different Ansible deployment architecture with Ansible Tower, a platform approach to security. We will discuss the challenging landscape forcing us to move to automation. At the same time, it introduces Ansible playbooks, Ansible variables, and other main components.

 

Ansible Automation Explained
Prev 1 of 1 Next
Prev 1 of 1 Next

 

Ansible Architecture: The Drive for Automation

To move to an Ansible architecture has been driven by several megatrends. Firstly, the rise of distributed computing made the manual approach to almost anything in the IT environment obsolete. Not only because this causes many errors and mistakes but also because the configuration drift from the desired to the actual state was considerable.

Not only is this an operational burden, but also a considerable security risk. Today, deploying applications by combining multiple services that run on a distributed set of resources is expected. As a result, configuration and maintenance are more complex than in the past.

 



Ansible Architecture


Key Ansible Architecture Discussion Points:


  • The issues with configuration drift.

  • Ansible CLI and Ansible Tower.

  • Ansible components.

  • Key Ansible features.

  • Ansible modularity and scalability.

  • Ansible deployment architecture.

  • Ansible Architecture diagram.

 

You have two options to implement all of this. First, you can connect up these services by, for example, manually spinning up the servers, installing the necessary packages, and SSHing to each one, or you can go down the path of automation, in particular, automation with Ansible.

So, with Ansible deployment architecture, we have the Automation Engine, the CLI, and Ansible Tower, which is more of an automation platform for enterprise-grade automation. This post focuses on Ansible Engine. 

Ansible Automation Requirements
Diagram: Ansible automation requirements. Ansible deployment architecture.

 

As a quick note, if you have environments with more than a few teams automating, I recommend Ansible Tower or the open-source version of AWX. This is because Ansible Tower has a 60-day trial license, while AWX is fully open-sourced and does not require a license. The open-source version of AWX could be a valuable tool for your open networking journey.

 

Red Hat Ansible Tower
Diagram: Red Hat Ansible Tower. Source RedHat.

 

  • A Key Point: Risky: The Manual Way.

Let me put it this way. If you are configuring manually, you will likely maintain all the settings manually. Or, more importantly, what about mitigating vulnerabilities and determining what patches or packages are installed in a large environment?

How can you ensure all your servers are patched and secured manually? Manipulating configuration files by hand is a tedious and error-prone task. Not to mention time-consuming. Equally, performing pattern matching to make changes to existing files is risky. 

 

  • A Key Point: The issue of Configuration Drift

The manual approach will result in configuration drift, where some servers will drift from the desired state. Configuration drift is caused by inconsistent configuration items across devices, usually due to manual changes and updates and not following the automation path. Ansible is all about maintaining the desired state and eliminating configuration drift.

 

    Ansible Architecture  

Automation Feature


Modules 

 Module utilities

 Plugins

 Inventory

Playbooks

 

Ansible Workflow:

Ansible operates on a push-based model, where the control node pushes configurations and commands to the managed nodes. The workflow involves the following steps:

1. Inventory:

The inventory is a file that contains a list of managed nodes. It provides Ansible with information such as IP addresses, hostnames, and connection details required to establish communication.

2. Playbooks:

Playbooks are YAML files that define the desired state of the managed nodes. They consist of a series of tasks or plays, where each task represents a specific action to be executed on the managed nodes. Playbooks can be as simple as a single task or as complex as a multi-step deployment process.

3. Execution:

Ansible executes playbooks on the control node and communicates with managed nodes to perform the defined tasks. It uses modules, which are small programs written in Python or other scripting languages, to interact with the managed nodes and carry out the required operations.

4. Reporting:

Ansible provides detailed reports on the execution status of tasks, allowing administrators to monitor and troubleshoot any issues that arise. This helps maintain visibility and ensure the desired configurations are applied consistently across the infrastructure.

  • A key point: Video on Ansible Job Template

In this product demonstration, we will go through the key components of Ansible Tower and its use of Job Templates. We will look at the different Job Template parameters that you can use to form an automation job that you can deploy to your managed assets.

Ansible Tower Job Template
Prev 1 of 1 Next
Prev 1 of 1 Next

Advantages of Ansible Architecture:

The Ansible architecture offers several advantages, making it a preferred choice for automation:

1. Simplicity:

Ansible’s architecture is designed to be simple and easy to understand. Using YAML playbooks and declarative language allows administrators to define configurations and tasks in a human-readable format.

2. Agentless:

Unlike traditional configuration management tools, Ansible requires no agent software installed on managed nodes. This reduces complexity and eliminates the need for additional overhead.

3. Scalability:

Ansible’s architecture is highly scalable, enabling administrators to manage thousands of nodes simultaneously. SSH and WinRM protocols allow for efficient communication and coordination across large infrastructures.

 

Components of Ansible Deployment Architecture

    • Configuration management

The Ansible architecture is based on a configuration management tool that can help alleviate these challenges. Ansible replaces the need for an operator to tune configuration files manually and does an excellent job in application deployment and orchestrating multi-deployment scenarios. It can also be integrated into CI/CD pipelines.

In reality, Ansible is relatively easy to install and operate. However, it is not a single entity. Instead, it comprises tools, modules, and software-defined infrastructure that form the ansible toolset configured from a single host that can manage multiple hosts.

We will discuss the value of idempotency with Ansible modules later in mind. Even with the idempotency nature of modules, you can still have users of Ansible automate over each other. Ansible Tower or AWS is the recommended solution for multi-team automation efforts.

Ansible vs Tower
Diagram: Ansible vs Tower. Source Red Hat.

 

    • Pre-deployed infrastructure: Terraform

Ansible does not deploy the infrastructure; you could use other solutions like Terraform that are best suited for this. Terraform is infrastructure as a code tool. Ansible Engine is more of a configuration as code. The physical or virtual infrastructure needs to be there for Ansible to automate, compared to Terraform, which does all of this for you.

Ansible is an easy-to-use DevOps tool that manages configuration as code has in the same design through any sized environment. Therefore, the size of the domain is irrelevant to Ansible.

As Ansible Connectivity uses SSH that runs over TCP, there are multiple optimizations you can use to increase performance and optimize connectivity, which we will discuss shortly. Ansible is often described as a configuration management tool and is typically mentioned along the same lines as Puppet, Chef, and Salt. However, there is a considerable difference in how they operate. Most notably, the installation of agents.

 

    • Ansible architecture: Agentless

The Ansible architecture is agentless and requires nothing to be installed on the managed systems. In addition, its architecture is serverless and agentless, so it has a minimal footprint. Some configuration management systems, such as Chef and Puppet, are “pull-based” by default.

Where agents are installed, periodically check in with the central service and pull-down configuration. Ansible is agentless and does not require the installation of an agent on the target for Ansible to communicate to the target host.

However, it requires connectivity from the control host to the target inventory ( which contains a list of hosts that Ansible manages) with a trusted relationship. For convenience, we can have passwordless sudo connectivity between our hosts. This allows you to log in without a password and can be a security risk if someone gets to your machines; they could have escalated privileges on all the Ansible-managed hosts.

 

Agentless Automation
Diagram: Agentless Automation. Source Docs at Ansible

 

Ansible Deployment Architecture

Ansible Architecture Key Features

  • Easy-to-read syntax

  • Not a full programming language

  • Jinja2 templating language:

  • Scalability

  • Optimized SSH connectivity

  • Modularity 

 

Key Ansible features:

Easy-to-Read Syntax: Ansible uses the YAML file format and Jinja2 templating. Jinja2 is the template engine for the Python programming language. Ansible uses Jinja2 templating to access variables and facts and extends the defaults of Ansible for more advanced use cases.

Not a full programming language: Remember that Ansible is not a full-fledged programming language, but it has several good features. One of the most important of these is a variable substitution, or using the values of variables in strings or other variables.

In addition, the variables in Ansible make the Ansible playbooks, which are like executable documents, very flexible. Variables are a powerful construct within Ansible and can be used in various ways. Nearly every single thing done in Ansible can include a variable reference. We also have dynamic variables known as facts.

Jinja2 templating language: The defaults of Ansible are extended using Jinja2 templating language. In addition, Ansible’s use of Jinja2 templating adds more advanced use cases to Ansible. One great benefit is that it is self-documenting, so when someone looks at your playbook, it’s easy to understand, unlike a Python code or a Bash script.

So not only is Ansible easy to understand, but with just a few lines of YAML, which is the language used for Ansible, you can install, let’s say, web servers on as many hosts as you like.

Ansible Architecture: Scalability: Ansible can scale. For example, Ansible uses advanced features like SSH multiplexing to optimize SSH performance. Some use cases manage thousands of nodes with Ansible from a single machine.

SSH Connection: Parallel connections: We have three managed hosts: web1, web2, and web3. Ansible will make SSH connections parallel to web1, web2, and web3. It will then execute the first task on the list on all three hosts simultaneously. In this example, the first task is installing the Nginx package, so the task in the playbook would look like this. Ansible uses the SSH protocol to communicate with hosts, except for Windows hosts.

This SSH service is usually integrated with the operating system authentication stack, enabling you to use Kerberos to improve authentication security. Ansible uses the same authentication methods that you will already be familiar with. SSH keys are typically the easiest way to proceed as they remove the need for users to input the authentication password every time a playbook is run.

Ansible Connectivity
Diagram: Ansible connectivity and Ansible deployment architecture.

 

  • A key point: Optimizing SSH 

Ansible uses SSH to manage hosts, and establishing an ssh connection takes time. However, you can optimize SSH with several features. Because the SSH protocol runs on top of the TCP protocol, you need to create a new TCP connection when you connect to a remote host with SSH.

You don’t want to open a new SSH connection for every activity. Here, you can use Control Master, which allows multiple simultaneous SSH connections with a remote host using one network connection. Control Persists, or multiplexing, keeps a connection option for xx seconds. The pipeline allows more commands to use simultaneous SSH connections. If you recall how Ansible executes a task?

    1. It generates a Python script based on the module being invoked
    2. It then copies the Python script to the host
    3. Finally, it executes the Python script

The pipelining optimization will execute the Python scripts by piping it to the SSH sessions instead of copying git. Here, we are using one SSH session instead of two. These options can be configured in Ansible.cfg, and we use the SSH_connecton section. Then, you can specify how these connections are used.  

 

  • A note on scalability: Ansible and modularity

Ansible scales down well because simple tasks are easy to implement and understand in playbooks. Ansible scales well because it allows the decomposing complex jobs into smaller pieces. So we can bring the concept of modularity into playbooks as the playbook becomes more difficult. I like using Tags for playbook developments that can save time and effect to test different parts of the playbook when you know certain parts are 100% working.

 

Security wins! No daemons and no listening agents

Once Ansible is installed, it will not add a database, and there will be no daemons to start or keep running. You only need to install it on one machine (which could easily be a laptop), and it can manage an entire fleet of remote machines from that central point. No Ansible agent is listening on a port. Therefore, when you use Ansible, there is no extra attack surface for a bad actor to play with.

This is a big win for security following one of the leading security principles of reducing the attack surface. When you run the Ansible-playbook command, Ansible connects to the remote servers and does what you want. By its defaults, Ansible is pretty streamlined out of the box, but you can enhance it by configuring Ansible.cfg file, to be even more automated.

 

Ansible Architecture: Ansible Architecture diagram.

    • Ansible Inventory: Telling Ansible About Your Servers

The Ansible architecture diagram has several critical components. First, the Ansible inventory is all about telling Ansible about your servers. Ansible can manage only the servers it explicitly knows about. Ansible comes with one default server of the local host, the control host.

You provide Ansible with information about servers by specifying them in an inventory. We usually create a directory called “inventory” to hold this information.

For example, a straightforward inventory file might contain a list of hostnames. The Ansible Inventory is the system that a playbook runs against. It is a list of systems in your infrastructure that the automaton is executed against. The following Ansible architecture diagram shows all the Ansible components, including modules, playbooks, and plugins.

Ansible Architecture Diagram
Diagram: Ansible Architecture Diagram.

 

Ansible architecture diagram: Inventory highlights

These hosts commonly have hosts but can comprise other components such as network devices, storage arrays, and other physical and virtual appliances. It also has valuable information that can be used alongside our target using the execution.

The Inventory can be simple as a text file or more dynamic where the Inventory is an executable where the data is sourced dynamically. This way, we can store data externally and use it during runtime. So we can have a dynamic inventory via Amazon Web Service or create our own dynamic.

 

  • A key point: Video on Ansible Inventory

This short educational tutorial will discuss the Ansible Inventory used to hold the Ansible Managed Host. We will then discuss the different types of inventories, static and dynamic. Along with other ways, you can apply variables to host in the inventory.

 

The Ansible Inventory | Ansible Automation
Prev 1 of 1 Next
Prev 1 of 1 Next

 

Example: AWS EC2 External inventory script  

You will see a connection to the cloud in the above Ansible architecture diagram. If you use Amazon Web Services EC2, maintaining an inventory file might not be the best approach because hosts may come and go over time, be managed by external applications, or be affected by AWS autoscaling.

For this reason, you can use the EC2 external inventory script. In addition, if your hosts run on Amazon EC2, then EC2 tracks information about your hosts for you. Ansible inventory is flexible; you can use multiple inventory sources simultaneously. Mix dynamic and statically managed inventory sources in the same ansible run is possible. Many are referring to this as an instant hybrid cloud.

Ansible and NMAP
Diagram: Ansible and NMAP. Source Red Hat.

 

Ansible deployment architecture and Ansible modules

Next, within the Ansible deployment architecture, we have the Ansible modules, considered to be the main workhorse of Ansible. You use modules to perform various tasks, such as installing a package, restarting a service, or copying a configuration file. Ansible modules cater to a wide range of system administration tasks.

This list has the categories of the kinds of modules that you can use. There are over 3000 modules. So you may be thinking, who is looking after these modules? That’s where collections in the more recent versions of Ansible are doing.

Ansible architecture

    • Extending Ansible modules

The modules are the scripts ( that are written in Python) that come with packages with Ansible and Perform some action on the managed host. Ansible has extensive modules covering many areas, including networking, cloud computing, server config, containerized, and virtualization.

In addition, there are many modules to support the automation requirements you have. If no modules exist, you can create a custom module with the extensive framework of Ansible. So, each task is a one-to-one correlation with the module. So, for example, a template task will use the template module.

 

    • A key point: Idempotency

Modules strive to be idempotent, allowing the module to run repeatedly without a negative impact. In Ansible, the input is in the form of command-line arguments to the module, and the output is delivered as JSON to STDOUT. Input is generally provided in the space-separated key=value syntax, and it’s up to the module to deconstruct these into usable data.

Most of the Ansible modules are also idempotent. Idempotent means running an Ansible playbook multiple times against a server is safe. For example, if the deploy user does not exist, Ansible will create it. If it does exist, Ansible will not do anything. This is a significant improvement over the shell script approach, where running the script a second time might have different and potentially unintended effects.

 

    • The Use of Ansible Ad Hoc Commands

For instance, if you wanted to power off all of your labs for the weekend, you could execute a quick one-liner in Ansible without writing a playbook. With an Ad Hoc command, we can be suitable for running one command on a host, and it uses the modules. They are used for checking configuring on the host and are also good for learning Ansible.

Note that Ansible is the executable for ad hoc one-task executions, and ansible-playbook is the executable that will process playbooks to orchestrate multiple tasks.

The opposable side of the puzzle is the playbook commands are used for more complex tasks and are better for use cases where the dependencies have to be managed. The playbook can take care of the entire application deployments and dependencies.

Ansible Ad Hoc commands
Diagram: Ansible Commands. Source Docs at Ansible.

 

Ansible Plays

  • Ansible playbooks

An Ansible Playbook can have multiple plays that can exist within an Ansible playbook that can execute on different managed assets. So, an Ansible Play is all about “what am I automating.” Then, it connects to the hosts to perform the actions. Each playbook comprises one or more ‘plays’ in a list. The goal of a play is to map a group of hosts to some well-defined roles, represented by things Ansible calls tasks.

At a basic level, a task is just a call to an Ansible module. By composing a playbook of multiple ‘plays’, it is possible to orchestrate multi-machine deployments, running specific steps on all machines in the web servers group. For example, particular actions on the database server group, then more commands back on the web servers group, etc.

 

  • Ansible tasks

Ansible is ready to execute a task once a playbook is parsed and the hosts are determined. Tasks include a name, a module reference, module arguments, and task control directives. Task execution By default, Ansible executes each task in order, one at a time, against all machines matched by the host pattern. Each task executes a module with specific arguments. You can also use the “–start-at-task <task name> flag to tell the Ansible playbook to start a playbook in the middle of a task.

 

  • Task execution

Each play contains a list of tasks. Tasks are executed in order, one at a time, against all machines matched by the host pattern before moving on to the next task. It is essential to understand that, within a play, all hosts will get the same task directives. This is because the purpose of a play is to map a selection of hosts to tasks.  

 

OpenShift Networking Deep Dive

OpenShift SDN

OpenShift SDN

In today's fast-paced cloud computing and containerization world, efficient networking solutions are essential to ensure seamless communication between containers and applications. OpenShift SDN (Software-Defined Networking) has emerged as a powerful tool for simplifying container networking and managing the complexities of distributed systems.

This blog post will explore what OpenShift SDN is, its key features, and its benefits to developers and operators.

OpenShift SDN is a networking plugin explicitly developed for OpenShift, a leading container platform. It provides a software-defined networking layer that abstracts the underlying network infrastructure and enables seamless communication between containers across different hosts within a cluster.

By decoupling the networking layer from the physical infrastructure, OpenShift SDN simplifies network configuration and management, making deploying, scaling, and managing containerized applications easier.

Table of Contents

Highlights: OpenShift SDN

 

Application Exposure

When considering OpenShift and how OpenShift networking SDN works, you need to fully understand how application exposure works and how to expose yourself to the external world so that external clients can access your internal application. For most use cases, the applications running in Kubernetes ( see Kubernetes networking 101 ) pods’ containers need to be exposed, and this is not done with the pod IP address.

Instead, the pods are given IP addresses for different use cases. Application exposure is done with OpenShift routes and OpenShift services. The construct used depends on the level of exposure needed. 

The Role of SDN

OpenShift SDN (Software Defined Network) is a software-defined networking solution designed to make it easier for organizations to manage their network traffic in the cloud. It is a network overlay technology that enables distributed applications to communicate over public and private networks. OpenShift SDN is based on the Open vSwitch (OVS) platform and provides a secure, reliable, and highly available layer 3 network overlay. With OpenShift SDN, users can define their network topologies, create virtual networks, and control traffic flows between virtual machines and containers.

 

Related: For pre-information, kindly visit the following:

  1. OpenShift Security Best Practices
  2. ACI Cisco
  3. DNS Security Solutions
  4. Container Networking
  5. OpenStack Architecture
  6. Kubernetes Security Best Practice

 



OpenShift Networking

Key OpenShift SDN Discussion points:


  • Route and Service constructs.

  • Service discovery with DNS.

  • OpenShift SDN Operators.

  • Discussion on Service types.

  • OpenShift Network modes.

 

Back to Basics: OpenShift SDN

Kubernetes has gained considerable rage over the past few years, with OpenShift SDN being one of its most mature distributions. OpenShift removes the complexity of operating Kubernetes and provides several layers of abstraction over vanilla Kubernetes with an easy-to-consume dashboard.

OpenShift is a platform to help software teams develop and deploy distributed software built on Kubernetes. It has a large set of built-in tools or can be deployed quickly. While it can significantly help its users and eliminate many traditionally manual operations burdens, keep in mind that OpenShift is a distributed system that must be deployed, operated, and maintained.

 

Key Features of OpenShift SDN:

1. Multitenancy: OpenShift SDN allows multiple tenants to share the same cluster while providing isolation and security between them. It creates virtual networks and implements network policies to control traffic flow.

2. Service Discovery: OpenShift SDN includes a built-in DNS service that automatically assigns unique names to services running within the cluster. This simplifies communication between services, eliminating the need for manual IP address management.

3. Network Policy Enforcement: OpenShift SDN enables fine-grained control over network traffic using network policies. Operators can define rules to allow or deny communication between pods or services based on various criteria, such as IP addresses, ports, and labels.

4. Scalability and Resilience: OpenShift SDN is designed to scale horizontally as the cluster grows, ensuring the network can handle increased traffic and workload. It also provides resilience by automatically detecting and recovering from failures to maintain uninterrupted service.

Benefits of OpenShift SDN:

1. Simplified Networking: OpenShift SDN abstracts the complexities of network configuration, making it easier for developers to focus on building and deploying applications. It provides a consistent networking experience across different clusters and environments.

2. Increased Efficiency: With OpenShift SDN, containers can communicate directly with each other, bypassing unnecessary hops and reducing latency. This improves application performance and enhances overall efficiency.

3. Enhanced Security: The network policies in OpenShift SDN enable operators to enforce strict security measures, protecting sensitive data and preventing unauthorized access. It provides a secure environment for running containerized applications.

4. Seamless Integration: OpenShift SDN seamlessly integrates with other OpenShift components and tools, such as the Kubernetes API, allowing for easy management and monitoring of containerized applications.

 

Kubernetes’ concept of a POD

As the smallest compute unit that can be defined, deployed, and managed, OpenShift leverages the Kubernetes concept of a pod. This is one or more containers deployed on one host. A pod is the equivalent of a physical or virtual machine instance to a container. Containers within pods can share their local storage and networking, as each pod has its IP address.

An individual pod has a lifecycle; it is defined, assigned to a node, and then runs until the container(s) exit or are removed for some other reason. Pods can be removed after exiting or retained to allow access to container logs, depending on policy and exit code.

 

Kubernetes POD

In OpenShift, pod definitions are largely immutable; they cannot be modified while running. In OpenShift, changes are implemented by terminating existing pods and recreating them with modified configurations, base images, or both. Additionally, pods are expendable and do not maintain their state when recreated. In general, pods should not be managed directly by users but by higher-level controllers.

 

Kubernetes’ Concept of Services

Kubernetes services act as internal load balancers. To proxy connections to replicated pods, it identifies a set of replicated pods. While service remains consistently available, backing pods can be added or removed arbitrarily, enabling everything that depends on it to refer to it at a consistent address. OpenShift Container Platform uses cluster P addresses to allow pods to communicate with each other and access the internal network.

The service can be assigned additional externalIP and ingressIP addresses external to the cluster to allow external access. An external IP address can also be a virtual IP address that provides highly available access to the service.

IP addressing and port mappings are assigned to services, which proxy to an appropriate backing pod when accessed. Using a label selector, a service can find all containers running on a specific port that provides a particular network service. Like pods, services are REST objects. 

Kubernetes service

 

There are a couple of options to get some hands-on with OpenShift. You can download the CodeReady Containers for Linux, Microsoft, MacOS, or a pre-built Sandbox Lab environment that RedHat provides.

Stages:

First, we must extract the files with tar XVF on the CRC Linux file. This will be extracted into the current directory. You may want to move this to the /usr/local/bin directory. As the CodeReady containers are in the binary, you will work with it; you want it to be your path.

Then, we run the CRC setup. The most important thing is ensuring you reach the virtualization requirements for CRC Ready Container that requires KVM to be available.

So, it is unlikely that this will work in a public cloud environment unless you can get a bare metal instance with KVM available.  However, once downloaded to your local machine, you can run the CRC to install it and pull the secret.

 

OpenShift Networking SDN

To start OpenShift networking SDN, we have the route constructs to provide access to specific services from the outside world. So, there is a connection point between the Route and the service construct. First, the Route connects to the service; then, the service acts as a software load balancer to the correct pod or pods with your application.

There can be several different service types with the default of cluster-IP. So, you may consider the service the first level of exposing applications, but they are unrelated to DNS name resolution. To make servers accepted by FQDN, we use the OpenShift route resource, and the Route provides the DNS.

OpenShift Networking Deep Dive
Diagram: OpenShift networking deep dive.

The default service cluster IP addresses are from the OpenShift Dedicated internal network, and they are used to permit pods to access each other. Services are assigned an IP address and port pair that, when accessed, proxy to an appropriate backing pod.

 

OpenShift Routes
Diagram: Creating OpenShift Routes. Source OpenShift Docs.

 

By default, unsecured routes are configured and are, therefore, the easiest to configure. A secured route, however, offers security that keeps your connection private. Create secure HTTPS routes using the create route command and optionally supplying certificates and keys (PEM-format files that must be generated and signed separately).

 

OpenShift Networking Deep Dive

Service Discovery and DNS

Applications depend on each other to deliver information to users. These relationships are complex to manage in an application spanning multiple independently scalable pods. So, we don’t access applications by pod IP. These IP addresses will change for one reason, and it’s not a scalable solution.

To make this easier, OpenShift deploys DNS when the cluster is deployed and makes it available on the pod network. DNS in OpenShift allows pods to discover the resources in the OpenShift SDN.

 

The DNS Operator

The DNS operator runs DNS services and uses Core DNS. The pods use the internal Core DNS server for DNS resolution. The pod’s DNS name server is automatically set to the Core DNS. OpenShift provides its internal DNS, which is implemented via Core DNS and dnsmasq for service discovery. The dnsmasq is a lightweight DNS forwarder that provides DNS. 

 

Layer Approach to DNS.

DNS in the Openshift is a layered approach. Originally, DNS in Kubernetes was used for service discovery. The problem was solved a long time ago. DNS was the answer for service discovery back then, as it still is now. Service Discovery means an application or service inside; it can reference a service by name, not an IP address.

The pods deployed represent microservices and have a Kubernetes service in front of them, pointing to these pods and discovering these by DNS name. So the service is transparent. The internal DNS manages this in Kubernetes; originally, it was SKYDNS KubeDNS, and now it’s Core DNS.

The DNS Operator has several roles:

    1. It creates the default cluster DNS name cluster. local
    2. Assigns DNS names to namespaces. The namespace is part of the FQDN.
    3. Assign DNS names to services. So, both the service and namespace are part of the FQDN name.

 

OpenShift DNS Operator
Diagram: OpenShift DNS Operator. Source OpenShift Docs.

 

OpenShift SDN and the DNS processes

The Controller nodes

We have several components that make up the OpenShift cluster network. First, we have a controller node. There are multiple controller nodes in a cluster. The role of the controller nodes is to redirect the traffic to the PODs. We are running a route on each controller node and using Core DNS. So, in front of the Kubernetes cluster or layer, this is a hardware load balancer. Then, we have external DNS, which is outside of the cluster. 

This external DNS has a wildcard domain; thus, external DNS through the wildcard is resolved to the frontend hardware load balancer. So, users who want to access a service issue the request and contact external DNS for name resolution.

Then, external resolves the wildcard domain to the load balancer, and the load balancer will load balance to the different control nodes, and for these control nodes, we can address the route and service.

OpenShift and DNS: Wildcard DNS.

The OpenShift has an internal DNS server, which is reachable only by Pods. We need an external DNS server configured with wildcard DNS to make the service available by name to the outside. The wildcard DNS is resolved to all resources created in the cluster domain by fixing the OpenShift load balancer. 

This OpenShift load balancer provides a frontend to the control nodes, and the control nodes are run as ingress controllers and are part of the cluster. They are part of the internal cluster and have access to internal resources.

 

    • OpenShift ingress operators

For this to work, we need to use the OpenShift Operators. The Ingress Operator implements the IngressController API and enables external access to OpenShift Container Platform cluster services. It does this by deploying one or more HAProxy ingress controllers to handle the routing side.

You can use the Ingress Operator to route traffic by specifying the OpenShift Container Platform route construct. You may have also heard of the Kubernetes Ingress resources. Both are similar, but the OpenShift route can have additional security features along with the use case of split green deployments.

The OpenShift route construct and encryption

The OpenShift Container Platform route provides traffic to services in the cluster. In addition, routes offer advanced features that might not be supported by standard Kubernetes Ingress Controllers, such as TLS re-encryption, TLS passthrough, and split traffic for blue-green deployments.

In Kubernetes’s words, we use Ingress, which exposes services to the external world. However, in Openshift, it is a best practice to use a routing set. Routes are an alternative to Ingress.

 

openshift networking deep dive
Diagram: OpenShift networking deep dive.

 

We have three Pods, each with a different IP address. So, to access these Pods, we need a service. Essentially, this service provides a load balancing service and distribution load to the pods using a load balancing algorithm, a round robin by default.

The service is an internal component, and in Openshift, we have routes that provide a URL for the services so they can be accessible from the outside world. So, the URL created by the Route points to the service and the service points to the pods. In the Kubernetes world, Ingress pointed out the benefits, not routes.

 

Vide: Product demonstration on OpenShift Networking

In the following video, I will demonstrate OpenShift networking. We will go through the different OpenShift networking concepts, including the OpenShift routes, services, pods, replica sets, and much more! At the end of the demonstration, you will understand the OpenShift default networking and how to configure external access. The entire video is a full demo with animated diagrams helping you stay focused for the whole duration.

 

Product Demonstration for OpenShift Networking
Prev 1 of 1 Next
Prev 1 of 1 Next

 

Different types of services

Type: 

  • ClusterIP: The Service is exposed as an IP address internal to the cluster. This is useful for a microservices design where the front end connects to the backend without exposing the service externally. These are the Default Types. The service type is ClusterIP, meaning you have a cluster-wide IP address.
  • Node Port: A service type that exposes a port on the node’s IP address. This is like port forwarding on the physical node. However, the node port does not connect the internal cluster pods to a port dynamically exposed on the physical node. So external users can connect to the port on the node, we get port forwarding to the node port. This then goes to the pods and is load-balanced to the pods.
  • Load Balancer: You can find a service type in public cloud environments. 

 

Forming the network topology: OpenShift SDN networking

New pod creation: OpenShift networking SDN

As new pods are created on a host, the local OpenShift software-defined network (SDN) allocates and assigns an IP address from the cluster network subnet assigned to the node and connects the veth interface to a port in the br0 switch.  And it does this with the OpenShift OVS, which programs OVS rules via the OVS bridge. At the same time, the OpenShift SDN injects new OpenFlow entries into the OVSDB of br0 to Route traffic addressed to the newly allocated IP Address to the correct OVS port connecting the pod.

openshift SDN
Diagram: OpenShift SDN.

 

Pod network: 10.128.0.0/14

The pod network defaults to use the 10.128.0.0/14 IP address block. Each node in the cluster is assigned a /23 CIDR IP address range from the pod network block. That means, by default, each application node in OpenShift can accommodate a maximum of 512 pods. 

OpenFlow’s role is to manage how IP addresses are allocated to each application node. The Openshift cluster-wide network is established via the primary CNI plugin, which is the essence of SDN for Openshift and configures the overlay network using the OVS.

OVS is used in your OpenShift cluster as the communications backbone for your deployed pods. OVS in and out of every pod affects traffic in and out of the OpenShift cluster. OVS runs as a service on each node in the cluster. The Primary CNI SDN Plugin uses network policies using Openvswitch flow ruleswhich dictate which packets are allowed or denied. 

OpenShift Network Policy Tutorial
Diagram: OpenShift network policy tutorial.

 

Configuring OpenShift Networking SDN

The default flat network

When you deploy OpenShift, the default configuration for the pod network’s topology is a single flat network. Every pod in every project can communicate without restrictions. OpenShift SDN uses a plugin architecture that provides different network topologies. Depending on your network and security requirements, you can choose a plugin that matches your desired topology. Currently, three OpenShift SDN plugins can be enabled in the OpenShift configuration without significantly changing your cluster.

 

OpenShift SDN default CNI network provider

OpenShift Container Platform uses a software-defined networking (SDN) approach to provide a unified cluster network that enables communication between pods across the OpenShift Container Platform cluster. This pod network is established and maintained by the OpenShift SDN, configuring an overlay network using Open vSwitch (OVS).

OpenShift SDN modes:

OpenShift SDN provides three SDN modes for configuring the pod network.

  1. ovs-subnet— Enabled by default. Creates a flat pod network, allowing all project pods to communicate.
  2. ovs-multitenant— Separates the pods by the project. The applications deployed in a project can only communicate with pods deployed in the same project. 
  3. ovs-network policy— Provides fine-grained Ingress and egress rules for applications. This plugin can be more complex than the other two.

 

    • OpenShift ovs-subnet

The OpenShift ovs-subnet is the original OpenShift SDN plugin. This plugin provides basic connectivity for the Pods. This network connectivity is sometimes called a “flat” pod network. It is described as a “flat” Pod network because there are no filters or restrictions, and every pod can communicate with every other Pod and Service in the cluster. Flat network topology for all pods in all projects lets all deployed applications communicate. 

 

    • OpenShift ovs-multitenant

With OpenShift ovs-multitenant plugin, each project receives a unique VXLAN ID known as a Virtual Network ID (VNID). All the pods and services of an OpenShift Project are assigned to the corresponding VNID. So now we have segmentation based on the VNID. Doing this maintains project-level traffic isolation, meaning that Pods and Services of one Project can only communicate with Pods and Services in the same project. There is no way for Pods or Services from one project to send traffic to another. The ovs-multitenant plugin is perfect if just having projects separated is enough.

 

Unique across projects

Unlike the ovs-subnet plugin, which passes all traffic across all pods, this one assigns the same VNID to all pods for each project, keeping them unique across projects, and sets up flow rules on the br0 bridge to make sure that traffic is only allowed between pods with the same VNID.

 

VNID for each Project

When the ovs-multitenant plugin is enabled, each project is assigned a VNID. The VNID for each Project is maintained in the etcd database on the OpenShift master node. When a pod is created, its linked veth interface is associated with its Project’s VNID, and OpenFlow rules are created to ensure it can communicate only with pods in the same project.

 

    • The ovs-network policy plugin 

The ovs-multitenant plugin cannot control access at a more granular level. This is where the ovs-network policy plugin steps in, adds more configuration power, and lets you create custom NetworkPolicy objects. As a result, the ovs-network policy plugin provides fine-grained access control for individual applications, regardless of their project. You can tailor your topology requirement by isolating policy using network policy objects.

This is the Kubernetes Network Policy, so you map, Label, or tag your application, then define a network policy definition to allow or deny connectivity across your application. Network policy mode will enable you to configure their isolation policies using NetworkPolicy objects. Network policy is the default mode in OpenShift Container Platform 4.8.

 

  • OpenShift OVN Kubernetes CNI network provider

OpenShift Container Platform uses a virtualized network for pod and service networks. The OVN-Kubernetes Container Network Interface (CNI) plugin is a network provider for the default cluster network. OVN-Kubernetes is based on the Open Virtual Network (OVN) and provides an overlay-based networking implementation. A cluster that uses the OVN-Kubernetes network provider also runs Open vSwitch (OVS) on each node. OVN configures OVS on each node to implement the declared network configuration.

 

OVN-Kubernetes features

The OVN-Kubernetes Container Network Interface (CNI) cluster network provider implements the following features:

  • Uses OVN (Open Virtual Network) to manage network traffic flows. OVN is a community-developed, vendor-agnostic network virtualization solution.
  • Implements Kubernetes network policy support, including ingress and egress rules.
  • It uses the Geneve (Generic Network Virtualization Encapsulation) protocol rather than VXLAN to create an overlay network between nodes.

 

Container Network Interface CNI
Diagram: Container Network Interface CNI. Source OpenShift Docs

 

Closing comments on OpenShift SDN

OpenShift SDN (Software-Defined Networking) is a crucial component of the OpenShift platform, providing a flexible and scalable networking solution for containerized applications. It enables seamless communication between containers on different nodes within an OpenShift cluster.

At its core, OpenShift SDN leverages the power of Open vSwitch (OVS), a widely used open-source virtual switch. Using OVS, OpenShift SDN can create a virtual network overlay across nodes in the cluster, ensuring efficient networking between containers.

One of the critical advantages of OpenShift SDN is its ability to provide network isolation for different projects and applications running on the OpenShift platform. Each project or application is assigned its isolated network, preventing interference and ensuring security. OpenShift SDN also offers advanced networking features such as network policy enforcement. This allows administrators to define fine-grained rules for traffic flow within the cluster, ensuring that only authorized communication is permitted between containers.

Another notable feature of OpenShift SDN is its support for multi-tenancy. With multi-tenancy, different teams or organizations can share the same OpenShift cluster while maintaining network separation. This enables efficient resource utilization and simplifies management for cluster administrators. OpenShift SDN is designed to be highly scalable and resilient. It can handle many containers and automatically adapts to changes in the cluster, such as adding or removing nodes. This ensures the network remains stable and performant even under high load conditions.

OpenShift SDN utilizes various networking technologies to provide seamless container connectivity, including Virtual Extensible LAN (VXLAN) and Geneve tunneling. These technologies enable the creation of a virtual network fabric that spans the entire cluster, allowing containers to communicate without any physical network limitations.

 

Highlights: OpenShift SDN

OpenShift SDN, short for Software-Defined Networking, is a revolutionary technology that has transformed the way we think about network management in the world of containerization. In this blog post, we delved deep into the intricacies of OpenShift SDN and explored its various components, benefits, and use cases. So, fasten your seatbelts as we embark on this exciting journey!

Section 1: Understanding OpenShift SDN

OpenShift SDN is a networking plugin for the OpenShift Container Platform that provides a robust and scalable network infrastructure for containerized applications. It leverages the power of Kubernetes and overlays network connectivity on top of existing physical infrastructure. OpenShift SDN offers unparalleled flexibility, agility, and automation by decoupling the network from the underlying infrastructure.

Section 2: Key Components of OpenShift SDN

To comprehend the inner workings of OpenShift SDN, let’s explore its key components:

1. Open vSwitch: Open vSwitch is a virtual switch that forms the backbone of OpenShift SDN. It enables the creation of logical networks and provides advanced features like load balancing, firewalling, and traffic shaping.

2. SDN Controller: The SDN controller is responsible for managing and orchestrating the network infrastructure. It acts as the brain of OpenShift SDN, making intelligent decisions regarding network policies, routing, and traffic management.

3. Network Overlays: OpenShift SDN utilizes network overlays to create virtual networks on top of the physical infrastructure. These overlays enable seamless communication between containers running on different hosts and ensure isolation and security.

Section 3: Benefits of OpenShift SDN

OpenShift SDN brings a plethora of benefits to containerized environments. Some of the notable advantages include:

1. Simplified Network Management: With OpenShift SDN, network management becomes a breeze. It abstracts the complexities of the underlying infrastructure, allowing administrators to focus on higher-level tasks and reducing operational overhead.

2. Scalability and Elasticity: OpenShift SDN is highly scalable and elastic, making it suitable for dynamic containerized environments. It can easily accommodate the addition or removal of containers and adapt to changing network demands.

3. Enhanced Security: OpenShift SDN provides enhanced security for containerized applications by leveraging network overlays and advanced security policies. It ensures isolation between different containers and enforces fine-grained access controls.

Section 4: Use Cases for OpenShift SDN

OpenShift SDN finds numerous use cases across various industries. Some prominent examples include:

1. Microservices Architecture: OpenShift SDN seamlessly integrates with microservices architectures, enabling efficient communication between different services and ensuring optimal performance.

2. Multi-Cluster Deployments: OpenShift SDN is well-suited for multi-cluster deployments, where containers are distributed across multiple clusters. It simplifies network management and enables seamless inter-cluster communication.

Conclusion:

In conclusion, OpenShift SDN is a game-changer in the world of container networking. Its software-defined approach, coupled with advanced features and benefits, empowers organizations to build scalable, secure, and resilient containerized environments. Whether you are deploying microservices or managing multi-cluster setups, OpenShift SDN has got you covered. So, embrace the power of OpenShift SDN and unlock new possibilities for your containerized applications!

Chaos Engineering

Chaos Engineering Kubernetes

 

 

Chaos Engineering Kubernetes

In the world of cloud-native computing, Kubernetes has emerged as the de facto container orchestration platform. With its ability to manage and scale containerized applications, Kubernetes has revolutionized modern software development and deployment. However, as systems become more complex, ensuring their resilience and reliability has become a critical challenge. This is where Chaos Engineering comes into play. In this blog post, we will explore the concept of Chaos Engineering in the context of Kubernetes and its importance in building robust, fault-tolerant applications.

Chaos Engineering is a discipline that deliberately injects failure into a system to uncover weaknesses and vulnerabilities. By simulating real-world scenarios, organizations can proactively identify and address potential issues before they impact end-users. Chaos Engineering embraces the philosophy of “fail fast to learn faster,” helping teams build more resilient systems that can withstand unforeseen circumstances and disruptions with minimal impact.

Regarding Chaos Engineering in Kubernetes, the focus is on injecting controlled failures into the ecosystem to assess the system’s behavior under stress. By leveraging Chaos Engineering tools and techniques, organizations can gain valuable insights into the resiliency of their Kubernetes deployments and identify areas for improvement.

 

Highlights: Chaos Engineering Kubernetes

  • The Traditional Application

When considering Chaos Engineering kubernetes, we must start from the beginning. Not too long ago, applications ran in single private data centers, potentially two data centers for high availability. These data centers were on-premises, and all components were housed internally. Life was easy, and troubleshooting and monitoring any issues could be done by a single team, if not a single person, with predefined dashboards. Failures were known, and there was a capacity planning strategy that did not change too much, and you could carry out standard dropped packet test.

  • A Static Infrastructure

The network and infrastructure had fixed perimeters and were pretty static. There weren’t many changes to the stack, for example, daily. Agility was at an all-time low, but that did not matter for the environments in which the application and infrastructure were housed. However, nowadays, we are in a completely different environment.

Complexity is at an all-time high, and agility in business is critical. Now, we have distributed applications with components/services located in many different places and types of places, on-premises and in the cloud, with dependencies on both local and remote services. So, in this land of complexity, we must find system reliability. A reliable system is one that you can trust will be reliable.

 

Before you proceed to the details of Chaos Engineering, you may find the following useful:

  1. Service Level Objectives (slos)
  2. Kubernetes Networking 101
  3. Kubernetes Security Best Practice
  4. Network Traffic Engineering
  5. Reliability In Distributed System
  6. Distributed Systems Observability

 



Kubernetes Chaos Engineering

Key Chaos Engineering Kubernetes Discussion points:


  • Unpredictable failure modes.

  • The need for baseline engineering.

  • Non-ephemerel and ephemeral service types.

  • So many metrics to count.

  • Debugging microservices.

  • The rise of Chaos Engineering.

  • Final points on Service Mesh.

 

  • A key point: Video on Chaos Engineering Kubernetes

In this video tutorial, we are going through the basics of how to start a Chaos Engineering project, along with a discussion on baseline engineering. I will introduce to you how this can be solved by knowing exactly how your application and infrastructure perform under stress and what are their breaking points.

 

Chaos Engineering: How to Start A Project
Prev 1 of 1 Next
Prev 1 of 1 Next

 

Back to basics with Chaos Engineering Kubernetes

Today’s standard explanation for Chaos Engineering is “The facilitation of experiments to uncover systemic weaknesses.” The following is true for Chaos Engineering.

  1. Begin by defining “steady state” as some measurable output of a system that indicates normal behavior.
  2. Hypothesize that this steady state will persist in both the control and experimental groups.
  3. Submit variables that mirror real-world events like servers that crash, hard drives that malfunction, severed network connections, etc.
  4. Then, as a final step. Try to disprove the hypothesis by looking for a steady state difference between the control and experimental groups.

 

Chaos Engineering Scenarios in Kubernetes:

1. Pod Failures: Simulating failures of individual pods within a Kubernetes cluster allows organizations to evaluate how the system responds to such events. By randomly terminating pods, Chaos Engineering can help ensure that the system can handle pod failures gracefully, redistributing workload and maintaining high availability.

2. Network Partitioning: Introducing network partitioning scenarios can help assess the resilience of a Kubernetes cluster. By isolating specific nodes or network segments, Chaos Engineering enables organizations to test how the group reacts to network disruptions and evaluate the effectiveness of load balancing and failover mechanisms.

3. Resource Starvation: Chaos Engineering can simulate resource scarcity scenarios by intentionally consuming excessive resources, such as CPU or memory, within a Kubernetes cluster. This allows organizations to identify potential performance bottlenecks and optimize resource allocation strategies.

Benefits of Chaos Engineering in Kubernetes:

1. Enhanced Reliability: By subjecting Kubernetes deployments to controlled failures, Chaos Engineering helps organizations identify weak points and vulnerabilities, enabling them to build more resilient systems that can withstand unforeseen events.

2. Improved Incident Response: Chaos Engineering allows organizations to test and refine their incident response processes by simulating real-world failures. This helps teams understand how to quickly detect and mitigate potential issues, reducing downtime and improving the overall incident response capabilities.

3. Cost Optimization: By identifying and addressing performance bottlenecks and inefficient resource allocation, Chaos Engineering can help optimize the utilization of resources within a Kubernetes cluster. This, in turn, leads to cost savings and improved efficiency.

 

Beyond the Complexity Horizon

Therefore, monitoring and troubleshooting are much more demanding, as everything is interconnected, making it difficult for a single person in one team to understand what is happening entirely. The edge of the network and application boundary surpasses one location and team. Enterprise systems have gone beyond the complexity horizon, and you can’t understand every bit of every single system.

Even if you are a developer closely related to the system and truly understand the nuts and bolts of the application and its code, no one can understand every bit of every single system.  So, finding the correct information is essential, but once you find it, you have to give it to those who can fix it. So monitoring is not just about finding out what is wrong; it needs to alert, and these alerts need to be actionable.

 

Troubleshooting: Chaos engineering kubernetes

Chaos Engineering aims to improve a system’s reliability by ensuring it can withstand turbulent conditions. Chaos Engineering makes Kubernetes more secure. So, if you are adopting Kubernetes, you should adopt Chaos Engineering as an integral part of your monitoring and troubleshooting strategy.

Firstly, we can pinpoint the application errors and understand, at best, how these errors arose. This could be anything from badly ordered scripts on a web page to, let’s say, a database query that has bad sequel calls or even unoptimized code-level issues.

Or there could be something more fundamental going on. It is common to have issues with how something is packaged into a container. You can pull in the incorrect libraries or even use a debug version of the container. Or there could be nothing wrong with the packaging and containerization of the container; it is all about where the container is being deployed. There could be something wrong with the infrastructure, either a physical or logical problem—incorrect configuration or a hardware fault somewhere in the application path.

 

Non-ephemeral and ephemeral services

With the introduction of containers and microservices observability, monitoring solutions need to manage non-ephemeral and ephemeral services. We are collecting data for applications that consist of many different benefits.

So when it comes to container monitoring and performing chaos engineering kubernetes tests, we need to understand the nature and the application that lays upon fully. Everything is dynamic by nature. You need to have monitoring and troubleshooting in place that can handle the dynamic and transient nature. When monitoring a containerized infrastructure, you should consider the following.

Container Lifespan: Containers have a short lifespan; containers are provisioned and commissioned based on demand. This is compared to the VM or bare-metal workloads that generally have a longer lifespan. As a generic guideline, containers have an average lifespan of 2.5 days, while traditional and cloud-based VMs have an average lifespan of 23 days. Containers can move, and they do move frequently.

One day, we could have workload A on cluster host A, and the next day or even on the same day, the same cluster host could be hosting Application workload B. Therefore, different types of impacts could depend on the time of day.

Containers are Temporary: Containers are dynamically provisioned for specific use cases temporarily. For example, we could have a new container based on a specific image. New network connections will be set up for that container, storage, and any integrations to other services that make the application work. All of this is done dynamically and can be done temporarily.

Different monitoring levels: We have many monitoring levels in a Kubernetes environment. The components that make up the Kubernetes deployment will affect application performance. We have, for example, nodes, pods, and application containers. We have monitoring at different levels, such as the VM, storage, and microservice level.

Microservices change fast and often: Microservices consist of constantly evolving apps. New microservices are added, and existing ones are decommissioned quickly. So, what does this mean to usage patterns? This will result in different usage patterns on the infrastructure. If everything is often changing, it can be hard to derive the baseline and build a topology map unless you have something automatic in place. 

Metric overload: We now have loads of metrics. We now have additional metrics for the different containers and infrastructure levels. We must consider metrics for the nodes, cluster components, cluster add-on, application runtime, and custom application metrics. This is compared to a traditional application stack where we use metrics for components such as the operating system and the application. 

 

  • A key point: Video on Observability vs. Monitoring

We will start by discussing how our approach to monitoring needs to adapt to the current megatrends, such as the rise of microservices. Failures are unknown and unpredictable. Therefore, a pre-defined monitoring dashboard will have difficulty keeping up with the rate of change and unknown failure modes.

For this, we should look to have the practice of observability for software and monitoring for infrastructure.

 

Observability vs Monitoring
Prev 1 of 1 Next
Prev 1 of 1 Next

 

Metric explosion

In the traditional world, we didn’t have to be concerned with the additional components such as an orchestrator or the dynamic nature of many containers. With a container cluster, we must consider metrics from the operating system, application, orchestrator, and containers.  We refer to this as a metric explosion. So now we have loads of metrics that need to be gathered. There are also different ways to pull or scrape these metrics.

Prometheus is expected in the world of Kubernetes and uses a very scalable pull approach to getting those metrics from HTTP endpoints either through Prometheus client libraries or exports.

Prometheus Monitoring Application
Diagram: Prometheus Monitoring Application: Scaping Metrics.

 

A key point: What happens to visibility  

So we need complete visibility now more than ever. And not just for single components but visibility at a holistic level. Therefore, we need to monitor a lot more data points than we had to in the past. We need to monitor the application servers, Pods and containers, clusters running the containers, the network for service/pod/cluster communication, and the host OS.

All of the data from the monitoring needs to be in a central place so trends can be seen and different queries to the data can be acted on. Correlating local logs would be challenging in a sizeable multi-tier application with docker containers. We can use Log forwarders or Log shippers such as FluentD or Logstash to transform and ship logs to a backend such as Elasticsearch.

 

A key point: New avenues for monitoring

Containers are the norm for managing workloads and adapting quickly to new markets. Therefore, new avenues have opened up for monitoring these environments. So we have, for example, AppDynamics and Elastic search, which are part of the ELK stack, the various logs shippers that can be used to help you provide a welcome layer of unification. We also have Prometheus to get metrics. Keep in mind that Prometheus works in the land of metrics only. There will be different ways to visualize all this data, such as Grafana and Kibana. 

 

What happened to visibility

What happens to visibility? So we need complete visibility now more than ever. And not just for single components but visibility at a holistic level. Therefore, we need to monitor a lot more data points than we had to in the past. We need to monitor the application servers, Pods and containers, clusters running the containers, the network for service/pod/cluster communication, and the host OS. 

All of the data from the monitoring needs to be in a central place so trends can be seen and different queries to the data can be acted on. Correlating local logs would be challenging in a sizeable multi-tier application with docker containers. We can use Logforwarders or Log shippers such as FluentD or Logstash to transform and ship logs to a backend such as Elasticsearch.

Containers are the norm for managing workloads and adapting quickly to new markets. Therefore, new avenues have opened up for monitoring these environments. So I have mentioned AppDynamics, Elastic search, which is part of the ELK stack, and the various log shippers that can be used to help you provide a layer of unification. We also have Prometheus. There will be different ways to visualize all this data, such as Grafana and Kibana. 

 

Microservices complexity: Management is complex

So, with the wave towards microservices, we get the benefits of scalability and business continuity, but managing is very complex. The monolith is much easier to manage and monitor. Also, as they are separate components, they don’t need to be written in the same language or toolkits. So you can mix and match different technologies.

So, this approach has a lot of flexibility, but we can have increased latency and complexity. There are a lot more moving parts that will increase complexity.

We have, for example, reverse proxies, load balancers, firewalls, and other infrastructure support services. What used to be method calls or interprocess calls within the monolith host now go over the network and are susceptible to deviations in latency. 

 

Debugging microservices

With the monolith, the application is simply running in a single process, and it is relatively easy to debug. Many traditional tooling and code instrumentation technologies have been built, assuming you have the idea of a single process. However, with microservices, we have a completely different approach with a distributed application.

Now, your application has multiple processes running in other places. The core challenge is that trying to debug microservices applications is challenging.

So much of the tooling we have today has been built for traditional monolithic applications. So, there are new monitoring tools for these new applications, but there is a steep learning curve and a high barrier to entry. New tools and technologies such as distributed tracing and chaos engineering kubernetes are not the simplest to pick up on day one.

 

  • Automation and monitoring: Checking and health checks

Automation comes into play with the new environment. With automation, we can do periodic checks not just on the functionality of the underlying components, but we can implement the health checks of how the application performs. All can be automated for specific intervals or in reaction to certain events.

With the rise of complex systems and microservices, it is more important to have real-time monitoring of performance and metrics that tell you how the systems behave. For example, what is the usual RTT, and how long can transactions occur under normal conditions?

 

  • A key point: Video on Distributed Tracing

We generally have two types of telemetry data. We have log data and time-series statistics. The time-series data is also known as metrics in a microservices environment. The metrics, for example, will allow you to get an aggregate understanding of what’s happening to all instances of a given service.

Then, we have logs, on the other hand, that provide highly fine-grained detail on a given service. But have no built-in way to provide that detail in the context of a request. Due to how distributed systems fail, you can’t use metrics and logs to discover and address all of your problems. We need a third piece to the puzzle: distributed tracing.

 

Distributed Tracing Explained
Prev 1 of 1 Next
Prev 1 of 1 Next

 

The Rise of Chaos Engineering

There is a growing complexity of infrastructure, and let’s face it, a lot can go wrong. It’s imperative to have a global view of all the infrastructure components and a good understanding of the application’s performance and health. In a large-scale container-based application design, there are many moving pieces and parts, and trying to validate the health of each piece manually is hard to do. 

With these new environments, especially cloud-native at scale. Complexity is at its highest, and many more things can go wrong. For this reason, you must prepare as much as possible so the impact on users is minimal.

So, the dynamic deployment patterns you get with frameworks with Kubernetes allow you to build better applications. But you need to be able to examine the environment and see if it is working as expected. Most importantly, this course’s focus is that to prepare effectively, you need to implement a solid strategy for monitoring in production environments.

Chaos Engineering
Diagram: Chaos engineering testing.

 

    • Chaos engineering kubernetes

For this, you need to understand practices like Chaos Engineering and Chaos Engineering tools and how they can improve the reliability of the overall system. Chaos Engineering is the ability to perform tests in a controlled way. Essentially, we intentionally break things to learn how to build more resilient systems.

So, we are injecting faults in a controlled way to make the overall application more resilient by injecting various issues and faults. It comes down to a trade-off and your willingness to accept it. There is a considerable trade-off with distributed computing. You have to monitor efficiently, have performance management, and, more importantly, accurately test the distributed system in a controlled manner. 

 

    • Service mesh chaos engineering

 Service Mesh is an option to use to implement Chaos Engineering. You can also implement Chaos Engineering with Chaos Mesh, a cloud-native Chaos Engineering platform that orchestrates tests in the Kubernetes environment. The Chaos Mesh project offers a rich selection of experiment types. Here are the choices, such as the POD lifecycle test, network test, Linux kernel, I/O test, and many other stress tests.

Implementing practices like Chaos Engineering will help you understand and manage unexpected failures and performance degradation. The purpose of Chaos Engineering is to build more robust and resilient systems. 

Conclusion:

Chaos Engineering has emerged as a valuable practice for organizations leveraging Kubernetes to build and deploy cloud-native applications. By subjecting Kubernetes deployments to controlled failures, organizations can proactively identify and address potential weaknesses, ensuring the resilience and reliability of their systems. As the complexity of cloud-native architectures continues to grow, Chaos Engineering will play an increasingly vital role in building robust and fault-tolerant applications in the Kubernetes ecosystem.

 

Docker security

Docker Container Security: Building a Sandbox

Docker Container Security

In recent years, Docker has revolutionized how software is developed and deployed. Its ability to create lightweight, isolated containers has made it a popular choice among developers. One of the most powerful features of Docker is the ability to create sandboxes, which are remote environments that mimic the production environment. This blog post will explore the benefits of building a Docker sandbox and provide a step-by-step guide.

Containerization brings convenience and scalability, but it also introduces unique security challenges. In this blog I will highlight the fundamental concepts of container security, including container isolation, image vulnerabilities, and the shared kernel model.

To fortify your Docker environment, it is crucial to implement a set of best practices. These sections will explore various security measures, such as image hardening, least privilege principle, and container runtime security. We will also discuss the significance of regular updates and vulnerability scanning.

Highlights: Container Security

Table of Contents

Namespaces and Control Groups

The building blocks for Docker security options and implementing Docker security best practices, such as the Kernel primitives, have been around for a long time, so they are not all new when considering their security. However, the container itself is not a kernel construct; it is an abstraction of using features of the host operating system kernel. For Docker container security and building a Docker sandbox, these kernel primitives are the namespaces and control groups that allow the abstraction of the container.

The Role of Kernel Primitives

To build a docker sandbox, Docker uses control groups to control workloads’ resources to host resources. As a result, Docker allows you to implement system controls with these container workloads quickly. Fortunately, much of the control group complexity is hidden behind the Docker API, making containers and Container Networking much easier to use.

Then we have namespaces that control what a container can see. A namespace allows us to take an O/S with all its resources, such as filesystems, and carve it into virtual operating systems called containers. Namespaces are like visual boundaries, and there are several different namespaces.

Related: For additional pre-information, you may find the following helpful.

  1. Container Based Virtualization
  2. Remote Browser Isolation
  3. Docker Default Networking 101
  4. Kubernetes Network Namespace
  5. Merchant Silicon
  6. Kubernetes Networking 101



Docker Security

Key Docker Container Security Discussion points:


  • Docker security best practices.

  • The role of the namespaces and control groups.

  • Kernal and Hypervisor attack surface.

  • How to build a container sandbox.

  • Docker container security starting points.

Back to Basics: Docker Container Security

Containers

For a long time, big web-scale players have been operating container technologies to manage the weaknesses of the VM model. In the container model, the container is analogous to the VM. However, a significant difference is that containers do not require their full-blown OS. Instead, all containers operating on a single host share the host’s OS.

This frees up many system resources, such as CPU, RAM, and storage. Containers are again fast to start and ultra-portable. Consequently, moving containers with their application workloads from your laptop to the cloud and then to VMs or bare metal in your data center is a breeze.

Docker Container Diagram
Diagram: Docker Container. Source Docker.

Sandbox containers

Sandbox containers are a virtualization technology that provides a secure environment for applications and services to run in. They are lightweight, isolated environments that run applications and services safely without impacting the underlying host.

This type of virtualization technology enables rapid deployment of applications while also providing a secure environment that can be used to isolate, monitor, and control access to data and resources. Sandbox containers are becoming increasingly popular as they offer an easy, cost-effective way to deploy and manage applications and services securely.

They can also be used for testing, providing a safe and isolated environment for running experiments. In addition, Sandbox containers are highly scalable and can quickly and easily deploy applications across multiple machines. This makes them ideal for large-scale projects, as they can quickly deploy and manage applications on a large scale. The following figures provide information that is generic to sandbox containers.

Docker Sandbox
Diagram: Docker Sandbox.

Understanding the Risks

Docker containers present unique security challenges that need to be addressed. We will delve into the potential risks associated with containerization, including container breakouts, image vulnerabilities, and compromised host systems.

Implementing Container Isolation

One fundamental aspect of securing Docker containers is isolating them from each other and from the host system. We will explore techniques such as namespace and cgroup isolation and the use of security profiles to strengthen container isolation and prevent unauthorized access.

Regular Image Updates and Vulnerability Scanning

Keeping your Docker images up to date is vital for maintaining a secure container environment. We will discuss the importance of regularly updating base images and utilizing vulnerability scanning tools to identify and patch potential security vulnerabilities in your container images.

Container Runtime Security

The container runtime environment plays a significant role in container security. We will explore runtime security measures such as seccomp profiles, AppArmor, and SELinux to enforce fine-grained access controls and reduce the attack surface of your Docker containers.

Monitoring and Auditing Container Activities

Effective monitoring and auditing mechanisms are essential for promptly detecting and responding to security incidents. We will explore tools and techniques for monitoring container activities, logging container events, and implementing intrusion detection systems specific to Docker containers.

container attack vectors
Diagram: Container attack vectors. Source Adriancitu

Benefits of Using a Docker Sandbox:

1. Replicating Production Environment: A Docker sandbox allows developers to create a replica of the production environment. This ensures the application runs consistently across different environments, reducing the chances of unexpected issues when the code is deployed.

2. Isolated Development Environment: With Docker, developers can create a self-contained environment with all the necessary dependencies. This eliminates the hassle of manually setting up development environments and ensures team consistency.

3. Fast and Easy Testing: Docker sandboxes simplify testing applications in different scenarios. By creating multiple sandboxes, developers can test various configurations, such as different operating systems, libraries, or databases, without interfering with each other.

1st Lab Guide: Privilege Escalation

Privilege Escalation: The Importance of a Sandbox

During an attack, your initial entry point into a Linux system is via a low-privileged account, which provides you with a low-privileged shell. To obtain root-level access, you need to escalate your privileges. This is generally done by starting with enumeration.

Sometimes, your target machine may have misconfigurations that you could leverage to escalate your privileges. Here, we will look for misconfigurations, particularly those that leverage the SUID (Set User Identification) permission.

The SUID permission allows low-privileged users to run an executable with the file system permissions of its owner. For example, if the system installs an executable globally, that executable would run as root. 

Note:

A quick way to find all executables with SUID permission is to execute the command labeled Number 1 in the screenshot below. I’m just running Ubuntu on a VM. To illustrate how the SUID permission can be abused for privilege escalation, you will use the found executable labeled number 2.

Privilege Escalation

Analysis:

    • After setting the SUID permission, re-run the command to find all executables. You will see /usr/bin/find now appears in the list.
    • Since find has the SUID permission, you can leverage it to execute commands in the root context. 
    • You should now see the contents of the /etc/shadow file. This file is not visible without root permissions. From here, you can leverage additional commands to execute more tasks and gain a high-privilege backdoor.

privilege attacks

Proactive Measures to Mitigate Privilege Escalation:

To safeguard against privilege escalation attacks, individuals and organizations should consider implementing the following measures:

1. Regular Software Updates:

Keeping operating systems, applications, and software up to date with the latest security patches helps mitigate vulnerabilities that can be exploited for privilege escalation.

2. Strong Access Controls:

Implementing robust access control mechanisms, such as the principle of least privilege, helps limit user privileges to the minimum level necessary for their tasks, reducing the potential impact of privilege escalation attacks.

3. Multi-factor Authentication:

Enforcing multi-factor authentication adds an extra layer of security, making it more difficult for attackers to gain unauthorized access even if they possess stolen credentials.

4. Security Audits and Penetration Testing:

Regular security audits and penetration testing can identify vulnerabilities and potential privilege escalation paths, allowing proactive remediation before attackers can exploit them.

Step-by-Step Guide to Building a Docker Sandbox:

Step 1: Install Docker:

The first step is to install Docker on your machine. Docker provides installation packages for various operating systems, including Windows, macOS, and Linux. Visit the official Docker website and follow the installation instructions specific to your operating system.

Step 2: Set Up Dockerfile:

A Dockerfile is a text file that contains instructions for building a Docker image. Create a new ” Dockerfile ” file and define the base image, dependencies, and any necessary configuration. This file serves as the blueprint for your Docker sandbox.

Step 3: Build the Docker Image:

Once the Dockerfile is ready, you can build the Docker image by running the “docker build” command. This command reads the instructions from the Dockerfile and creates a reusable image that can be used to run containers.

Step 4: Create a Docker Container:

Once you have built the Docker image, you can create a container based on it. Containers are instances of Docker images that can be started, stopped, and managed. Use the “docker run” command to create a container from the image you built.

Step 5: Configure the Sandbox:

Customize the Docker container to match your requirements. This may include installing additional software, setting environment variables, or exposing ports for communication. Use the Docker container’s terminal to make these configurations.

Step 6: Test and Iterate:

Once the sandbox is configured, you can test your application within the Docker environment. Use the container’s terminal to execute commands, run tests, and verify that your application behaves as expected. If necessary, make further adjustments to the container’s configuration and iterate until you achieve the desired results.

Building a Docker Sandbox

Docker Security Best Practices: Containerized Processes

Containers are often referred to as “containerized processes.” Essentially, a container is a Linux process running on a host machine. However, the process has a limited view of the host and can access a subtree of the filesystem. Therefore, it would be best to consider a container a process with a restricted view.

Namespace and resource restrictions provide the limited view offered by control groups. The inside of the container looks similar to that of a V.M. with isolated processes, networking, and file system access. However, it seems like a normal process running on the host machine from the outside. 

2nd Lab Guide: Container Security 

One of the leading security flaws to point out when building a docker sandbox is that containers, by default, run as root.  Notice in the example below that we have a tool run on the Docker host that can perform an initial security scan – called Docker Bench. Remember that running containers as root comes with inherent security risks that organizations must consider carefully. Here are some key concerns:

Note:

1. Exploitation of Vulnerabilities: Running containers as root increases the potential impact of vulnerabilities within the container. If an attacker gains access to a container running as root, they can potentially escalate their privileges and compromise the host system.

2. Escaping Container Isolation: Containers rely on a combination of kernel namespaces, cgroups, and other isolation mechanisms to provide separation from the host and other containers. Running containers as root increases the risk of an attacker exploiting a vulnerability within the container to escape this isolation and gain unauthorized access to the host system.

3. Unauthorized System Access: If a container running as root is compromised, the attacker may gain full access to the underlying host system, potentially leading to unauthorized system modifications, data breaches, or even the compromise of other containers running on the same host.

Containers running as root

Docker container security and protection

Containers run as root by default.

The first thing to consider when starting Docker container security is that containers run as root by default and share the Kernel of the Host OS. They rely on the boundaries created by namespaces for isolation and control groups to prevent one container from consuming resources negatively. So here, we can avoid things like a noisy neighbor, where one application uses up all resources on the system preventing other applications from performing adequately on the same system.

In the early days of containers, this is how container protection started with namespace and control groups, and the protection was not perfect. For example, it cannot prevent all interference in resources the operating system kernel does not manage. 

So, we need to move to a higher abstraction layer with container images. The container images encapsulate your application code and any dependencies, third-party packages, and libraries. Images are our built assets representing all the fields to run our application on top of the Linux kernel. In addition, images are used to create containers so that we can provide additional Docker container security here.

Docker Container Security
Diagram: Docker Container Security. Rootless mode. Source Aquasec.

Security concerns. Image and supply chain

To run a container, we need to pull images. The images are pulled locally or from remote registries; we can have vulnerabilities here. Your hosts connected to the registry may be secure, but that does not mean the image you are pulling is secure. Traditional security appliances are blind to malware and other image vulnerabilities as they are looking for different signatures. There are several security concerns here.

Users can pull full or bloated images from untrusted registries or images containing malware. As a result, we need to consider the container threats in both runtimes and the supply chain for adequate container security.

Scanning Docker images during the CI stage provides a quick and short feedback loop on security as images are built. You want to discover unsecured images well before you deploy them and enable developers to fix them quickly rather than wait until issues are found in production.

You should also avoid unsecured images in your testing/staging environments, as they could also be targets for attack. For example, we have image scanning from Aqua, and image assurance can be implemented in several CI/CD tools, including the Codefresh CI/CD platform.

Container image security
Diagram: Container image security. Source Aqua.

3rd Lab Guide: Container Security

The following example shows running a container with an unprivileged user, not a root user. Therefore, we are not roots inside the container. With this example, we are using user ID 1500. Notice how we can’t access the /etc/shadow password file, which is a file that needs root privileges to open. To mitigate the risks associated with running containers as root, organizations should adopt the following best practices:

container security

Note:

1. Principle of Least Privilege: Avoid running containers as root whenever possible. Instead, use non-root users within the container and assign only the necessary privileges required for the application to function correctly.

2. User Namespace Mapping: Utilize user namespace mapping to map non-root users inside the container to a different user outside the container. This helps provide an additional layer of isolation and restricts the impact of any potential compromise.

3. Secure Image Sources: Ensure container images from trusted sources are obtained in your environment. Regularly update and patch container images to minimize the risk of known vulnerabilities.

4. Container Runtime Security: Implement runtime security measures such as container runtime security policies, secure configuration practices, and regular vulnerability scanning to detect and prevent potential security breaches.

Bonus Content: Understanding Docker Networking:

Docker networking is a fundamental aspect of containerization, enabling containers to communicate with each other and the host system. By default, Docker provides three types of networks: bridge, host, and overlay. Each network type serves a different purpose and offers various advantages based on your application’s requirements.

1. Bridge Network:

The bridge network is the default network driver in Docker. It allows containers on the same host to communicate using IP addresses. Containers connected to the bridge network can share with the outside world through the host’s network interface. This network isolates containers from the host system and other containers, providing a secure environment for your applications.

2. Host Network:

In the host network mode, containers share the same network stack as the host system. This means containers have direct access to the host’s network interface, bypassing any network isolation provided by Docker. The host network mode is suitable for high-performance scenarios where you need to maximize network throughput.

3. Overlay Network:

The overlay network allows containers to communicate across multiple Docker hosts. This network type is essential for creating and deploying distributed applications across a swarm of Docker hosts. It utilizes the Docker Swarm mode to provide seamless communication between containers running on different hosts, regardless of their physical location.

  • A key point: Lab Guide on Docker Networking

Docker has several default networks available to the Docker host by default. Using these network types, you can restrict containers from communicating on the same or different hosts. In the example below, we have inspected the default network type of bridge. All containers attached to this network type can communicate and will be assigned an IP address from the 172.17.0.0/16 range.

Also, notice the scope of the bridge network type is local, meaning it is local to this host. To communicate over different Docker hosts, you would need to implement the network types of VXLAN, which is an overlay network.

Docker Default networking
Diagram: Docker Default networking

Advanced Networking Concepts:

Apart from the default network types, Docker networking offers several advanced features that enhance container communication and security.

1. DNS Resolution:

Docker automatically assigns each container a unique name, making it easier to reference them in your application code. The embedded DNS server in Docker allows containers to resolve each other’s names, simplifying the communication process.

2. Container-to-Container Communication:

Containers within the same network can communicate with each other directly using their IP addresses. This enables microservices architectures, where each service runs in its container and communicates with others over the network.

3. Container Expose Ports:

Docker allows you to expose specific container ports to the host system or other containers. This enables you to securely expose your application services to the outside world or other containers within the network.

4. Network Security:

Docker provides various security features to protect your containerized applications. You can implement network policies to control inbound and outbound traffic, ensuring that only authorized connections are allowed. Additionally, you can encrypt network traffic using TLS certificates for secure communication between containers.

4th Lab Guide: Container Networking

Containers can be part of two networks. Consider the bridge network a standard switch, except it is virtual. Anything attached can communicate. So, if we have a container with two virtual ethernet cards connected to two different switches, the container is in two networks.

inspecting container networks
Diagram: Inspecting container networks

Security concerns: Container breakouts

The container process is visible from the host. Therefore, if a bad actor gets access to the host with the correct privileges, it can compromise all the containers on the host. If an application can read the memory that belongs to your application, it can access your data. So, you need to ensure that your applications are safely isolated from each other. If your application runs on separate physical machines, accessing another application’s memory is impossible. From the security perspective, physical isolation is the strongest but is often not always possible. 

If a host gets compromised, all containers running on the host are potentially compromised, too, especially if the attacker gains root or elevates their privileges, such as a member of the Docker Group.

So, your host must be locked down and secured, so container breakouts are hard to do. Also, remember that it’s hard to orchestrate a container breakout. Still, it is not hard to misconfigure a container with additional or excessive privileges that make a container breakout easy.

Docker Container Security
Diagram: Docker container security. Source Aqua.

The role of the Kernel: Potential attack vector

The Kernel manages its userspace processes and assigns memory to each process. So, it’s up to the Kernel to ensure that one application can’t access the memory allocated to another. The Kernel is hardened and battle-tested, but it is complex, and the number one enemy of good security is complexity. You cannot rule out a bug in how the Kernel manages memory; an attacker could exploit that bug to access the memory of other applications.

Hypervisor: Better isolation? Kernel attack surface

So, does the Hypervisor give you better isolation than a Kernel gives to its process? The critical point is that a kernel is complex and constantly evolving; as crucial as it manages memory and device access, the Hypervisor has a more specific role. As a result, the hypervisors are smaller and more straightforward than whole Linux kernels. 

What happens if you compare the lines of code in the Linux Kernel to that of an open-source hypervisor? Less code means less complexity, resulting in a smaller attack surface—a more minor attack surface increases the likelihood of a bad actor finding an exploitable flaw.

With a kernel, the userspace process allows some visibility of each other. For example, you can run specific CLI commands and see the running processes on the same machine. Furthermore, you can access information about those processes with the correct permissions. 

This fundamentally differs between the container and V.M. Many consider the container weaker in isolation. With a V.M., you can’t see one machine’s process from another. The fact that containers share a kernel means they have weaker isolation than the V.M. For this reason and from the security perspective, you can place containers into V.Ms.

Docker Container Security: Building a Docker Sandbox

So, we have some foundational docker container security that has been here for some time. A Linux side of security will give us things such as namespace, control groups we have just mentioned, secure computing (seccomp), AppArmor, and SELinux that provide isolation and resource protection. Consider these security technologies to be the first layer of security that is closer to the workload. Then, we can expand from there and create additional layers of protection, creating an in-depth defense strategy.

Container Sandbox
Diagram: Building a Sandbox. Source Aqua

How to create a Docker sandbox environment

As a first layer to creating a Docker sandbox, you must consider the available security module templates. Several security modules can be implemented that can help you enable fine-grained access control or system resources hardening your containerized environment. More than likely, your distribution comes with a security model template for Docker containers, and you can use these out of the box for some use cases.

However, you may need to tailor the out-of-the-box default templates for other use cases. Templates for Secure Computing, AppArmor, and SELinux will be available. Along with the Dockerfile and workload best practices, these templates will give you an extra safety net. 

Docker Security Best Practices – Goal1: Strengthen isolation: Namespaces

One of the main building blocks of containers is a Linux construct called the namespace, which provides a security layer for your applications running inside containers. For example, you can limit what that process can see by putting a process in a namespace. A namespace fools a process that it uniquely has access to. In reality, other processes in their namespace can access similar resources in their isolated environments. The resources belong to the host system.

Docker Security Best Practices – Goal2: Strengthen isolation: Access control

Access control is about managing who can access what on a system. We inherited Unix’s Discretionary Access Control (DAC) features from Linux. Unfortunately, they are constrained, and there are only a few ways to control access to objects. If you want a more fine-grained approach, we have Mandatory Access Control (MAC), which is policy-driven and granular for many object types.

We have a few solutions for MAC. For example, SELinux was in Kernel in 2003 and AppArmor in 2010. These are the most popular in the Linux domain, and these are implemented as modules via the LSM framework.

SELinux was created by the National Security Agency (NSA ) to protect systems and was integrated into the Linux Kernel. It is a Linux kernel security module that has access controls, integrity controls, and role-based access controls (RBAC)

Docker Security Best Practices – Goal3: Strengthen isolation: AppArmor

AppArmor applies access control on an application-by-application basis. To use it, you associate an AppArmor security profile with each program. Docker loads a default profile for the container’s default. Keep in mind that this is used and not on the Docker Daemon. The “default profile” is called docker-default. Docker describes it as moderately protective while providing broad application capability.

So, when you instantiate a container, it uses the “docker default” policy unless you override it with the “security-opt” flag. This policy is crafted for the general use case. The default profile is applied to all container workloads if the host has AppArmor enabled. 

Docker Security Best Practices – Goal4: Strengthen isolation: Control groups

Containers should not starve other containers from using all the memory or other host resources. So, we can use control groups to limit resources available to different Linux processes. Control Groups control hosts’ resources and are essential for fending Denial-of-Service Attacks. If a function is allowed to consume, for example, unlimited memory, it can starve other processes on the same host of that host resource.

This could be done inadvertently through a memory leak in the application or maliciously due to a resource exhaustion attack that takes advantage of a memory leak. The container can fork as many processes (PID ) as the max configured for the host kernel.

Unchecked, this is a significant avenue as a DoS. A container should be limited to the number of processors required through the CLI. A control group called PID determines the number of processes allowed within a control group to prevent a fork bomb attack. This can be done with the PID subsystem.

Docker Security Best Practices – Goal5: Strengthen isolation: Highlighting system calls 

System calls run in the Kernel space, with the highest privilege level and kernel and device drivers. At the same time, a user application runs in the user space, which has fewer privileges.  When an application that runs in user space needs to carry out such tasks as cloning a process, it does this via the Kernel, and the Kernel carries out the operation on behalf of the userspace process. This represents an attack surface for a bad actor to play with.

Docker Security Best Practices – Goal6: Security standpoint: Limit the system calls

So, you want to limit the system calls available to an application. If a process is compromised, it may invoke system calls it may not ordinarily use. This could potentially lead to further compromisation.  It would help if you aimed to remove system calls that are not required and reduce the available attack surface. As a result, it will reduce the risk of compromise and risk to the containerized workloads.

Docker Security Best Practices – Goal7: Secure Computing Mode

Secure Computing Mode (seccomp) is a Linux kernel feature that restricts the actions available within the containers. For example, there are over 300+ syscalls in the Linux system call interface, and your container is unlikely to need access. For instance, if you don’t want containers to change kernel modules. Therefore, they do not need to call the “create” module, “delete” module, or “init”_module.” Seccomp profiles are applied to a process that determines whether or not a given system call is permitted. Here, we can list or blocklist a set of system calls.

The default seccomp profile sets the Kernel’s action when a container process attempts to execute a system call. An allowed action specifies an unconditional list of permitted system calls.

For Docker container security, the default seccomp profile blocks over 40 syscalls without ill effects on the containerized applications. You may want to tailor this to suit your security needs, restrict it further, and limit your container to a smaller group of syscalls. It is recommended to have a seccomp profile for each application that permits precisely the same syscalls it needs to function. This will follow the security principle of the least privileged.

Closing Comments: Building a Docker Sandbox

Docker containers are an increasingly popular way of deploying applications securely and efficiently. However, as with any technology, security risks come with using Docker containers, and it’s essential to understand how to mitigate those risks. The following are critical considerations for building a Docker sandbox.

The first step to ensuring Docker container security is to keep the Docker daemon and the underlying host operating system up-to-date. This includes patching the Docker engine and the host with the latest security updates. Additionally, check the Docker version you are running and upgrade to the latest version if necessary.

Next, it’s essential to employ best practices when creating container images. This includes removing unnecessary packages and updating all components to the latest versions. You should also limit access to the images to only the necessary users. It would help to avoid including sensitive information, such as passwords, in the images.

When creating containers, you should limit the number of processes and resources each container can access. This will help to prevent malicious processes from running on the host system. Additionally, limit the memory and CPU resources given to each container.

Finally, securing the communication between the Docker container and the host system is essential. This can be done using secure protocols like TLS, SSH, and HTTPS. You should also ensure that you properly configure the firewall and use authentication measures, such as username and password authentication.

  • Container Isolation:

One of the fundamental benefits of Docker is its ability to isolate applications within containers. This isolation prevents any potential vulnerabilities from spreading across the system. Running each application in its container makes it easier to contain security breaches and limit their impact.

  • Regular Image Updates:

Regularly updating Docker images is crucial for maintaining container security. By regularly patching and updating the base images used within containers, you ensure that any known vulnerabilities are addressed promptly. Additionally, monitoring official Docker repositories and subscribing to security mailing lists can provide valuable insights into possible vulnerabilities and required updates.

  • Image Scanning:

Regular image scanning is essential to identify potential security risks within Docker containers. Various tools like Clair and Anchore can automatically scan container images for vulnerabilities. By integrating these tools into your continuous integration and deployment pipelines, you can ensure that only secure images are deployed.

  • Secure Container Configuration:

Properly configuring Docker containers is vital for maintaining security. Limiting the container’s capabilities, such as restricting access to sensitive host directories and disabling unnecessary services, reduces the attack surface. Employing robust authentication mechanisms and enforcing least privilege access control further enhances container security.

  • Network Segmentation:

Implementing network segmentation within Docker containers is an effective way to prevent unauthorized access. By isolating containers into different network segments based on their sensitivity, you can restrict communication between containers and minimize the impact of a potential breach. Docker’s built-in network functionality, such as overlay networks and network policies, allows for granular control over container communication.

  • Runtime Monitoring:

Continuous monitoring of Docker containers during runtime is crucial for detecting and mitigating potential security incidents. Utilize container monitoring tools, like Sysdig and Prometheus, to monitor resource usage, network traffic, and access patterns. Analyzing these metrics lets you identify suspicious activities and take necessary actions to prevent security breaches.

Summary: Container Security

Docker containers have become a cornerstone of modern application development and deployment in today’s digital landscape. With their numerous advantages, it is imperative to address the importance of container security. This blog post delved into the critical aspects of Docker container security and provided valuable insights to help you safeguard your containers effectively.

Section 1: Understanding the Basics of Container Security

Containerization brings convenience and scalability, but it also introduces unique security challenges. This section will highlight the fundamental concepts of container security, including container isolation, image vulnerabilities, and the shared kernel model.

Section 2: Best Practices for Securing Docker Containers

To fortify your Docker environment, it is crucial to implement a set of best practices. This section will explore various security measures, such as image hardening, the least privilege principle, and container runtime security. We will also discuss the significance of regular updates and vulnerability scanning.

Section 3: Securing Container Networks and Communication

Container networking ensures secure communication between containers and the outside world. These sections delved into strategies such as network segmentation, container firewalls, and secure communication protocols. Additionally, we touched upon the importance of monitoring network traffic for potential intrusions.

Section 4: Container Image Security Scanning

The integrity of container images is of utmost importance. This section highlights the significance of image-scanning tools and techniques to identify and mitigate vulnerabilities. We explored popular image-scanning tools and discussed how to integrate them seamlessly into your container workflow.

Section 5: Managing Access Control and Authentication

Controlling access to your Docker environment is critical for maintaining security. This section covered essential strategies for managing user access, implementing role-based access control (RBAC), and enforcing robust authentication mechanisms. We will also touch upon the concept of secrets management and protecting sensitive data within containers.

Conclusion:

In conclusion, securing your Docker containers is a multifaceted endeavor that requires a proactive approach. By understanding the basics of container security, implementing best practices, ensuring container networks, scanning container images, and managing access control effectively, you can significantly enhance the security posture of your Docker environment. Remember, container security is ongoing, and vigilance is critical to mitigating potential risks and vulnerabilities.