Colored cables sticked into server switch of network equipment in data center

OpenShift | Networking

We have several challenges with traditional data center networks that prove the inability to support today’s types of applications, such as microservices and containers. Therefore we need a new set of networking technologies built into OpenShift that can more adequately deal with today’s landscape changes. Firstly, one of the main issues is that we have a tight coupling with all the networking and infrastructure components. With traditional data center networking, Layer 4 is coupled with the network topology at fixed network points and lacks the flexibility to support today’s containerized applications that are more agile than the traditional monolith application.

One of the main issues is that containers are short-lived and constantly spun down. Assets that support the application, such as IP addresses, firewalls, policies, and overlay networks that glue the connectivity, are constantly recycled. These changes bring a lot of agility and business benefits, but there is a large comparison to a traditional network that is relatively static where changes happen every few months.

openshift networking challenges

Diagram: OpenShift Networking Challenges

 

Endpoint Reachability

Also, Endpoint Reachability. Not only have endpoints changed but have the ways we reach them. The application stack previously had very few components, maybe just a cache, web server, or database. Using a load balancing algorithm, the most common network service allows a source to reach an application endpoint or load balance to several endpoints. A simple round-robin was common or a load balancer that measured load. Essentially, the sole purpose of the network was to provide endpoint reachability. However, changes inside the data center are driving networks and network services towards becoming more integrated with the application.

Nowadays, the network function exists no longer solely to satisfy endpoint reachability; it is fully integrated. In the case of Red Hat’s Openshift, the network is represented as a Software-Defined Networking (SDN) layer. SDN means different things to different vendors. So, let me clarify in terms of OpenShift.

 

Highlighting Software-Defined Network (SDN)

When you examine traditional networking devices, we have the control and forwarding plane; these roles are shared on a single device. The concept of SDN separates these two planes, i.e., the control and forwarding planes are decoupled from each other. They can now reside on different devices, bringing many performances and management benefits. The benefits of the network integration and decoupling make it much easier for the applications to be divided into several microservice components driving the microservices culture of application architecture. You could say that SDN was a requirement for microservices.

 

oprenshift

Diagram: OpenShift Networking explained. Link to YouTube video.

 

Challenges to Docker Networking 

      • Port Mapping and NAT

Docker containers have been around for a while, but when they first came out, networking had significant drawbacks. If you examine container networking, for example, Docker containers, there are other challenges when they connect to a bridge on the node where the docker daemon is running. To allow network connectivity between those containers and any endpoint external to the node, we need to do some port-mapping and Network Address Translation (NAT). This by itself adds complexity. Port Mapping and NAT have been around for ages. Introducing these networking functions will complicate container networking when running at scale. Perfectly fine for 3 or 4 containers, but the production network will have many more endpoints to deal with. The origins of container networking are based on a simple architecture and primarily a single-host solution.

 

      • Docker at scale: The need for an orchestration layer

The core building blocks of containers, such as namespaces and control groups, are battle-tested. And although the docker engine manages containers by facilitating Linux Kernel resources, it’s limited to a single host operating system. Once you get past three hosts, it is hard to manage the networking. Everything needs to be spun up in a certain order, and consistent network connectivity and security, regardless of the mobility of the workloads, are also challenged. This led to an orchestration layer. Just as a container is an abstraction over the physical machine, the container orchestration framework is an abstraction over the network. This brings us to the Kubernetes networking model in which Openshift takes advantage of and enhances; for example, we have the Openshift Route Construct that exposes applications for external access. We will be discussing Openshift Routes and Kubernetes Services in just a moment.

 

Openshift security

Diagram: OpenShift Security. Link to YouTube video.

 

Introduction to Openshift

OpenShift Container Platform (formerly known as OpenShift Enterprise) or OCP is Red Hat’s offering for the on-premises private platform as a service (PaaS). OpenShift is based on the Origin open-source project and is a Kubernetes distribution. The foundation of the OpenShift Container Platform is based on Kubernetes and therefore shares some of the same networking technology along with some enhancements. Kubernetes is the main container orchestration, and Openshift is derived from both containers and kubernetes as the orchestration layer. All of which lay upon an SDN layer that glues everything together. It is the role of SDN to create the cluster-wide network. And the glue that connects all the dots is the overlay network that operates over an underlay network. But first, let us address the Kubernetes Networking model of operation.

 

  • The Kubernetes Model: Pod Networking

The Kubernetes networking model was developed to simplify Docker container networking that had some drawbacks, as we have just discussed. It did this by introducing the concept of a Pod and Pod networking that allows multiple containers inside a Pod to share an IP namespace. They can communicate with each other on IPC or localhost. Nowadays, we are placing a single container into a single Pod, and the Pod acts as a boundary layer for any cluster parameters that directly affect the container. So we run deployment against pods and not containers. In OpenShift, we can assign networking and security parameters to Pods that will affect the container inside. When an app is deployed on the cluster, each Pod gets an IP assigned, and each Pod could have different applications.

For example, Pod 1 could have a web front end, and Pod could be a database, so the Pods need to communicate. For this, we need a network and IP address. By default, Kubernetes allocates each Pod an internal IP address for applications running within the Pod. Pods and their containers can network, but clients outside the cluster do not have access to internal cluster resources by default. With Pod networking, every Pod must be able to communicate with each other Pod in the cluster without Network Address Translation (NAT).

OpenShift Network Policy

Diagram: OpenShift Network Policy.

 

  • A common Service Type: ClusterIP

The most common type of service IP address is type “ClusterIP .”The ClusterIP is a persistent virtual IP address used for load balancing traffic internal to the cluster. Services with these service types cannot be directly accessed outside the cluster. There are other service types for that requirement. The service type of Cluster-IP is considered for East-West traffic since it is traffic originating from Pods running in the cluster to the service IP backed by Pods that also run in the cluster. Then to enable external access to the cluster, we need to expose the services that the Pod or Pods represent, and this is done with an Openshift Route that provides a URL.

So we have a Service running in front of the Pod or groups of Pod. And the default is for internal access only. Then we have a Route with is URL-based that gives the internal service external access.

 

Different Openshift SDN Networking Modes

So depending on your Openshift SDN configuration, there are different ways you can tailor the network topology. You can have free for all Pod connectivity, similar to a flat network or something stricter with different levels of security boundaries and restrictions. A free for all Pod connectivity between all projects might be good for a lab environment. Still, for production networks with multiple projects, you may need to tailor the network with segmentation, and this can be done with one of the OpenShift SDN plugins, which we will get to in just a moment. Openshift networking does this with an SDN layer and enhances Kubernetes networking so we can have a virtual network across all the nodes and is created with the Open switch standard. For the Openshift SDN, this Pod network is established and maintained by the OpenShift SDN, which configures an overlay network using Open vSwitch (OVS).

 

The OpenShift SDN Plugin

We mentioned that you could tailor the virtual network topology to suit your networking requirements, which can be determined by the OpenShift SDN plugin and the SDN model you choose to select. With the default OpenShift SDN, there are several modes available. This level or SDN mode you choose is concerned with managing connectivity between applications and providing external access to them. Some modes are more fine-grained than others. How are all these plugins enabled? The Openshift Container Platform (OCP) networking relies on the Kubernetes CNI model while supporting several plugins by default and several commercial SDN implementations, including Cisco ACI. The native plugins rely on the virtual switch Open vSwitch and offer alternatives to providing segmentation using VXLAN, specifically the VNID or the Kubernetes Network Policy objects:

We have, for example:

        • ovs-subnet  
        • ovs-multitenant  
        • ovs-network policy

Observability and Controllability

Observability and Controllability: Issues with Metrics

What Is a Metric: Good for Known

So when it comes to observability and controllability, one needs to understand the downfall of the metric. In reality, a metric is a single number, with tags optionally appended for grouping and searching those numbers. They are disposable and cheap and have a predictable storage footprint. A metric is a numerical representation of a system state over the recorded time interval and can tell you if a particular resource is over or underutilized at a particular moment in time. For example, CPU utilization might be at 75% right now.

There can be many tools to gather metrics, such as Prometheus along with several techniques used to gather these metrics, such as the PUSH and PULL approaches. There are pros and cons to each method. However, Prometheus and its PULL approach are prevalent in the market. However, if you are looking for full observability and controllability, keep in mind it is solely in the world of metrics-based monitoring solutions.

Monitoring Metrics

Diagram: Monitoring Metrics.

 

Metrics: Resource Utilization Only

So metrics are useful to tell us about resource utilization. Within a Kubernetes environment, these metrics are used to perform auto-healing and auto-scheduling purposes. So when it comes to metrics, monitoring performs several functions. First, it can collect, aggregate, and analyze metrics to shift through known patterns that indicate troubling trends. The key point here is that it shifts through known patterns. Then, based on a known event, metrics trigger alerts that notify when further investigation is needed. Finally, we have dashboards that display the metrics data trends adapted for visual consumption on top of all of this.

These monitoring systems work well for identifying previously encountered known failures but don’t help as much for the unknown. Unknown failures are the norm these days with disgruntled systems and complex system interactions. Metrics are good for dashboards but there won’t be a predefined dashboard for unknowns as it can’t track something it does not know about. Using metrics and dashboards like this is a very reactive approach. Yet, it’s an approach that has been widely accepted as the norm. Monitoring is a reactive approach best suited for detecting known problems and previously identified patterns. 

So within a microservices environment, the metrics can help you when the microservice is healthy or unhealthy. Still, a metric will have a hard time telling you if a microservices function takes a long time to complete or if there is an intermittent problem with an upstream or downstream dependency. So we need different tools to gather this type of information. We have an issue with metrics because they only look at individual microservices with a given set of attributes. So they don’t give you a holistic view of the entire problem. For example, the application stack now exists in numerous locations and location types; we need a holistic viewpoint. And a metric does not give this. For example, metrics are used to track simplistic system states that might indicate a service may be running poorly or maybe a leading indicator or an early warning signal. However, while those measures are easy to collect, they don’t turn out to be useful measures for triggering alerts.

 

observability and controllability

Diagram: The Three Pillars of Observability: Metrics, Traces, and Logs. Link to YouTube video.

 

  • Issues With Dashboards: Useful Only for a Few Metrics

So these metrics are gathered and stored in time-series databases, and we have several dashboards to display these metrics. These dashboards were first built, there weren’t many system metrics to worry about. You could have gotten away with 20 or so dashboards. But that was about it. As a result, it was easy to see the critical data anyone should know about for any given service. Moreover, those systems were pretty simple and did not have many moving parts. This is in contrast to the modern services that typically collect so many metrics that it’s impossible to fit them into the same dashboard.

 

  • Issues with Aggregate Metrics

So we need to find ways to fit all the metrics into a few dashboards. Here the metrics are often pre-aggregated and averaged. However, the issue is that the aggregate values no longer provide meaningful visibility even when we have filters and drill-downs. Therefore we need to predeclare conditions that describe conditions that we think we are going to expect in the future.  This is where we use instinctual practices of past experiences and rely on gut feeling. Remember the network and software hero? It would help if you tried to avoid aggregation and averaging within the metrics store. On the other hand, we have Percentiles that offer a richer view. Keep in mind, however, that they require raw data.

 

Highlighting Observability: Any Question

Observability and controllability tools take on an entirely different approach. They strive for different exploratory approaches to finding problems. Essentially, those operating observability systems don’t sit back and wait for an alert or for something to happen. Instead, they are always actively looking and asking random questions to the observability system. Observability tools should gather rich telemetry for every possible event, having full content of every request and then having the ability to store it and query. In addition, these new observability tools are specifically designed to query against high-cardinality data. High cardinality allows you to interrogate your event data in any arbitrary ways that we see fit. Now we ask any questions about your system and inspect its corresponding state. 

 

Observability and Controllability

Diagram: Distributed Tracing. Link to YouTube video.

 

Key Observability and Controllability Considerations

  • No Predicts in Advance

Due to the nature of modern software systems, you want the ability to understand any inner state and any services without anticipating or predicting them in advance. For this, we need to gain valuable telemetry and use some new tools and technological capabilities to gather and interrogate this data once it has been collected. Telemetry needs to be constantly gathered in flexible ways to debug issues without predicting how failures may occur. 

The conditions that affect infrastructure health change infrequently, and they are relatively easier to monitor the infrastructure. In addition, we have several well-established practices to predict, such as capacity planning and the ability to remediate automatically (e.g., such as auto-scaling in a Kubernetes environment. All of which can be used to tackle these types of known issues.

Observability Tools

Diagram: Observability Tools.

 

Due to its relatively predictable and slowly changing nature, the aggregated metrics approach monitors and alerts perfectly for infrastructure problems. So here, a metric-based system works well. Metrics-based systems and their associated alerts help you see when capacity limits or known error conditions of underlying systems are being reached. So you could say that metrics-based systems work well for infrastructure problems that don’t change too much but fall dramatically short in the world of complex distributed systems. For these types of systems, you should opt for an observability and controllability platform. 

data center transition

Traditional Data Center | Cisco ACI

Traditionally, we have built our networks based on a hierarchical design. This is often referred to as the traditional data center with a three-tier design with an access layer, an aggregation layer, and a core layer. Historically, this design enabled a substantial amount of predictability because aggregation switch blocks simplified the spanning-tree topology. In addition, the need for scalability often pushed this design into modularity, which increased predictability. However, although we increased predictability, the main challenge inherent in the three-tier models is that it was difficult to scale. As the number of endpoints increases and the need to move between segments, we need to span layer 2.

The traditional data center design, often leads to poor network design and human error. You don’t want to have layer 2 segment between the data center unless you have the proper controls. Although modularization is still desired in networks today, the general trend has been to move away from this design type that evolves around spanning tree to a more flexible and scalable solution with VXLAN and other similar Layer 3 overlay technologies. In addition, the Layer 3 overlay technologies bring a lot of network agility which is vital to business success.

The word agility refers to making changes, deploying services, and supporting the business at the speed it desires. This means different things to different organizations. For example, a network team can be considered agile if it can deploy network services in a matter of weeks. In others, it could mean that business units in a company should be able to get applications to production or scale core services on demand through automation with Ansible CLI or Ansible Tower. Regardless of how you define agility, there is little disagreement with the idea that network agility is vital to business success. The problem we have is that network agility has traditionally been hard to achieve until now with the Cisco ACI. Let’s recap some of the main data center transitions to understand fully.

 

Traditional Data Center Transformation

Diagram: Traditional data center transformation.

 

Traditional Data Center:

  • Layer 2 to the Core

The traditional data center has gone through several transitions. Firstly, we had Layer 2 to the core. Then, from the access to the core, we had Layer 2 and not Layer 3. A design like this would, for example, trunk all VLANs to the core. For redundancy, you would manually prune VLANs from the different trunk links. Our challenge with this approach of having Layer 2 to the core relies on Spanning Tree Protocol. Therefore redundant links are blocked. As a result, we don’t have the full bandwidth, leading to performance degradation and simply wasting resources. Another challenge is to rely on topology changes to fix the topology. Spanning Tree Protocol does have timers to limit the convergence and can be tuned for better performance. Still, we rely on the convergence from Spanning Tree Protocol to fix the topology but Spanning Tree Protocol was never meant to be a routing protocol. Compared to other protocols operating higher up in the stack are designed to be more optimized to react to changes in the topology. But STP is not an optimized control plane protocol which is a big hinder to the traditional data center. You could relate this to how VLANs have transitioned to become a security feature. However, their purpose was originally for performance reasons.

 

  • Routing to Access Layer

To overcome these challenges to building stable data center networks, the Layer 3 boundary gets pushed further and further to the network’s edge. Layer 3 networks can use the advances in routing protocols that can handle failures and link redundancy much more efficiently. Alot more efficnerty that Spanning Tree Protocol that should never have been there in the first place. Then we had routing at the access. With this design, we can eliminate Spanning Tree Protocol to the core and then run Equal Cost MultiPath (ECMP) from the access to the core. We can run ECMP as we are now Layer 3 routing from the access to the core layer instead of running STP that blocks redundant links.  However, equal cost multipath (ECMP) routes offer a simple way to share the network load by distributing traffic onto other paths. ECMP is therefore typically applied only to entire flows or sets of flows. A flow in this respect may be characterized by destination address, source address, transport level ports, payload protocol.

 

A Key Point: Equal Cost MultiPath (ECMP)

Equal Cost MultiPath (ECMP) brings many advantages; firstly, ECMP gives us full bandwidth with equal costs links. As we are routing, we no longer have to block redundant links to prevent loops at Layer 2. However, we still have Layer 2 in the network design, and we still have Layer 2 on the access layer; therefore, parts of the network will still rely on Spanning Tree Protocol, and it converges times when there is a change in the topology. So we may have Layer 3 from the access to the core, but we still have Layer 2 connections at the edge and rely on STP to block redundant links to prevent loops. Another potential drawback is that having smaller Layer 2 domains can limit where the application can reside in the data center network. Which drives more of a need to to transition from the traditional data center design.

 

data center network design

Diagram: Data center network design: Equal cost multi path.

 

The Layer 2 domain that the applications may use could be limited to a single server rack connected to one ToR or two ToR for redundancy with a layer 2 interlink between the two ToR switches to pass the Layer 2 traffic. These designs are not optimal as you have to specify where you want your applications to be set. Therefore, putting the breaks on agility. As a result, there was another key data center transition, and this was the introduction to the overlay data center designs.

 

  • The Rise of Virtualization

Virtualization is creating a virtual — rather than actual — version of something, such as an operating system (OS), a server, a storage device, or network resources. Virtualization uses software that simulates hardware functionality to create a virtual system. It is creating a virtual version of something like computer hardware. It was initially developed during the mainframe era. With virtualization, the virtual machine could exist on any host. As a result, Layer 2 had to be extended to every switch. This was problematic for Larger networks as the core switch had to learn every MAC address for every flow that traversed it. To overcome this and take advantage of the convergence and stability of layer 3 networks, overlay networks became the choice for data center networking. 

VXLAN is an encapsulation protocol that provides data center connectivity using tunneling to stretch Layer 2 connections over an underlying Layer 3 network. In data centers, VXLAN is the most commonly used protocol to create overlay networks that sit on top of the physical network, enabling the use of virtual networks. The VXLAN protocol supports the virtualization of the data center network while addressing the needs of multi-tenant data centers by providing the necessary segmentation on a large scale.

Here we are encapsulating traffic into a VXLAN header and forwarding between VXLAN tunnel endpoints, known as the VTEPs. With overlay networking, we have the overlay and the underlay concept. By encapsulating the traffic into the overlay VXLAN, we now use the underlay, which in the ACI is provided by IS-IS, to provide the Layer 3 stability and redundant paths using Equal Cost Multipathing (ECMP) along with the fast convergence of routing protocols.

 

 

  • The Cisco Data Center Transition

The Cisco data center has gone through several stages when you think about it. First, we started with Spanning Tree, moved to Spanning Tree with vPCs, and then replaced the Spanning Tree with FabricPath. FabricPath is what is known as a MAC-in-MAC Encapsulation. Then we replaced Spanning Tree with VXLAN, which is a MAC-in-IP Encapsulation. Today in the data center, VXLAN is the de facto overlay protocol for data center networking. The Cisco ACI uses an enhanced version of VXLAN to implement both Layer 2 and Layer 3 forwarding with a unified control plane. Replacing SpanningTree with VXLAN, where we have a MAC-in-IP encapsulation, was a welcomed milestone for data center networking.

 

Introduction to the ACI

The Cisco Cisco Application Centric Infrastructure Fabric (ACI) is the Cisco SDN solution for the data center. Cisco has taken a different approach from the centralized control plane SDN approach with other vendors and has created a scalable data center solution that can be extended to multiple on-premises, public, and private cloud locations. The ACI fabric has many components that include Cisco Nexus 9000 Series switches with the APIC Controller running in the leaf/spine ACI fabric mode. These components form the building blocks of the ACI, supporting a dynamic integrated physical and virtual infrastructure.

 

A key point. The Cisco ACI version.

Before Cisco ACI 4.1, the Cisco ACI fabric allowed only a two-tier (spine-and-leaf switch) topology, in which each leaf switch is connected to every spine switch in the network with no interconnection between leaf switches or spine switches. Starting from Cisco ACI 4.1, the Cisco ACI fabric allows a multitier (three-tiers) fabric and two tiers of leaf switches, which provides the capability for vertical expansion of the Cisco ACI fabric. This is useful to migrate a traditional three-tier architecture of core aggregation access that has been a common design model for many enterprise networks and is still required today.

 

A key point. The APIC Controller.

The network is driven by the database consisting of the Cisco Application Policy Infrastructure Controller ( APIC) working in a cluster from the management perspective. The APIC is the centralized point of control and everything you want to configure you can do in the APIC. Consider the APIC to be the brains of the ACI fabric and server as the single source of truth for configuration within the fabric. The APIC controller is a policy engine and holds the defined policy, which essentially tells the other elements in the ACI fabric what to do. This database allows you to manage the network as a single entity. 

In summary, the APIC is the the infrastructure controller is the main architectural component of the Cisco ACI solution. It is the unified point of automation and management for the Cisco ACI fabric, policy enforcement, and health monitoring. The APIC is not involved in data plane forwarding.

data center layout

Diagram: Data center layout: The Cisco APIC controller.

 

The APIC represents the management plane which allows the system to maintain the control and data plane in the network. The APIC is not the control plane device, nor does it sit in the data traffic path. Remember that the APIC controller can crash, and you still have forwarded in the fabric. The ACI solution is not an SDN centralized control plane approach. The ACI is a distributed fabric with independent control planes on all fabric switches. 

 

Modular data center design: The Leaf and Spine 

Leaf-spine is a two-layer data center network topology that’s useful for data centers that experience more east-west network traffic than north-south traffic. The topology comprises leaf switches (servers and storage connect) and spine switches (to which leaf switches connect). In this two-tier Clos architecture, every lower-tier switch (leaf layer) is connected to each top-tier switch (spine layer) in a full-mesh topology. The leaf layer consists of access switches that connect to devices such as servers. The spine layer is the network’s backbone and is responsible for interconnecting all leaf switches. Every leaf switch connects to every spine switch in the fabric. The path is randomly chosen, so the traffic load is evenly distributed among the top-tier switches. Therefore, if one of the top-tier switches fails, it would only slightly degrade performance throughout the data center.

Unlike the traditional data center, the ACI operates with a Leaf and Spine architecture. Now traffic comes in through a device sent from an end host. In the ACI, this is known as a Leaf device. We also have the Spine devices that are Layer 3 routers with no special hardware dependencies.  In a basic leaf and spine fabric, every Leaf is connected to every Spine, and any endpoint in the fabric is always the same distance in terms of hops and latency from every other endpoint that is internal to the fabric. The ACI Spine switches are Clos intermediary switches with many key functions. Firstly, they exchange routing updates with leaf switches via Intermediate System-to-Intermediate System (IS-IS) and rapidly forward packets between leaf switches. They provide endpoint lookup services to leaf switches through the Council of Oracle Protocol (COOP). They also handle route reflection to the leaf switches using Multiprotocol BGP (MP-BGP).

 

Cisco ACI Overview

Diagram: Cisco ACI Overview.

 

The Leaf switches are the ingress/egress points for traffic into and out of the ACI fabric. In addition, they are the connectivity points for the variety of endpoints that the Cisco ACI supports. The leaf switches provide end-host connectivity. The spines act as a fast, non-blocking Layer 3 forwarding plane that supports Equal Cost Multipathing (ECMP) between any two endpoints in the fabric and uses overlay protocols such as VXLAN under the hood. VXLAN enables any workloads to exist anywhere in the fabric. By using VXLAN, we can now have workloads exist anywhere in the fabric without introducing too much complexity.

 

      • A Key Point:

This is a big improvement to data center networking as now we can have workloads, physical or virtual, in the same logical layer 2 domain, even when we are running Layer 3 down to each ToR switch. The ACI is a scalable solution as the underlay is specifically built to be scalable as more links are added to the topology. Along with being resilient when links in the fabric are brought down due to, for example, maintenance or failure. 

 

  • The Normalization event

XLAN is an industry-standard protocol that extends Layer 2 segments over Layer 3 infrastructure to build Layer 2 overlay logical networks. The ACI infrastructure Layer 2 domains reside in the overlay, with isolated broadcast and failure bridge domains. This approach allows the data center network to grow without the risk of creating too large a failure domain. All traffic in the ACI fabric is normalized as VXLAN packets. At the ingress, ACI encapsulates external VLAN, VXLAN, and NVGRE packets in a VXLAN packet. This is known as ACI encapsulation normalization. As a result, the forwarding in the ACI fabric is not limited to or constrained by the encapsulation type or encapsulation overlay network. If need be, the ACI bridge domain forwarding policy can be defined to provide standard VLAN behavior where required.

When traffic hits the Leaf, there is a normalization event. The normalization takes traffic sent from the servers to the ACI and makes it ACI compatible. Essentially, we are giving traffic that is sent from the servers a VXLAN ID so it can be sent across the ACI fabric. Traffic is normalized and then encapsulated with a VXLAN header and routed across the ACI fabric to the destination leaf where the destination endpoint is. This is, in a nutshell, how the ACI leaf and Spine work. We have a set of leaf switches that connect to the workloads and the spines that connect to the Leaf. VXLAN is the overlay protocol that carries data traffic across the ACI fabric.  A key point to this type of architecture is that the Layer 3 boundary is moved to the Leaf. This brings a lot of value and benefits to data center design. This boundary makes more sense as we have to route and encapsulate at this layer without going up to the core layer.

 

 

Cisco ACI

Service Level Objectives (SLOs): Customer-centric view

Site Reliability Engineering (SRE) teams have tools such as Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budget that can guide them on the road to building a reliable system with the customer viewpoint as the metric. These new tools or technologies form the basis for a reliable system and are the core building blocks of a reliable stack. The first thing you need to understand is the service’s expectations. This introduces the areas of service-level management and its components. The core concepts of service level management are Service Level Agreement (SLA), Service Level Objectives (SLO), and Service Level Indicators (SLIs). The common indicators used are Availability, latency, duration, and efficiency. It is critical to monitor these indicators to catch problems before your SLO is violated. These are the cornerstone of developing a good SRE practice.

 

System Reliability Meaning

 

Diagram: System Reliability Meaning.

 

      • SLI: Service level Indicator: A well-defined measure of “successful enough.” It is a quantifiable measurement of whether a given user interaction was good enough. Did it meet the expectation of the users? Does a web page load. within a certain time. This allows you to categorize whether a given interaction is good or bad.
      • SLO: Service level objective: A top-line target for a fraction of successful interactions.
      • SLA: Service level agreement: consequences. It’s more a legal construct. 

 

So, Reliability is not so much a feature but more of a practice that must be prioritized and taken into consideration from the very beginning and is not something that should be added later on. For example, when a system or service is in production. The most important feature of any system is Reliability, and it’s not a feature that a vendor can sell you. So if someone tries to sell you an add-on solution called Reliability, don’t buy it, especially if they offer you 100% reliability. Nothing can be 100% reliable all the time. If you strive for 100% reliability, you will miss out on opportunities to perform innovative tasks and the need to experiment and take risks that can help you build better products and services. 

 

system reliability

Diagram: Site Reliability Engineering (SRE). Link to YouTube video.

 

Components of a Reliable System

  • Distributed System

To build reliable systems that can tolerate a variety of failures, the system needs to be distributed so that a problem in one location doesn’t mean your entire service stops operating. So you need to be able to build a system that can handle, for example, a node dying or perform adequately with a certain load. To create a reliable system, you need to understand it fully and what happens when the different components that make up the system reach certain thresholds. This is where practices such as Chaos Engineering can help you.

 

  • Chaos Engineering 

We can have practices like Chaos Engineering that can confirm your expectations, give you confidence in your system at different levels, and prove you can have a certain amount of tolerance levels to Reliability. Chaos Engineering allows you to find weaknesses and vulnerabilities in complex systems. It is an important task that can be automated into your CI/CD pipelines. So you can have various Chaos Engineering verifications before you reach production. And these Chaos Engineering tests, such as load and Latency tests, can all be automated with little or no human interaction. The practice of Chaos Engineering is often used by Site Reliability Engineering (SRE) teams to improve resilience and must be used as part of your software development/deployment process.  

 

Chaos Engineering

Diagram: Chaos Engineering. Link to YouTube video.

 

It’s All About Perception: Customer-Centric View

Reliability is all about perception. Suppose the user considers your service unreliable. In that case, you will lose consumer trust as service perception is poor, so it’s important to provide consistency with your services as much as you can. For example, it’s OK to have some outages. Outages are expected, but you can’t have them all the time and for long durations. Users expect to have outages at some point in time but not for so long. User Perception is everything, and if the user thinks you are unreliable, you are. Therefore you need to have a customer-centric view, and using customer satisfaction is a critical metric to measure. This is where the key components of service management, such as Service Level Objectives (SLO) and Service Level Indicators (SLI), come to play. There is a balance that you need to find between Velocity and Stability. You can’t stop innovation, but you can’t take too many risks. An Error Budget will help you here and Site Reliability Engineering (SRE) principles. 

 

  • Users Experience: Static Thresholds

User experience means different things to different sets of users. We now have a model where different service users may be routed through the system in different ways, using different components. Therefore providing experiences that can vary widely. We also know now that the services no longer tend to break in the same few predictable ways over and over. With complex microservices and many software interactions, we have a lot of unpredictable failures that are never seen before. These are often referred to as black holes. We should have few alerts triggered by only focusing on symptoms that directly impact user experience and not because a threshold was reached. If your POD network reaches a certain threshold, this tells you nothing about user experience. You can’t rely on static thresholds anymore as they have no relationship to customer satisfaction.

If you are using static thresholds, they can’t reliably indicate any issues with user experience. Alerts should be set up to detect failures that impact user experience. Traditional monitoring falls short in trying to do this as it usually has predefined dashboards looking for something that has happened before. This brings us back to the challenges with the traditional metrics-based monitoring; we rely on static thresholds to define optimal system conditions, which has nothing to do with user experience. However, modern systems change shape dynamically under different workloads. Static thresholds for monitoring can’t reflect impacts on user experience. They lack context and are too coarse.

 

How to Approach Reliability 

  • New Tools and Technologies

We have new tools such as distributed tracing. So if the system becomes slow, what is the best way to find the bottleneck? Here you can use Distributed Tracing and Open Telemetry. So the tracing helps us instrument our system, so we figure out where the time has been spent and can be used across distributed microservice architecture to troubleshoot problems. Open Telemetry provides a standardized way of instrumenting our system and providing those traces.

We have already touched on Service Level Objectives, Service Level Indicators, and Error Budget. You want to know why and how something has happened. So we don’t just want to know when something has happened and then react to an event that is not looking from the customer’s perspective. We need to understand if we are meeting Service Level Agreement (SLA) by gathering the number and frequency of the outages and any performance issues. Service Level Objectives (SLO) and Service Level Indicators (SLI) can assist you with measurements. 

Service Level Objectives (SLO) and Service Level Indicators (SLI) not only assist you with measurements. They offer a tool for having better system reliability and form the base for the Reliability Stack. SLIs and SLOs help us interact with Reliability differently and offer us a path to build a reliable system. So now we have the tools and a disciple to use the tools within. Can you recall what that disciple is? the discipline is Site Reliability Engineering (SRE)

System Reliability Formula

Diagram: System Reliability Formula.

 

  • SLO-Based Approach to Reliability

If you’re too reliable all the time, you’re also missing out on some of the fundamental features that SLO-based approaches give you. The main area you will be missing out on is the freedom to do what you want, test, and innovate. If you’re too reliable, you’re missing out on opportunities to experiment, perform chaos engineering, ship features quicker than before, or even introduce structured downtime to see how your dependencies react. To learn a system, you need to break it. So if you are 100% reliable, you can’t touch your system, so you will never truly learn and understand your system. You want to give your users a good experience, but you’ll run out of resources in various ways if you try to ensure this good experience happens 100% of the time. SLOs let you pick a target that lives between those two worlds.

 

  • Balance Velocity and Stability

So you can’t just have Reliability by itself; you also need to have new features and innovation. Therefore, you need to find a balance between velocity and stability. So we need to balance Reliability with other features you have and are proposing to offer. Suppose you have access to a system with an amazing feature that doesn’t work. The users that have the choice will leave. So the framework for finding the balance between velocity and stability is Site Reliability Engineering. So how do you know what level of Reliability you need to provide to your customer? This all goes back to the business needs that reflect the customer’s expectations. So with SRE, we have a customer-centric approach.

The main source of outages is making changes even when the changes are planned. This can come in many forms, such as pushing new features, applying security patches, deploying new hardware, and scaling up to meet customer demand will greatly impact if you strive for a 100% reliability target.  If nothing changes to the physical/logical infrastructure or other components, we will not have bugs. We can freeze our current user base and never have to scale the system. In reality, this will not happen. There will always be changes. So it would be best if you found a balance.

 

observability platform

Observability vs Monitoring

To understand the difference between observability vs monitoring, we need to first discuss the role of monitoring. Monitoring is the evaluation to help identify the most valuable and efficient use of resources. So the big question I put to you is what to monitor. This is the first step to preparing a monitoring strategy and there are a couple of questions you can ask yourself to understand fully if monitoring by itself is enough or do you need to move to an observability platform. Firstly, you should consider what you should be monitoring, why you should be monitoring, and how to monitor it.?  When you know this, you can move into the different tools and platforms available. Some of these tools will be open source and others commercial. When evaluating these tools, one word of caution: does each tool work in a silo, or can it be used across technical domains? Silos are breaking agility in every form of technology.

 

Observability vs Monitoring

Diagram: Observability vs Monitoring

 

The Starting Point: Observability vs Monitoring

You need to measure and gather the correct event information in your environments, and this will be done with several tools. This will let you know what is affecting your application performance and infrastructure. As a good starting point, there are four golden signals to look out for: There is latency, saturation, traffic, and errors. These are Google’s Four Golden Signals. The four most important metrics to keep track of are: 

      1. Latency: How long it takes to serve a request
      2. Traffic: The number of requests being made.
      3. Errors: The rate of failing requests. 
      4. Saturation: How utilized the service is.

So now we have some kind of a guide on what to monitor, and let us apply this to Kubernetes to, for example, let’s say a frontend web service that is part of a tiered application, we would be looking at the following:

      1. How many requests is the front end processing at a particular point in time,
      2. How many 500 errors are users of the service received, and 
      3. Is the service overutilized by request?

So we already know that monitoring is a form of evaluation to help identify the most valuable and efficient use of resources. With monitoring, we observe and check the progress or quality of something over time. So within this, we have metrics, logs, and alerts. Each has a different role and purpose.

 

    • Monitoring: The Role of Metrics

Metrics are related to some entity and allow you to view how many resources you consume. The metric data consists of numeric values instead of unstructured text such as documents and web pages. Metric data is typically also time series, where values or measures are recorded over some time.  An example of such metrics would be available bandwidth and latency. It is important to understand baseline values. Without a baseline, you will not know if something is happening out of the norm. What are the usual baseline values of the different metrics for bandwidth and latency? Are there any fluctuations in these metrics? How do these values rise and fall during normal operations and peak usage? And this may change over the different days in the week and months. If, during normal operations, you notice a rise in these values. This would be deemed abnormal and should act as a trigger that something could be wrong and needs to be investigated. Keep in mind that these values should not be gathered as a once-off and can be gathered over time to give you a good understanding of your application and its underlying infrastructure.

 

    • Monitoring: The Role of Logs

Logging is an essential part of troubleshooting application and infrastructure performance. Logs give you additional information about the events. This is important for troubleshooting or discovering the root cause of the events. Logs will have a lot more detail than metrics. So you will need some way to parse the logs or use a log shipper. A typical log shipper will take these logs from the standard out in a Docker container and ship them to a backend for processing. FluentD or Logstash has its pros and cons and can be used here to the group and sent to a backend database that could be the ELK stack ( Elastic Search). When using this approach, you can add different things to logs before sending them to the backend. For example, you can add GEO IP information. And this will add richer information to the logs that can help you troubleshoot.

 

    • Monitoring: The Role of Alerting

Then we have the alerting, and it would be best to balance how you monitor and what you alert on. So, we know that alerting is not always perfect, and it will take time to get the right alerting strategy in place. It’s not a simple day one installation and requires much effort and cross-team collaboration. You know that alerting on too much will cause alert fatigue. And we are all too familiar with the problems alert fatigue can bring and tensions to departments. To minimize this, you need to consider Service Level Objective (SLO) for alerts. SLOs are measurable characteristics such as availability, throughput, frequency, and response times. Service Level Objectives are the foundation for a reliability stack. Also, it would be best if you also considered alert thresholds. If these are too short, you will get a lot of false positives on your alerts. 

Even with all of these in place, monitoring is not enough. Due to the sheer complexity of today’s landscape, you need to consider and think differently about the tools you use and how you use the intelligence and data that you receive from these tools to resolve issues before they become incidents.  Like that, monitoring by itself is not enough. The tool used to monitor is just a tool that probably does not cross technical domains, and there will be different groups of users who will administer each tool without a holistic view. The tools alone can take you only half the way through the journey.  Also, what needs to be addressed is the culture and the traditional way of working in silos. A siloed environment can affect the monitoring strategy you want to implement. Here you can look into an Observability platform.

 

Observability vs Monitoring

Diagram: Observability vs Monitoring. Link to YouTube video.

 

Observability vs Monitoring

So when it comes to observability vs monitoring we know that monitoring can detect problems and tell you if a system is down and when your system is UP, Monitoring doesn’t care. Monitoring only cares when there is a problem. The problem has to happen before monitoring takes action. It’s very reactive. So if everything is working, monitoring doesn’t care. On the other hand, we have an Observability platform, a more proactive practice. It’s about what and how your system and services are doing. Observability lets you improve your insight into how complex systems are working, and let’s quickly get to the root cause of any problem known and unknown. Observability is best suited for interrogating systems to explicitly discover the source of any problem, along any dimension or combination of dimensions, without first predicting. This is a proactive approach.

 

The Pillars of Observability

This is achieved by utilizing a combination of logs, metrics, and traces. So we need to have data collection, storage, and analysis across these domains. While also being able to perform alerting on what matters most. Let’s say you want to draw correlations between units like TCP/IP packets and HTTP errors experienced by your app. The Observability platform pulls the context from different sources of information like logs, metrics, events, and traces into one central context. Distributed tracing adds a lot of value here. Also, when everything is placed into one context, you can switch between the necessary views to troubleshoot the root cause accordingly easily. A good key component of any observability system is to have the ability to view these telemetry sources with one single pane of glass. 

Distributed Tracing in Microservices

Diagram: Distributed Tracing in Microservices

 

Known and Unknown / Observability Unknown and Unknown

Monitoring automatically reports whether known failure conditions are occurring or are about to occur. In other words, they are optimized for reporting on unknown conditions about known failure modes. This is referred to as known unknowns. In contrast, Observability is centered around discovering if and why previously unknown failure modes may be occurring: in other words, to discover unknown unknowns.

The monitoring-based approach of using metrics and dashboards is an investigative practice that leads with the experience and intuition of humans to detect and make sense of system issues. This is ok for a simple legacy system that fails in predictable ways, but the instinctual technique falls short for modern systems that fail in very unpredictable ways. With modern applications, the complexity and scale of their underlying systems quickly make that approach unattainable, and we can’t rely on hunches. Observability tools differ from traditional monitoring tools because they enable engineers to investigate any system, no matter how complex. You don’t need to react to a hunch or have intimate system knowledge to generate a hunch.

 

Observability and Controllability

Diagram: Distributed Tracing Explained: Link to YouTube video.

 

  • Monitoring vs Observability: Working Together?

Monitoring best helps engineers understand infrastructure concerns. While Observability best helps engineers understand software concerns. So Observability and Monitoring can work together. First, the infrastructure does not change too often, and when it fails, it will fail in more predictable ways. So we can use monitoring here. This is in comparison to software system states that change daily and are not predictable. Observability fits this purpose. The conditions that affect infrastructure health change infrequently, relatively easier to predict. We have several well-established practices to predict, such as capacity planning and the ability to remediate automatically (e.g., as auto-scaling in a Kubernetes environment. All of which can be used to tackle these types of known issues. Due to its relatively predictable and slowly changing nature, the aggregated metrics approach monitors and alerts perfectly for infrastructure problems. So here, a metric-based system works well. Metrics-based systems and their associated alerts help you see when capacity limits or known error conditions of underlying systems are being reached. Now we need to look at monitoring the Software. Now we need access to high-cardinality fields. This may include the user id or a shopping cart id. Code that is well-instrumented for Observability allows you to answer complex questions that are easy to miss when examining aggregate performance.

 

openshift security

OpenShift Security | Best Practices

Securing containerized environments is considerably different from securing the traditional monolithic application because of the inherent nature of the microservices architecture. We went from one to many and there is a clear difference in attack surface and entry points to consider. So there is a lot to consider for OpenShift security best practices. In the past, the application stack had very few components, maybe just a cache, web server, and database. The most common network service allows a source to reach an application, and the sole purpose of the network is to provide endpoint reachability.  As a result, the monolithic application has few entry points, for example, ports 80 and 443. Not every monolithic component is exposed to external access and is required to accept requests directly.  And we designed our networks around these facts.

 

    • Central Security Architecture

Therefore, we often see security enforcement in a fixed central place in the network infrastructure. This could be, for example, a central security stack consisting of several security appliances. Often referred to as a kludge of devices. As a result, the individual components within the application need not worry about carrying out any security checks as it occurs centrally for them. On the other hand, with the common microservices architecture, those internal components are specifically designed to operate independently and accept requests independently, which brings huge benefits to scale and deploying pipelines. However, now each of the components may have its entry points and accept external connections. Therefore, they need to be concerned with security individually and not rely on a central security stack to do this for them.

OpenShift Security Guide

Diagram: OpenShift Security Guide.

 

    • The Different Container Attack Vectors 

These changes have considerable consequences for security and how you approach your OpenShift security best practices. The security principles still apply and we still are concerned with reducing the blast radius, least privileges, etc but they need to be applied from a different perspective and to multiple new components in a layered approach. Security is never done in isolation. So as the number of entry points to the system increases, the attack surface broadens, leading us to several container attack vectors that are not seen with the monolithic. We have, for example, attacks on the Host, images, supply chain, and container runtime. Not to mention, there is also a considerable increase in the rate of change for these types of environments; There is an old joke saying that a secure application is an application stack with no changes.

So when you make a change you are potentially opening the door to a bad actor. Today’s application changes considerably something a few times per day for an agile stack. We have unit and security tests and other safety tests that can reduce mistakes but no matter how much preparation you do, whenever there is a change, there is a chance of a breach. So we have environmental changes that affect security and some alarming technical challenges to how containers run as default. Such as running as root by default and with an alarming amount of capabilities and privileges.

 

Challenges with Containers

  • Containers running as root

So, as you know, containers run as root by default and share the Kernel of the Host OS, and the container process is visible from the Host. This in itself is a considerable security risk when a container compromise occurs. When a security vulnerability in the container runtime arose and a container escape was performed, as the application ran as root, it could become root on the underlying Host. Therefore, if a bad actor gets access to the Host and has the correct privileges, it can compromise all the hosts’ containers.

 

  • Risky Configuration

Containers often run with excessive privileges and capabilities, A lot more than it needs to carry out their job efficiently. As a result, we need to consider what privileges the container has and whether it runs with any unnecessary capabilities that it does not need. Some of the capabilities that a container may have are defaults that fall under risky configurations and should be avoided. You want to keep an eye on the CAP_SYS_ADMIN. This flag grants access to an extensive range of privileged activities.  The container has isolation boundaries by default with namespace and control groups ( when configured correctly). However, granting the excessive container capabilities will weaken the isolation between the container and this Host and other containers on the same Host. Essentially, removing or dissolving the container’s ring-fence capabilities.

 

Openshift security

Diagram: OpenShift Security: Link to YouTube video.

 

Starting OpenShift Security 

Then we have security with OpenShift that overcomes many of the default security risks you have with running containers. And OpenShift does much of this out of the box. If you are looking for further information on securing an OpenShift cluster, kindly check out my course for Pluralsight on OpenShift Security.

OpenShift Container Platform (formerly known as OpenShift Enterprise) or OCP is Red Hat’s offering for the on-premises private platform (PaaS). OpenShift is based on the Origin open-source project and is a Kubernetes distribution.  The foundation of the OpenShift Container Platform is based on Kubernetes and therefore shares some of the same networking technology and some enhancements. However, as you know, Kubernetes is a complex beast and can lack by itself when trying to secure clusters.  OpenShift does a good job of taking Kubernetes and wrapping it in a layer of security, such as with the use of Security Context Constraints (SCCs) that bring your cluster a good base of security.

 

OpenShift Security: Security Context Constraint

When your application is deployed to OpenShift, the default security model will enforce that it is run using an assigned Unix user ID unique to the project you are deploying it to. Now we can prevent images from being run as the Unix root user. When hosting an application using OpenShift, the user ID that a container runs as will be assigned based on which project it is running in. Containers are not allowed to run as the root user by default—a big win for security. SCC also allows you to set different restrictions and security configurations for PODs. So, instead of allowing your image to run as the root, which is a considerable security risk, you should run as an arbitrary user by specifying an unprivileged USER, setting the appropriate permissions on files and directories, and configuring your application to listen on unprivileged ports.

 

OpenShift Security Context Constraints

Diagram: OpenShift Security Context Constraints.

 

SCC Defaults Access

Security context constraints let you drop privileges by default, which is important and still the best practice. Red Hat OpenShift security context constraints (SCCs) ensure that, by default, no privileged containers run on OpenShift worker nodes.  Another big win for security. Access to the host network and host process IDs are denied by default. Users with the required permissions can adjust the default SCC policies to be more permissive. So when considering SCC, think of SCC admission controllers as restricting POD access, similar to how RBAC restricts user access. To control the behavior of pods, we have security context constraints (SCCs). These cluster-level resources define what resources can be accessed by pods and provide an additional level of control.  Security context constraints let you drop privileges by default, an important best practice. With Red Hat OpenShift SCCs, no privileged containers run on OpenShift worker nodes. By default, access to the host network and host process IDs are also denied. A big win for OpenShift security.

 

Restricted Security Context Constraints (SCCs)

There are a few SCC available by default, and you may have the head of the restricted SCC. By default, all pods, except those for builds and deployments, use a default service account assigned by the restricted SCC, which doesn’t allow privileged containers – that is, those running under the root user and listening on privileged ports are ports under <1024. SCC can be used to manage the following:

 

    1. Privilege Mode: this setting allows or denies a container running in privilege mode. As you know, privilege mode bypass any restriction such as control groups, Linux capabilities, secure computing profiles, 
    2. Privilege Escalation: This setting enables or disables privilege escalation inside the container ( all privilege escalation flags)
    3. Linux Capabilities: This setting allows the addition or removal of certain Linux capabilities
    4. Seccomp profile – this setting shows which secure computing profiles are used in a pod
    5. Root-only file system: this makes the root file system read-only 

 

The goal is to assign the fewest possible capabilities for a pod to function fully. This least-privileged model ensures that pods can’t perform tasks on the system that aren’t related to their application’s proper function. The default value for the privileged option is False; setting the privileged option to True is the same as giving the pod the capabilities of the root user on the system. Although doing so shouldn’t be common practice, privileged pods can be useful under certain circumstances. 

 

Openshift Networking

Diagram: OpenShift Networking: Link to YouTube video.

System Observability

System Observability: The Different Demands

We have had a considerable drive with innovation that has spawned several megatrends that have affected how we manage and view our network infrastructure and the need for observability. In reality, we have seen the decomposition of everything, from one to many. Many services and dependencies in multiple locations need to be managed and operated instead of the monolithic where everything is generally housed internally. The megatrends have resulted in a dynamic infrastructure with new failure modes not seen in the monolithic forcing us to look a different system observability tools and practices.

There has also been a shift in point of control. We move towards new technologies, and many of these loosely coupled services or infrastructure your services lay upon are not directly under your control. The edge of control has been pushed, creating different types of network and security perimeters. These parameters are now closer to the workload than a central security stack. Therefore the workloads themselves are concerned with security. For a more detailed explanation of these changes that drive the need for good observability and how they may effect you, a full 2-hour course I did for Pluralsight on DevOps Operational Strategies can be found here: DevOps: Operational Strategies.

 

System Observability Design

Diagram: System Observability Design.

 

  • How This Affects Failures

The major issue that I have seen with my clients is that application failures are no longer predictable, and the dynamic systems can fail in very creative ways challenging existing monitoring solutions. But, more importantly, the practices that support them. We have a lot of partial failures that are not just unexpected but not known or ever seen before. For example, if you recall, we have the network hero. 

 

  • The Network Hero

It is someone that knows every part of the network and has seen every failure at least once before. They are no longer useful in today’s world, and you need proper Observability. When I was working as an Engineer, we would have plenty of failures, but more than likely, we would have seen them before. And there was a system in place to fix the error. Today’s environment is much different. We can no longer rely on simply seeing either a UP or Down and setting static thresholds and then alerting based on those thresholds. A key point to note at this stage is that none of these thresholds consider the customer’s perspective.  If your POD is running at 80% CPU, does that mean the customer is unhappy? When looking to monitor, you should look from your customer’s perspective and what matters to them. Content Delivery Network (CDN) was one of the first to realize this game and measure what matters most to the customer.

 

The Different Demands

So the new modern and complex distributed systems place very different demands on your infrastructure and the people that manage the infrastructure.  For example, in microservices, there can be several problems with a particular microservice:

    • The microservices could be running under high resource utilization and therefore slow to respond, causing a timeout
    • The microservices could have crashed or been stopped and is therefore unavailable
    • The microservices could be fine, but there could be slow-running database queries.
    • So we have a lot of partial failures. 

 

Therefore: We Can No Longer Predict

The big shift we see with software platforms is that they are evolving much quicker than products and paradigms that we are using, for example, to monitor them. As a result, we need to consider new practices and technologies with dedicated platform teams along with good system observability. We really can’t predict anything anymore, which puts the brakes on some traditional monitoring approaches, especially the metrics-based approach to monitoring. I’m not saying that these monitoring tools are not doing what you want them to do. But, they work in a siloed environment, and there is a lack of connectivity. So we have monitoring tools working in silos in different parts of the organization and more than likely managed by different people trying to monitor a very dispersed application with multiple components and services in various places. 

 

Diagram: Prometheus Monitoring: Link to YouTube video.

 

  • Relying On Known Failures: Metric-Based Approach

A metrics-based monitoring approach relies on having encountered known failure modes in the past. The metric-based approach relies on known failures and predictable failure modes. So we have predictable thresholds that someone is considered to experience abnormal.  Monitoring can detect when these systems are either over or under the predictable thresholds that are previously set. And then, we can set alerts and hope that these alerts are actionable. This is only useful for variants of predictable failure modes.  Traditional metrics and monitoring tools can tell you any performance spikes or notice that a problem occurs. But they don’t let you dig into the source of problems and let us slice and dice or see correlations between errors. If the system is complex, this approach is harder to get to the root cause in a reasonable timeframe.

With the traditional style metrics systems, you had to define custom metrics, and these were always defined upfront. So with this approach, we can’t start to ask new questions about problems. So it would be best if you defined the questions to ask upfront. Then we set performance thresholds and pronounce them “good” or “bad.” and check and re-check those thresholds. We would tweak the thresholds over time, but that was about it. This monitoring style has been the de facto approach, but we don’t now want to predict how a system can fail. Always looking and always observing instead of waiting for problems, such as reaching a certain threshold before acting.

System Observability Analysis

Diagram: System Observability Analysis.

 

  • Metrics: Lack of Connective Event

Metrics did not retain the connective event. As a result, you cannot ask new questions in the existing dataset.  These traditional system metrics could miss unexpected failure modes in complex distributed systems. Also, the condition detected via system metrics might be unrelated to what is happening. An example of this could be an abnormal number of running threads on one component that might indicate garbage collection is in progress. It might also indicate that slow response times might be imminent in an upstream service.

 

  • Users Experience: Static Thresholds

User experience means different things to different sets of users. We now have a model where different service users may be routed through the system in different ways, using different components, providing experiences that can vary widely. We also know now that the services no longer tend to break in the same few predictable ways over and over.  We should have few alerts triggered by only focusing on symptoms that directly impact user experience and not because a threshold was reached.

 

  • The Challenge: Can’t Reliably Indicate Any Issues With User Experience

If you are using static thresholds, they can’t reliably indicate any issues with user experience. Alerts should be set up to detect failures that impact user experience. Traditional monitoring falls short in trying to do this. With the traditional metrics-based monitoring, we rely on static thresholds to define optimal system conditions, which has nothing to do with user experience. However, modern systems change shape dynamically under different workloads. Static thresholds for monitoring can’t reflect impacts on user experience. They lack context and are too coarse.

 

The Need For System Observability

System observability is a practice. Rather than just focusing on a tool that does logging, metrics, or altering, Observability is all about how you approach problems, and for this, you need to look at your culture.  So you could say that Observability is a cultural practice and allows you to be proactive to findings instead of relying on a reactive approach that we are used to in the past.  Nowadays, we need a different viewpoint, and we generally want to see everything from one place. You want to know how the application works and how it interacts with the other infrastructure components, such as the underlying servers, physical or server, the network, and how data transfer looks in a transfer and stale state.  What level of observation do you need so you know that everything is performing as it should? And what should you be looking at to get this level of detail?

Monitoring is knowing the data points and the entities we are gathering from. On the other hand, Observability is like when you put all of the data together. So monitoring is the act of collecting data, and Observability is putting it all together in one single pane of glass. Observability is observing the different patterns and deviations from baseline; monitoring is getting the data and putting it into the systems.

 

The 3 Pillars of Observability

We have three pillars of System Observability. There are Metrics, Traces, and Logging. So it is an oversimplification to define or view Observability as just having these pillars. But for Observability, you need these in place. Observability is all about how you connect the dots from each of these pillars. If someone asked me which one I prefer, it would be distributed tracing. Distributed tracing allows you to visualize each step in service request executions. As a result, it doesn’t matter if services have complex dependencies. You could say that the complexity of the Dynamic systems is abstracted with distributed tracing.

 

Distributed tracing

Diagram: Distributed Tracing: Link to YouTube video.

 

  • Use Case: Challenges Without Tracing

For example, latency can stack up if a downstream database service experiences performance bottlenecks. As a result, the end-to-end latency is high. By the time that latency is detected three or four layers upstream, it can be incredibly difficult to identify which component of the system is the root of the problem because now that same latency is being seen in dozens of other services.

 

  • Distributed Tracing: A Winning Formula

Modern distributed systems tend to scale into a tangled knot of dependencies. Therefore, distributed tracing shows the relationships between various services and components in a distributed system. Traces help you understand system interdependencies. Unfortunately, those inter-dependencies can obscure problems and make them particularly difficult to debug unless their relationships are clearly understood.

 

system reliability

System Reliability in an Unpredictable World

There have been considerable shifts in our environmental landscape that have caused us to examine how we operate and run our systems and networks. We have had a mega shift with the introduction of various cloud platforms and their services and containers along with the complexity of managing distributed systems that unveil large gaps in current practices in the technologies we use. Not to mention the flaws with the operational practices around these technologies. All of this has caused a knee-jerk reaction to a welcomed drive-in innovation to system reliability. Yet, some technologies and tools used to manage these innovations are not in line with the innovative events. Many of these tools have stayed relatively static in our dynamic environment. So we have static tools used in a dynamic environment, which causes friction.

The big shift we see with software platforms is that they are evolving much quicker than products and paradigms that we use to monitor them. We need to consider new practices and technologies with dedicated platform teams to enable a new era of system reliability.

 

    • Lack of Connective Event: Traditional Monitoring

If you examine traditional monitoring systems, they look to capture and examine signals in isolation. The monitoring systems work in a siloed environment similar to developers and operators before the rise of DevOps. Existing monitoring systems cannot detect the “Unknowns Unknowns” common with modern distributed systems. This often leads to disruptions of services. So you may be asking what an “Unknown Unknown” is.

I’ll put it to you this way: distributed systems we see today don’t have any or much sense of predictability. Certainly not enough predictability to rely on static thresholds, alerts, and old monitoring tools. If something is static, it can be automated, and we have static events such as in Kubernetes, a POD reaching a limit. Then a replica set introduces another pod on a different node as long as certain parameters are met such as Kubernetes Labels and Node Selectors. However, this is only a small piece of the failure puzzle in a distributed environment.  Today, we have what’s known as partial failures and systems that fail in very creative ways.

 

    • System Reliability: Creative Ways to Fail

So we know that some of these failures are easily predicted, and actions are taken. For example, if this Kubernetes POD node reaches a certain utilization, we can automatically reschedule PODs on a different node to stay within our known scale limits. We have predictable failures that can be automated and not just in Kubernetes but with any infrastructure. An Ansible script is useful when we have predictable events. However, we have much more to deal with than POD scaling; we have many partial failures and complicated failures known as black holes.

 

    • Today’s World of Partial Failures

Microservices applications are distributed and susceptible to many external factors. On the other hand, if you examine the traditional monolithic application style, all the functions reside in the same process. It was either switched ON or OFF!! Not much happened in between. So if there is a failure in the process, the application as a whole will fail. The results are binary, and it is usually either a UP or Down. And with some basic monitoring in place, this was easy to detect, and failures were predictable. There was no such thing as a partial failure. In a monolith application, all application functions are within the same process. And a major benefit of these monoliths is that you don’t have partial failures.

However, in a cloud-native world, where we have taken the old monolith and broken it into a microservices-based application, a request made from a client can go through multiple hops of microservices, and we can have several problems to deal with. There is a lack of connectivity between the different domains. There will be many monitoring tools and knowledge tied to each domain, and alerts are often tied to thresholds or rate-of-change violations that have nothing to do with user satisfaction. User satisfaction is the key metric to care about.

 

chaos e

Diagram: Chaos Engineering – How to start a project: Link to YouTube video.

 

Today You Have No Way to Predict

So the new modern and complex distributed systems place very different demands on your infrastructure—considerably different from the simple three-tier application where everything was generally housed in one location.  We really can’t predict anything anymore, which puts the brakes on some traditional monitoring approaches. When you can no longer predict what will happen, you can no longer rely on a reactive approach to monitoring and management. The move towards a proactive approach to system reliability is a welcomed strategy.

 

A Quick Note on Blackholes: Strange Failure Modes

When considering a distributed system, many things can happen. A service or region can disappear or disappear for a few seconds or ms and appear again. We consider this as going into a black hole when we have strange failure modes. So when anything goes into it will disappear. So strange failure modes are unexpected and surprising. There is certainly nothing predictable about strange failure modes. So what happens when your banking transactions are in the black hole? What if your banking balance is displayed incorrectly or if you make a transfer to an external account and it does not show up? I did a demo on this with my training course. Here I examined the effects of Blackholes on system reliability and demoed a sample application called Bank of Anthos in the course DevOp: Operational Strategies

 

Highlighting Site Reliability Engineering (SRE) and Observability

The practices of Site Reliability Engineering (SRE) and Observability are what are needed to manage these types of unpredictability and unknown failures. SRE is about making systems more reliable. And everyone has a different way of implementing the SRE practices. Usually, about 20% of your issues cause 80% of your problems. You need to be proactive and fix these issues upfront. You need to be able to get ahead of the curve and do these things to stop the incidents from occurring. This usually happens in the wake of a massive incident. This usually acts as a teachable moment. This gives the power to be the reason to listen to a Chaos Engineering project. 

 

system reliability

Diagram: Site Reliability Engineering and Observability: Link to YouTube video.

 

  • New Tools and Technologies: Distributed Tracing

We have new tools such as distributed tracing. So if the system becomes slow, what is the best way to find the bottleneck? Here you can use Distributed Tracing and Open Telemetry. So the tracing helps us instrument our system, so we figure out where the time has been spent and can be used across distributed microservice architecture to troubleshoot problems. Open Telemetry provides a standardized way of instrumenting our system and providing those traces.

 

  • SLA, SLI, SLO, and Error Budgets

So we don’t just want to know when something has happened and then react to an event that is not looking from the customer’s perspective. We need to understand if we are meeting SLA by gathering the number and frequency of the outages and any performance issues. Service Level Objectives (SLO) and Service Level Indicators (SLI) can assist you with measurements. Service Level Objectives (SLO) and Service Level Indicators (SLI) not only assist you with measurements. They offer a tool for having better Reliability and form the base for the Reliability Stack.

 

chaos engineering

Chaos Engineering: Don’t forget the baseline

In the past, applications were running in single private data centers, potentially two data centers for high availability. There may have been some satellite PoPs but generally, everything was housed in a few locations. These types of data centers were on-premises and all components were housed internally. As a result, troubleshooting and monitoring any issues was relatively easy. The network and infrastructure were pretty static, the network and security perimeters were known and there weren’t that many changes to the stack for example on a daily basis.  However, nowadays we are in a completely different environment where we have distributed applications with components/services located in many different places and types of places, on-premises and in the cloud with dependencies on both local and remote services. In comparison to the monolith, today’s applications have many different types of entry points to the external world.

 

However! A Lot Can Go Wrong

There is a growing complexity of infrastructure and let’s face it a lot can go wrong. It’s imperative to have a global view of all the components of the infrastructure and a good understanding of the application performance and health. In a large-scale container-based application design, there are many moving pieces and parts, and trying to manually validate the health of each piece is hard to do.  If you want some tips on how to monitor and more importantly how to react to events, you can try my course on Monitoring NetDevOps.

Therefore, monitoring and troubleshooting are a lot harder, especially as everything is interconnected in ways that make it difficult for a single person in one team to fully understand what is going on. Nothing is static anymore and things are moving around all the time. This is why it is even more important to focus on the patterns and to be able to efficiently see the path of where the issue is. Some modern applications could be in multiple clouds and different location types at the same time. As a result, there are multiple data points to consider. If any of these segments are slightly overloaded, this sum of each of the overloaded segments results in poor performance on the application level. 

 

Chaos Engineering

Diagram: Chaos Engineering. Link to YouTube video.

 

What Does This Mean to Latency

Distributed computing has lots of components and services with components far apart. This is in contrast to a monolith that has all parts in one location. As a result of the distributed nature of modern applications, latency can add up.  So we have both network latency and application latency. The network latency is several orders of magnitude bigger. As a result, you need to minimize the number of Round Trip Times and reduce any unneeded communication to an absolute minimum. When communication is required across the network, it’s better to gather as much data together to get bigger packets that are more efficient to transfer.

With the monolith, the application is simply running in a single process and it is relatively easy to debug. A Lot of the traditional tooling and code instrumentation technologies have been built assuming you have the idea of a single process. The core challenge is that trying to debug microservices applications is challenging. So a lot of the tooling we have today has been built for traditional monolithic applications. So there are new monitoring tools for these new applications but there is a steep learning curve and a high barrier to entry.

 

A New Approach: Chaos Engineering

For this, you need to understand practices like Chaos Engineering, and how they can improve the reliability of the overall system. Chaos Engineering is the ability to perform tests in a controlled way. Essentially, what we are doing is breaking things on purpose in order to learn how to build systems more resilient. So we are injecting faults in a controlled way so we can make the overall application more resilient by injecting a variety of issues and faults. Implementing practices like Chaos Engineering will help you understand and better manage unexpected failures and performance degradation. The purpose of Chaos Engineering is to build more robust and resilient systems.

 

  • A Final Note On Baselines: Don’t Forget Them!!

Creating a good baseline is a critical factor. You need to understand how things work under normal circumstances. A baseline is a fixed point of reference that is used for comparison purposes. You need to know usually how long it takes to start the Application to the actual login, and how long it takes to do the basic services before there are any types of issues or heavy load. Baselines are critical to monitoring. It’s like security, if you can’t see what you can’t protect. The same assumptions apply here. Go for a good baseline and if you can have this fully automated. Tests need to be carried out against the baseline on an ongoing basis. You need to test all the time to see how long it takes users to use your services. Without baseline data, it’s difficult to estimate any changes or demonstrate progress.

 

prometheus monitoring

Diagram: Prometheus Monitoring. Link to YouTube video.

docker security options

Docker Security Options

So you are currently in the Virtual Machine world and considering the transition to a containerized environment as you want to smoothen your application pipeline and gain the benefits of a Docker containerized environment. But you have heard from many the containers are insecure, run by root by default, and have tons of capabilities that just scare you. Yes, we have a lot of benefits to the containerized environment and for some application stacks, containers are the only way to do it. However, we have a new attack surface with some of the benefits of deploying containers. So even though the bad actors’ intent may stay the same, we must mitigate a range of new attacks and protect new components. To combat these you need to be aware of the most common Docker security options.

Docker container security

Diagram: Docker Container Security Supply Chain.

 

Use Case: I’m Sorry to Say This Happened

More than often the tools and appliances you have in place are completely blind to containers. The tools just look at a running process and think, well if the process is secure then I’m secure. One of my clients run a container with the DockerFile and pulled an insecure image. The tools onsite did not know what an image was, therefore could not scan it. As a result, we had malware right in the core of the network, a little bit too close to the database server for my liking. 

Yes, we call containers a fancy process and I’m to blame here too but we need to consider what is around the container to fully secure it. For a container to function, it needs the support of the infrastructure around it such as the CI/CD pipeline. For this, you need to consider all of the infrastructures to improve your security posture. If you are looking for quick security tips on docker security, this course I created for Pluralsight may help you with Docker security options.

 

We have Ineffective Traditional Tools

The containers are not like traditional workloads. We can run an entire application with all its dependencies with a single command. The legacy security tools and processes often assume largely static operations and need to be adjusted to adapt to the rate of change in containerized environments. With non-cloud-native data centers, Layer 4 is coupled with the network topology at fixed network points and lacks the flexibility to support containerized applications. There is often only inter-zone filtering, and east to west traffic may go unchecked. A container changes the perimeter and it moves right to the workload. Just look at a microservices architecture, it has many entry points as compared to monolithic applications.

Containers are a world apart from the monolithic. Containers are short-lived and constantly spun down, and assets such as servers, I.P. addresses, firewalls, drives, and overlay networks are recycled to optimize utilization and enhance agility. Traditional perimeters designed with I.P. address-based security controls lag in a containerized environment. Rapidly changing container infrastructure rules and signature-based controls can’t keep up with a containerized environment. Securing hyper-dynamic container infrastructure using traditional networks ​​and endpoint controls won’t work. For this reason, you should adopt tools and techniques that are purpose-built for a containerized environment.

 

docker container security best practices

Diagram: Docker container security: Link to YouTube video.

 

The Need for Proper Observability

Not only do you need to implement good docker security options, but you also need to concern yourself with the recent observability tools. So we need proper observability of the state of security and the practices used in the containerization environment, and we need to automate this as much as possible. Not just to automate the development but also the security testing, container scanning, and monitoring., You are only as secure as the containers you have running. You need to have observability into systems and applications and to be proactive in these findings. It is not something that you can buy and it’s a cultural change.  You want to know how the application is working with the server, how the network is with the application, and what data transfer looks like in transfer and also in a stable state.  

 

Prometheus monitoring

Diagram: Monitoring vs Observability: Link to YouTube video.

 

What level of observation do you need so you know that everything is performing as it should? There are several challenges to securing a containerized environment. Containerized technologies are dynamic and complex and require a new approach that can handle the agility and scale of today’s landscape. There are initial security concerns that you must understand before you get started with container security. This will help you explore a better starting strategy.

 

prometheus monitoring

Diagram: Prometheus Monitoring: Link to YouTube video.

 

  • Container Attack Vectors: New Threat Model

We must consider a different threat model and understand how security principles such as least privilege and defense in depth apply to the available Docker security options. With Docker containers, we have a completely different way to run applications and, as a result, a different set of risks to deal with.

Instructions are built into Dockerfiles, which run applications considerably differently from a normal application workload. With the correct rights, a bad actor could put anything in the Dockerfile without the necessary guard rails in place that understand containers, there will be a threat. Therefore, we need to examine new network and security models as old tools and methods won’t meet these demands.  A new network and security model requires you to mitigate against a new set of attack vectors. Bad actors’ intent stays the same. They are not going away anytime soon. But they now have a different and potentially easier attack surface to play with if not configured correctly.

Personally, I would consider the container attack surface to be pretty large and if not locked down, there will be many default tools at the disposal of bad actors. For example, we have image vulnerability, access control exploits, container escape, privilege escalation, and application code exploits and attacks on the docker host and all of the docker components.

 

  • Docker Security Options: A Final Security Note!

Containers by themselves are secure, and the kernel is pretty much battle-tested it’s not often you will come across a kernel compromise but they do happen from time to time. A container escape is hard to orchestrate unless a container misconfiguration could result in excessive privileges. You should stay clear of setting container capabilities that provide excessive privileges from a security standpoint. If you minimize the container capabilities, you are stripping down the container’s functionality to a bare minimum. Therefore, the attack surface is limited and minimizes the attack vector available to the attacker.  You also want to keep an eye on the CAP_SYS_ADMIN. This flag grants access to an extensive range of privileged activities. There are many other capacities that containers run by default that can cause havoc.

 

defined perimeter

Safe-T SDP- Why Rip and Replace your VPN?

Although organizations realize the need to upgrade their approach to user access control. The deployment of existing technologies is holding back the introduction of Software Defined Perimeter (SDP). A recent report carried out by the Cloud Security Alliance (CSA) on the “State of Software Defined Perimeter” states that the main barrier to adopting SDP is the existing in-place security technologies.

One can understand the reluctance to take the leap. After all, VPNs have been a cornerstone of secure networking for over two decades. They do provide what they say; secure remote access. However, they have not evolved to appropriately secure our developing environment. In fact, the digital environment has changed considerably in recent times. There is a big push for the cloud, BYOD, and remote workers, thereby putting pressure on existing VPN architectures. As our environment evolves, the existing security tools and architectures must evolve also.

Undoubtedly, there is a common understanding of the benefits of adopting the zero-trust principles that SDP provides over traditional VPNs. But the truth that organizations want even safer, less disruptive, and less costly deployment models cannot be ignored. VPNs aren’t a solution that works for every situation. It is not enough to offer solutions that would involve ripping the existing architectures completely or even putting SDP on certain use cases. The barrier to adopting SDP involves finding a middle ground.

 

Safe-T; Providing the middle ground

Safe-T is aware of this need for a middle ground. Therefore, in addition to the standard SDP offering, Safe-T also offers this middle-ground, to help the customer on the “journey from VPN to SDP”, resulting in a safe path to SDP.

Now organizations do not need to rip and replace the VPN. SDP and VPNs can work together, thereby yielding a more robust security infrastructure. Having network security that can bounce you between IP address locations can make it very difficult for hackers to break in. Besides, if you already have a VPN solution that you are comfortable with, you can continue using it and pair it with Safe-T’s innovative SDP approach. By adopting this new technology you get equipped with a middle-ground that not only improves your security posture but also maintains the advantages of existing VPN.

Recently, Safe-T has released a new version of its SDP solution called ZoneZero that enhances VPN security by adding SDP capabilities. Adding SDP capabilities allows exposure and access to applications, and services. The access is granted only after assessing the trust, based on policies for an authorized user, location, and application. In addition, access is granted to the specific application or service, rather than the network, as you would provide with a VPN.

Deploying SDP on top of the existing VPN offers a customized and scalable zero-trust solution. It provides all the benefits of SDP while lowering the risks involved in adopting the new technology. Currently, Safe-T’s ZoneZero is the only SDP solution in the market with a primary focus on enhancing VPN security by adding zero trust capabilities, rather than replacing it.

 

The challenges of just using a traditional VPN

While VPNs have stood the test of time, today, we know that the true security architecture is based upon the concept of zero trust access. VPNs operating by themselves are unable to offer optimum security. Now, let’s examine some of the common shortfalls.

The VPN lacks in the sense that they are not equipped to grant access on a granular, case-by-case level. This is a major problem that SDP addresses. According to the traditional security setup, in order to get access to an application, you had to connect a user to a network. Whereas, the users that were not on the network, for example, remote workers, we needed to create a virtual network to place the user on the same network as the application.

To enable external access, organizations started to implement remote access solutions (RAS) to restrict user access and create secure connectivity. To provide application access, an inbound port is exposed to the public internet. However, this open port is visible to anyone on the internet and not just to remote workers.

From a security standpoint, the idea of network connectivity to access an application is likely to bring many challenges. We then moved to the initial layer of zero trust, which was to isolate different layers of security within the network. This provided a way to quarantine the applications that are not meant to be seen, as dark. But this leads to a sprawl of network and security devices.

For example, you could use inspection path control with a stack of hardware. This enabled the users to only access what they were allowed to, based on the blacklist security approach. Security policies provided a broad-level and overly permissive access. The attack surface was simply too wide. Also, the VPN just displays static configurations that have no meaning. For example, a configuration may state that this particular source can reach this destination by using this port number and policy.

However, with this configuration, the contextual configuration is not taken into consideration. There are just ports and IP addresses and the configuration offers no visibility into the network to see who, what, when, and how they are connecting with the device.

More than often, access policy models are coarse-grained, which provides users with more access than is required. This model does not follow the least privilege model. The VPN device provides only the network information and the static policy does not dynamically change based on the levels of trust.

Say, for example, the user’s anti-virus software is accidentally turned off or by malicious malware. Or maybe you want to re-authenticate when certain user actions are performed. In such cases, a static policy cannot dynamically detect this and change configuration on the fly. They should actually be able to express and enforce the policy configuration based on the identity, which takes into consideration both the user and the device.

 

The SDP acceptance

The new technology adoption rate can be slow initially. The primary reason could be the lack of understanding that what you have in place today, by itself, is not the best for your organization in the future. Maybe now is the time to stand back and ask if this is the future that we really want.

All the money and time you have spent on the existing technologies are not evolving at pace with today’s digital environment. This indicates the necessity for new capabilities to be added. These get translated into different meanings based on the CIO and CTO roles of an organization. The CTOs are passionate to embrace new technologies and invest in the future. They are always on the lookout to take advantage of new and exciting opportunities in technology. However, the CIO looks at things in a different manner. Usually, the CIO wants to stay with the known and is reluctant to change even in case of loss of service. Their sole aim is to keep the lights on.

This shines the torch on the need to find the middle ground. And that middle-ground is to adopt a new technology that has endless benefits for your organization. The technology should be able to satisfy the CTO group while also taking every single precaution and not disrupting the day-to-day operations.

 

  • The push by the marketers

There is a clash between what is needed and what the market is pushing. The SDP industry standard is to encourage the customers to rip and replace their VPN in order to deploy their SDP solution. But the customers have invested in a comprehensive VPN and are reluctant to replace it.

The SDP market initially pushed for a rip and replace model, which would eliminate the use of traditional security tools and technologies. This should not be the recommended case since the SDP functionality can overlap with the VPNs. Although the existing VPN solutions have their drawbacks there should be an option to use the SDP in parallel. Thereby, offering the best of both worlds.

 

How does Safe-T address this?

Safe-T understands that there is a need to go down the SDP path, but you may be reluctant to do a full or partial VPN replacement. So let’s take your existing VPN architecture and add the SDP capability to it.

The solution is placed after your VPN. The existing VPN communicates with Safe-T ZoneZero which will do the SDP functions after your VPN device. From an end user’s perspective, they will continue to use their existing VPN client. In both cases, the users operate as normal. There are no behavior changes and the users can continue using their VPN client.

For example, they authenticate with the existing VPN as before. But the VPN communicates with SDP for the actual authentication process as opposed to communicating with, for example, the Active Directory (AD).

What do you get from this? From an end-user’s perspective, their day-to-day process does not change. Also, instead of placing the users on your network as you would with a VPN, they are switched over to application-based access. Even though they are using a traditional VPN to connect, they are still getting the full benefits of SDP.

This is a perfect stepping stone on the path toward SDP. Significantly, it provides a solid bridge to an SDP deployment. It will lower the risk and cost of the new technology adoption with minimal infrastructure changes. It removes the pain caused by deployment.

 

The ZoneZero™ deployment models

Safe-T offers two deployment models; ZoneZero Single-Node and Dual-Node.

With the single-node deployment, a ZoneZero virtual machine is located between the external firewall/VPN and the internal firewall. All VPN is routed to the ZoneZero virtual machine and it controls which traffic continues to flow into the organization.

In the dual-node deployment model, the ZoneZero virtual machine is located between the external firewall/VPN and the internal firewall. And an access controller is in one of the LAN segments, behind the internal firewall.

In both cases, the user opens the IPSEC or SSL VPN client and enters the credentials. The credentials are then retrieved by the existing VPN device and passed over RADIUS or API to ZoneZero for authentication.

SDP is charting the course to a new kind of network and security architecture. But at this time, a middle ground can reduce the risks associated with the deployment. The only viable option is to run the existing VPN architectures in parallel with SDP. This way, you get all the benefits of SDP with minimal disruption.

 

Zero Trust Access

Safe-T; A Progressive Approach to Zero Trust Access

The foundations that support our systems are built with connectivity and not security as an essential feature. TCP connects before it authenticates. Security policy and user access based on IP lack context and allow architectures that exhibit overly permissive access. Most likely, this will result in a brittle security posture enabling the need for Zero Trust Access. Our environment has changed considerably, leaving traditional network and security architectures vulnerable to attack. The threat landscape is unpredictable. We are getting hit by external threats from all over the world. However, the environment is not just limited to external threats. There are insider threats also within a user group, and insider threats, across user group boundaries.

Therefore, we need to find ways to decouple security from the physical network and also decouple application access from the network. To do this, we need to change our mindset and invert the security model. Software-Defined Perimeter (SDP) is an extension of zero trust which presents a revolutionary development. It provides an updated approach that current security architectures fail to address. SDP is often referred to as Zero Trust Access (ZTA). Safe-T’s package of the access control software is called: Safe-T Zero+. Safe-T offers a phased deployment model, enabling you to progressively migrate to zero-trust network architecture while lowering the risk of technology adoption. Safe-T’s Zero+ model is flexible to meet today’s diverse hybrid I.T requirements. It satisfies the zero-trust principles that are used to combat today’s network security challenges.

 

Network Challenges

  • Connect First and Then Authenticate

TCP has a weak security foundation. When clients want to communicate and have access to an application: they first set up a connection. It is only after the connect stage has been carried out, can the authentication stage be accomplished. Unfortunately, with this model, we have no idea who the client is until they have completed the connect phase. There is a possibility that the requesting client is not trustworthy.

 

  • The Network Perimeter

We began with static domains, whereby internal and external segments are separated by a fixed perimeter. Public IP addresses are assigned to the external host and private addresses to the internal. If a host is assigned a private IP, it is thought to be more trustworthy than if it has a public IP address. Therefore, trusted hosts operate internally, while untrusted operate externally to the perimeter. Here, the significant factor that needs to be considered is that IP addresses lack user knowledge to assign and validate trust.

Today, I.T has become more diverse since it now supports hybrid architectures with a variety of different user types, humans, applications, and the proliferation of connected devices. Cloud adoption has become the norm these days since there is an abundance of remote workers accessing the corporate network from a variety of devices and places.

The perimeter approach no longer reflects the typical topology of users and servers accurately. It was actually built for a different era where everything was inside the walls of the organization. However, today, organizations are increasingly deploying applications in the public clouds that are located in geographical locations. These are the locations that are remote from an organization’s trusted firewalls and the perimeter network. This certainly stretches the network perimeter.

We have a fluid network perimeter where data and users are located everywhere. Hence, now we operate in a completely new environment. But the security policy controlling user access is built for static corporate-owned devices, within the supposed trusted LAN

 

  • Lateral Movements

A major concern with the perimeter approach is that it assumes a trusted internal network. However, evidently, 80% of threats are from internal malware or a malicious employee that will often go undetected.

Besides, with the rise of phishing emails, an unintentional click will give a bad actor broad-level access. And once on the LAN, the bad actors can move laterally from one segment to another. They are likely to navigate undetected between, or within the segments.

Eventually, the bad actor can steal the credentials and use them to capture and exfiltrate valuable assets. Even social media accounts can be targeted for data exfiltration since they are not often inspected by the firewall as a file transfer mechanism.

 

  • Issues with the Virtual Private Network (VPN)

What is happening with traditional VPN access is that the tunnel creates an extension between the client’s device and the application’s location. The VPN rules are static and do not dynamically change with the changing levels of trust on a given device. They provide only network information which is a crucial limitation.

Therefore, from a security standpoint, the traditional method of VPN access enables the clients to have broad network-level access. This makes the network susceptible to undetected lateral movements. Also, the remote users are authenticated and authorized but once permitted to the LAN they have coarse-grained access. This obviously creates a high level of risk as undetected malware on a user’s device can spread to an inner network.

Another significant challenge is that VPNs generate administrative complexity and cannot easily handle cloud, or multiple network environments. They require the installation of end-user VPN software clients and knowing where the application that they are accessing is located. Users would have to make changes to their VPN client software to gain access to the applications situated at different locations. In a nutshell, traditional VPNs are complex for administrators to manage and for users to operate.

With public concern over surveillance, privacy, and identity theft growing, an increasing number of people are turning to VPNs to help keep them safer online. But where should you start when choosing the best VPN for your needs?

Also, poor user experience is most likely to occur as you need to backhaul the user traffic to a regional data center. This adds latency and bandwidth costs.

In recent years, torrenting has started to become increasingly popular amongst computer users who wish to download files such as movies, books, and songs. Without having a VPN, this could risk your privacy and security. It is also important to note that you should be very careful when it comes to downloading files to your computer as they could cause more harm than good. 

 

Can Zero Trust Access be the Solution?

The main principle that ZTA follows is that nothing should be trusted. This is regardless of whether the connection is originating inside or outside the network perimeter. Reasonably, today we have no reason to trust any user, device, or application, some companies may try and decrease accessibility with the use of programs like office 365 distribution group to allow and disallow users and devices’ specific network permissions. You know that you cannot protect what you cannot see but the fact that you also cannot attack what you cannot see also holds true. ZTA makes the application and the infrastructure completely undetectable to unauthorized clients, thereby creating an invisible network.

Preferably, application access should be based on contextual parameters, such as who/where the user is located, the judgment of the security stance of the device, and then a continuous assessment of the session should be performed. This moves us from network-centric to user-centric, providing a connection-based approach to security. Security enforcement should be based on user context and include policies that matter to the business. It should be unlike a policy based on subnets that have no meaning. The authentication workflows should include context-aware data, such as device ID, geographic location, and the time and day when the user requests access.

It’s not good enough to provide network access. We must provide granular application access with a dynamic segment of 1. Here, an application microsegment gets created for every request that comes in. Micro-segmentation creates the ability to control access by subdividing the larger network into small secure application micro perimeter internal to the network. This abstraction layer puts a lockdown on lateral movements. In addition, zero trust access also implements a policy of least privilege by enforcing controls that enable the users to have access only to the resources they need to perform their tasks.

 

Characteristics of Safe-T

Safe-T has 3 main pillars to provide a secure application and file access solution with:

1) An architecture that implements zero trust access,

2) A proprietary secure channel that enables users to remotely access/share sensitive files and

3) User behavior analytics.

Safe-T’s SDP architecture is designed to substantially implement the essential capabilities delineated by the Cloud Security Alliance (CSA) architecture. Safe-T’s Zero+ is built using these main components:

The Safe-T Access Controller is the centralized control and policy enforcement engine that enforces end-user authentication and access. It acts as the control layer, governing the flow between end-users and backend services.

Secondly, the Access Gateway acts as a front-end to all the backend services published to an untrusted network. The Authentication Gateway presents to the end-user in a clientless web browser. Hence, a pre-configured authentication workflow is provided by the Access Controller. The authentication workflow is a customizable set of authentication steps, such as 3rd party IDPs (Okta, Microsoft, DUO Security, etc.). In addition, it has built-in options, such as captcha, username/password, No-Post, and OTP.

 

Safe-T Zero+ Capabilities

The Safe-T Zero+ capabilities are in line with zero trust principles. With Safe-T Zero+, clients requesting access must go through authentication and authorization stages before they can access the resource. Any network resource that has not passed these steps is blackened. Here, URL rewriting is used to hide the backend services.

This reduces the attack surface to an absolute minimum and follows the Safe-T’s axiom: If you can’t be seen, you can’t be hacked. In a normal operating environment, for the users to get access to services behind a firewall, they have to open ports on the firewall. This presents security risks as a bad actor could directly access that service via the open port and exploit any vulnerabilities of the service.

Another paramount capability of Safe-T Zero+ is that it implements a patented technology called reverse access to eliminate the need to open incoming ports in the internal firewall. This also eliminates the need to store sensitive data in the demilitarized zone (DMZ). It has the ability to extend to on-premise, public, and hybrid cloud, supporting the most diverse hybrid and meeting the I.T requirements. Zero+ can be deployed on-premises, as part of Safe-T’s SDP services, or on AWS, Azure, and other cloud infrastructures, thereby protecting both cloud and on-premise resources.

Zero+ provides the capability of user behavior analytics that monitors the actions of protected web applications. This allows the administrator to inspect the details of anomalous behavior. Thence, forensic assessment is easier by offering a single source for logging.

Finally, Zero+ provides a unique, native HTTPS-based file access solution for the NTFS file system, replacing the vulnerable SMB protocol. Besides, users can create a standard mapped network drive in their Windows explorer. This provides a secure, encrypted, and access-controlled channel to shared backend resources.

 

Deployment Strategy

Safe-T customers can exclusively select an architecture that meets their on-premise or cloud-based requirements.

 

There are 3 options:

i) The customer deploys three VMs: 1) Access Controller, 2) Access Gateway, and 3) Authentication Gateway. The VMs can be deployed on-premises in an organization’s LAN, on Amazon Web Services (AWS) public cloud, or on Microsoft’s Azure public cloud.

ii) The customer deploys the 1) Access Controller VM and 2) Access Gateway VM on-premises in their LAN. The customer deploys the Authentication Gateway VM on a public cloud, such as AWS or Azure.

iii) The customer deploys the Access Controller VM on-premise in the LAN and Safe-T deploys and maintains two VMs 1) Access Gateway and 2) Authentication Gateway; both hosted on Safe-T’s global SDP cloud service.

 

ZTA Migration Path

Today, organizations recognize the need to move to zero trust architecture. However, there is a difference between recognition and deployment. Also, new technology brings with it considerable risks. Chiefly, traditional Network Access Control (NAC) and VPN solutions fall short in many ways, but a rip and replace model is a very aggressive approach.

To begin the transition from legacy to ZTA, you should look for a migration path that you feel comfortable with. Maybe you want to run a traditional VPN in parallel or in conjunction with your SDP solution and only for a group of users for a set period of time. A probable example could be: choosing a server used primarily by experienced users, such as DevOps or QA personnel. This ensures that the risk is minimal if any problem occurs during the phased deployment of SDP access in your organization.

A recent survey carried out by the CSA indicates that SDP awareness and adoption are still in an early stage. However, when you do go down the path of ZTA, vendor selection which provides an architecture that matches your requirements is the key to successful adoption. For example, look for SDP vendors who allow you to continue using your existing VPN deployment while adding

SDP/ZTA capabilities on top of your VPN. This could sidestep the possible risks involved if you switch to completely new technology.