Data Center Design – Control Plane Conundrum
The goal of data networks and data center design is very simple – to transport end-user traffic from A to B without any packet drops yet the metrics we use to achieve this goal can be very different. The data center is evolving and progressing through a variety of topology and technology changes. The new data center technologies we are seeing today such as Fabric Path, LISP, THRILL, VXLAN are being driven by a change in the end users requirement, the application has changed.
These new technologies may be addressing new challenges yet the fundamental question where to create the Layer 2/Layer 3 boundary and the need for Layer 2 in the access layer remains the same. The question stays the same yet the technologies available to addresses this challenge have evolved.
Using the traditional core, distribution and access model, the access layer is layer 2 and servers being served to each other in the access layer are in the same IP subnet and VLAN. The same access VLAN will span the access layer switches for east to west traffic and any outbound traffic is via a First Hop Redundancy Protocol ( FHRP ) like Hot Standby Router Protocol ( HSRP ). Servers in different VLANs are isolated from each other and cannot communicate directly, inter-VLAN communications require a Layer 3 device. The humble beginnings of virtualization started with VLANs and they were used to segment traffic at Layer 2. The virtualization side of VLANs comes from the fact that two servers that are physically connected to different switches but in the same VLAN can communicate with each, assuming the VLAN spans both switches. Each VLAN can be defined as a broadcast domain in a single Ethernet switch or shared among connected switches. Whenever a switch interface that belongs to a VLAN received a broadcast frame ( destination MAC is ffff.ffff.ffff), the device must forward this frame to all other ports that are defined in the same VLAN. This approach is very simple in design and is almost like a plug and play network.
The first question to ask yourself is why not connect everything in the data center into one large Layer 2 broadcast domain? Layer 2 is a plug and play network so why not?
The reason being is that there are many scaling issues in large layer 2 networks. Layer 2 networks don’t have controlled / efficient network discovery protocols. Address Resolution Protocol ( ARP ) is used to locate end hosts and uses Broadcasts and Unicast replies. A single host might not generate a lot of this traffic but imagine what would happen if 10,000 hosts were connected to the same broadcast domain. There is also no hierarchy in MAC addressing. Unlike Layer 3 networks where you can have summarization and hierarchy addressing, MAC addresses are flat and creating several thousand hosts to a single broadcast domain will create large tables of forwarding information. Because end hosts are potentially not static they are likely to be attached and removed to the network at regular intervals which will also create a high rate of change in the control plane. You can of course have a large Layer 2 data center with multiple tenants if they don’t need to communicate with each. For the shared services requirements such as WAAS or load balancing can be solved by spinning up the service VM in the tenants Layer 2 broadcast domain. This design will hit scaling and management issues.
There is a general consensus to move away from Layer 2 design to a more robust and scalable Layer 3 design. But why is there still a need for Layer 2 in the data center?
a) Servers that perform the same function might need to communicate with each other as a result of a clustering protocol or simply as part of the application inner functions. If the communication is clustering protocol heart beats or some type of server-to-server application packets that are not routable then you need this communication layer to be on the same VLAN i.e Layer 2 domain, as these type of packets are not routable and don’t understand the IP layer.
b) Stateful devices such as firewalls and load balancers need Layer 2 adjacency as they constantly exchange connection and session state information.
c) Servers that are dual-homed – Single server with two server NIC’s and one NIC to each switch will require a layer 2 adjacency if the adapter has a standby interface that uses the same MAC and IP addresses after a failure. In this situation, the active and standby interfaces must be on the same VLAN and use the same default gateway.
d) If your virtualization solutions are not able to handle Layer 3 VM mobility you may need to stretch VLANs between PODS / Virtual Resource Pools or even data centers so you can move VMs around the data center at Layer 2 ( without changing their IP address ).
Cisco went one giant step and and recently introduced Dynamic Fabric Automaton ( DFA ), similar to Juniper QFabric, offers you both Layer 2 switching and Layer 3 routing at the access layer / ToR. Firstly, they have Fabric Path ( IS-IS for Layer 2 connectivity ) in the core which gives you optimal Layer 2 forwarding between all the edges and then they configure the same Layer 3 address on all the edges which gives you optimal Layer 3 forwarding across the whole fabric. On the edge you can have Layer 3 Leaf switches, for example Nexus 6000 series or you can integrate with Layer 2 only devices like the Nexus 5500 series or the Nexus 1000v. You can also connect external routers or even USC or FEX to the Fabric. In addition to running IS-IS as the control plane, DFA also uses MP-iBGP with some Spine nodes being the Route Reflector to exchange IP forwarding information. DFA also employs a Fabric Path technique called “Conversational Learning”. The first packet triggers full RIB lookup and the subsequent packets are switched in the switching cache which is implemented in hardware.
This technology provides Layer 2 mobility throughout the data center while also providing optimal traffic flow using Layer 3 routing. Cisco comment that “DFA provides a scale-out architecture without congestion points in the network while providing optimized forwarding for all types of applications“.
Terminating Layer 3 at the access / ToR has clear advantages and disadvantages. Other advantages include the reduction in the size of the broadcast domain but this comes at the cost of reducing the mobility domain across which VM’s can be moved. Terminating L3 at the accesses can also result in sub-optimal routing because there will be hair pinning or traffic tromboning of across-subnet traffic taking multiple and unnecessary hops across the data center fabric.
What has the industry introduced to overcome these limitations and address the new challenges? – Overlay Networking
In its simplest form, an overlay is a dynamic tunnel between two endpoints that enables Layer 2 frames to be transported between those endpoints. These overlay based technologies provide a level of indirection that enables switch table sizes to not increase in the order of the number of end hosts that are supported. The types of overlays that exist today are Fabric Path, THRILL, LISP, VXLAN, NVGRE, OTV, PBB and Shorted Path Bridging. They are essentially virtual networks that sit on top of a physical network, often the physical network not being aware of the existence of the virtual layer above it.
Fabric Path is a Layer 2 technology that provides Layer 3 benefits such as multipathing to the classical Layer 2 networks by using IS-IS at Layer 2. This eliminates the need for spanning tree protocol, thereby avoiding the pitfalls of having large Layer 2 networks. As a result, Fabric Path enables massive Layer 2 network that support multipath ( ECMP ). THRILL is an IEEE standard that, like Farbic Path, is a Layer 2 technology which provides the same Layer 3 benefits as Fabric Path to the Layer 2 networks by using IS-IS. LISP is popular in Active / Active data centers for DCI route optimization/mobility and separates the location and the identifier ( EID ) of host which allow VMs to move across subnet boundaries while keeping the endpoint identification. Popular encapsulation formats include VXLAN ( proposed by Cisco and Vmware ), and STT ( originally created by Nicira but will be deprecated over time as VXLAN comes to dominate ). OTV is a Cisco proprietary innovation in the Data Center Interconnect ( DCI ) space for enabling Layer 2 extension across data center sites. While Fabric Path can be used as a DCI technology over short distances with dark fiber, OTV has been specifically designed for the DCI while Fabric Path is primarily used for intra DC communications. Failure boundary and site independence are preserved in OTV networks because OTV uses a control plane protocol to sync MAC addresses between sites and prevents unknown unicast floods. Recent IOS versions have the ability to allow unknown unicast floods for certain VLANs, which is not available if you are using Fabric Path as the DCI technology.
Another potential trade-off between control plane scaling, Layer 2 VM mobility and optimal ingress / egress traffic flow would be Software-Defined-Networking ( SDN ). SDN at a basic level can be used to create direct paths through the network fabric to effectively achieve isolation of private networks. An SDN network allows you to choose the correct forwarding information on a per flow basis and this per flow optimization eliminates the need to have VLAN separation in the data center fabric. Instead of using VLANs to enforce traffic separation, the SDN controller has a set of policies that can allow traffic to be forwarding from a particular source to destination.