Data Center Design – Active Active design
The challenge of Data Center design “Layer 2 is weak & IP is not mobile”
In the past, best practice recommended that networks from distinct data centers should be connected through Layer 3 ( routing ), isolating the know Layer 2 turmoil. However, the business is driving the application requirements which is changing the connectivity requirements between data centers. The need for Active / Active data centers has been driven by the following.
a) Increased dependence on East West traffic
b) Clustered Applications
d) Business Continuity
d) Workload Mobility
It is generally recommended to have a Layer 3 connections with path separation through Multi-VRF, P2P VLANs or MPLS/VPN yet some applications cannot function over a Layer 3 environment and the majority of geo clusters require Layer 2 adjacency between their nodes. Whether this be for heartbeat and connection ( status and control synchronization ) state information or the requirement to share virtual IP and MAC address to facilitate traffic handling in case of failure. There are however some clustering products ( Veritas, Oracle RAC ) that do support communication over Layer 3 but they are a minority and don’t represent the general case.
Virtual Machine migration between data centers increases application availability and Layer 2 network adjacency between ESX hosts is currently required and a consistent LUN must be maintained for stateful migration. In other words if the VM loses its IP addresses it will lose its state and the TCP sessions will drop resulting in a cold migration ( VM does a reboot ) as opposed to a hot migration ( VM does not reboot ).
As a result of the stretched VLAN requirement, data center architects started to deploy traditional Layer 2 over the DCI and unsurprisingly were faced with interesting results. Although flooding and broadcasts are necessary for IP communication in Ethernet networks, they can become dangerous in a DCI environment. Traffic tromboning can also be formed between two stretched data centers and this is why non optimal internal routing happens within extended VLANs. Traffic tromboning can affect either ingress or egress traffic. On egress you can have FHRP filtering to isolate HSRP partnership and provide an active / active setup for HSRP. On ingress you can have GSLB, Route Injection and LISP.
To overcome a lot of these problems there are virtualization technologies that can be used for Layer 2 extensions between data centers. They include vPC, VSS, Fabric Path, VPLS and OTV.
In summary, we have different technologies that can be used for LAN extensions and the main mediums which they can be deployed on are Ethernet, MPLS and IP.
1) Ethernet: VSS and vPC or Fabric Path
2) MLS: EoMPLS and A-VPLS and H-VPLS
3) IP: OTV
Ethernet Extensions and Multi-Chassis EtherChannel ( MEC )
Requires protected DWDM or direct fibers and works between two data centers only. Cannot support a multi data center topology i,e full mesh of data centers but it can support hub and spoke topologies.
Previously, LAG could only terminate on one physical switch. Both VSS-MEC and vPC are a port-channeling concept extending link aggregation to two separate physical switches. This allows for the creation of L2 typologies based on link aggregation which eliminates the dependency on STP thus allowing you to scale available Layer 2 bandwidth by bonding the physical links.
Due to the fact that vPC and VSS create a single connection from a STP perspective, disjoint STP instances can be deployed in each data center. Such isolation can be achieved with BPDU Filtering on the DCI links or Multiple Spanning Tree ( MST ) region on each site. At the time of writing vPC does not support Layer 3 peering but if you want a L3 link just create one as this does not need to run on dark fiber or protected DWDM unlike the extended Layer 2 links.
Cisco have validated this design and it is freely available on the Cisco site. In summary, they have tested a variety of combinations such as VSS-VSS, VSS-vPV, vPC-vPC and validated the design with 200 Layer 2 VLANs and 100 SVIs or 1000 VLANs and 1000 SVI with static routing.
At the time of writing, the M series for the Nexus 7000 supports native encryption of Ethernet frames through the IEEE 802.1AE standard. This implementation uses Advanced Encryption Standard ( AES ) cipher and a 128-bit shared key.
Ethernet Extension and Fabric path.
Fabric path allows network operators the ability to design and implement a scalable Layer 2 fabric allowing VLANs anywhere to help reduce the physical constraints on server location. It provides a high available design with up to 16 active paths at layer 2, each path a 16 member port channel for Unicast and Multicast. This enables the MSDC networks to have flat typologies, enabling nodes to be separated by a single hop ( equidistant endpoints ). Cisco have not targeted Fabric Path as a primary DCI solution as it does not have specific DCI functions compared to OTV and VPLS. Its primary purpose is for Clos-based architectures. But if you have the requirement to interconnect 3 or more sites Fabric path is a valid solution when you have short distances between your DC’s via high quality point to point optical transmission links. Your WAN links must support Remote Port Shutdown and micro flapping protection. By default, OTV and VPLS should be the first solutions considered as they are Cisco Validated Design with specific DCI features e.g. OTV has the capability to flood unknown unicast for specific VLANs.
IP core with Overlay Transport Virtualization ( OTV ).
OTV provides a dynamic encapsulation with multipoint connectivity of up to 10 sites ( NX-OS 5.2 supports 6 sites and NX-OS 6.2 supports 10 sites )
OTV, which is also known as Over-The-Top virtualization is a specific DCI technology enabling Layer 2 extension across data center sites by employing a MAC in IP encapsulation with built-in loop prevention and failure boundary preservation. There is no data plane learning and all unicast and multicast learning between sites is facilitated via the overlay control plane ( Layer 2 IS-IS ) that runs on top of the providers network.
OTV is supported on the Nexus 7000 since 5.0 NXOS Release and ASR 1000 since 3.5 XE Release.
OTV as a DCI has robust high availability and most failures can be sub-sec convergence with only extreme and very unlikely failures such as device down resulting in <5 seconds
Locator ID/Separator Protocol ( LISP)
Locator ID/Separator Protocol ( LISP) has a lot of applications and as the names suggests separates the location and the identifier of the network hosts, this making it possible for VM’s to move across subnet boundaries while still retaining their IP address. LISP works well when you have to move workloads and also when you have to distribute workloads across data centers making it a perfect complimentary technology for an active – active date center design. It provides you with:
a) Global IP-Mobility across subnets for disaster recovery and cloud bursting ( without LAN extension ) and optimized routing across extended subnet sites.
b) Routing with Extended subnets for Active / Active data centers and distributed clusters ( with LAN extension).
LISP answer the problems with ingress and egress traffic tromboning. It has a location mapping table so when a host move is detected, updates are automatically triggered and ingress routers ( ITR’s or PITRs) now send traffic to the new location. From an ingress path flow inbound on the WAN perspective, LISP can answer the limited problems we have with BGP in controlling ingress flows. Without LISP we are limited to specific route filtering meaning if you have a PI Prefix consisting of a /16 and you break this up and advertise into 4 x /18 you may still get poor ingress load balancing on your DC WAN links, even if you were to break this up to 8 x /19 the results may still be unfavorable. LISP works differently that BGP in the sense that a LISP proxy provider would advertise this /16 for you ( you don’t advertise the /16 from your DC WAN links ) and send traffic at 50:50 to our DC WAN links. LISP can get a near perfect 50:50 conversion rate at the DC edge.