Leaf-Spine – Data Center Design

Tree-based topologies have been the mainstay of data center networks. Traditionally, Cisco has recommended a multi-tier tree based data center topology as depicted in the diagram below.  These networks are characterized by aggregation pairs ( AGGs ) which aggregate through many points in the network.  Hosts connect to access or edge switches, these switches connect to distribution and distribution connects to core.  The core should offer no services ( firewall, load balancing or WAAS ) and its main role is to forward packets as quickly as it can. The aggregation switches define the boundary for the Layer 2 domain and to contain broadcast traffic to individual domains, VLANs are used to further subdivide traffic into segmented groups.

Multi-Tier tree-based topology

Multi-Tier tree-based topology


The focus of their design was based on principles of fault avoidance and the strategy for implementing this principle it to take each switch and its connected links and build redundancy into it. This lead to the introduction of port-channels and devices deployed in pairs.  Servers pointed to a First Hop Redundancy Protocol, like HSRP or VRRP ( Hot Standby Router Protocol or Virtual Router Redundancy Protocol ). The steady state type of network design led to many inefficiencies:

a) Inefficient use of bandwidth via a single rooted core.

b) Operational and configuration complexity.

c) The cost of having redundant hardware.

d) Not optimized for small flows.

With recent changes to application and user requirement, the functions of data centers have changed.  Which in turn has changed the topology and design of the data center.  The traditional aggregation point style of design was inefficient and recent changes in end-user requirements are driving architects to design around the following key elements.

New Data Center Requirements

1) Equidistant endpoints with non-blocking network core.

Equidistant endpoints means that every device is a maximum of one hop away from each other, resulting in consistent latency in the data center. The term “non-blocking” refers to the internal forwarding performance of the switch. Basically, non-blocking is the ability to forward at line rate tx/rx – sender X can send to receiver Y and not be blocked by a simultaneous sender Q.

A blocking architecture is one that is not able to deliver the full amount of bandwidth even if the individually switching modules are not oversubscribed or if all port are not transmitting simultaneously.

2) Unlimited workload placement and mobility.

The application team want the ability to place the application at any point in the network and to communicate with existing services like storage. This usually means that VLANs need to sprawl for vmotion to work. The main question is where do we need large layer 2 domains? Bridging doesn’t scale and that’s not just because of spanning tree issues, it’s due to the fact the MAC addresses are not hierarchical and cannot be summarized. There is also a limit of 4000 vlans.

3) Lossless transport for storage and other elephant flows.

To support this type of traffic, data centers require not only conventional QoS tools but also Data Center Bridging ( DCB ) tools such as Priority flow control ( PFC ), Enhanced transmission selection ( ETS ) and Data Center Bridging Exchange ( DCBX ) to be applied throughout their designs. These standards are enhancements that allow lossless transport and congestion notification over full-duplex 10 Gigabit Ethernet networks.

DCB Features and Benefits

Feature Benefit
Priority-based Flow Control ( PFC ) Manages bursty single traffic source on a multiprotocol link.
Enhanced transmission selection ( ETS ) Enables bandwidth management between traffic types for multiprotocol links.
Congestion notification Addresses the problems of sustained congestion by moving corrective action to the edge of the network.
Data Center Bridging Exchange Protocol Allows the exchange of enhanced Ethernet parameters.

4) Simplified provisioning and management.

Simplified provisioning and management is key to operational efficiency. The ability to auto provision and for the users to manage their networks is a challenge for future networks.

5) High server to access layer transmission rate at Gigabit and 10 Gigabit Ethernet.

Before the advent of virtualization, servers transitioned from 100Mbps to 1GbE as processor performance increased.  With the introduction of high-performance multicore processors and with each physical server hosting multiple VMs, the processor to network connection bandwidth requirements increased dramatically, making 10 Gigabit Ethernet the most common network access option for servers. The popularization of 10 Gigabit Ethernet for server access has provided a straightforward approach to group / bundle multiple Gigabit Ethernet interfaces into a single connection which makes Ethernet an extremely viable technology for future proof I/O consolidation. In addition, in order to reduce networking costs, data centers are now carrying data and storage traffic over Ethernet using protocols such as iSCSI ( Internet Small Computer System Interface ) and FCoE ( Fibre Channel over Ethernet ).  FCoE allows the transport of Fibre Channel over a lossless Ethernet network.

FCoE Frame Format

FCoE Frame Format

Although there has been some talk of the introduction of 25 Gigabit Ethernet due to the excessive price of 40 Gigabit Ethernet, the two main speeds on the market are Gigabit and 10 Gigabit Ethernet.

The following is a comparison tables between Gigabit and 10 Gigabit Ethernet:

Gigabit Ethernet 10 Gigabit Ethernet
+ Well know and field tested + Much faster vMotion
+ Standard and cheap Copper cabling + Converged storage & network ( FCoE or lossless iSCSI/NFS)
+ NIC on motherboard + Reduce the number of NICs per server
+ Built-in Qos with ETS and PFC
+ Uses fiber cabling which has lower energy consumption and error rate
– Numerous NIC per hypervisor host. Maybe up to 6 NICs ( user data, vmotion, storage ) – More expensive NIC cards
– No storage / networking convergence. Unable to combine networking and storage onto one NIC – Usually requires new cabling to be laid which intern could mean more structured panels
– No lossless transport for storage and elephant flows – SFP used either for single mode or multimode fiber can be up to $4000 list per each

Understanding Spine and Leaf

The key difference between traditional aggregation layers/points and fabric networks, is that fabric don’t aggregate.

If we want to provide 10GB for every edge router to send 10GB to every other edge router, we must add bandwidth between routers A and B i.e if we have three hosts sending at 10GB each we need a core that supports 30GB.

Traditional Aggregation Topology

Traditional Aggregation Topology

The reason we have to add bandwidth at the core is because what if two routers wanted to each send 2 x 10GB of data and the core only supports a maximum of 10GB ( 10GB link between Router A and B)? both streams of data must be interleaved onto the oversubscribed link so that both senders get an equal amount of bandwidth. 



When you have more bandwidth coming into the core than the core can accommodate, you get blocking and oversubscription. Blocking and oversubscription cause delay and jitter which are bad for some applications so we need to find a way to provide full bandwidth between each end hosts.

Oversubscription is expressed as the ratio of inputs to outputs (ex. 3:1), or as a percent which is calculated (1 – (# outputs / # inputs)).  For example (1 – (1 output / 3 inputs)) = 67% oversubscribed). There will always be some oversubscription on the network and there is nothing we can do to get away from that but as a general rule of thumb an oversubscription value of 3:1 is best practice. Some applications will operate fine when oversubscription occurs and it is up to the architect to have a thorough understanding of application traffic patterns, bursting needs, and baseline states in order to accurately define the oversubscription limits a system can tolerate.

The simplest solution to overcome the oversubscription and blocking problems would be to increase the bandwidth between Router A and B as shown in the diagram labelled “Traditional Aggregation Topology”. This is feasible up to a certain point. When the number of edge host grow the number of links between Router A and B will also have to grow to 10GB and 30GB. Data center links are expensive and so are the optics used to connect them.

The Solution is a Leaf – Spine design

The solution is to divide the core devices into a number of Spine devices, which exposes the internal fabric. This is achieved by spreading the fabric across multiple devices ( Leaf and Spine ). The spreading of the fabric results in every leaf edge switch connects to every spine core switch resulting in every edge device having full bandwidth of the fabric.  This effectively places multiple traffic streams in parallel as opposed to the traditional multitier design that stacked multiple streams onto a single link. The higher degree of equal-cost multi-path routing ( ECMP ) found with leaf and spine architectures allows for greater cross-sectional bandwidth between layers, thus greater east-west bandwidth.

There is also a reduction in the fault domain as compared to traditional access, distribution, core designs. A failure of a single device only reduces the available bandwidth by a fractions and only transit traffic will be lost with a link failure. ECMP reduces liability to a single fault and brings domain optimization.

Leaf-spine network topology

Leaf-spine network topology

Origination of the Leaf and Spine design

A clos network was originally designed by Charles Clos in 1952 as a multi-stage circuit switched interconnection network to provide a scalable approach to build large-scale voice switches. It constrained high-speed switching fabrics and required low-latency non-blocking switching elements.

There has been an increase is the deployment of Clos based models in data center deployments and usually the Clos network is folded in around the middle to form a “folded-Clos” network, referred to as a leaf-spine topology. The leaf-spine design consists of three levels of switches: servers connect directly to ToR ( top of rack ) switches, ToR connect to aggregation switches and intermediate switches connect to aggregation switches. The Spine is responsible for interconnecting all Leafs, and provides the means for hosts in one rack to talk to hosts in another rack. The Leafs are responsible for physically connecting the servers and equally distribute traffic via ECMP across all Spines nodes.

Leaf and Spine : Folded 3-stage Clos fabric

Deployment considerations:

A. Fixed or Modular Switches

Fixed Switches Modular switches
+ Cheaper + Gradual Growth
+ Lower Power Consumption + Larger fabrics with leaf/spine topologies
+ Require less space + Build in redundancy with redundant SUP’s and SSO/NSF
+ More ports per RU + In-Service software redundancy
+ Easier to manage
– Hard to manage – More expensive
– Difficult to expand
– More cabling due to increase in device numbers

The Leaf layer is what determines the size of the Spine and also the oversubscription ratios. It is responsible for advertising subnets into the network fabric. An example of a Leaf device would be a Nexus 3064 which provides:

1) Line-rate for Layer 2 and Layer 3 on all ports.

2) Shared memory buffer space.

3)Throughput of 1/2 terabits per second ( Tbps ) and 950 million packets per second ( Mpps )

4) 64-way ECMP

The Spine Layer is responsible for learning infrastructure routes and physically interconnect all Leaf nodes. The Nexus 7K is the platform of choice for the Spine device layer. The F2 series line cards can provide 48x 10G line rate ports and fit very well the requirements for a spine architecture.

The following are the types of implementations you could have with this topology:

a) Layer 3 fabric with standard routing.

b) Large-scale bridging ( FabricPath, THRILL or SPB ).

c) Many-chassis MLAG ( Cisco VSS ).

The remainder of this article will focus on Layer 3 fabrics with standard routing.

B. Non-Redundant Layer 3 Design

Non-Redundant Layer 3 Design

Non-Redundant Layer 3 Design

Design Summary

a) Layer 3 directly to the access layer. Layer 2 VLANS do not span the spine layer.

b) Servers are connected to single switches. Servers are not dual connected to two switches  i.e there is no server to switch redundancy or MLAG.

c) All connections between the switches will be pure routed point-to-point layer 3 links.

d) No inter-switch VLANs so no VLAN will ever go beyond more than one switch.

The Challenge

When the Spine switches only advertise the default to the leaf switches, the leaf switches loose visibility of the entire network and you will need additional intra-Spine links.  Intra-Spine links should not be used for data plane traffic in a leaf-spine architecture.

Summarization is Bad

Summarization is Bad

Design Assumptions

The Spine layer is passing a default route to the Leaf.

The link between the Leaf connecting to Host 1 and Spine Z fails. In the diagram, the link is marked with a red “X”.


Host 4 sends traffic to the fabric that is destined to Host 1. This traffic gets spread ( ECMP ) across all links that are connecting the connected Leaf to the Spine layers. The traffic hits Spine C and as C does not have a direct link ( it has failed ) to the leaf connecting to Host 1 some traffic may be dropped while others will be sub optimal. To overcome this you would need to add inter switch links between the Spine layers, which is not recommended.


1) To buy Leaf switches that can support enough IP prefixes and don’t use summarization from Spine to Leaf.

2) Always use 40G links instead of channels of 4 x 10G links because link aggregation bandwidth does not affect routing costs. If you lose a link in the port-channel the cost of the port channel does not change which could result in congestion on the link. You could use Embedded Event Manager ( EEM ) scripting to change the OSPF cost after one of the port-channels fails. This would add in additional complexity to the network as you now don’t have equal cost routes.  This would lead you to use the Cisco proprietary protocol EIGRP that does support unequal cost routing. If you didn’t want to support a Cisco proprietary protocol you could implement MPLS TE between the ToR switches. You would need to check that the DC switches support the MPLS switching of labels.

3) Use QSFP optics as they are more robust than SFP optics. This will lower the likelihood of one of the parallel links failing.


C. Redundant Layer 3 Design

Redundant Layer 3 Design

Redundant Layer 3 Design

Design Summary

a) Servers are dual-home to two different switch.

b) Servers have one IP addresses due to restriction of TCP applications. Ideally use LACP ( Link Aggregation Control Protocol  ) between the host and servers.

c) Layer 2 trunk links between the Leaf switches is needed to carry VLANs that span both switches. This will restrict VLANs from spanning the core thus creating a large L2 fabric based on STP.

d) ToR switches will need to be in the same subnets ( share the servers subnet ) and advertise this subnet into the fabric. This is due to the fact that the servers are dual homed to 2 switches with one IP address.

The challenge

The leaf switches both advertise the same subnet to the spine switches. The spine switches think they have two paths to reach the host. The Spine switch will spread its traffic to Host 1 to both Leaf switches that are connecting the Host 1 and Host 2. In certain scenarios this could result in traffic to the hosts traversing the interswitch link between the leaf nodes. This may not be a problem if the majority of traffic is leaving the servers northbound ( traffic leaving the data center ).  However, if there is a lot of inbound traffic this link could become a bottleneck and congestion point.

This may not be an issue if this is a hosting web server farm, because the majority of traffic will be leaving the data center to the users that are external.


1) If there is a lot of east to west traffic ( 80 % ) then it is mandatory to use LAG ( Link Aggregation Group ) between the servers and ToR Leaf switches.

2) The two Leaf switches must support MLAG ( Multichassis Link Aggregation ). The result of using MLAG on the Leaf switches is that when either connecting Leaf receives traffic destined for host X, it knows it can reach it directly through its connected link. Resulting in optimal southbound traffic flow.

3) Most LAG solutions place traffic generating from a single TCP session on to a single uplink, limiting the TCP session throughput to the bandwidth of a single uplink interface. However,  Dynamic NIC teaming is available in Windows Server 2012 R2 which can split a single TCP session into multiple flows and distribute them across all uplinks.

4) Use dynamic link aggregation – LACP and not static port channels. The LAGs used between servers and switches should use LACP to prevent traffic black holing.

About Matt Conran

Matt Conran has created 184 entries.