leaf and spine design

Spine Leaf Architecture


 modular data center design


Spine Leaf Architecture

What is spine and leaf architecture? A spine leaf architecture is a variation of data center topologies that consists of two switching layers. We have a spine-leaf switch design consisting of two layers. The leaf layer consists of access switches that aggregate traffic from endpoints that could be traditional servers or containers and connect directly to the spine, which is the network core. The Spine switch will often have two for redundancy to interconnect all leaf switches in a full-mesh leaf and spine topology. With a spine and leaf data center network design, the leaf switches do not directly connect.

Instead, all connectivity goes through the core, and the physical and logical layout is generally the same based on network overlay protocols, more than likely VXLAN. An example of a data center that utilizes such a design is the Cisco ACI. The ACI Cisco consists of three main components in ACI the Application Policy Infrastructure Controller (APIC), the spine switches, and the leaf switches.


Spine Leaf Architecture

Key Spine Leaf Architecture Discussion Points:

  • Introduction to the spine leaf architecture and what is involved.

  • Highlighting the details of this type of data center design.

  • Critical points on spine-leaf switch requirements.

  • Technical details on the origins of this design.

  • Technical solutions that can be used in the leaf and spine design.


  • A key point: Video on spine leaf switch architecture with Cisco ACI

The following video provides a good overview of what is spine and leaf architecture. We will examine the leaf and spine data center architecture. We know this design is a considerable step from traditional DC design. As a use case, we will focus on how Cisco has adopted the leaf and spine design with its ACI Cisco product. We will address the components and how they form the Cisco ACI data center fabric.


Back to basic with data center design

At its most straightforward, a data center is a physical facility that houses applications and data. Such a design is based on a computing and storage resources network that enables the delivery of shared applications and data. The critical elements of a data center design include routers, switches, firewalls, storage systems, servers, and application-delivery controllers.

The data center should be flexible in quickly deploying and supporting new services. Such a design needs substantial initial planning and consideration of port density, access layer uplink bandwidth, actual server capacity, and oversubscription, to name a few.


Traditional Tree-Based Topologies

We have tree-based topologies on the opposite side of a spine-leaf switch design. Tree-based topologies have been the mainstay of data center networks. Traditionally, Cisco has recommended a multi-tier tree-based data center topology, as depicted in the diagram below.

These networks are characterized by aggregation pairs ( AGGs ) that aggregate through many network points. Hosts connect to access or edge switches, which connect to distribution, and distribution connects to the core.

The core should offer no services ( firewall, load balancing, or WAAS ), and its central role is to forward packets as quickly as possible. The aggregation switches define the boundary for the Layer 2 domain, and to contain broadcast traffic to individual domains, VLANs are used to further subdivide traffic into segmented groups. A style of design that operates very differently from that of a spine leaf architecture.


The traditional three-tier model was based on the following design principles:

  1. The access switch connects to endpoints, e.g., servers.
  2. The aggregation or distribution switches provide redundant connections to access switches.
  3. The core switches provide fast transport between aggregation switches, typically connected in a redundant pair for high availability.
  4. Networking and security services such as load balancing or firewalling were typically connected to the distribution layers.


spine leaf architecture
The traditional data center design. Non spine leaf architecture.


The focus of the design

Their design’s focus was based on fault avoidance principles, and the strategy for implementing this principle is to take each switch and its connected links and build redundancy into it. This led to the introduction of port channels and devices deployed in pairs. In addition, servers pointed to a First Hop Redundancy Protocol, like HSRP or VRRP ( Hot Standby Router Protocol or Virtual Router Redundancy Protocol ). Unfortunately, the steady-state type of network design led to many inefficiencies:

  1. Inefficient use of bandwidth via a single-rooted core.
  2. Operational and configuration complexity.
  3. The cost of having redundant hardware.
  4. It is not optimized for small flows.

Recent changes to application and user requirements have changed the functions of data centers, which in turn has changed the topology and design of the data center to a spine-leaf switch topology. For example, the traditional aggregation point design style was inefficient, and recent changes in end-user requirements are driving architects to design around the following key elements.


Spine Leaf Architecture: Requirements

A spine-leaf architecture collapses one of these tiers at the most basic level, as depicted in the diagram below. Follow the following design principles:

  1. The removal of the Spanning Tree Protocol (STP)
  2. Increased use of fixed-port switches over modular models for the network backbone
  3. More cabling to purchase and manage, given the higher interconnection count
  4. A scale-out vs. scale-up of infrastructure.
what is spine and leaf architecture
Diagram: What is spine and leaf architecture? 2-Tier Spine Leaf Design


Leaf and Spine Main Points

With the introduction of the cloud and containerized infrastructure, there was an increase in east-west traffic. East-west traffic differs from north to south traffic and moves laterally from server to server. Generally, this type of traffic flow stays internal to the data center.

With the change in traffic patterns, we need to design our data centers, to have low-latency and optimized traffic flows, especially for time-sensitive or data-intensive applications. A spine-leaf data center design aids this by ensuring traffic always has the same number of hops from its next destination, so latency is lower and predictable.

STP has always been problematic in the data center. Now with a leaf and spine, the capacity improves because STP is no longer required. In the past, STP blocked redundant paths between two switches, where only one could be active at any time.

As a result, paths often need to be more subscribed. With a leaf, spine-leaf architectures rely on protocols such as Equal-Cost Multipath (ECPM) routing to load balance traffic across all available paths while still preventing network loops. So instead of running STP to the spine layer, we can run routing protocols.

We also have better scalability. We can add additional spine switches, and leaf switches can be seamlessly inserted when port density becomes problematic. There is no need to take down the core layer for upgrades.

STP Blocking.
Diagram: STP Blocking. Source Cisco Press free chapter.


Data Center Requirements

  • 1) Equidistant endpoints with non-blocking network core.

Equidistant endpoints mean that every device is a maximum of one hop away from the other, resulting in consistent latency in the data center. The term “non-blocking” refers to the internal forwarding performance of the switch.

Non-blocking is the ability to forward at line rate tx/Rx – sender X can send to receiver Y and not be blocked by a simultaneous sender. A blocking architecture cannot deliver the total bandwidth even if the individually switching modules are not oversubscribed or if all ports are not transmitting simultaneously.

  • 2) Unlimited workload placement and mobility.

The application team wants to place the application at any point in the network and communicate with existing services like storage. This usually means that VLANs need to sprawl for VMotion to work. The main question is, where do we need large layer 2 domains? Bridging doesn’t scale, and that’s not just because of spanning tree issues; it’s because the MAC addresses are not hierarchical and cannot be summarized. There is also a limit of 4000 VLANs.

  • 3) Lossless transport for storage and other elephant flows.

To support this type of traffic, data centers require not only conventional QoS tools but also Data Center Bridging ( DCB ) tools such as Priority flow control ( PFC ), Enhanced transmission selection ( ETS ), and Data Center Bridging Exchange ( DCBX ) to be applied throughout their designs. These standards are enhancements that allow lossless transport and congestion notification over full-duplex 10 Gigabit Ethernet networks.




Priority-based Flow Control ( PFC )

Manages bursty single traffic source on a multiprotocol link

Enhanced transmission selection ( ETS )

Enables bandwidth management between traffic types for multiprotocol links

Congestion notification

Addresses the problems of sustained congestion by moving corrective action to the edge of the network

Data Center Bridging Exchange Protocol 

Allows the exchange of enhanced Ethernet parameters


  • 4) Simplified provisioning and management.

Simplified provisioning and management are critical to operational efficiency. However, the ability to auto-provision and for the users to manage their networks is challenging for future networks.

  • 5) High server-to-access layer transmission rate at Gigabit and 10 Gigabit Ethernet.

Before the advent of virtualization, servers transitioned from 100Mbps to 1GbE as processor performance increased. With the introduction of high-performance multicore processors and each physical server hosting multiple VMs, the processor-to-network connection bandwidth requirements increased dramatically, making 10 Gigabit Ethernet the most common network access option for servers.

In addition, the popularization of 10 Gigabit Ethernet for server access has provided a straightforward approach to group/bundle multiple Gigabit Ethernet interfaces into a single connection, making Ethernet an extremely viable technology for future-proof I/O consolidation.

In addition, to reduce networking costs, data centers are now carrying data and storage traffic over Ethernet using protocols such as iSCSI ( Internet Small Computer System Interface ) and FCoE ( Fibre Channel over Ethernet ). FCoE allows the transport of Fibre channels over a lossless Ethernet network.

spine-leaf switch
FCoE Frame Format


Although there has been some talk of introducing 25 Gigabit Ethernet due to the excessive price of 40 Gigabit Ethernet, the two main speeds on the market are Gigabit and 10 Gigabit Ethernet. The following is a comparison table between Gigabit and 10 Gigabit Ethernet:


Gigabit Ethernet

 10 Gigabit Ethernet

+ Well know and field-tested

+ Much faster vMotion

+ Standard and cheap Copper cabling

+ Converged storage & network ( FCoE or lossless iSCSI/NFS)

+ NIC on the motherboard

+ Reduce the number of NICs per server

+ Cedric Kelly

+ Built-in Qos with ETS and PFC

+ Uses fiber cabling which has lower energy consumption and error rate

- Numerous NICs per hypervisor host. Maybe up to 6 NICs ( user data, VMotion, storage )

- More expensive NIC cards

- No storage/networking convergence. Unable to combine networking and storage onto one NIC

- Usually requires new cabling to be laid which intern could mean more structured panels

- No lossless transport for storage and elephant flows

- SFP used either for single-mode or multimode fiber can be up to $4000 list per each

Spine-Leaf Switch Design

The critical difference between traditional aggregation layers/points and fabric networks is that fabric doesn’t aggregate. If we want to provide 10GB for every edge router to send 10GB to every other edge router, we must add bandwidth between routers A and B, i.e., if we have three hosts sending at 10GB each, we need a core that supports 30 GB.


We must add bandwidth at the core because what if two routers wanted to send 2 x 10GB of data, and the core only supports a maximum of 10GB ( 10GB link between routers A and B)? Both data streams must be interleaved onto the oversubscribed link so that both senders get equal bandwidth. 

You get blocking and oversubscription when more bandwidth comes into the core than the core can accommodate. Blocking and oversubscription cause delay and jitter, which is bad for some applications, so we must find a way to provide total bandwidth between each end host.

Oversubscription is expressed as the ratio of inputs to outputs (ex. 3:1) or as a percent that is calculated (1 – (# outputs / # inputs)). For example, (1 – (1 output / 3 inputs)) = 67% oversubscribed). There will always be some oversubscription on the network, and there is nothing we can do to get away from that, but as a general rule of thumb, an oversubscription value of 3:1 is best practice.

Some applications will operate fine when oversubscription occurs. It is up to the architect to thoroughly understand application traffic patterns, bursting needs, and baseline states to define the oversubscription limits a system can tolerate accurately.

The simplest solution to overcome the oversubscription and blocking problems would be to increase the bandwidth between Router A and B, as shown in the diagram labeled “Traditional Aggregation Topology.” This is feasible up to a certain point. Router A and B links must also grow to 10GB and 30 GB when the number of edge hosts grows. Datacenter links and the optics used to connect them are expensive.


The Solution

Spine-Leaf Switch Design

The solution is to divide the core devices into several spine devices, which expose the internal fabric enabling a spine leaf architecture similar to what you see with ACI networks. This is achieved by spreading the fabric across multiple devices ( leaf and spine ).

The spreading of the fabric results in every leaf edge switch connecting to every spine core switch resulting in every edge device having the total bandwidth of the fabric. This places multiple traffic streams parallel, unlike the traditional multitier design that stacks multiple streams onto a single link.

In addition, the higher degree of equal-cost multi-path routing ( ECMP ) found with leaf and spine architectures allows for greater cross-sectional bandwidth between layers, thus greater east-west bandwidth. There is also a reduction in the fault domain compared to traditional access, distribution, and core designs.

A failure of a single device only reduces the available bandwidth by a fraction, and only transit traffic will be lost with a link failure. ECMP reduces liability to a single fault and brings domain optimization.


Origination of the spine and leaf design

Charles Clos initially designed a Clos network 1952 as a multi-stage circuit-switched interconnection network to provide a scalable approach to building large-scale voice switches. It constrained high-speed switching fabrics and required low-latency, non-blocking switching elements.

There has been an increase in the deployment of Clos-based models in data center deployments. Usually, the Clos network is folded around the middle to form a “folded-Clos” network, referred to as a spine leaf architecture. The spine-leaf switch design consists of three switches:

  • Servers connect directly to ToR ( top of rack ) switches.
  • ToR connects to aggregation switches.
  • Intermediate switches connect to aggregation switches. 

The spine is responsible for interconnecting all Leafs and allows hosts in one rack to talk to hosts in another. The leafs are responsible for physically connecting the servers and distributing traffic via ECMP across all spines nodes.


Leaf and Spine: Folded 3-Stage Clos fabric

Spine-leaf switch deployment considerations:

A. Spine-leaf switch: Fixed or modular switches


Fixed Switches

Modular switches

+ Cheaper

+ Gradual Growth

+ Lower Power Consumption

 + Larger fabrics with leaf/spine topologies

+ Require less space

 + Build-in redundancy with redundant SUPs and SSO/NSF

+ More ports per RU

+ In-Service software redundancy

+ Easier to manage

- Hard to manage

- More expensive

- Difficult to expand

- More cabling due to an increase in device numbers


The leaf layer determines the size of the spine and the oversubscription ratios. It is responsible for advertising subnets into the network fabric. An example of a leaf device would be a Nexus 3064, which provides the following:

  1. Line rate for Layer 2 and Layer 3 on all ports.
  2. Shared memory buffer space.
  3. Throughput of 1/2 terabits per second ( Tbps ) and 950 million packets per second ( Mpps )
  4. 64-way ECMP


Spine-leaf switch


The spine layer is responsible for learning infrastructure routes and physically interconnecting all leaf nodes. The Nexus 7K is the platform for the Spine device layer. The F2 series line cards can provide 48x 10G line rate ports and fit very well the requirements for a spine architecture.
The following are the types of implementations you could have with this topology:

  1. Layer 3 fabric with standard routing.
  2. Large-scale bridging ( FabricPath, THRILL, or SPB ).
  3. Many-chassis MLAG ( Cisco VSS ).

This article will focus on Layer 3 fabrics with standard routing.


B. Spine-leaf switch: Non-redundant layer 3 design

Spine-leaf switch: Design Summary

  1. Layer 3 directly to the access layer. Layer 2 VLANs do not span the spine layer.
  2. Servers are connected to single switches. Servers are not dual connected to two switches, i.e., there is no server to switch redundancy or MLAG.
  3. All connections between the switches will be pure routed point-to-point layer 3 links.
  4. There are no inter-switch VLANs, so no VLAN will ever go beyond one switch.


Spine-leaf switch: The challenge

When the spine switches only advertise the default to the leaf switches, the leaf switches lose visibility of the entire network, and you will need additional intra-spine links. Therefore, intra-spine links should not be used for data plane traffic in a leaf-spine architecture.

Spine-leaf switch: Design assumptions

The spine layer passes a default route to the Leaf. The link between the Leaf connecting to Host 1 and Spine Z fails. In the diagram, the link is marked with a red “X.” Host 4 sends traffic to the fabric destined for Host 1.

This traffic spreads ( ECMP ) across all links connecting the connected Leaf to the Spine layers. The traffic hits Spine C, and as C does not have a direct link ( it has failed ) to the Leaf connecting to Host 1, some traffic may be dropped while others will be sub-optimal. To overcome this, you must add inter-switch links between the Spine layers, which is not recommended.


  • A key point: Video on Spine and Leaf design with Cisco ACI.

The following video will address fabric deployment and provisioning in the CISCO ACI. All of this is done automatically for you, and we will check to ensure this has been done for you. The Cisco ACI operates over a leaf and spine architecture.

We will confirm this by checking the individual ports on each ACI node, LLD status, and IS-IS adjacency status. We will also examine the traditional DC design based on the 3-tier architecture with many drawbacks, forcing us to move to a leaf and spine data center design.



Spine-leaf switch: Recommendations

  1. Buy Leaf switches that can support enough IP prefixes and don’t use summarization from Spine to Leaf.
  2. Always use 40G links instead of channels of 4 x 10G links because link aggregation bandwidth does not affect routing costs. If you lose a link in the port channel, the cost of the port channel does not change, which could result in congestion on the link. You could use Embedded Event Manager ( EEM ) scripting to change the OSPF cost after one of the port channels fails. This would add complexity to the network as you now don’t have equal-cost routes. This would lead you to use the Cisco proprietary protocol EIGRP, which supports unequal cost routing. If you didn’t want to support a Cisco proprietary protocol, you could implement MPLS TE between the ToR switches. First, you need to check that the DC switches support the MPLS switching of labels.
  3. Use QSFP optics as they are more robust than SFP optics. This will lower the likelihood of one of the parallel links failing.


C. Spine-leaf switch: Redundant layer 3 design

Spine-leaf switch: Design Summary

  1. The servers are dual home to two different switches.
  2. Servers have one IP address due to the restriction of TCP applications. Ideally, use LACP ( Link Aggregation Control Protocol ) between the host and servers.
  3. Layer 2 trunk links between the Leaf switches are needed to carry VLANs that span both switches. This will restrict VLANs from spanning the core, thus creating a sizeable L2 fabric based on STP.
  4. ToR switches must be in the same subnets ( share the server’s subnet) and advertise this subnet into the fabric. Again, the servers are dual-homed to 2 switches with one IP address.


Spine-leaf switch: The challenges

The leaf switches both advertise the same subnet to the spine switches. The spine switches and thinks they have two paths to reach the host. The Spine switch will spread its traffic from Host 1 to Leaf switches connecting Host 1 and Host 2. In specific scenarios, this could result in traffic to the hosts traversing the Interswitch link between the leaf nodes. This may not be a problem if most traffic leaves the servers northbound ( traffic leaving the data center ). However, if there is a lot of inbound traffic, this link could become a bottleneck and congestion point. This may not be an issue if this is a hosting web server farm because most traffic will leave the data center to external users.


Spine-leaf switch: Recommendation

  1. If there is a lot of east-to-west traffic ( 80 % ), using LAG ( Link Aggregation Group ) between the servers and ToR Leaf switches is mandatory.
  2. The two Leaf switches must support MLAG ( Multichassis Link Aggregation ). The result of using MLAG on the Leaf switches is that when either connecting Leaf receives traffic destined for host X, it knows it can reach it directly through its connected link—resulting in optimal southbound traffic flow.
  3. Most LAG solutions place traffic generated from a single TCP session onto a single uplink, limiting the TCP session throughput to the bandwidth of a single uplink interface. However, Dynamic NIC teaming is available in Windows Server 2012 R2 which can split a single TCP session into multiple flows and distribute them across all uplinks.
  4. Use dynamic link aggregation – LACP and not static port channels. The LAGs between servers and switches should use LACP to prevent traffic blackholing.


Key Spine Leaf Architecture Summary Points:

Main Checklist Points To Consider

  • The spine leaf architecture consists of a leaf layer and a spine layer. Endpoints connect to the leaf layer—the spine switch act as the core.

  • This layout of the leaf and spine gives you optimal load balancing and ECMP for any endpoint in any location.

  • The traditional tree-based topologies are not suited for virtualization and you will always be hit with the core port count.

  • The spine and leaf can build massive data centers with, for example, folder 3-stage design.

  • Cisco ACI is an example of a leaf and spine design. VXLAN is the most common overlay protocol that works over what is known as the underlay.


Spine-leaf switch

Matt Conran
Latest posts by Matt Conran (see all)

Comments are closed.