Cumulus – Pure Layer-3 Data Centre
The source for this blog post is taken from Ivan Pepelnjak’s recent software gone wild podcasts (show 38) with Cumulus Networks.
The challenges designing a true layer-3 only data centre surface at the access layer. Dual connected servers terminating on separate Top-of-Rack (ToR) switches cannot have more than one IP address. A limitation resulting in VLAN sprawl, unnecessary ToR inter-switch links, and uplink broadcast domain sharing. Dinesh Dutt and Cumulus Networks devised a clever solution entailing the redistribution of Address Resolution Protocol (ARP), avoiding Multi-Chassis Link Aggregation (MLAG) designs and allowing pure Layer-3 data centre networks. Layer 2 was not built with security in mind. Introducing a Layer-3 only data centre completely eliminates any layer 2 security problems. For a brief introduction to Cumulus architecture, kindly visit my previous post on Cumulus networks.
Are we using the “right” layer 2 protocol?
Layer 1 is the easy layer. It defines an encoding scheme needed to pass ones and zeros between devices. Things get a bit more interesting at Layer 2, where adjacent devices exchange frames (layer 2 packets) for reachability. Layer-2 addresses known as MAC addresses are commonly used at Layer 2 but not always needed. Their need arises when you have more than two devices attached to the same physical network. Imagine a device receiving a stream of bits, does it matter if Ethernet, native IP or CLNS/CLNP comes in the “second” layer? The question we should ask ourselves is are we using the “right” layer 2 protocol? Many networks implement VLANs to support random IP address assignment and IP mobility. The switches perform layer-2 forwarding even though they might be capable of layer-3 forwarding. They forward packets based on MAC addresses within a subnet yet a layer-3 switch does not need Layer 2 information to route IPv4 or IPv6 packets. Cumulus has gone one step further and made it possible to configure every server-to-ToR interface a layer 3 interface. Their design permits multipath default route forwarding, removing the need for ToR interconnects and common broadcast domain sharing of uplinks.
Bonding Vs ECMP
A typical server environment consists of a single server with two uplinks. For device and link redundancy, uplinks are bonded into a port channel and terminated on different ToR switches; forming an MLAG. As this is an MLAG design, the ToR switches need an interswitch link. You cannot bond server NICs to two separate ToR switches without creating an MLAG.
If you don’t want to use an MLAG there are other Linux modes available on hosts, such as “active | passive”, “active | passive on receive”. A 3rd mode is available but consists of a trick using different ARP replies for the different neighbors. This forces both MAC address into the ARP cache of your neighbors; allowing both interfaces to receive. The “active | passive“ mode is popular as it offers predictable packet forwarding and easier troubleshooting. The “active | passive on receive” mode receives on one link but transmits on both. Usually, you can only receive on one interface as that is what is in the ARP cache of your neighbors. To prevent MAC address flapping at the ToR switch, separate MAC addresses are transmitted. If a switch receives the same MAC address over two separate interfaces, it will generate an MAC Address Flapping error. In each of the bonding examples, we have a common problem in that we can’t associate one IP address with two MAC addresses. These solutions also require ToR interswitch links. The only way to get around this is to implement a pure layer-3 Equal-cost multi-path routing (ECMP) solution between host and ToR.
Pure Layer-3 Solution Complexities
Firstly, we cannot have one IP address with two MAC address. To overcome this, we implement additional Linux features. Linux has the capability for an unnumbered interface, permitting the assignment for the same IP address to both interface; one IP address for two physical NICs. Next, we assign a /32 Anycast IP address to the host via a loopback address.
Secondly, the end hosts need to send to a next hop that is not on a shared subnet. Linux allows you to specify an attribute to the received default route, called “on link”. This attribute tells end hosts that “I might not be on a directly connected subnet to the next hop but trust me that the next hop is on the other side of this link”. It basically forces the hosts to send ARP requests, regardless of common subnet assignment. These techniques enable the assignment of the same IP address to both interfaces and permit forwarding to a default route out both interfaces. Each interface is on its own broadcast domain. Subnets can span two ToR without requiring bonding or an inter-switch link.
Return traffic is slightly different and it depends what the ToR advertises back to the network. There are two modes, firstly, if the ToR advertises a /24 to the rest of the network everything works fine until the server-to-ToR link fails. Now, it becomes a layer-2 problem as you already said you could reach the subnet. Resulting in return traffic traversing an inter-switch ToR link in order to get back to the server. But this goes against our previous design requirement of removing any ToR inter-switch links. Essentially, you need to opt for the second mode and advertise a /32 for each host back into the network. Take the information learnt in ARP, consider it as a host routing protocol and redistribute it into the data centre protocol i.e redistribute ARP. The ARP table gets you the list of neighbors and the redistribution pushes those entries into the routed fabric as /32 host routes. This gives you the ability to redistribute only what /32 are active and present in ARP tables. It should be noted that this is not a default mode and currently an experimental feature.
Additional information on layer-3 data centre for Cumulus at IPspace.net