IGP Control plane Convergence
What is the longest the network could take to converge?
A networking device is tasked with two planes of operation – the control plane and data plane. The job of the data plane is to switch traffic across the routers interfaces as fast as possible i.e the moving of the packets. The control plane has the more complex operation of putting together and creating the controls so the data plane can operate efficiently. The networks control plane is tasked with finding the best path from any given source to any given network destination. To do this in an efficient manner it must react quickly and be dynamic to changes that are happening in the network. I found two similar definitions of convergence.
“Convergence is the amount of time ( and thus packet loss ) after a failure in the network and before the network settles into a steady state” also, ” Convergence is the amount of time ( and thus packet loss) after a failure in the network and before the network responds to the failure”
The difference between the two is very subtle but very important – steady state vs just responding.
The control plane and its reaction to topology changes can be separated in to four parts below. Each area must be addresses individually as leaving one area out results in slow convergence and application time-out. Follow through each of the components and ask yourself the corresponding questions:
a) Detecting the topology change. How long does it take to detect the failure?
b) Notifying the rest of the network about the change. How long does it take to describe the failure?
c) Calculate a new best path. How long does it take to find another loop free route?
d) Switching to the new best path. How long does it take to switch to the alternate route?
Before we dive into each individual step, I would like to share convergence times for each routing protocols. The times displayed below are from a Cisco Live session and are based on real world case studies and field research. From each of the steps described above we are separating them into the following fields – Detect, Describe, Find Alternative and Total Time.
|Detect||<1 second best, 105 seconds average.||<1 second best, 20 seconds average||<1 second best, 15 seconds average.30 seconds worst.|
|Describe||15 seconds average, 30 seconds worst||1 seconds best, 5 seconds average.||2 seconds|
|Find Alternative||15 seconds average, 30 seconds worst||1 second average.||*** <500ms per query hop average Assume a 2 second average|
|Total Time||Best Average Case: 31 seconds Average Case: 135 seconds Worse Case: 179 seconds||Best Average Case: 2 to 3 seconds
Average Case: 25 seconds
Worse Case: 45 seconds
|Best Average Case: <1 seconds
Average Case: 20 seconds
Worse Case: 35 seconds
*** The alternate route is found before the describe phase, this is due to the feasible successor design with EIGRP route path selection.
EIGRP is the fastest but only fractional. This is due to the fact that EIGRP has a pre-built loop free path which is known as a feasible successor. The FS is a route which has a higher metric than the successor, making it a backup route to the successor route. The affect of a pre-computed backup route on convergence is that EIGRP can react locally to a change in the network topology and nowadays this is usually done in the FIB. Without a feasible successor, EIGRP would have to query for the alternative route, increasing convergence time. You can however have Loop Free Alternative ( LFA ) for OSPF which can have an alternate path pre computed but they can only work with certain typologies and don’t guarantee against micro loops ( EIGRP guarantee against micro loops).
Ask yourself a question, is < 1 second convergence fast enough for today’s applications? Certainly for some non critical application that work on top of TCP that answer would be yes. TCP has built-in back-off algorithms that can deal with packet loss by re-transmitting to recover lost segments. But non bulk data application like Video and VOIP have a lot stricter rules and require fast convergence and minimal packet loss. A 5 second delay in routing protocol convergence could mean several hundreds of voice calls being dropped. A 50 second delay for a Gigabit Ethernet link implies about 6.25 GB of information lost.
To add resilience to a network you can aim to make the network redundant. When you add redundancy, you are essentially betting that outages of the original path and the backup path will not occur at the same time and that the primary path does not fate share with the backup path ( they do not share common underlying infrastructure i.e physical conducts or power ). There needs to be a limit on the number of links you add to make your network redundant and adding 50 extra links does not make your network 50 times more redundant. It actually does the opposite! The control plane is tasked with finding the best path and must react to modifications in the network as quickly as possible. However, every additional link you add slows down the convergence of the routers control plane as there is additional information to compute resulting in longer convergence times. The correct number of backup links is a trade-off between redundancy versus availability. The optimal level of redundancy between two points should be two or three links. The fourth link would actually make the network converge more slower.
Routing protocol algorithms can be tweaked to exponentially back off and deal with bulk information but no matter how much timer tweaking you do the more information in the routing databases result in longer convergence times. The primary way to reduce network convergence is to reduce the size of your routing tables either by accepting just a default route, creating a flooding boundary domain or some other configuration method. For example, a common approach in OSPF to reduce the size of routing tables and flooding boundary is to create OSPF stub areas. OSPF stub areas limit the amount of information in the area. EIGRP limits the flooding query domain by creating EIGRP stub routers and by intelligently designing aggregation points.
Now lets revisit the components in routing protocol convergence:
1. Failure detection
2. Failure propagation ( flooding, etc.) IGP Reaction.
3. Topology/Routing calculation. IGP Reaction.
4. Update of the routing and forwarding table ( RIB & FIB)
The first and foremost problem facing the control plane is quickly detecting topology changes. Detecting the failure is the most critical and most challenging part of network convergence and can occur at different layers of the OSI stack – Physical Layers ( Layer 1), Data Link Layer ( Layer 2 ), Network Layer ( Layer 3 ) and Application layer ( Layer 7 ). There are many types of techniques used to detect link failures but they all generally come down to two basic types:
Event Driven notification – loss of carrier or when one element of the network detects a failure and notifies the other network elements.
Polling driven notification – generally HELLO protocols that test the path for reachability such as Bidirectional Forwarding Detection ( BFD ). Event driven notifications are always preferred over Polling driven notifications as the latter have to wait three polls before declaring a path down. However, there are some cases when you have a multiple Layer devices in the path and a HELLO polling systems is the only method that can be used to detect a failure.
Layer 1: Ethernet mechanism like auto-negotiation ( 1 GigE ) and link fault signalling ( 10 GigE 802.3ae/ 40 GigE 802.3ba ) can signal local failures to the remote end.
But the challenge is to get the signal across an optical cloud as relaying the fault information to the other end is not always possible. When there is a “bump” in the Layer 1 link it is not always possible for the remote end to detect the failure. In this case the link fault signalling from Ethernet would get lost in the service providers network. The actual link-down / interface-down event detection is hardware-dependent. Older platforms such as the 6704 line cards for the Catalyst 6500 used per-port polling mechanism which resulted in a 1 sec detect link failure time period. More recent Nexus switches and the latest Catalyst 5600 line cards have an interrupt-driven notification mechanism resulting in very fast and predictable detection of link failure.
Layer 2: Layer 2 detection mechanism will kick in if Layer 1 mechanism don’t. Unidirectional Link Detection ( UDLD ) is a Cisco proprietary light-weight Layer 2 failure detection protocol designed for detecting one-way connection due to physical or soft failure and mis-wirings.
UDLD is a fairly slow protocol and uses an average of 15 seconds for message interval and an average of 21 seconds for detection. The time period that it offers has raised question marks to its use in today’s data centers. The chances of mis wiring’s is now very small, physical uni-directional failure are always communicated by Layer 1 mechanisms and STP Bridge Assurance takes care of soft failures in either direction. STP Bridge assurance turns STP into a bidirectional protocol and ensure that spanning tree never fails open and only fails closed. Fails open means that if a switch does not hear from its neighbor its immediately start forwarding on originally blocked ports causing havoc in networks.
Layer 3: In some cases failure detection has to reply on HELLO protocols at Layer 3 and is needed when there are intermediate Layer 2 hops over Layer links and when you have concerns over uni direction failures on point-to-point physical links.
All Layer 3 protocols use HELLO’s to maintain neighbor adjacency and a DEAD time to declare a neighbor dead. These timers can be tuned for faster convergence but it is generally not recommended due to the increase on CPU utilization causing false positive and the challenges faced with ISSU and SSO. Enabling Bidirectional Forwarding Detection ( BFD ) as the Layer 3 detection mechanism is strongly recommended over aggressive protocol times and try to use BFD for all protocols. Bidirectional Forwarding Detection ( BFD ) is a lightweight hello protocol designed for sub-second Layer 3 failure detection and can run over multiple transport protocols such as MPLS, THRILL , IPv6 and IPv4, making it the preferred method for Layer 3 failure detection.
When a change occurs in the network topology, it needs to be registered with the local router and transmitted throughout the rest of the network. The transmission of the change information will be carried out differently for Link State and Distance Vector protocols. Link state protocols must flood information to every device in the network and distance vector must process the topology change at every hop though the network. The processing of information at every hop may make you conclude that link state protocols will always converge more quickly than path vector protocols but this isn’t really the case and EIGRP due to its pre-computed backup path will converge more quickly than any link state protocol .
To propagate topology changes as quickly as possible OSPF ( Link state ) can group changes into few LSA while slowing down the rate at which information is flooded i.e don’t flood on every change every time. This is accomplished with link state flood timer tuning combined with exponential back off systems, for example, link-state advertisement delay / initial link -state advertisement throttle delay. No such timers exist for Distance Vector Protocols. Reducing the size of the routing table is the only option for EIGRP and this can be done with the aggregation and filtering of reachability information ( summary route or Stub areas ).
Similar to the second step , this is where link state protocols use exponential back-off timers. These timers adjust the amount waiting time OSPF and ISIS wait after receiving new topology information before calculating the best path. More information about this and in particular Fast Reroute in my next post.
Update of the routing and forwarding table ( RIB & FIB)
Finally, after the topology information has been flooding through the network and a new best path has been calculated, the new best path must be installed in the Forwarding Information Base ( FIB ). The FIB is basically a copy of the RIB in hardware and the forwarding process finds it much easier to read than the RIB. This is usually done in hardware and most vendors offer features that will install a pre computed backup path on the line cards forwarding table so the fail over from primary path to the backup path can be done in milliseconds and without an interrupt to the routers CPU.