Buffers and Packet Drops
Today’s data centres have a mixture of applications and workloads all with different consistency requirements. Some applications require predictable latency while others sustained throughput. It’s usually the case that the slowest flow is the ultimate determining factor affecting the end-to-end performance. So to try to satisfy varied conditions and achieve predictable application performance we must focus on consistent bandwidth and unified latency for ALL flows types and workloads.
Poor performance is due to a lot of factors that can either be controlled or not. One factor that can be monitored is buffer sizes in the network devices used to interconnect source and destination points. Poor buffers cause bandwidth to be unfairly allocated among different types of flows. If some flows do not receive adequate bandwidth, they will exhibit long tails and long completion time degrading performance.
The speed of a network is all about how fast you can move and complete a data file from one location to another. Some factors you can influence and others you can’t control, such as the physical distance from one point to the next. This is why we see a lot of content distributed closer to the source with for example intelligent caching to improve user response latency and reduce cost of data tranmission. The TCP’s connection-oriented procedure will affect application performance for further distance endpoints than it would for source-destination pairs internal to the data centre.
We can’t change the laws of physics and distance will always be a factor, but there are ways to optimise networking devices to improve application performance. One way is to optimise the buffers sizes and select the right buffer architecture to support applications that send bursty traffic. There is a big debate as to whether big or small buffers are best or whether we need lossless transport or to drop packets.
How long it takes a flow to complete is significantly affected by the TCP congestion control and network device buffer. TCP was invented over 35 years ago and makes sure that sent blocks of data are received intact. It creates a logical connection between source-destination pairs with the finding of endpoints carried out at the lower IP layer. The congestion control element was added later to ensure that data transfers can be accelerated or slowed down based on current network conditions.
Big Buffers vs Small Buffers
Both small and large buffers sizes have different effects on application flow types. Some sources claim that small buffers sizes optimise performance, while other claims that larger buffers are better. Many of the web giants including Facebook, Amazon, and Microsoft use small buffer switches. It depends on your environment. Understanding your application traffic pattern and testing optimisations techniques are essential to finding the sweet spot. Most out of the box applications are not going to be fine tuned for your environment, and the only rule of thumb is to lab test.
Complications arise when the congestion control behaviour of TCP interacts with the network device buffer. The two have different purposes. TCP congestion control continuously monitors available network bandwidth by using packet drops as the metric. On the other hand buffering is used to avoid packet loss. In a congestion scenario, the TCP is buffered, but the sender and receiver have no way of knowing that there is congestion and the TCP congestion behaviour is never initiated. So the two mechanisms that are used to improve application performance don’t compliment each other and require careful testing for your environment.
The first step is to find your networks threshold where packets get dropped. Tools such as iperf3, tcpdump, tcpprobe can be used to tests and understand the effects on TCP. There is no point looking at a vendor’s reports and concluding that their “real world” testing characteristics fit your environment. They are only guides and “real world” traffic tests are misleading. Usually, no standard RFC is used for vendor testing, and they will always try to make their own products appear better by tailoring the test ( packet size etc ) to suit their environment. Recently, there were conflicting buffer testing results from both Arista 7124S and Cisco Nexus 5000.
The Nexus 5000 works best when most ports are congested at the same time. While the Arista 7100 performs best when some ports are congested but not all. The fact is that these platforms have different buffer architectures regarding buffer sizes, buffer disciplines, and buffer management influences how you test.
TCP Congestion Control
The discrepancy and uneven bandwidth allocation for flow boils down to the natural behaviour of how TCP reacts and interacts with insufficient packet buffers and the resulting packets drops. The behaviour is known as the TCP/IP bandwidth capture effect. The TCP/IP bandwidth capture effect does not affect the overall bandwidth but more individual Query Completion Times and Flow Completion Times (FCT) for applications. The QCT and FCT are prime metrics for measuring TCP-based application performance.
A TCP streams pace of transmission is based on a built-in feedback mechanism. The ACK packets from the receiver adjust the sender’s bandwidth to match available network bandwidth. With each ACK received, the sender’s TCP starts to incrementally increase the pace of sending packets to use all available bandwidth. On the other hand, it takes 3 duplicate ACK messages for TCP to conclude packet loss on the connection and start the process of retransmission.
So, in the case of inadequate buffers, packets are dropped to signal the sender to ease the rate of transmission. TCP flows that are dropped start to back off and naturally receive less bandwidth than the other flows that do not back off. The flows that don’t back off get hungry and take more bandwidth. This causes some flows to get more bandwidth than others in an unfair manner. By default, the decision to drop some flows and leave other flows alone is not controlled and made purely by chance.
This is conceptually similar to Ethernet CSMA/CD bandwidth capture effect in shared Ethernet. Stations that collide with other stations on a shared LAN would back off and receive less bandwidth. This is not too much of a problem these days due to all switches supporting full duplex.
DCTCP & Explicit Congestion Notification (ECN)
There is a new variant of TCP called DCTP which improves the congestion behaviour. DCTCP relies on ECN enhancing the TCP congestion control algorithms.
On Ivan Pepelnjak recent IPspace podcast, JR Rivers from Cumulus says signalling congestion early and explicitly results in better performance and flow behaviour. DCTP tries to measure how often you experience congestion and use that to determine how fast it should reduce or increase its offered load depending based on the level of congestion. DCTP certainly reduces latency and provides more fair behaviour between streams. The recommended approach that JR is testing is to use DCTP with both ECN and Priority Flow control (pause environments).
Microbursts are a type of small bursty traffic pattern lasting only for a few microseconds, commonly seen in Web 2 environments. This kind of traffic is the opposite to what we see with storage traffic, which always has large bursts.
Bursts only become a problem and cause packet loss when there is oversubscription; many communicating with one. This results in what is known as fan-in causing packet loss. Fan-in could be a communication pattern consisting of say 23-to-1 or 47-to-1; n-to-many unicast or multicast. All these sources send packets to one destination causing congestion and packet drops. One way to overcome this is to have sufficient buffering.
It’s important for network devices to have sufficient packet memory bandwidth to handle these types of bursts. If they don’t have the required buffers, Fan-in can increase end-to-end latency degrading application performance. Latency is never a good thing for application performance but it’s still not as bad as packet loss. When the switch can buffer traffic correctly, packet loss is eliminated, and the TCP window can scale to its maximum size.
Mice & Elephant flows
There are two types of flows in data centre environments. We have large, elephant and smaller mice flows. Elephant flows might only represent a low proportion of the number of flows but consume most of the total data volume.
Mice flows are, for example, control and alarm/control messages and usually pretty significant. As a result, they should be giving priority over larger elephant flows, but this is sometimes not the case with simple buffer types that don’t distinguish between flow types. Priority can be given by somehow regulating the elephant flows with intelligent switch buffers.
Mice flows are often bursty flows where one query is sent to many servers. This results in many small queries get sent back to the single originating host. These messages are often small only requiring 3 to 5 TCP packets. As a result, the TCP congestion control mechanism may not even be evoked as the congestion mechanisms take 3 duplicate ACK messages. Due to the size of elephant flows they will invoke the TCP congestion control mechanism (mice flows don’t as they are too small).
Both mice and elephant flows react differently when combined in a shared buffer. Simple small and deep buffers operate on a first come first serve basis and do not distinguish between different flow sizes; everyone is treated equally. Elephants can fill out the buffers and starve mice flows adding to their latency. Bandwidth aggressive elephant flows can quickly fill up the buffer and impact sensitive mice flows. Longer latency results in longer flow completion time which is a prime measuring metric for application performance. On the other hand, intelligent buffers understand the types of flows and schedule accordingly. With intelligent buffers, elephant flows are given early congestion notification and under stress, the mice flows are expedited. This offers a better living arrangement for both mice and elephant flows.
You first need to be able to measure your application performance and understand your scenarios. Small buffer switches are used for the most critical applications and do very well. You are unlikely going to make a bad decision with small buffers. So it’s better to start by tuning your application. Out of box behaviour is really generic and don’t take into consideration failures or packet drops. Understanding the application and then tuning host and network devices in an optimised leaf and spine fabric is the way forward. If you have a lot of incast traffic then having large buffers on the leaf will benefit you more than having large buffers on the Spine.