Transmission Control Protocol (TCP) applications offer reliable byte stream with congestion control mechanisms adjusting flows to current network load. Designed in the 70s, TCP is the most widely used protocol and remains largely unchanged, unlike the networks it operates within. Back in those days the designers understood there could be link failure and decided to decouple the network layer (IP) from the transport layer (TCP). This enables the routing with IP around link failures without breaking the end-to-end TCP connection. Dynamic routing protocols do this automatically without the need for transport layer knowledge.
Even Though it has a wide adoption, it does not fully align with the multipath characteristics of today’s networks. TCP main drawback is that it’s a single path per connection protocol. Single path means once the stream is placed on a path ( end points of the connection) it can not be moved to another path even though multiple paths may exist between peers. This characteristic is suboptimal as the majority of today’s networks, and end hosts have multipath characteristics for better performance and robustness.
The ability to use multiple paths for a single TCP session increases resource usage and resilience. All this is achieved with additional extensions added to regular TCP enabling the transport of connection across multiple links simultaneously. The core aim of Multipath TCP (MPTCP) is to allow a single TCP connection to use multiple paths simultaneously by using abstractions at the transport layer. As it operates at the transport layer the upper and lower layers are transparent to its operation. No network or link-layer modifications are needed.
There is no need to make any changes to the network or the end hosts. The end hosts use the same socket API call, and the network continues to operate as before. No special configurations are required as it’s a capability exchange between hosts. Multipath TCP is 100% backwards compatible with regular TCP.
MPTCP binds a TCP connection between two hosts, not two interfaces like regular TCP does. Regular TCP creates a connection between two IP endpoints by establishing a source/destination by IP address and port number. The application has to choose a single link for the connection. However, MPTCP creates new TCP connections known as subflows, providing the ability for the application to take different links for each subflow.
Subflows are set up the same as regular TCP connections. They consist of a flow of TCP segments operating over individual paths but still part of the overall MPTCP connection. Subflows are never fixed and may fluctuate in number during the lifetime of the parent Multipath TCP connection.
Multipath TCP Uses cases
Multipath TCP is particularly useful in multipath data centre and mobile phone environments. All mobiles allow you to connect via WiFi and 3G network. MPTCP enables either the combined throughput and the switching of interfaces ( Wifi / 3G ) without disrupting the end-to-end TCP connection.
For example, if you are currently on a 3G network with an active TCP stream, the TCP stream is bound to that interface. If you want to move to the Wifi network you need to reset the connection and all ongoing TCP connections will, therefore, get reset. With MPTCP the swapping of interfaces is transparent.
Next generation leaf and spine data centre networks are built with Equal-Cost Multipath (ECMP). Within the data centre, any two endpoints are equidistant. For one endpoint to communicate to another, a TCP flow is placed on a single link, not spread over multiple links. As a result, single-path TCP collisions may occur, reducing the throughput available to that flow. This is commonly seen for large flows and not small mice flows.
In a data centre when a server starts a TCP connection it gets placed on a path and stays there. With MPTCP instead of using a single path per connection you could use many subflows per connection. If some of those subflows get congested, you just don’t send over that particular subflow improving traffic fairness and bandwidth optimisations.
The default behaviour of spreading traffic through a LAG or ECMP next hops is based on the hash-based distribution of packets. An array of buckets is created, and each outbound link is assigned to one or more buckets. Fields are taken from the outgoing packet header such as source-destination IP address / MAC address and hashed based on this endpoint identification. A bucket is selected by the hash and the packet is queued to the interface that is assigned that bucket.
The issue here is that the load balancing algorithm does not take into account interface congestions or packet drops. With all mice flows this is fine but once you mix mice and elephant flows together your performance will suffer. An algorithm is needed to identify congested links and then reshuffle the traffic. A good use for MPTCP is when you have a mix of mice and elephant flows. Generally, MPTCP does not improve performance for environments with only mice flows.
With small files say 50KB MPTCP offers the same performance as regular TCP. As the file size increases MPTCP usually has the same results as link bonding. The benefits of MPTCP come to play when files are very big (300 KB ). At this level, MPTCP outperforms link bonding as the congestion control can efficiently balance the load better over the links.
MPTCP Connection Setup
The aim of the connection is to have a single TCP connection with many subflows. The two endpoints using MPTCP are synchronised and have connection identifiers for each of the subflows.
MPTCP starts the same as regular TCP. If additional paths are available additional TCP subflow sessions are combined into the existing TCP session. The original TCP session and other subflow sessions appear as one to the application, and the main Multipath TCP connection seems like a regular TCP connection. The identification of additional paths boils down to the number of IP addresses on the hosts.
The TCP handshake starts as normal, but within the first SYN, there is a new MP_CAPABLE option ( value 0x0 ) and a unique connection identifier. This allows the client to indicate they want to do MPTCP. At this stage, the application layer just creates a standard TCP socket with additional variables indicating that it wants to do MPTCP.
If the receiving server end is MP_CAPABLE it will reply with the SYN/ACK MP_CAPABLE along with its connection identifier. Once the connection is agreed the client and server will set up state. Inside the kernel, this creates a Meta socket acting as the layer between the application and all the TCP subflows.
Under a multipath condition and when multiple paths are detected (based on IP addresses) the client start a regular TCP handshake but with the MP_JOIN option (value 0x1) and uses the connection identifier for the server. The server then replies with a subflow setup. Now, new subflows are created and as the data is sent from the application to the meta socket, the scheduler will schedule over each of the subflows.
TCP sequence numbers
Regular TCP uses sequences numbers enabling the receiving side to put back packets in the correct order before sending to the application. The sender can figure out which packets are lost by looking at the ACK’s.
For MPTCP packets need to go multiple paths, so you first need sequence numbers to put packets back in order before they are passed to the application. You also need to have the sequence numbers to inform the sender if there was any packet loss on a path.
When an application sends, the segment is assigned a data sequence number. TCP looks at the subflows to see where to send this segment. When it then sends on a subflow, it uses a sequence number and puts in the TCP header and the other data sequence number gets set in the TCP options.
The sequence number on the TCP header informs the client of any packet loss. The data sequence number is used by the received to reorder packets before sending to the application.
Congestion control was never a problem in circuit switching. Resources are reserved at call setup to prevent congestion during data transfer resulting in a lot of bandwidth underutilization due to the reservation of circuits.
We then moved to packet switching where we had a single link with no reservation, but the flows could use as much of the link as they want. This increases the utilisation of the link and also the possibility of congestion. To help this situation congestion control mechanisms were added to TCP. Similar TCP congestion control mechanisms are employed for MPTCP.
Normal TCP congestion control maintains a congestion window for each connection, and on each ACK you increase the window size. With a drop you half the window.
MPTCP operates in a similar way. You maintain one congestion window for each subflow path. Similarly to normal TCP when you have a drop on a subflow you half the window for that subflow. However, the increase rules are different to standard TCP behaviour. It gives more of an increase for subflows with a larger window. A larger window means it has a lower loss. As a result, traffic moves from congested to uncongested links dynamically.