Nowadays, the Linux stack is no longer a standalone operating system and serves a variety of functions around the network. The number and type of applications the networking stack must support varies from Android handsets to data centre routers and switches, both virtualized and bare metal. The application is a sign of today’s variety. Some are outbound orientated, others inbound orientated. There are many different spectrums of applications and when you have a variety in your application space it is hard to have one networking solution. This put pressure on Linux networking to evolve and support a variety of application stacks with different network requirements. The challenge arises from the different expectations on end hosts to that of a middle node running Linux. The Linux stack must perform differently in all these areas.
Recap on Linux Networking Subsystem
The Linux system architecture contains the user space, kernel, and the actual hardware. At the top of the Linux framework, user space exists with various user applications. In the middle, the kernel space forward packets, accepting instruction from the user space element. At the very bottom, we have the actual hardware, such as CPU, RAM, and NIC. One way to communicate between userspace and kernel is via Netlink. The Netlink socket is what handles bidirectional communication between the two. It can be created in user space with the socket() system call and or in kernel with netlink_kernel_create(). The following shows a Netlink socket created in kernel and userspace.
The Netlink protocol implementation resides under the following net/netlink folder listed below. The af_netlink provide the netlink kernel socket API, genetlink provides the generic netlink API and diag provides information about the netlink sockets.
The Linux networking subsystems is part of the kernel space and is one of the most important subsystems. Even if hosts are not connected the network subsystem is used for the client – server interaction of X-Windows.
The Linux Kernel networking stack processes incoming packets arriving at Layer 2 to the network layer and then passes for local delivery to the transport layer protocols listening to TCP or UDP sockets. Any packets not destined to the local system are sent back down the stack for transmission. The kernel does not handle anything above Layer 4. All layers above Layer 4 are handled by user space applications.
sk_buff and net_device
The sk_buff and net_device are fundamental to the networking subsystem. The network device driver ( net_device structure ) receives and transmit packets, either to pass them up the stack ( Layer 3 to Layer 4 ) or to transmit to an outgoing interface. To determine the interface and specific packet handling activities, a lookup in the routing subsystem is performed for every incoming / outgoing packet. There are many things that may affect packet traversal such as Netfilter hooks, IPsec subsystem, TTL etc. The sk_buff ( Socket Buffer ) represents data and headers. Packets are received on the wire by a NIC (netdevice) and placed in the sk_buff and then passed through the network stack.
The userspace networking stack can slow down the performance of the CPU. Everything that crosses over to kernel affects performance. So, if the application crosses over the user / kernel boundary then its going to cost a lot. You should minimize this by keeping as much in the kernel and below as possible and only go to userspace for a quick breath. For example, transit traffic might not need to go to userspace all the time.
Linux Networking and Android
Linux is used extensively as the base for Android phones. The Linux networking stack has different needs for mobile devices than to data centre devices. Phone moves all the time, connects to different networks with varying quality. Phones are connected to multiple networking nearly all the time. If devices are on the WIFI network and require to send an SMS you need to bring up the cell network which is on a different IP interface.
Users want all networks at the same time and the Linux stack must seamless switch across network boundaries. For this, the application has to shoot all the TCP connections so they don’t get blocked on reads that they will never compete. Normally, in Linux when you remove the IP address the TCP connection will stay there hoping that the IP address will come back. As a result, for every network switch, the TCP connections are closed.
Linux must also support different per application and socket routing, for example, connecting to a wireless printer while on the CELL network. There is also a method to let users know if they are connecting to a WIFI network that doesn’t have a backhaul connection. To do this Linux needs to use DNS and open a TCP connection on the backhaul network. For such as small device, the networking stack needs to handle many different functions.
Linux Networking and the Data Center
Linux has accelerated in the data centre and is the base for open source cloud environments. Many virtual switch functions are available with hardware offload for accelerated performance. The Linux kernel supports three types of software bridges – Bridge, MACVLAN, and Open vSwitch. There is also a NIC embedded switch solution with SR-IOV that may be used instead of the software switch. Recently, there has been many new bridge features such as FDB manipulation, VLAN filtering, Learning / flooding control, Non-promiscuous bridge, VLAN filtering for 802.1as (Q-in-Q).
A typical packet processing pipeline of a switch includes:
- Packet parsing and classification – L2,L3,L4, tunnelling, VXLAN VNI, inner packet L2,L3,L4.
- Push/pop for VLAN or encapsulation / decapsulation for tunneling.
- QoS related function such as Metering, Shaping, Marking, and scheduling.
- Switching operations.
The data plane is accelerated by decomposing the packet processing pipeline and offloading some stages to the hardware ASICs. Layer 2 features that can be offloaded to ASIC may include MAC learning and ageing, STP handling, IGMP snooping and VLXAN. It is also possible to offload Layer 3 functions to ASICs.
Linux Switch Types
The bridge is an MAC&VLAN standard bridge, containing an FDB ( forwarding DB), STP ( spanning tree) and IGMP functions. The bridge contains a record of MAC to port allocation in the FDB. Building up the FDB is called “MAC learning” or simply the “learning process”.
MACVLAN is a switch based on STATIC MAC&VLAN. It uses unicast filtering instead of promiscuous mode and supports a number of modes – private, VEPA, bridge and passthru. MACVLAN is a reverse VLAN under Linux. It takes a single interface and creates multiple virtual interfaces with different MAC addresses. Essentially, it enables the creation of independent logical devices over a single ethernet device – “many to one” relationship in contrast to a “one to many” relationship where you map a single NIC to multiple networks. MACVLAN offers isolation in the sense that it will only see traffic on an interface with a specified MAC address.
Open vSwitch is a flow based switch that performs MAC learning like the Linux bridge. It supports protocols like STP and more importantly OpenFlow. Its forwarding is based on flows and everything is forwarding based on a flow table. It is really becoming the de facto software switch and has an impressive feature list, now including stateful services and connection tracking. It is also used in many complex use cases involving nested Open vSwitch designs with OVN (Open source virtual networking). By default, the OVS acts as a learning switch and learns like a normal Layer 2 switch. For advanced operations, it can be connected to an SDN controller or use the command line to manually add OpenFlow rules.