Network Function Virtualization or Holy Cows?
The source for this blog post is taken from Ivan Pepelnjak’s recent webinar on Network Function Virtualization.
While Software Defined Networking (SDN) and Network Function Virtualization (NFV) are used in the same context, they satisfy separate functions in the network. NFV is used to program network functions, like network overlays, QoS, VPNs etc. SDN is used to program the network flows. They have completely different heritages. SDN was born in the academic labs and found roots with the major hyper-scale data centres of Google, Amazon and Microsoft. It’s use case is now moving from the internal data centre to the service provider and mobile networks. NFV, on the other hand, was pushed by service providers in 20012-2013 and work is driven out of the European Telecommunications Standard Institute (ETSI) working group. The ETSI have proposed an NFV reference architecture and a number of white papers and technology leaflets.
What is NFV? ASIC vs Intel x86 Processor
To understand NFV, consider the inside of both proprietary network devices and standard servers. The inside of a network device looks similar to that of a standard server. They have a number of common components, including Flash, PCI bus, RAM etc. Apart from the number of physical ports the architecture is very similar. The ASIC (application-specific integrated circuit) are not as important as vendors would like you to believe. When buying a networking device, you are not paying for the hardware. The hardware is really cheap and you are actually paying for the software & maintenance costs. Hardware is the smallest component of the total price. Why can’t you run network services on Intel x86? Why is there a need to run these services on vendor proprietary software? x86 general-purpose OS’s can perform just as well as some routers with dedicated silicon.
The concept of NFV is simple, let’s deploy network service in VM format on generic non-proprietary hardware. If anything, it increases network agility as you can now deploy services in seconds and not weeks. The time-to-deployment is quicker, enabling the introduction of new concepts and products in line with the business deployment speeds needed for today’s networks. NFV reduces the number of redundant devices. For example, why have two firewall devices in active / standby when you can insert or replace a failed firewall in seconds with NFV? It also simplifies the network and reduces the shared state in network components. Shared state is always bad for a network and too much device complexity leads to what Ivan Pepelnjak calls device “holy cows”. A holy cow is a network device ingrained so much in the network with obsolete and old configurations it cannot be moved easily or cheaply (everything can be moved at a cost).
However, not everything can be expressed in software. You can’t replace a Terabit switch with an Intel CPU. It may be cheaper to replace a top end Cisco GSR or CSR with an Intel x86 server, but functionally it is far from practical. There will always be a requirement for hardware-based forwarding and this will likely never change. But if your existing hardware is using Intel’s x86 forwarding there is no reason why it can’t run on generic hardware. Possible network functions for NFV include Firewalls, Deep Packet Inspection (DPI) devices, Intrusion Detection Systems, SP CE and PE devices, and Server Load Balancers. DPI is never usually done in hardware so why can’t we put in on an x86 server? Load balancing can be scaled out among many virtual devices in a pay-as-you-grow model, making it an accepted NFV candidate. There is no need to put 20 IP address on a load balancer when you can easily scale 20 independent load balancing instances in NFV formate.
Control Plane Functionality
While the relevant use cases of NFV continue to evolve, an immediate and widely accepted use case would be with control plane functionality. Control plane functions don’t require intensive hardware-based forwarding functions.They provide reachability and control information for end-to-end communication.
For example, take the case of a BGP Route Reflector (RR) or LISP Mapping Database. An RR does not participate in data plane forwarding. It is usually designed in a redundant route reflector cluster for control plane services; reflecting routes from one node to another. It is not in the data transit path. We have been using proprietary vendor hardware as route reflectors for ages as they had the best BGP stack. But buying a high-end Cisco or Juniper devices just to run RR control plane services is a waste of money and hardware resource. Why buy a router with good forwarding hardware when you only need the control plane software element? LISP Mapping Databases are commonly deployed on x86 and not a dedicated routing appliance. This is how the lispers.net open ecosystem mapping server is deployed. All routers needed for control plane services can be put in a VM. Control plane functionality is not the only NFV use case. Service providers are also implementing NFV for virtual PE and virtual CE devices. They offer per customer use cases by building a unique service chain for each customer. Some providers want to allow customers to build their own service chain. With this, you can quickly create new services and test new service adoption rates to determine if anyone is buying the product. A great way to test new services.
There are three elements relating to performance – management, control, and data plane. Management and control plane performance is not as critical as data plane. As long as you get decent protocol convergence timers, it should be good enough. But generally speaking, they are not as important as the data plane forwarding, which is critical for performance. The performance you get out of the box with an x86 device isn’t great, maybe 1 or 2 GiG of forwarding performance. If you do something very simple like switching Layer 2 packets, the performance will increase to 2 or 3 GiG per core. Unfortunately, this is considerably less than what the actual hardware can do. The hardware can push 50 to 100GiG through a mid range server. Why is out of the box performance so bad?
The problem lies with the TCP stack and Linux Kernel. The Linux Kernel was never designed for high-speed packet forwarding. It doesn’t offer great data plane but does offer excellent control plane function. To improve performance you may need multi-core processing. Sometimes the forwarding path taken by the virtual switch is so long that it kills performance. Especial when the encapsulation and decapsulation process of tunneling are involved. In the past when you started to use Open vswitch (OVS) with GRE tunneling, performance fell drastically. They never had this problem with VLANS as they were using a different code path. Now, with the latest version of OVS, performance is not an issue. It’s actually faster than most of its alternatives, such as the Linux Bridge. The performance has increased due to the changes in architecture for Multithreading, Megaflows and additional Classifier improvements. It can be optimized further with Intel DPDK. DPDK is a set of enhanced libraries and drivers that enable kernel bypass; gaining impressive performance. Performance may also be gained by moving the hypervisor out of the picture with SR-IOV. SR-IOV takes a single physical NIC and slices it into multiple virtual NICS and then you connect VM to one of the Virtual NICs. Allowing the VM to work with the hardware directly.