Removing State from Network Functions

We have seen a major evolution in technology where network functions can run in software on non-proprietary commodity hardware, be it a grey box or white box deployment model.

Taking network functions from a physical appliance and putting them into a virtual appliance is only half the battle. The move to software does provide the on-demand elastically and scale of network functions, also the quick recovery from failures. However, we are still hindered by one major factor – the state that each network function needs to process.

We still have the challenges that are created with the tight coupling of the state and processing for each of the individual network functions, be it a firewall, load balancer, or intrusion protection system (IPS). Having the state tightly coupled with the network functions limits the network functions agility, scalability, and recovery from failures.
Compounded by this we have seen an increase in network complexity.

The rise of public cloud, the emergence of hybrid and multi-cloud has made data center connectivity more complicated and more critical than ever.

What is state

Before we delve into the potential ways to solve this problem, mainly with the introduction of stateless network functions. Let us first describe the different types of state. We have two: dynamic and static.

The dynamic state is continuously updated by the network function processes. The dynamic state could be anything from connection information in a firewall to the server mappings in the load balancer. The static state, on the other hand, could include something like pre-configured firewall rules or the IPS signature database.

The dynamic state must persist across instance failures and be available to the network functions when they are scaling in or out. The static state, on the other hand, is the easy one and that can be replicated to a network instance upon boot time.

Stateless network functions

Stateless Network Functions is a new and disruptive technology that decouples the design of network functions into a stateless process component along with a data store layer.

There also needs to be some kind of orchestration layer that can monitor the network function instances for load and failure, and adjust the number of instances accordingly. Taking or decoupling the state from a network function enables a more elastic and resilient infrastructure. So how does this work?

Well, from a 20,000 bird’s eye view, the network functions themselves become stateless. The statefulness of the application such as a stateful firewall is maintained by storing the state in a separate data store. The data store provides the resilience of the state. No state is stored on the individual networking functions themselves.

Data store example

The data store can be for example RAMCloud. RAMCloud is a distributed key-value storage system that offers high-speed storage for large-scale applications.

It is purposely designed for when a large number of servers need low-latency access to a durable data store. RAMCloud is really good for low-latency access as its based primarily in DRAM.  RAMCloud keeps all data in DRAM, as a result, the network functions can read RAMCloud objects remotely over the network in as little as 5μs.

Stateless network functions advantages

Stateless network functions may not be useful for all network functions but are useful for the common network functions that can be re-designed in a stateless manner.

Stateful network functions are useful for a stateful firewall, intrusion prevention system, network address translator and a load balancer. Removing the state and placing it on a database brings a number of advantages to network management.

As the state is accessed via a data store a new instance can be launched and traffic immediately directed to it offering elasticity. Secondly, resilience, a new instance can be spawned instantaneously upon failure.

Finally, as an individual packet can be handled by any one of the instances, packets traversing different paths do not have issues with asymmetric and multi-path routing.

Problems with having state: Failure

The majority of network designs have redundancy built in. In sounds easy, when one data center fails over to let the secondary take over. When the data center interconnect (DCI) is configured properly everything should work upon failover, correct?

Let’s not forget about one little thing called state with a firewall in each data center. The network address translation (NAT) in the primary data center stores the mapping for two flows, let’s call them F1 and F2. Upon failure, the second firewall in the other data center takes over and traffic is directed to the new firewall.

However, any packets belonging to the flows F1 and F2 will have no entry in the second firewall. This will result in a failed lookup, existing connections will timeout causing application failure.  Asymmetric routing causes problems. If a firewall has as an established state for a client to server connection (SYN packet), if the return SYN-ACK passes through a different firewall the packet will result in a failed lookup and get dropped.

Some have tried to design distributed active-active firewalls in an attempt to solve the layer 3 issues and asymmetrical-traffic-flow-over-stateful-firewall. The solution looks perfect. Simply configure both wide area network (WAN) routers to advertise the same IP prefix to the outside world. This will attract inbound traffic and pass the traffic through the nearest firewall. Nice and easy. The active-active firewalls would exchange flow information, solving the asymmetrical flow problems.? Check out Ivan Pepelnjaks experience.

Distributed active-active firewall state across each data center is better in powerpoint than in real life.

Problems with having the state: Scaling

The tight coupling of the state can also cause problems with the scaling of network functions.   Scaling out NAT functions will have the same effect as NAT box failure. Packets from flow originating from a different firewall that is directed to a new instance will result in a failed lookup.

About Matt Conran

Matt Conran has created 184 entries.

Leave a Reply