Kubernetes Networking 101
Kubernetes is an open source cluster management tool released by Google in June 2014. Google has said it launches over 2 billion containers per week and Kubernetes was designed to control and manage the orchestration of all these containers. Initially, they were building a Borg and Omega system but this later resulted in Kubernetes. All lessons learned in the journey from Borg to Kubernetes is now passed to the open source community. Kubernetes went 1.0 in July 2015 and is now at version 1.3.0. The Kubernetes deployment support GCE, AWS, Azure, vSphere and bare metal.
Kubernetes – Birds Eye View
At a very high level, Kubernetes enables a group of hosts to be viewed as a one single compute instance. The single compute instance, consisting of multiple physical hosts is used to deploy containers against. This offers a completely different abstraction level to what we have with single container deployments. Users start to think about high-level application services and the concept of service design only. They are no longer concerned with individual container deployment as the orchestrator looks after the deployment, scale, and management. For example, users specify to the orchestration system they want a specific type of application with defined requirements, now go deploy it for me. The orchestrator manages the entire rollout, specifies the targeted hosts and manages the container lifecycle. The user don’t get involved with host selection. This type of abstraction allows users to focus on design and workload requirements only – the orchestrator takes care of all the low-level deployment and management details.
Distributed systems are more fine-grained now with Kubernetes driving microservices. Microservices is a fast moving topic involving the breaking down of applications into many specific services. All services have their own lifecycles, collaborating with each other. Splitting the Monolith with microservices is not a new idea ( the term is ) but the emergence of new technologies is having a profound effect.
The specific domain / containers require a uniformed way to communicate and access each other’s services. A strategy needs to be maintained to manage container interaction. How do we scale containers? What’s the process for container failure? How do we react to container resource limits? Although Docker does help with container management, Kubernetes orchestration works on a different scale and looks at the entire application stack allowing management at a service / application level ( let’s jump into Docker Swarm another time ).
To take full advantage of the portability of containers and microservices we need a management and orchestration system. Containers can’t just be thrown into a sea of compute and expect to tie themselves together and work efficiently. A management tool is required to govern and manage the life of containers, where they are placed and who they can talk to. Containers have a complicated existence and many pieces are used to patch up their communication flow and management. We have updates, high availability, service discovery, patching, security and of course networking. The most important aspect to keep in mind with Kubernetes or any container management system is that they are not concerned with individual container placement. The focus is on workload placement. Users enter high-level requirements and the scheduler does the rest – where, when, and how many?
Looking at workloads to analyze placement optimizes application deployment. For example, some processes that are part of the same service will benefit from network proximity. Front end tiers sending large chunks of data to a backend database tier should be close to each other, not to trombone across the network to other Kubernetes host for processing. When there is common data that needs to be accessed and processed it makes sense to put containers “close” to each other in a cluster. The following diagram displays core Kubernetes architecture.
To achieve this Kubernetes uses 4 main constructs to build the application stack – Pods, Services, Labels, and Replication Controllers. All constructs are configured and combined together resulting in a full application stack with all management components. Pods group similar containers on the same hosts, Labels tag objects, replication controllers manage the desired to actual state at a POD level, not container level, and services enable Pod-to-Pod communication. These constructs enable the management of your entire application lifecycle as a whole opposed to individual application components. Construct definition is through configuration files either in YAML or JSON format.
Pods are the smallest scheduling unit in Kubernetes and basically hold a set of closely related containers, all sharing fate, and resources. Containers in a Pod share the same network namespace and have to be installed on the same host. The main idea to keep similar or related containers together is that processing is performed locally and does not incur any latency traversing from one physical host to another. Local processing is always faster than remote processing.
Pods essentially hold containers with related pieces of the application stack. A key point is that they are ephemeral and follow a specific lifecycle. They should come and go without any service interruption as any service-destined traffic directed should be towards the “service” endpoint IP address, not the Pod IP address. Even Though Pods have a Pod wide-IP address, service reachability is carried out with service endpoints. Services are not as ephemeral ( although can be deleted ) and don’t go away as much as Pods. They act as the front end VIP to back end Pods ( more on this later ). This type of architecture really hammers home the level of abstraction Kubernetes is looking for.
The following example displays a Pod definition file. We have some basic configuration parameters here such as the name, ID of the Pod. Also, notice that the type of object is set to “Pod”. This will be set according to the object we are defining. Later we will see this set as “service” for defining a service endpoint. In this example, we are defining two containers – “testpod80” and “testpod8080”. We also have the option to specify the container image and Label. As Kubernetes assigns the same IP to the Pod in which both containers live we should be able to browse to the same IP but different port numbers 80 or 8080. Traffic gets redirected to the respective container.
Containers within a Pod share their network namespaces. All containers within can reach each other’s ports on localhost. This does reduce the isolation between containers but any more isolation would kinda go against why we have Pods in the first place. They are meant to group “similar” containers sharing the same resource volumes, RAM, and CPU. For Pod segmentation, we have labels – a Kubernetes tagging system.
Labels offer another level of abstraction by tagging items as part of a group. They are essentially key-value pairs categorizing constructs. When we create Kubernetes constructs we have the option to set a label, which acts as a tag for that construct. This means you can access a group of objects by specifying the label assigned to those objects. For example, labels distinguish containers as being part of a web or database tier. The “selector” field tells Kubernetes which labels to use in finding Pods to forward traffic too.
The replication controller ( RC ) manages the lifecycle and state of Pods. It ensures the desired state always matches the actual state. When you create an RC you define how many copies ( aka replicas) of the Pod you want in the cluster. The RC maintains the correct numbers are running by either creating or removing Pods at any given time. Kubernetes doesn’t care about the number of containers running in a Pod, it’s only care is with the number of Pods. It works at a Pod level.
The following is an example of an RC service definition file. Here you can see that the desired state of replicas is “2”. The replica of 2 means that the number of pods each controller should maintain is 2. Changing the number up or down will either increase or decrease the number of Pods the replication controller manages. For example, if the RC notices that there are too many, it will stop some to bring the replication controller back to the desired state. The RC always keeps track of the desired state and brings back to actual state originally specified in the service definition file. We may also assign a label for grouping replication controllers together.
Service endpoints enable the abstraction of services and the ability to scale horizontally. Essentially, it is an abstraction defining a logical set of Pods. Services represent groups of Pods acting as one and allow Pods to access services in other Pods without directing service destined traffic to the Pod IP. Remember Pods are short-lived!
The IP address the service endpoint gets is from the “Portal Net” range defined on the API service. The address has local significance to the host so make sure it doesn’t clash with the docker0 bridge IP address.
Pods are targeted by accessing a service that represents a group of Pods. A service can be viewed with similar analogy to that of a load balancer, sitting in front of Pods accepting front end service-destined traffic. Services really act as the main hooking point for service / Pod interactions. They offer the high-level abstraction to Pods and the containers within. All traffic gets redirected to the service IP endpoint which in turn performs the redirection to the correct backend. Traffic hits the service IP address ( Portal Net ) and a netfliter IPtable rules forwards to a local host high port number. The -proxy service creates the high port number forming the basis for load balancing. The load balancing object then listens to that port. The kub–proxy acts as a full proxy maintaining two distinct TCP connections. One separate connection from container to proxy and another from the proxy to load balanced destination.
The following is an example of a service definition file. The service listens on listens on port 80 and sends traffic to the backend container port on 8080. Notice how the object kind is set to “service” and not “Pods” like in the previous definition file.
Kubernetes Networking Model
The Kubernetes networking model details that each Pod should have a routable IP address. This makes communication between Pods easier by not requiring any NAT or port mappings that we had with earlier versions of Docker networking. With Kubernetes, for example, we can have a web server and database server placed in the same Pod and use the local interface for cross communication. As there is no additional translations performance is better to that of an NAT’d approach.
Kubernetes Network Proxy
Kubernetes fulfills service -> Pods integration by enabling a network proxy called the kube-proxy on every node in a cluster. The network proxy is always there even if Pods are not running. Its main task is to route traffic to the correct Pod and can do TCP,UDP stream forwarding or round robin TCP,UDP forwarding. The kube-proxy captures service-destination traffic and proxies requests from the service endpoint back to the Pod running the application. The traffic is forwarded to the Pods on the target port defined in the definition file. The target port is a random port assigned during service creation. To make all this work, Kubernetes uses IPtables and Virtual IP address.
When using Kubernetes alongside, for example, Opencontrail, the kube-proxy is disabled on all hosts, and connectivity is implemented by the OpenContrail vrouter module via overlays ( MPLS over UDP encapsulation ). Another vendor on the forefront is Midokura, the co-founder behind OpenStack Project Kuryr. This project aims to bring any SDN plugin (MidoNet, Dragon flow, OVS, etc) to Containers. More on these another time.
Kubernetes Pod-IP Approach
The Pods IP address is reachable by all other Pods and hosts in the Kubernetes cluster. The address is not usually routable outside of the cluster. This should not be too much of a concern as most traffic stays within application tiers inside the cluster. Any inbound external traffic is achieved by mapping external load-balancers to services in the cluster.
The Pod-IP approach assumes that all Pods can reach each other without creating specific links. They can access each other by IP rather than through a port mapping on the physical host. Port mappings hide the original address by performing a masquerade – Source NAT. Similar to how your home residential router hides local PC and laptop IP addresses from the public Internet. Cross-node communication is much simpler as every Pod has an IP address. There isn’t any port mapping or NAT like there is with default docker networking. If the kube-proxy receives traffic for a Pod that is not on its host, it simply forwards the traffic to the correct Pod-IP for that service.
The IP per POD offers a simplified approach to K8 networking. A unique IP per host would potentially need port mappings on the host IP as the number of containers increase. Managing port assignment would become an operational and management burden, similar to earlier versions of Docker. On the flip side of that, a unique IP per container would surely hit scalability limits.
Kubernetes PAUSE container
Kubernetes has what’s known as a PAUSE container also referred to as a Pod infrastructure container. It handles the networking by holding the networking namespace and IP address for the containers on that Pod. Some refer to the PAUSE container as an implementation detail you can safely ignore.
Within a Pod, each container uses the “mapped container” mode to connect to the pause container. The mapped container mode is implemented with a source and target container grouping. The source container is the user created container and the target container is the infrastructure pause container. Destination Pod IP traffic first lands on the pause container and gets translated to the backend containers. The pause container and the user built containers all share the same network stack. Remember we created a service destination file with two container – port 80 and port 8080? It is the pause container that actually listens on these port numbers.
In summary, the Kubernetes model introduces 3 methods of communications.
- a) Pod-to-Pod communication directly by IP address. Kubernetes has a Pod-IP wide metric simplifying communication.
- b) Pod-to-Service Communication – Clients traffic is directed to service virtual IP which is then intercepted by the kub-proxy process ( running on all host ) and directed to the correct Pod.
- c) External-to-Internal Communication – external access is captured by an external load balancer which targets nodes in a cluster. The kub-proxy determines the correct Pod to send traffic to. More on this in a separate post.
Docker & Kubernetes Networking Comparison
Docker uses host-private networking. The Docker engine creates a default bridge and every container gets a virtual ethernet to that bridge. The veth acts like a pipe – one end is mapped to the docker0 bridge namespace and the other end to the containers Linux namespace. This provides connectivity between containers on the same Docker bridge.
All containers are assigned an address from the 172.17.42.0 range and 172.17.42.1 to the default bridge acting as the container gateway. Any off host traffic requires port mappings and NAT for communication. The containers real IP address is hidden and the network would see the container traffic as coming from the docker nodes physical IP address. The resulting effect is that containers can talk to each other by IP address only on the same virtual bridge. Any off host container communication requires messy port allocations. Recently, there are enhancements to docker networking and multi-host native connectivity without translations.
Although there are enhancements to Docker network the NAT / Port mapping design is not a clean solution. The K8 model offers a different approach and the docker0 bridge gets a routable IP address. What this means is that any outside host can directly access that Pod by IP address freely rather than through a port mapping on the physical host. Kubernetes has no NAT for container-to-container or for container-to-node traffic.
For instant hands on you can sign up for a trial version at GCE. The GCE has a ready made Google Container Engine allowing you to play with Kubernetes clusters. Kindly see my previous setup post on Kubernetes basics. Jon Langemak has some excellent deployment and operational notes for Kubernetes on Bare Metal with Saltstack.