Network namespaces
For the container approach to work, every global resource in the system must be wrapped in some sort of namespace. This wrapping has been done for some relatively simple resources, such as the utsname information or process IDs; some of the resulting code has already found its way into the mainline. There is not a whole lot of use, however, for containers which are completely isolated from the rest of the world; usually some sort of networking capability is needed. For example, containers can usefully contain a web browser (keeping it from exposing the rest of the system should it prove vulnerable) or a web server - but only if networking works. But containers should not be able to see each others' packet streams, and, ideally, should be able to bind to the same ports without interfering with each other.
Making that work requires network namespaces. These namespaces virtualize all access to network resources - interfaces, port numbers, etc., - allowing each container the network access it needs (but no more). As with all other problems in computer science, the network namespace issue can be addressed with another layer of indirection. There is a small problem with this approach, however: the networking code is a vast pile of complex, highly-tuned code overseen by developers who have little tolerance for changes which introduce performance overhead or potential bugs. Getting any sort of network namespace implementation merged is going to require quite a bit of very careful work.
One approach can be seen in the L2 network namespace patch set posted recently by Dmitry Mishin. These patches concentrate on the lower levels of the network stack, trying to get proper namespaces established for network devices and the IPv4 layer. In an attempt to minimize churn in the networking code, the L2 namespace patch introduces the idea of the "current network namespace," kept in a per-CPU variable. The current namespace is implemented as a stack, with push and pop operations; in theory, it allows all network operations to happen within the proper namespace. Your editor was unable to convince himself that this scheme would work properly in the face of any sort of kernel preemption, but that may just be a matter of not having looked hard enough.
The net_device structure gains a net_ns field, providing the namespace to which the device belongs. It is set to whatever namespace is current when the device is created. The device lookup functions have become namespace-aware; if a device does not belong to the current namespace, it becomes invisible. A different version of the loopback device is created for each namespace. Then, the IPv4 routing code has been extended so that each namespace gets its own set of routing tables. The code which matches incoming packets to sockets has also been made namespace-aware; there is still a single hash table, but the namespace has been made part of the match criteria.
Network interfaces made up of real hardware will normally remain in the root namespace. Communication with other namespaces is made possible by way of a "virtual Ethernet" device, included with the patch set. A virtual device can be thought of as a wire into a restricted namespace; it presents one device within that namespace and one in the parent (normally root) namespace. Packets written to one end show up at the other. With the addition of a few routing rules in the root namespace, packets meeting the right criteria can be directed into (and out of) specific namespaces.
The L2 namespace patch provides the plumbing for the creation of little virtualized Internets within a single system, but they do not yet provide complete isolation. A process within its namespace can reconfigure its interfaces, perhaps creating problems for the system as a whole. Tightening things down is left to the L3 namespace patch, posted by Daniel Lezcano. An L3 namespace is always the child of an L2 namespace; it is the end of the line, however, being unable to have child namespaces of its own. There are also no network admin capabilities in an L3 namespace; once an L3 namespace is created, it is stuck with whatever network configuration its parent gave it.
The end result is that a contained system can be put within an L3 namespace and it should be able to perform networking without interfering with (or even seeing) other systems in other namespaces.
A somewhat different approach can be seen in the network namespace patches posted by Eric W. Biederman. Eric, aware of the challenges involved in getting network namespaces merged, is far more concerned with the process than the specific namespace implementation. So his patches focus mostly on getting the internal APIs right.
The first step is to figure out how network namespaces are to be represented. Rather than use a structure, Eric has opted for a mechanism which marks all network-related global resources in a special way. These resources get linked into a special section of the kernel which can be cloned when a new namespace is created. Each global variable becomes an offset into the per-namespace section; it must be accessed by way of a special macro. This approach appears cumbersome, but it has a couple of advantages. If a module with per-namespace variables is loaded, those variables can be added to each existing namespace on the fly. And, if namespaces are not in use, the overhead of the whole mechanism drops to zero. This is an important feature: to have a hope of being merged, a network namespace implementation will have to have no impact on systems which are not using it.
The patch set (31 parts strong) then works through various parts of the networking API, adding a namespace parameter to functions which need it. There is no global "current namespace" concept in Eric's patches; it is, instead, an explicit parameter everywhere. Thus, for example, every function which creates a socket (they exist in every protocol implementation) gets a namespace parameter. The sk_buff structure (which represents a packet) has a namespace field assigned from either the process creating it (for outbound packets) or the device it was received from; the various protocol-specific functions are expected to take that namespace into account. Functions dealing with netlink sockets get namespace parameters, as do those which implement network device lookup, event generation, and Unix-domain sockets. Like the L2 patches, Eric's implementation includes a virtual network device (called "etun") which can be use to route packets between namespaces.
Unlike the L2/L3 patches, Eric's work deals with the virtualization of the networking-related /proc, sysctl, and sysfs interfaces. Doing so requires adding shadow directory support to sysfs. Shadow directories loosen the connection between sysfs and the internal kobject hierarchy, allowing different namespaces to see different contents in the same locations.
A key aspect of Eric's patch is that it implements little namespace mechanism. Instead, much of the networking stack is made to test the namespace it is given and fail if the root namespace is not in use. The idea is to get the interfaces right first, then to start to fill in the mechanism in relatively small pieces. The tests ensure that the network stack will not surprise users by doing the wrong thing if it is not yet fully prepared to handle non-root namespaces.
Despite the posting of all these patches, the amount of discussion has been
quite low. One gets the sense that the network developers have not yet
started to take these patches seriously. This issue seems unlikely to go
away, however; there remains a great deal of interest in getting container
features into the mainline kernel. Sooner or later, this discussion is
likely to take off.
Index entries for this article | |
---|---|
Kernel | Networking |
Kernel | Virtualization/Containers |