Kernel development

Brief items

Kernel release status

The current development kernel is 3.16-rc7, released on July 27. Linus appears to be happier with the pace of change at this point: "We obviously *do* have various real fixes in here, but none of them look all that special or worrisome. And rc7 is finally noticeably smaller than previous rc's, so we clearly are calming down. So unlike my early worries, this might well be the last rc, we'll see how next week looks/feels."

Stable updates: 3.15.7, 3.14.14, 3.10.50, and 3.4.100 were released on July 28. The 3.15.8, 3.14.15, 3.10.51, and 3.4.101 updates are in the review process as of this writing; they can be expected on or after August 1.

Comments (none posted)

Quotes of the week

Then you publish the source code.

Oh, look you just left your house. The merging of your code is many many miles distant and you just started walking that road, just now, not when you started writing it, not when you started legal review, not when you rewrote it internally the 4th time. You just did it this moment.

— Dave Airlie

Turns out I'm easily bribed. The code looks pretty clean and simple and is refreshingly free of comments, which only confuse people anyway.

— Andrew Morton

My suggestion always is, "Find something that bothers you, fix it, and go from there." And if you can't find a problem with the kernel on your machine, then you really aren't looking hard enough.

— Greg Kroah-Hartman

Comments (none posted)

Kernel development news

Two paths to a better readdir()

By Jonathan Corbet
July 30, 2014

A common filesystem workload follows a simple pattern: work through a list of files in a directory, and use stat() to obtain information about each of those files. The "ls -l" command is a classic example of this type of workload, but there are many others. This workload has always run more slowly on Linux systems than developers would like, but getting a solution into the kernel has happened even more slowly. Recently, a pair of possible solutions was posted by Abhi Das; perhaps this time this issue will be resolved — in a surprising way.

The problem with the "ls -l" workload is simple enough: two system calls are required for each file of interest. A call to getdents() (usually via the readdir() function in the C library) obtains the name of a file in the directory; then stat() is called to get the information about that file. The stat() call, in particular, can be expensive, with each call forcing the underlying filesystem to perform I/O to obtain the desired information. In some cases, that information may be spread across multiple on-disk data structures, requiring even more I/O, even if the calling application does not actually use everything that stat() returns. Doing all this work is inefficient; it would be nice if there were a way to limit the information gathered to what the application actually needs and to get that information in batches.

This issue is not new; it was, in fact, already somewhat old when it was discussed at the 2009 Linux Storage and Filesystem Workshop. A proposed solution, in the form of the xstat() system call, was posted in 2010 but did not get very far. At this point, well into 2014, some filesystems have code to try to optimize for this kind of workload, but there is still no general solution in the kernel. For the last few years, there has appeared to be little interest among developers in working on this problem.

In that setting, Abhi has come forward with two independent solutions demonstrating two separate approaches to the problem. His hope is to get feedback on both and, once one of them emerges as the preferred solution, get it into the mainline kernel.

xgetdents()

The first approach builds on the 2010 xstat() system call by David Howells. It adds two new system calls:

    int xstat(int dirfd, const char *filename, unsigned int flags,
    	      unsigned int mask, struct xstat *info);
    int fxstat(int fd, unsigned int flags, unsigned int mask, struct xstat *info);

The first form looks up a given file by name, while the second returns information for an open file identified by its descriptor. The flags field is there to change the operation of the system call; there is little use of it in this patch set. Of more interest is mask, which tells the kernel which information is being requested by the application. There are quite a few bits that can be set here; examples include XSTAT_MODE (for the file protection bits), XSTAT_UID (file owner), XSTAT_RDEV (underlying storage device), XSTAT_ATIME (last access time), or XSTAT_INO (inode number). XSTAT_ALL_STATS can be used to request all available information. On a successful return, the info structure will be filled in with the requested data.

On top of this work, Abhi has added another system call:

    int xgetdents(unsigned int fd, unsigned int flags, unsigned int mask,
		  void *buf, unsigned int count);

Here, fd is a file descriptor for the directory of interest, while flags and mask are as above (though mask has been extended to allow the application to request various types of extended attribute data). Information is returned in buf, which is a simple byte array, count bytes in length. The xgetdents() call will attempt to retrieve information about multiple files in the given directory until buf fills.

The actual data returned in buf is somewhat complex. The top-level structures defining this information are:

    struct xdirent_blob {
	unsigned int    xb_xattr_count;
	char            xb_blob[1]; /* contains variable length data like
				     * NULL-terminated name, xattrs etc */
    };

    struct linux_xdirent {
	unsigned long        xd_ino;
	char                 xd_type;
	unsigned long        xd_off;
	struct xstat         xd_stat;
	unsigned long        xd_reclen;
	struct xdirent_blob  xd_blob;
    };

The documentation of the return format is somewhat sparse. Actually, it does not exist at all, so one is forced to reverse-engineer it from the code. It appears that information for each file will be returned in one variable-length linux_xdirent structure. The name of the file is the first thing stored in xd_blob, followed by extended attribute information if that has been requested. This structure clearly requires a bit of work to understand and pick apart on the user-space side, but it does have the advantage of allowing all of that information to be collected and returned in a single system call.

dirreadahead()

The alternative approach is rather simpler. It adds a single system call:

	int dirreadahead(unsigned int fd, loff_t *offset, unsigned int count);

This call will attempt to initiate the reading of file information for count files in the directory represented by fd, starting at the given offset within the directory. The offset value will be updated to reflect the number of files whose information was actually read. One can thus use multiple dirreadahead() calls to work through a directory with the kernel maintaining the offset value as things progress.

In this case, it is still necessary to call getdents() and stat() to get the needed information. What changes is that, with luck, the filesystem will have already pulled that information into an internal cache, so the calls should be handled quickly. Reading information for multiple files at once allows batching to be done; even if the information is dispersed on physical media, the necessary I/O operations can be reordered for optimal execution.

The introductory message to the two patch sets included some benchmark results on the GFS2 filesystem. Both approaches performed better than mainline kernels when presented with a workload heavy with getdents() and stat() system calls. Perhaps surprisingly, dirreadahead() consistently performed far better than xgetdents(). That result may be an artifact of the xgetdents() implementation or of the GFS2 filesystem, but it shows that the far simpler readahead-based approach is worthy of consideration.

The readahead idea quickly led to questions of whether the kernel could somehow perform this readahead automatically, as it does with basic file I/O. Trond Myklebust noted that the NFS client tries to detect workloads where this kind of readahead might be of value. In the general case, though, this detection is hard to do; there is no obvious connection within the kernel between the getdents() and stat() calls. So, for now at least, it may be up to user space to communicate that information directly. Either of the two interfaces described here could be used for that communication, but it seems that the relative simplicity of the dirreadahead() approach would argue strongly in its favor, even in the absence of better benchmark results.

Comments (6 posted)

The RCU-tasks subsystem

By Jonathan Corbet
July 30, 2014

The read-copy-update (RCU) mechanism is charged with keeping old versions of data structures around until it knows that no CPU can hold a reference to them; once that happens, the structures can be freed. Recently, though, a potential RCU user came forward with a request for something different: would it be possible to defer the destruction of an old data structure until it is known that no process holds a reference to it? The answer would appear to be "yes," as demonstrated by the RCU-tasks subsystem recently posted by Paul McKenney.

Normal RCU works on data structures that are accessed via a pointer. When an RCU-protected structure must change, the code that maintains that structure starts by making a copy. The changes are made to the copy, then the relevant pointer is changed to point to that new copy. At this point, the old version is inaccessible, but there may be code running that obtained a pointer to it before the change was made. So the old structure cannot yet be freed. Instead, RCU waits until every CPU in the system goes through a context switch (or sits idle). Since the rules for RCU say that references to data structures can only be held in atomic context, the "every CPU has context switched" condition guarantees that no references to an old data structure can be held.

It seems that the rules for the trampolines used by the tracing code are different, though, in that a process can be preempted while still holding a reference to (i.e. running within) an old version. Given that, normal RCU will not work for the management of these structures, meaning that some other, slower locking mechanism must be used. Using an RCU-like mechanism would require that the rules be changed somewhat.

In the normal RCU case, only one process can hold a reference to a protected structure on any given CPU; as a result, RCU focuses on figuring out when no CPU can hold a reference to a given data structure. In this case, there might be multiple processes on each CPU with a reference to the protected data structure, so the focus has to shift. Thus, RCU-tasks is a mechanism designed to figure out when no processes (rather than no processors) can hold such a reference.

With this interface, code that has replaced a protected data structure will arrange for the disposal of the old version with a call to:

    void call_rcu_tasks(struct head *rhp, void (*func)(struct rcu_head *rhp));

Once the appropriate "grace period" has passed, func() will be called with the given rhp to free the structure. For users of RCU-tasks, that is pretty much the entire API. Unlike ordinary RCU, RCU-tasks has no equivalent to rcu_read_lock() for access to protected data structures.

Ordinary RCU has, over the years, acquired a great deal of complexity in order to maximize the scalability of the subsystem. RCU-tasks, instead, is refreshingly simple, at least in its initial implementation. There is a single linked list of rcu_head structures that have been passed to call_rcu_tasks() but that have not yet been acted upon. The patch set adds a new kernel thread charged with managing that list. Once every second, it wakes up to see if any new entries have been added to the list (a subsequent patch replaces the poll with a wait queue). If so, the entire list is moved to a separate list, and the wait for a new grace period to pass begins.

That wait starts by creating a separate list of every runnable process in the system; tasks that are not runnable cannot, by the rules, hold a reference to data structures protected by RCU-tasks, and, thus, need not be considered. For each runnable task, a special "rcu_tasks_holdout" flag is set in the task structure. Hooks have been placed in the scheduler to clear that flag whenever the task voluntarily gives up the CPU or returns to user space. The RCU-tasks kernel thread goes into a separate loop, waking up every tenth of a second, to work through the list of "holdout" tasks; any that have had their flag reset are removed from the list. Once the list is empty, the destructor callbacks can be called and the cycle can start anew.

The code gets somewhat more complex as the patch series goes on. The addition of testing infrastructure and stall detection adds somewhat to its footprint. The biggest addition, though, is the addition of handling of tasks that exit while they are on the holdout list. Clearly, checking for the "holdout" flag in a task structure that may no longer exist is a bad idea, so this case does need to be properly handled. Doing so involves adding a new type of lock-protected doubly linked list and a bunch of management code; it is the biggest part of the entire patch set.

Thus far, we have not yet seen patches to make other code actually use this new facility. Most of the comments on this patch set have come from Peter Zijlstra, who is concerned about the overhead of polling and the lack of accounting of that overhead. So there are a few questions yet to be answered. While RCU-tasks may well prove to be a useful addition to the RCU API, nobody is expecting to see it in the 3.17 merge window.

Comments (none posted)

Control groups, part 5: The cgroup hierarchy

July 30, 2014

This article was contributed by Neil Brown

Control groups

In earlier articles, we have looked at hierarchies in general and at how hierarchy is handled by specific cgroup subsystems. Now, it is time to draw all of this together to try to understand what sort of hierarchy or hierarchies are needed and how this can be supported in the current implementation. As was recently reported, the 3.16 Linux kernel will have under-development support for a so-called "unified hierarchy". The new ideas introduced with that development will not be discussed yet, as we cannot really appreciate what value they might bring until we fully understand what we have. A later article will unpack the unified hierarchy, but for now we will start by understanding what might be called the "classic" cgroup hierarchies.

Classic cgroup hierarchies

In the classic mode, which may ultimately be deprecated, but is still fully supported, there can be several separate cgroup hierarchies. Each hierarchy starts its life as a root cgroup, which initially holds all processes. This root node is created by mounting an instance of the "cgroup" virtual filesystem and all further modifications to the hierarchy happen through manipulations of this filesystem, particularly mkdir to create cgroups, rmdir to remove cgroups and mv to rename a cgroup within the same parent. Once the cgroups are created, processes can be moved between them by writing process id numbers into special files. When a suitably privileged user writes a PID number to cgroup.procs in a cgroup, that process is moved, from the cgroup it currently resides in to the target cgroup.

This is a very "organizational" way to manipulate a hierarchy: create a new group and find someone to fill it. While this may seem natural for a filesystem-based hierarchy, we shouldn't assume it is the best way to manipulate all hierarchies. The simple hierarchy of sessions and process groups that we found in 4.4BSD works quite differently. There is no distinction between creating a group and putting the first process in the group.

We should assess the mechanisms for manipulation by considering whether they are fit-for-purpose, as well as whether they are convenient to implement. If the goal is to allow processes internal to the hierarchy, this mechanism is quite suitable. If we were to prefer to keep all processes in the leaves, it doesn't seem so well suited.

When a hierarchy is created, it is associated with a fixed set of cgroup subsystems. The set can be changed, but only if the hierarchy has no subgroups below the root, so, for most practical purposes, it is fixed. Each subsystem can be attached to at most one hierarchy, so, from the perspective of any given subsystem, there is only one hierarchy, but no assumptions can be made about what other subsystems might see.

Thus it is possible to have 12 different hierarchies, one for each subsystem, or a single hierarchy with all 12 subsystems attached, or any other combination in between. It is also possible to have an arbitrary number of hierarchies each with zero subsystems attached. Such a hierarchy doesn't allow any control of the processes in the various cgroups, but it does allow sets of related processes to be tracked.

Systemd makes use of this feature by creating a cgroup tree mounted at /sys/fs/cgroup/systemd with no controller subsystems. It contains a user.slice sub-hierarchy that classifies processes that result from login sessions, first by user and second by session. So:

    /sys/fs/cgroup/systemd/user.slice/user-1000.slice/session-1.scope

represents a cgroup that contains all the processes associated with the first login session for the user with UID 1000 (slice and scope are terms specific to systemd).

These "session scopes" seem to restore one of the values of the original process groups in V7 Unix — a clear identification of which processes belong to which login session. This probably isn't very interesting on a single-user desktop, but could be quite valuable on a larger, multi-user machine. While there is no direct control possible of the group, we can tell exactly which processes are in it (if any) by looking at the cgroup.procs file. Provided the processes aren't forking too quickly, you could even signal all the processes with something like:

    kill $(cat cgroup.procs)

The tyranny of choice

Probably the biggest single problem with the classic approach to hierarchy is the tyranny of choice. There seems to be a lot of flexibility in the different ways that subsystems can be combined: some in one hierarchy, some in another, none at all in a third. The problem is that this choice, once made, is system-wide and difficult to change. If one need suggests a particular arrangement of subsystems, while another need suggests something different, both needs cannot be met on the same host. This is particularly an issue when containers are used to support separate administrative domains on the one host. All administrative domains must see the same associations of cgroup subsystems to hierarchies.

This suggests that a standard needs to be agreed upon. Obvious choices are to have a single hierarchy (which is where the "unified hierarchy" approach appears to be headed) or a separate hierarchy for each subsystem (which is very nearly the default on my openSUSE 13.1 notebook: only cpu and cpuacct are combined). With all that we have learned so far about cgroup subsystems, we might be able to understand some of the implications of keeping subsystems separate or together.

As we saw, particularly in part 3, quite a few subsystems do not perform any accounting or, when they do, do not use that accounting to impose any control. These are debug, net_cl, net_perf, device, freezer, perf_event, cpuset, and cpuacct. None of these make very heavy use of hierarchy and, in almost all cases, the functionality provided by hierarchy can be achieved separately.

A good example is the perf_event subsystem and the perf program that works with it. The perf tool can collect performance data for a collection of processes and provides various ways to select those processes, one of which is to specify the UID. When a UID is given, perf does not just pass this to the kernel to ask for all matching processes, but rather examines all processes listed in the /proc filesystem, selects those with the given UID, and asks the kernel to monitor each of those independently.

The only use that the perf_event subsystem makes of hierarchy is to collect subgroups into larger groups, so that perf can identify just one larger group and collect data for all processes in all groups beneath that one. Since the exact same effect could be achieved by having perf identify all the leaf groups it is interested in (whether they are in a single larger group or not) in a manner similar to its selection of processes based on UID, the hierarchy is really just a minor convenience — not an important feature. For similar reasons, the other subsystems listed could easily manage without any hierarchy.

There are two uses of hierarchy among these subsystems that cannot be brushed away quite so easily. The first is with the cpuset subsystem. It will sometimes look upward in the hierarchy to find extra resources to use in an emergency. This feature is an intrinsic dependency on hierarchy. As we noted when we first examined this subsystem, similar functionality could easily be provided without depending on hierarchy, so this is a minor exception.

The other use is most obvious in the devices subsystem. It relates not to any control that is imposed but to the configuration that is permitted: a subgroup is not permitted to allow access that its parent denies. This use of hierarchy is not for classifying processes so much as for administrative control. It allows upper levels to set policy that the lower levels must follow. An administrative hierarchy can be very effective at distributing authority, whether to user groups, to individual users, or to containers that might have their own sets of users. Having a single administrative hierarchy, possibly based on the one that systemd provides by default, is a very natural choice and would be quite suitable for all these non-accounting subsystems. Keeping any of them separate seems to be hard to justify.

Network Traffic Control — another control hierarchy

The remaining subsystems, which are the ones most deserving of the term "resource controllers", manage memory (including hugetlb), CPU, and block-I/O resources. To understand these it it will serve us to take a diversion and look at how network resources are managed.

Network traffic can certainly benefit from resource sharing and usage throttling, but we did not see any evidence for network resource control in our exploration of the different cgroup subsystems, certainly not in the same way as we did for block I/O and CPU resources. This is particularly relevant since one of the documented justifications for multiple hierarchies, as mentioned in a previous article, is that there can be a credible need to manage network resources separately from, for example, CPU resources.

Network traffic is in fact managed by a separate hierarchy, and this hierarchy is even separate from cgroups. To understand it we need at least a brief introduction to Network Traffic Control (NTC). The NTC mechanism is managed by the tc program. This tool allows a "queueing discipline" (or "qdisc") to be attached to each network interface. Some qdiscs are "classful" and these can have other qdiscs attached beneath them, one for each "class" of packet. If any of these secondary qdiscs are also classful, a further level is possible, and so on. This implies that there can be a hierarchy of qdiscs, or several hierarchies, one for each network interface.

The tc program also allows "filters" to be configured. These filters guide how network packets are assigned to different classes (and hence to different queues). Filters can key off various values, including bytes within the packet, the protocol used for the packet, or — significant to the current discussion — the socket that generated the packet. The net_cl cgroup subsystem can assign a "class ID" to each cgroup that is inherited by sockets created by processes in that cgroup, and this class ID is used to classify packets into different network queues.

Each packet will be classified by the various filters into one of the queues in the tree and then will propagate up to the root, possibly being throttled (for example by the Token Bucket Filter, tbf, qdisc) or being competitively scheduled (e.g. by the Stochastic Fair Queueing, sfq, qdisc). Once it reaches the root, it is transmitted.

This example emphasizes the value in having a hierarchy, and even a separate hierarchy, to manage scheduling and throttling for a resource. It also shows us that it does not need to be a separate cgroup hierarchy. A resource-local hierarchy can fit the need perfectly and, in that case, a separate cgroup hierarchy is not needed.

Each of the major resource controllers, for CPU, memory, block I/O, and network I/O, maintain separate hierarchies to manage their resources. For the first three, those hierarchies are managed through cgroups, but for networking it is managed separately. This observation might suggest that there are two different sorts of hierarchies present here: some for tracking resources and some (possibly one "administrative hierarchy") for tracking processes.

The example in Documentation/cgroups/cgroups.txt does seem to acknowledge the possibility of a single hierarchy for tracking processes but worries that it "may lead to [a] proliferation of ... cgroups". If we included the net_cl subsystem in the systemd hierarchy described earlier, we would potentially need to create several sub-cgroups in each session for the different network classes that might be required. If other subsystems (e.g. cpu or blkio) wanted different classifications within each session, a combinatorial explosion of cgroups could result. Whether or not this is really a problem depends on internal implementation details, so we will delay further discussion of this until the next article, which focuses on exactly that subject.

One feature of cgroups hierarchies that is not obvious in the NTC hierarchies is the ability to delegate part of the hierarchy to a separate administrative domain when using containers. By only mounting a subtree of a cgroup's hierarchy in the namespace of some container, the container is limited to affecting just that subtree. Such a container would not, however, be limited in which class IDs can be assigned to different cgroups. This could appear to circumvent any intended isolation.

With networking, the issue is resolved using virtualization and indirection. A "veth" virtual network interface can be provided to the container that it can configure however it likes. Traffic from the container is routed to the real interface and can be classified according to the container it came from. A similar scheme could work for block I/O, but CPU or memory resource management could not achieve the same effect without full KVM-like virtualization. These would require a different approach for administrative delegation, such as the explicit sub-mount support that cgroups provides.

How separate is too separate?

As we mentioned last time, the accounting resource controllers need visibility into the ancestors of a cgroup to impose rate limiting effectively and need visibility into the siblings of a cgroup to effect fair sharing, so the whole hierarchy really is important for these subsystems.

If we take the NTC as an example, it could be argued that these hierarchies should be separate for each resource. NTC takes this even further than cgroups can, by allowing a separate hierarchy for each interface. blkio could conceivably want different scheduling structures for different block devices (swap vs database vs logging), but that is not supported by cgroups.

There is, however, a cost in excessive separation of resource control, much as there is (according to some) a cost in the separation of resource management as advocated for micro-kernels. This cost is the lack of "effective co-operation" identified by Tejun Heo as part of the justification for a unified hierarchy.

When a process writes to a file the data will first go into the page cache, thus consuming memory. At some later time, that memory will be written out to storage thus consuming some block-I/O bandwidth, or possibly some network bandwidth. So these subsystems are not entirely separate.

When the memory is written out, it will quite possibly not be written by the process that initially wrote the data, or even by any other process in the same cgroup. How, then, can this block-I/O usage be accounted accurately?

The memory cgroup subsystem attaches extra information to every page of memory so that it knows where to send a refund when the page is freed. It seems like we could account the I/O usage to this same cgroup when the page is eventually written, but there is one problem. That cgroup is associated with the memory subsystem and so could be in a completely different hierarchy. The cgroup used for memory accounting could be meaningless to the blkio subsystem.

There are a few different ways that this disconnect could be resolved:

Record the process ID with each page and use it to identify what should be charged for both memory usage and block-I/O usage, as both subsystems understand a PID. One problem would be that processes can be very short lived. When a process exits, we would need to either transfer its outstanding resource charges to some other process or a cgroup, or to just discard them. This is similar to the issue we saw in the CPU scheduler, where accounting just to a process would not easily lead to proper fairness for process groups. Preserving outstanding charges efficiently could be a challenge.
Invent some other identifier that can safely live arbitrarily long, can be associated with multiple processes, and can be used by each different cgroup subsystem. This is effectively the "extra level of indirection" that proverbially can solve any problem in computer science.

The class ID that connects the net_cl subsystem with NTC is an example of such an identifier. While there can be multiple hierarchies, one for each interface, there is only a single namespace of class ID identifiers.
Store multiple identifiers with each page, one for memory usage and one for I/O throughput.
The struct page_cgroup structure that is used to store extra per-page information for the memory controller currently costs 128 bits per page on a 64bit system — 64 bits for a pointer to the owning cgroup and 64 bits for flags, 3 of which are defined. If an array index could be used instead of a pointer, and a billion groups were deemed to be enough, two indexes and an extra bit could be stored in half the space. Whether an index could be used with sufficient efficiency is another exercise left for the interested reader.

A good solution to this problem could have applicability in other situations: anywhere that one process consumes resources on behalf of another. The md RAID driver in Linux will often pass I/O requests directly down to the underlying device in the context of the process that initiated the request. In other cases, some work needs to be done by a helper process that will then submit the request. Currently, the CPU time to perform that work and the I/O throughput consumed by the request are charged to md rather than to the originating process. If some "consumer" identifier or identifiers could be attached to each I/O request, md and other similar drivers would have some chance of apportioning the resource charges accordingly.

Unfortunately, no good solution to this problem exists in the current implementation. While there are costs to excessive separation, those costs cannot be alleviated by simply attaching all subsystems to the same hierarchy.

In the current implementation, it seems best to keep the accounting subsystems, cpu, blkio, memory, and hugetlb, in separate hierarchies, to acknowledge that networking already has a separate hierarchy thanks to the NTC, and to keep all the non-accounting subsystems together in an administrative hierarchy. These will depend on intelligent tools to effectively combine separate cgroups when needed.

Answers ...

We are now in a position to answer a few more of the issues that arose in an earlier article in this series. One was the question of how groups are named. As we saw above, this is is the responsibility of whichever process initiates the mkdir command. This contrasts with job control process groups and sessions, for which the kernel assigns a name in an (almost) arbitrary way when a process calls setsid() or setpgid(0,0). The difference may be subtle, but it does seem to say something about the expected authority structures. For job control process groups, the decision to form a new group comes from within a member of the new group. For cgroups, the decision is expected to come from outside. Earlier, we observed that embodying an administrative hierarchy in the cgroups hierarchy seemed to make a lot of sense. The fact that names are assigned from outside aligns with that observation.

Another issue was whether it was possible to escape from one group into another. Since moving a process involves writing the process ID to a file in the cgroup filesystem, this can be done by any process with write access to that file using normal filesystem access checks. When a PID is written to that file, there is a further check that the owner of the process performing the write is also the owner of the process being added, or is privileged. This means that any user can move any of their processes into any group where they have write access to cgroup.procs, irrespective of how much of the hierarchy that crosses.

Put another way, we can restrict where a process is moved to, but there is much less control over where it can be moved from. A cgroup can only be considered to be "closed" if the owners of all processes in it are barred from moving a process into any cgroup outside of it. Like the hierarchy manipulations looked at earlier, these make some sense when thinking about the cgroup hierarchy as though it were a filesystem, but not quite as much when thinking about it as a classification scheme.

... and questions

The biggest question to come out of this discussion is whether there is a genuine need for different resources to be managed using different hierarchies. Is the flexibility provided by NTC well beyond need or does it set a valuable model for others to follow? A secondary question concerns the possibility for combinatorial explosion if divergent needs are imposed on a single hierarchy and whether the cost of this is disproportionate to the value. In either case, we need a clear understanding of how to properly charge the originator of a request that results in some service process consuming any of the various resources.

Of these questions, the middle one is probably the easiest: what exactly are the implementation costs of having multiple cgroups? So it is to this topic we will head next time, when we look at the various data structures that hook everything together under the hood.

Comments (8 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.16-rc7 ?

Greg KH Linux 3.15.7 ?

Greg KH Linux 3.14.14 ?

Jiri Slaby Linux 3.12.25 ?

Luis Henriques Linux 3.11.10.14 ?

Greg KH Linux 3.10.50 ?

Kamal Mostafa Linux 3.8.13.27 ?

Greg KH Linux 3.4.100 ?

Architecture-specific

Haojian Zhuang enable Hisilicon HIX5HD2 SoC ?

Haojian Zhuang enable Hisilicon HiP04 ?

Hanjun Guo Introduce ACPI for ARM64 based on ACPI 5.1 ?

Andy Lutomirski x86: two-phase syscall tracing and seccomp fastpath ?

Core kernel code

Sergey Oboguev sched: deferred set priority (dprio) ?

Vincent Guittot sched: consolidation of cpu_capacity ?

Lai Jiangshan workqueue: offload the worker-management out from kworker ?

Masami Hiramatsu [PATCH ftrace/core v4 0/4] ftrace, kprobes: Introduce IPMODIFY flag for ftrace_ops to detect conflicts ?

Lina Iyer [PATCH 0/3] IRQ affinity notifier and per-cpu PM QoS ?

Frederic Weisbecker nohz: Support sysidle (+ some more nohz kick cleanups) ?

Paul E. McKenney RCU-tasks implementation ?

Development tools

Rob Jones A debugfs file system for managed resources (devres) ?

Tomasz Nowicki ACPI: Add GPIO-signaled event simulator. ?

Nicolas Pitre generic IPI tracing ?

Device drivers

Jamie Lentin HID: lenovo: Add support for Lenovo Compact Keyboard ?

Nicolin Chen Add Freescale ASRC driver ?

Caesar Wang This series adds support for RK3288 SoC integrated PWM ?

Stanimir Varbanov Support for Qualcomm QPNP PMIC's ?

Dan Murphy input: drv260x: Add TI drv260x haptics driver ?

Lee Jones irqchip: New driver for ST's SysCfg controlled IRQs ?

Hans Verkuil vivid: Virtual Video Test Driver ?

Device driver infrastructure

Rob Clark prepare for atomic.. the great propertyification ?

Documentation

NeilBrown autofs: the documentation I wanted to read ?

Karoly Kemeny net: kernel-doc compliant documentation for net_device ?

Maxime Ripard Documentation: dmaengine: Add a documentation for the dma controller API ?

Filesystems and block I/O

Ming Lei block/aio: loop mq conversion and kernel aio ?

Abhi Das dirreadahead system call ?

Abhi Das xgetdents system call ?

Michael Halcrow ext4: RFC: Encryption ?

David Sterba fiemap: introduce EXTENT_DATA_COMPRESSED flag ?

Martin K. Petersen Block/SCSI data integrity update v2 ?

Hannes Reinecke Initial SMR drive support ?

Memory management

Joerg Roedel mmu_notifier: Allow to manage CPU external TLBs ?

Vladimir Davydov Per-memcg slab shrinkers ?

Vlastimil Babka compaction: balancing overhead and success rates ?

Networking

Eric Dumazet ip: make IP identifiers less predictable ?

John Fastabend RCU sched classifiers ?

Thomas Graf Lockless netlink_lookup() with new concurrent hash table ?

Security-related

Theodore Ts'o random: introduce getrandom(2) system call ?

David Drysdale Adding FreeBSD's Capsicum security framework ?

Cristian Stoica Add TLS record layer encryption module ?

Page editor: Jonathan Corbet
Next page: Distributions>>