The evolution of control groups

By Jonathan Corbet
October 29, 2013

2013 Kernel Summit

The control group (cgroup) subsystem is currently under intensive development; some of that work will lead, eventually, to ABI changes that are visible from user space. Given the amount of controversy around this subsystem, it was not surprising to see control groups show up on the 2013 Kernel Summit agenda. Tejun Heo led a session to describe the consensus that had been reached on the cgroup interface, only to find that there are still a few loose ends to be tied down.

Tejun started by reminding the group that the multiple hierarchy feature of cgroups, whereby processes can be placed in multiple, entirely different hierarchies, is going away. The unified hierarchy work is not entirely usable yet, though, because it requires that all controllers be enabled for the full hierarchy. Some controllers still are not hierarchical at all; they are being fixed over time. The behavior of controllers is being made more uniform as well.

One big change that has been decided upon recently is to make cgroup controllers work on a per-process basis; currently they apply per-thread instead. Among other things, that means that threads belonging to the same process can be placed in different control groups, leading to various headaches. Of all the controllers only the CPU controller has any business working with individual threads. For that case, some sort of special interface will be introduced that will, among other things, allow processes to set CPU policies for their own threads.

That interface, evidently, might be implemented with yet another special-purpose virtual filesystem. There was some concern about how the cgroup subsystem may be adding features that, essentially, constitute new system calls without review; there were also concerns about how the filesystem-based interface suffers from race conditions. Peter Zijlstra worried about how the new per-thread interface might look, saying that there were a lot of vague details that still need to be worked out. Linus wondered if it was really true that only the CPU controller needs to look at individual threads; some server users, he said have wanted per-thread control for other resources as well.

Linus also warned that it might not be possible to remove the old cgroup interface for at least ten years; as long as somebody is using it, it will need to be supported. Tejun seemed unworried about preserving the old interface for as long as it is needed. Part of Tejun's equanimity may come from a feeling that it will not actually be necessary to keep the old interface for that long; he said that even Google, which has complained about the unified hierarchy plans in the past, has admitted that it can probably make that move. So he doesn't see people needing the old interface for a long time.

In general, he said, the biggest use for multiple hierarchies has been to work around problems in non-hierarchical controllers; once those problems are fixed, there will be less need for that feature. But he still agrees that it will need to be maintained for some years, even though removal of multiple hierarchy support would simplify things a lot. Linus pointed out that, even if nobody is using multiple hierarchies currently, new kernels will still need to work on old distributions for a long time. Current users can be fixed, he said, but Fedora 16 cannot.

Hugh Dickins worried that, if the old interface is maintained, new users may emerge in the coming years. Should some sort of warning be added to tell those users to shift to the new ABI? James Bottomley said, to general agreement, that deprecation warnings just don't work; distributions just patch them out to avoid worrying their users. Tejun noted that new features will only be supported in the new ABI; that, hopefully, will provide sufficient incentive to use it. Hugh asked what would happen if somebody submitted a patch extending the old ABI; Tejun said that the bar for acceptance would be quite high in that case.

From the discussion, it was clear that numerous details are still in need of being worked out. Paul Turner said that there is a desire for a notification interface for cgroup hierarchy changes. That, he said, would allow a top-level controller to watch and, perhaps, intervene; he doesn't like that idea, since Google wants to be able to delegate subtrees to other processes. In general, there seems to be a lack of clarity about who will be in charge of the cgroup hierarchy as a whole; the systemd project has plans in that area, but that creates difficulties when, for example, a distribution is run from within a container. Evidently some sort of accord is in the works there, but there are other interesting questions, such as what happens when the new and old interfaces are used at the same time.

All told, there is a fair amount to be decided still. Meanwhile, Tejun said, the next concrete step is to fix the locking, which is currently too strongly tied to the internal locking of the virtual filesystem layer. After that is done, it should be possible to post a prototype showing how the new scheme will work. That posting may happen by the end of the year.

[Next: Linux-next and -stable].

Index entries for this article
Kernel	Control groups
Conference	Kernel Summit/2013

The evolution of control groups

Posted Oct 30, 2013 0:41 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

What about cgroups that are meant to be NOT hierarchic, like the freezer cgroup?

For example, I might want to freeze MySQL and Apache together even though they might be in entirely different cgroup subtrees.

The evolution of control groups

Posted Oct 30, 2013 8:17 UTC (Wed) by johannbg (guest, #65743) [Link] (2 responses)

"systemd project has plans in that area, but that creates difficulties when, for example, a distribution is run from within a container"

Jonathan could you clarify what exactly those difficulties are?

Systemd in containers

Posted Oct 30, 2013 12:44 UTC (Wed) by corbet (editor, #1) [Link] (1 responses)

I do not have any detailed knowledge there, that's just what was said. But, clearly, anything running in a container will also be in a subtree of the control group hierarchy. If systemd wants to control the *whole* hierarchy, there could be trouble: systemd running in the container wrestling with systemd outside of the container in particular.

But it's possible I'm totally confused.

Systemd in containers

Posted Oct 30, 2013 16:25 UTC (Wed) by dowdle (subscriber, #659) [Link]

systemd offers the systemd-detect-virt command which helps systemd become aware if it is inside of a VM, a container, or a container within a VM. What it does with that knowledge, I'm not sure... but I would expect it to not assume it is king of the system... and that there may or may not be another systemd running on the virt/container host.

The evolution of control groups

Posted Oct 31, 2013 17:16 UTC (Thu) by luto (subscriber, #39314) [Link] (18 responses)

This is the first thing that actively worries me about systemd. Every other systemd feature is in the category if "if you don't want it, don't use it".

The problem with a single control group hierarchy is that, if something like systemd claims control of that hierarchy and you want to use cgroups for something else, you're screwed. There's one hierarchy, and systemd is using it.

I would *love* to see some minimal extra kernel functionality added to allow efficiently tracking and killing process subtrees. Then systemd could use that instead of control groups and life would be good.

The evolution of control groups

Posted Oct 31, 2013 18:29 UTC (Thu) by raven667 (subscriber, #5198) [Link] (17 responses)

Well the drive for a single userspace cgroup client is coming from the kernel side, to try and control the complexity of API by making a guarantee that only one thing is fiddling with it on a system. As long as systemd provides it's own userspace API, over dbus maybe, for informing it of the changes you want to make then I think it will be good for all practical purposes.

I'm not an expert on cgroups but if I've been reading the coverage correctly it seems that one of the problems with delegated and multiple access in the existing system is that you can set priorities which are global values but you can only see the values for the processes in your group so you can't know whether your priorities will play well with the rest of the system. For example is 43 (out of 1-1024) more or less than another containers priority? Only a control daemon (or administrator) which can see everything knows what the "right" values are for priorities to be able to set them in a meaningful way. A global control daemon can have a more complex policy as to what priority values clients can actually get set to, preventing a container guest from setting priority higher than your management processes for example.

The evolution of control groups

Posted Oct 31, 2013 18:33 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

>I'm not an expert on cgroups but if I've been reading the coverage correctly it seems that one of the problems with delegated and multiple access in the existing system is that you can set priorities which are global values but you can only see the values for the processes in your group so you can't know whether your priorities will play well with the rest of the system.
This is a red herring. It's extremely easy to work around this by introducing an extra level of hierarchy.

The evolution of control groups

Posted Oct 31, 2013 18:40 UTC (Thu) by luto (subscriber, #39314) [Link] (2 responses)

As long as systemd provides it's own userspace API, over dbus maybe, for informing it of the changes you want to make then I think it will be good for all practical purposes.

I disagree. This might be good for all practical purposes if:

Systemd were widely-enough deployed that writing code that required systemd were a reasonably thing to do.
Systemd's API provided access to the full set of functionality available from the kernel
Systemd's API allowed the creation of a cgroup hierarchy isn't just subclasses of systemd units

(1) isn't true and won't be unless Canonical has a major change of heart. (2) is a big unknown, since AFAIK no one has seen this API. (3) appears to be about as likely as hell freezing over, for two reasons. First, the systemd people probably don't want to do that. Second, and more importantly, systemd needs the cgroup hierarchy for its own infternal purposes, and, since there's only one hierarchy, systemd can't expose a hierarchy that's too different from its own.

Since process subtree tracking is independently useful (see, for example, the subreaper stuff) and semantically has nothing whatsoever to do with partitioning kernel resources, it would be nice to see it split out.

Then, the One True Userspace Cgroup Daemon could be a separate project (that would cooperate with systemd by default) and which, importantly, could be turned off if you don't want it.

The evolution of control groups

Posted Oct 31, 2013 20:56 UTC (Thu) by raven667 (subscriber, #5198) [Link] (1 responses)

#1 seems like a non-sequitur, what systemd provides for an API is only relevant to machines running systemd, if the machines you run are running upstart or something else then you'll need to create your own management daemon and API with whatever rules you want. I'm sure some workloads will have custom cgroup managers created to handle them but if there isn't enough interest to create alternative generic management daemons, then there isn't enough interest to create alternative management daemons.

#3 In the new API that the kernel developers want which has just one complete hierarchy then any subtree is going to be a child of something no matter what you use to manage it.

The evolution of control groups

Posted Oct 31, 2013 21:38 UTC (Thu) by luto (subscriber, #39314) [Link]

#1 seems like a non-sequitur, what systemd provides for an API is only relevant to machines running systemd, if the machines you run are running upstart or something else then you'll need to create your own management daemon and API with whatever rules you want. I'm sure some workloads will have custom cgroup managers created to handle them but if there isn't enough interest to create alternative generic management daemons, then there isn't enough interest to create alternative management daemons.

On current kernels, I can use cgroups without systemd. But in the new regime, if I want to write a program that uses cgroups, I'll have to write a version for systemd boxes and a version for non-systemd boxes. I can't just link against libsystemdcgroups and expect it to work on all machines. This sucks.

#3 In the new API that the kernel developers want which has just one complete hierarchy then any subtree is going to be a child of something no matter what you use to manage it.

Easy example: suppose I want to have httpd live in a cgroup. Then I want to run, from the terminal, a program in the same cgroup as httpd.service.

Currently, this is possible and works fine (although it's ugly and I understand the kernel folks' desire to change it).

In a sensible new design, it would still be possible -- I would just switch my process into httpd's cgroup. No sweat. There's still just one hierarchy.

In the new systemd design, this is impossible. My program isn't in httpd.service from systemd's point of view (and it won't be), so I can't do it. But this is absurd -- I want to be able to tell systemd to fsck off and let me manage my own cgroups. This will not be possible anymore.

The evolution of control groups

Posted Oct 31, 2013 18:54 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (12 responses)

> As long as systemd provides it's own userspace API, over dbus maybe, for informing it of the changes you want to make then I think it will be good for all practical purposes.
I really detest the DBUS part, actually. I have a couple of systems that use subtree delegation just fine, all I need to do is to set appropriate permissions on the cgroups tree. I can use traditional Linux access control for that, no need for anything fancy.

In the DBUS use-case I don't even know where to begin - what access control systems are used there?

The evolution of control groups

Posted Nov 1, 2013 17:07 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (11 responses)

Probably polkit. It's used to deny access to things like the Suspend and Hibernate methods for powerkit (which is now provided by logind; not sure what access controls it uses).

The evolution of control groups

Posted Nov 1, 2013 22:05 UTC (Fri) by nix (subscriber, #2304) [Link] (10 responses)

polkit? You mean the program which has just thrown away all user policy customization and required everyone to rewrite all their local security policies in Javascript?

*That* horror?

The evolution of control groups

Posted Nov 1, 2013 22:15 UTC (Fri) by luto (subscriber, #39314) [Link] (1 responses)

To be fair, in my opinion it's a considerable improvement over writing the policies in pkla language.

(Also, there wasn't really an intelligent solution to the problem that polkit is solving before. The disaster that is /etc/security isn't really better. Neither are groups, especially if you want to tie actions to physical presence.)

The evolution of control groups

Posted Nov 1, 2013 22:18 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Right now cgroups are pretty much close to my perfect solution - I just can set the regular file permissions and inspect them using the usual tools.

The evolution of control groups

Posted Nov 1, 2013 22:21 UTC (Fri) by rahulsundaram (subscriber, #21946) [Link] (7 responses)

https://fedorahosted.org/polkit-pkla-compat/

The evolution of control groups

Posted Nov 1, 2013 22:24 UTC (Fri) by luto (subscriber, #39314) [Link] (1 responses)

At least you don't have to use this thing :)

The evolution of control groups

Posted Nov 1, 2013 22:28 UTC (Fri) by rahulsundaram (subscriber, #21946) [Link]

Compatibility plugs are usually a transitional mechanism. So no, you don't have to use it.

The evolution of control groups

Posted Nov 1, 2013 23:07 UTC (Fri) by nix (subscriber, #2304) [Link] (4 responses)

Nice! This should have been highlighted in huge flashing letters by the policy-kit people when they pushed this incompatible change on us. (Not that it *is* entirely compatible: I was *using* ReturnValue, dammit.)

The evolution of control groups

Posted Nov 1, 2013 23:31 UTC (Fri) by rahulsundaram (subscriber, #21946) [Link] (3 responses)

That wouldn't have been possible. This was developed independent of the polkit developers sometime later.

The evolution of control groups

Posted Nov 3, 2013 19:54 UTC (Sun) by nix (subscriber, #2304) [Link] (2 responses)

That's... a bit of an indictment of the polkit developers. Should we trust them not to break all our configuration again without caring about the people they leave out in the cold? They did it once before, after all...

The evolution of control groups

Posted Nov 3, 2013 20:18 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link] (1 responses)

The older version is afaik parallel installable (Older version: PolicyKit, new version: polkit), so if you want to continue using the older format, you can as long as the distros include it. However the compatibility layer made the older version unnecessary. If you want to be relatively sure, you can ask them directly.

The evolution of control groups

Posted Nov 6, 2013 17:55 UTC (Wed) by nix (subscriber, #2304) [Link]

Unfortunately that's not true, as the naming change happened at a different time to (a lot earlier than) the config change. Anyone in the middle is stuck. Ah well, the compat thing should do the trick, with a bit of effort...

The evolution of control groups

Posted Nov 5, 2013 16:12 UTC (Tue) by sbohrer (guest, #61058) [Link]

Of all the controllers only the CPU controller has any business working with individual threads. For that case, some sort of special interface will be introduced that will, among other things, allow processes to set CPU policies for their own threads.

And what about the cpuset controller? Apparently I have no business wanting my threads to have different CPU affinities? Yes I know about sched_setaffinity() and pthread_setaffinity_np(). Using cpusets also allows creating/tweaking scheduler policies with cpuset.sched_load_balance and cpuset.sched_relax_domain_level and I'm not currently aware of any other APIs that do this.

The other advantage of using cpusets over sched_setaffinity() is that it becomes trivial to see and manage which threads/processes are pinned to which cores by walking the cpuset cgroups. For example you could make one cpuset group per core, plus a generic sysdefault group that contains everything else not specifically pinned. Then threads/processes can easily find and pin themselves to an empty core group and a master management daemon can twiddle the CPU affinity of the sysdefault group to keep all of the generic OS crap off of the cores that have tasks pinned. Again, yes this all could be done by repeatedly walking /proc and trying to identify and set the affinity of all processes in the system, but that sucks.