The evolution of control groups
Tejun started by reminding the group that the multiple hierarchy
feature of cgroups, whereby processes can be placed in multiple, entirely
different
hierarchies, is going away. The unified hierarchy work is not entirely
usable yet, though, because it requires that all controllers be enabled for
the full hierarchy. Some controllers still are not hierarchical at all;
they are being fixed over time. The behavior of controllers is being made
more uniform as well.
One big change that has been decided upon recently is to make cgroup controllers work on a per-process basis; currently they apply per-thread instead. Among other things, that means that threads belonging to the same process can be placed in different control groups, leading to various headaches. Of all the controllers only the CPU controller has any business working with individual threads. For that case, some sort of special interface will be introduced that will, among other things, allow processes to set CPU policies for their own threads.
That interface, evidently, might be implemented with yet another special-purpose virtual filesystem. There was some concern about how the cgroup subsystem may be adding features that, essentially, constitute new system calls without review; there were also concerns about how the filesystem-based interface suffers from race conditions. Peter Zijlstra worried about how the new per-thread interface might look, saying that there were a lot of vague details that still need to be worked out. Linus wondered if it was really true that only the CPU controller needs to look at individual threads; some server users, he said have wanted per-thread control for other resources as well.
Linus also warned that it might not be possible to remove the old cgroup interface for at least ten years; as long as somebody is using it, it will need to be supported. Tejun seemed unworried about preserving the old interface for as long as it is needed. Part of Tejun's equanimity may come from a feeling that it will not actually be necessary to keep the old interface for that long; he said that even Google, which has complained about the unified hierarchy plans in the past, has admitted that it can probably make that move. So he doesn't see people needing the old interface for a long time.
In general, he said, the biggest use for multiple hierarchies has been to work around problems in non-hierarchical controllers; once those problems are fixed, there will be less need for that feature. But he still agrees that it will need to be maintained for some years, even though removal of multiple hierarchy support would simplify things a lot. Linus pointed out that, even if nobody is using multiple hierarchies currently, new kernels will still need to work on old distributions for a long time. Current users can be fixed, he said, but Fedora 16 cannot.
Hugh Dickins worried that, if the old interface is maintained, new users may emerge in the coming years. Should some sort of warning be added to tell those users to shift to the new ABI? James Bottomley said, to general agreement, that deprecation warnings just don't work; distributions just patch them out to avoid worrying their users. Tejun noted that new features will only be supported in the new ABI; that, hopefully, will provide sufficient incentive to use it. Hugh asked what would happen if somebody submitted a patch extending the old ABI; Tejun said that the bar for acceptance would be quite high in that case.
From the discussion, it was clear that numerous details are still in need of being worked out. Paul Turner said that there is a desire for a notification interface for cgroup hierarchy changes. That, he said, would allow a top-level controller to watch and, perhaps, intervene; he doesn't like that idea, since Google wants to be able to delegate subtrees to other processes. In general, there seems to be a lack of clarity about who will be in charge of the cgroup hierarchy as a whole; the systemd project has plans in that area, but that creates difficulties when, for example, a distribution is run from within a container. Evidently some sort of accord is in the works there, but there are other interesting questions, such as what happens when the new and old interfaces are used at the same time.
All told, there is a fair amount to be decided still. Meanwhile, Tejun said, the next concrete step is to fix the locking, which is currently too strongly tied to the internal locking of the virtual filesystem layer. After that is done, it should be possible to post a prototype showing how the new scheme will work. That posting may happen by the end of the year.
[Next: Linux-next and -stable].
Index entries for this article | |
---|---|
Kernel | Control groups |
Conference | Kernel Summit/2013 |
Posted Oct 30, 2013 0:41 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
For example, I might want to freeze MySQL and Apache together even though they might be in entirely different cgroup subtrees.
Posted Oct 30, 2013 8:17 UTC (Wed)
by johannbg (guest, #65743)
[Link] (2 responses)
Jonathan could you clarify what exactly those difficulties are?
Posted Oct 30, 2013 12:44 UTC (Wed)
by corbet (editor, #1)
[Link] (1 responses)
But it's possible I'm totally confused.
Posted Oct 30, 2013 16:25 UTC (Wed)
by dowdle (subscriber, #659)
[Link]
Posted Oct 31, 2013 17:16 UTC (Thu)
by luto (subscriber, #39314)
[Link] (18 responses)
The problem with a single control group hierarchy is that, if something like systemd claims control of that hierarchy and you want to use cgroups for something else, you're screwed. There's one hierarchy, and systemd is using it.
I would *love* to see some minimal extra kernel functionality added to allow efficiently tracking and killing process subtrees. Then systemd could use that instead of control groups and life would be good.
Posted Oct 31, 2013 18:29 UTC (Thu)
by raven667 (subscriber, #5198)
[Link] (17 responses)
I'm not an expert on cgroups but if I've been reading the coverage correctly it seems that one of the problems with delegated and multiple access in the existing system is that you can set priorities which are global values but you can only see the values for the processes in your group so you can't know whether your priorities will play well with the rest of the system. For example is 43 (out of 1-1024) more or less than another containers priority? Only a control daemon (or administrator) which can see everything knows what the "right" values are for priorities to be able to set them in a meaningful way. A global control daemon can have a more complex policy as to what priority values clients can actually get set to, preventing a container guest from setting priority higher than your management processes for example.
Posted Oct 31, 2013 18:33 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Oct 31, 2013 18:40 UTC (Thu)
by luto (subscriber, #39314)
[Link] (2 responses)
Since process subtree tracking is independently useful (see, for example, the subreaper stuff) and semantically has nothing whatsoever to do with partitioning kernel resources, it would be nice to see it split out.
Then, the One True Userspace Cgroup Daemon could be a separate project (that would cooperate with systemd by default) and which, importantly, could be turned off if you don't want it.
Posted Oct 31, 2013 20:56 UTC (Thu)
by raven667 (subscriber, #5198)
[Link] (1 responses)
#3 In the new API that the kernel developers want which has just one complete hierarchy then any subtree is going to be a child of something no matter what you use to manage it.
Posted Oct 31, 2013 21:38 UTC (Thu)
by luto (subscriber, #39314)
[Link]
Currently, this is possible and works fine (although it's ugly and I understand the kernel folks' desire to change it).
In a sensible new design, it would still be possible -- I would just switch my process into httpd's cgroup. No sweat. There's still just one hierarchy.
In the new systemd design, this is impossible. My program isn't in httpd.service from systemd's point of view (and it won't be), so I can't do it. But this is absurd -- I want to be able to tell systemd to fsck off and let me manage my own cgroups. This will not be possible anymore.
Posted Oct 31, 2013 18:54 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (12 responses)
In the DBUS use-case I don't even know where to begin - what access control systems are used there?
Posted Nov 1, 2013 17:07 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link] (11 responses)
Posted Nov 1, 2013 22:05 UTC (Fri)
by nix (subscriber, #2304)
[Link] (10 responses)
*That* horror?
Posted Nov 1, 2013 22:15 UTC (Fri)
by luto (subscriber, #39314)
[Link] (1 responses)
(Also, there wasn't really an intelligent solution to the problem that polkit is solving before. The disaster that is /etc/security isn't really better. Neither are groups, especially if you want to tie actions to physical presence.)
Posted Nov 1, 2013 22:18 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Nov 1, 2013 22:21 UTC (Fri)
by rahulsundaram (subscriber, #21946)
[Link] (7 responses)
Posted Nov 1, 2013 22:24 UTC (Fri)
by luto (subscriber, #39314)
[Link] (1 responses)
Posted Nov 1, 2013 22:28 UTC (Fri)
by rahulsundaram (subscriber, #21946)
[Link]
Posted Nov 1, 2013 23:07 UTC (Fri)
by nix (subscriber, #2304)
[Link] (4 responses)
Posted Nov 1, 2013 23:31 UTC (Fri)
by rahulsundaram (subscriber, #21946)
[Link] (3 responses)
Posted Nov 3, 2013 19:54 UTC (Sun)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted Nov 3, 2013 20:18 UTC (Sun)
by rahulsundaram (subscriber, #21946)
[Link] (1 responses)
Posted Nov 6, 2013 17:55 UTC (Wed)
by nix (subscriber, #2304)
[Link]
Posted Nov 5, 2013 16:12 UTC (Tue)
by sbohrer (guest, #61058)
[Link]
And what about the cpuset controller? Apparently I have no business wanting my threads to have different CPU affinities? Yes I know about sched_setaffinity() and pthread_setaffinity_np(). Using cpusets also allows creating/tweaking scheduler policies with cpuset.sched_load_balance and cpuset.sched_relax_domain_level and I'm not currently aware of any other APIs that do this. The other advantage of using cpusets over sched_setaffinity() is that it becomes trivial to see and manage which threads/processes are pinned to which cores by walking the cpuset cgroups. For example you could make one cpuset group per core, plus a generic sysdefault group that contains everything else not specifically pinned. Then threads/processes can easily find and pin themselves to an empty core group and a master management daemon can twiddle the CPU affinity of the sysdefault group to keep all of the generic OS crap off of the cores that have tasks pinned. Again, yes this all could be done by repeatedly walking /proc and trying to identify and set the affinity of all processes in the system, but that sucks.
The evolution of control groups
The evolution of control groups
I do not have any detailed knowledge there, that's just what was said. But, clearly, anything running in a container will also be in a subtree of the control group hierarchy. If systemd wants to control the *whole* hierarchy, there could be trouble: systemd running in the container wrestling with systemd outside of the container in particular.
Systemd in containers
Systemd in containers
The evolution of control groups
The evolution of control groups
The evolution of control groups
This is a red herring. It's extremely easy to work around this by introducing an extra level of hierarchy.
The evolution of control groups
As long as systemd provides it's own userspace API, over dbus maybe, for informing it of the changes you want to make then I think it will be good for all practical purposes.
I disagree. This might be good for all practical purposes if:
(1) isn't true and won't be unless Canonical has a major change of heart. (2) is a big unknown, since AFAIK no one has seen this API. (3) appears to be about as likely as hell freezing over, for two reasons. First, the systemd people probably don't want to do that. Second, and more importantly, systemd needs the cgroup hierarchy for its own infternal purposes, and, since there's only one hierarchy, systemd can't expose a hierarchy that's too different from its own.
The evolution of control groups
The evolution of control groups
#1 seems like a non-sequitur, what systemd provides for an API is only relevant to machines running systemd, if the machines you run are running upstart or something else then you'll need to create your own management daemon and API with whatever rules you want. I'm sure some workloads will have custom cgroup managers created to handle them but if there isn't enough interest to create alternative generic management daemons, then there isn't enough interest to create alternative management daemons.
On current kernels, I can use cgroups without systemd. But in the new regime, if I want to write a program that uses cgroups, I'll have to write a version for systemd boxes and a version for non-systemd boxes. I can't just link against libsystemdcgroups and expect it to work on all machines. This sucks.
#3 In the new API that the kernel developers want which has just one complete hierarchy then any subtree is going to be a child of something no matter what you use to manage it.
Easy example: suppose I want to have httpd live in a cgroup. Then I want to run, from the terminal, a program in the same cgroup as httpd.service.
The evolution of control groups
I really detest the DBUS part, actually. I have a couple of systems that use subtree delegation just fine, all I need to do is to set appropriate permissions on the cgroups tree. I can use traditional Linux access control for that, no need for anything fancy.
The evolution of control groups
The evolution of control groups
The evolution of control groups
The evolution of control groups
The evolution of control groups
The evolution of control groups
The evolution of control groups
The evolution of control groups
The evolution of control groups
The evolution of control groups
The evolution of control groups
The evolution of control groups
The evolution of control groups
Of all the controllers only the CPU controller has any business working with individual threads. For that case, some sort of special interface will be introduced that will, among other things, allow processes to set CPU policies for their own threads.