KS2010: Scheduler issues
Among those are the fact that group scheduling bundles both bandwidth (the amount of CPU time allocated to a group) and priority into a single value. There are some real scalability issues with group scheduling; the wakeup path, in particular, is getting costly. Paul complained about a lack of cooperative scheduling APIs. Management of group scheduling is difficult; for the desktop case, automatic tty-based grouping will make life easier, but it won't help on server systems. There is no notion of priority between groups and no upper bound on the bandwidth any given group can consume. There are load balancing problems, especially when networking comes into the picture. And there is no notion of idle or batch scheduling in the group context.
With regard to load balancing, Paul said that the weight-based balancing tends to hurt CPU utilization. The balancing of groups is "primitive," leading to "herd migrations" which don't help the problem. There is no NUMA awareness in the group scheduler. The scheduler also does not account for the CPU time consumed by interrupt handling, leading to skewed scheduling results. Threaded interrupt handlers were suggested as a possible way of mitigating that last problem.
Google wants to use SCHED_IDLE for low-priority tasks, but it works poorly with the load balancing. Since idle tasks have no weight, the scheduler will not move them over to an idle core. These tasks also get a minimum share of the CPU which, while small, is still too high; it is not possible to isolate those loads entirely from the rest of the system.
Talking about scalability, Paul called out tg_shares_up(), which handles the distribution of CPU bandwidth. It is costly; since it is running across the Google cluster, he said, it may well be the function which is consuming the most CPU time on the planet. Something needs to be done to streamline that part of the system. Wakeup costs are high too; Paul would like to find a way to offload some of that cost to the CPU where the target process is running. That, he says, would spread out the costs and reduce cross-processor lock contention.
Google has posted some patches which allow the specification of an upper bound for CPU utilization; Paul would like to see that work merged. He would like to see the addition of priorities to group scheduling. Also nice would be a means by which the fairness window could be different for each group. High-priority groups should be given their fair share with relatively small periods; low priority work really only needs its share over longer periods.
Paul also talked about yet another variant on deadline scheduling called EEVDF. It works with virtual deadlines, so it's not aimed at realtime use. But it enables the sort of scheduling that Google would like, and it mixes very well with the current CFS scheduler. Evidently it provides non-uniform latency periods - implementing the variable windows that Google would like - and has nice idle-scheduling behavior as well.
Then, there was talk of "cooperative scheduling," which includes a mechanism by which threads can be notified when they are preempted or migrated. The notification mechanism was not clearly described; it sounded like a variant on signals. There is also a desire for a "thread nomination" mechanism by which one thread can pick another to run at any given time.
There was also some talk of testing, which, Paul said, is hard. One thing that has helped a lot is linsched, a scheduler simulator which has recently been fixed up and posted by Google. Linsched makes it easy to run tests in a highly repeatable way.
Index entries for this article | |
---|---|
Kernel | Scheduler |
Posted Nov 3, 2010 23:07 UTC (Wed)
by promotion-account (guest, #70778)
[Link]
The scheduler also does not account for the CPU time consumed by interrupt handling, leading to skewed scheduling results.
I guess that's handled now in the 2.6.37 merge window.
Posted Nov 4, 2010 9:10 UTC (Thu)
by sthibaul (✭ supporter ✭, #54477)
[Link]
I guess this was Scheduler Activations?
KS2010: Scheduler issues - Measuring process CPU time
KS2010: Scheduler issues