The weighted TEO cpuidle governor
Sampat started with a bit of background. The TEO governor is based on the idea that timer events are the most likely way that a system will wake up; they also happen to be the most deterministic, since they are known before the system goes idle. But CPUs are subject to wakeups from other sources — interrupts in particular — and that complicates the situation. So the TEO governor maintains a short history of actual idle times that is used to come up with a (hopefully) better guess for what the next idle period will really be.
This history is an eight-entry circular buffer that indicates the recent pattern of non-timer wakeups. When the time comes to pick an idle state, the TEO governor looks at how many of those wakeups led to a sleep time that was less than expected; if the answer is "a majority of them", then the average observed sleep time is used to select an idle state that is shallower than would have otherwise been chosen. It works well, he said, but maybe it can be made better?
He started by testing the idea of whether more history would improve the situation. Increasing the size of the idle-times buffer to 128 did not really help, though. With a set of benchmark results, Sampat showed performance numbers that were sometimes better and sometimes worse; latency often improved, but power consumption got much worse. More history led to the selection of shallower sleep states more often, in other words.
It turns out, he said, that an average is not a good model of the distribution of sleep times, and a longer history may not reflect what is going to happen in the future. So he concluded that what is needed is to store and manage the history differently. The cpuidle governor would benefit from a way to answer a specific question: if the kernel is about to pick a given sleep state, what are the chances that the actual sleep time will better match a sleep that is one level shallower?
The weighted governor
The result was the weighted-history TEO governor, which replaces the history buffer with an NxN matrix, where N is the number of sleep states supported by the processor. The rows correspond to the sleep state the TEO governor would pick in any given situation; each column along that row indicates the probability that the corresponding state should actually be chosen. If the system in question had three sleep states ("shallow", "medium", and "deep"), the matrix would be initialized to look like this:
Shallow Medium Deep Shallow 70% 15% 15% Medium 15% 70% 15% Deep 15% 15% 70%
In other words, the matrix is set up so that the chances of each state selection being correct are 70%, with the remaining 30% spread across all the other states. Giovanni Gherdovich asked whether this initial distribution was hard-coded; the answer was "yes for now", and that the values have been chosen from a set of experiments Sampat ran.
After each wakeup, the actual behavior is measured and the probabilities are tweaked accordingly. The actual amount of adjustment that should be performed is still unclear, he said; more experiments and testing are needed.
When it comes time to make a prediction, the governor uses a biased random-number generator to pick a state; the biasing is done so that the chances of picking any particular state are the same as the observed probability that said state is the correct one. Why do that rather than just pick the highest-probability state? Often it turns out that the probabilities are fairly close, so a subset of the available states are all about as likely to be correct. The system will self-correct when the random-number generator steers it wrong.
Results
A number of benchmark results followed, showing variable results. With schedbench, latency was better some times and worse others, but power consumption was always less. The accuracy of the sleep-state choices was similar to the unweighted TEO governor for a small number of threads, but improved for larger numbers of threads. Rafael Wysocki, the author of the TEO governor, said that he was surprised to see TEO doing as well as it does; he deliberately chose a simple algorithm to minimize the overhead involved.
Sampat modified the ebizzy benchmark to make it do occasional sleeps, and got better results than TEO for both throughput and power consumption. The pgbench benchmark showed mixed results, with things getting worse as more clients were added. Hackbench results saw better results with relatively short run times, and a consistent 8-10% improvement in power consumption.
At this point, some confusion about the results became evident. Sampat characterized the results as "overshooting" or "undershooting", which most people expected to refer to the sleep state chosen, but actually referred to the sleep residency time. So "overshooting" meant picking a sleep state that was too shallow — the residency time overshot the estimate. This terminology seems highly likely to change in the near future.
Wysocki observed that picking a sleep state that is too shallow is generally better than picking one that is too deep. Not sleeping deeply enough will cost some power, but sleeping too deeply can hurt the performance of the system (in both latency and throughput terms).
Sampat finished with an overview of the work that is yet to be done. The aging algorithm still needs some work; workloads change over time, and old history can lead to poor predictions going forward. He tried simply decaying the highest-probability state, but that led to large variance in the results.
Another issue is the initial weights put in the matrix; these were
determined through experiments, but more testing is needed. Wysocki
disagreed, though, saying that with proper aging, the initial states don't
matter much. The governor will correct itself over time. But that depends
on the aging working well, so that is the important part to work on. The
session concluded with Wysocki saying that the work looks promising and can
be discussed further on the mailing list.
Index entries for this article | |
---|---|
Kernel | Power management/cpuidle |
Conference | OS-Directed Power-Management Summit/2020 |
Posted May 14, 2020 22:18 UTC (Thu)
by Paf (subscriber, #91811)
[Link] (6 responses)
The reasoning for the randomization seems... thin? It’s surely not any worse to just always pick the highest probability, and have a mechanism for tie breaking?
Posted May 14, 2020 22:57 UTC (Thu)
by corbet (editor, #1)
[Link] (5 responses)
Posted May 16, 2020 4:15 UTC (Sat)
by martinfick (subscriber, #4455)
[Link] (4 responses)
Posted May 16, 2020 13:24 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link] (3 responses)
Posted May 16, 2020 19:33 UTC (Sat)
by Wol (subscriber, #4433)
[Link] (2 responses)
Cheers,
Posted May 16, 2020 20:18 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Posted May 16, 2020 22:30 UTC (Sat)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted May 15, 2020 9:27 UTC (Fri)
by scientes (guest, #83068)
[Link]
Posted May 15, 2020 9:34 UTC (Fri)
by pgdx (guest, #119243)
[Link] (3 responses)
Posted May 15, 2020 14:49 UTC (Fri)
by nilsmeyer (guest, #122604)
[Link]
Posted May 15, 2020 18:06 UTC (Fri)
by elel (subscriber, #100484)
[Link]
Posted May 30, 2020 20:51 UTC (Sat)
by Hi-Angel (guest, #110915)
[Link]
The weighted TEO cpuidle governor
Bear in mind that the number of sleep states is typically rather larger than the three I showed in the simplified example. If you are going to pick between four choices, each of which has a probability between, say, 18-22%, does it really make sense to always pick the highest? In a sense, this is the mechanism for tie breaking.
Random numbers
Random numbers
Random numbers
Random numbers
Wol
Random numbers
Random numbers
Wol
The weighted TEO cpuidle governor
"when there is nothing for the system to do"
"when there is nothing for the system to do"
"when there is nothing for the system to do"
"when there is nothing for the system to do"