Scheduler behavioral testing

July 10, 2019

Validating scheduler behavior is a tricky affair, as multiple subsystems both compete and cooperate with each other to produce the task placement we observe. Valentin Schneider from Arm described the approach taken by his team (the folks behind energy-aware scheduling — EAS) to tackle this problem.

Energy-aware scheduling relies on several building blocks, and any bug or issue present in those could affect the final task placement. In no particular order, they are:

per-entity load tracking (PELT), to have an idea of how much CPU bandwidth tasks actually use
cpufreq (schedutil), to get just the right performance level for a given workload
misfit, to migrate CPU-bound tasks away from LITTLE CPUs.

The LISA test framework has been designed to help validate these. In Arm, it is mainly used in two ways:

Fortnightly mainline integration: This consists of taking the latest tip/sched/core and adding all in-flight (on the list or soon to be) patches from the team. It is then tested on different Arm boards with several hundred test iterations.
Rafael J. Wysocki pointed out that his patches don't land in tip/sched/core, but Arm folks would want to test them nonetheless, since they rely on a properly functioning cpufreq implementation. It was agreed that Wysocki's linux-pm branch should be part of the testing base for future mainline integrations.
Patch validation: Anyone can easily validate their patches (or patches they're reviewing) using LISA and some board on their desk or in the continuous integration (CI) system.

In short, the tests run by LISA consist of synthetic workloads generated via rt-app, which has its execution traced, and the resulting traces are post-processed. The point of using rt-app is to carefully craft synthetic workloads that target specific scheduler behavior and minimize the number of involved functionalities. It shouldn't be too difficult to see why hackbench doesn't fit that bill.

An example of those tests is the EAS behavior test suite. Since LISA uses rt-app to craft its test workloads, it is possible to read the generated rt-app workload profile to estimate the utilization of the generated tasks before even running them. Furthermore, the energy model used by EAS is available to user space (via debugfs), so it can be fetched by the test tool.

With these two pieces of data, we can estimate an energy-optimal task placement and compute an optimal "energy budget". We can then run the workload and record the task placement via the sched_switch and sched_wakeup trace events. With the same energy model, we can estimate how much energy this placement cost, and compare the optimal versus estimated costs.

Energy is not everything, however, as we must make sure we maintain a sufficient level of performance. Rt-app provides some post-execution statistics that we monitor, letting us validate both energy efficiency and performance on a single workload execution.

Load-tracking signal (PELT) tests also rely on trace events. However, there are no PELT events in the upstream kernel and the addition of new trace events is frowned upon by the scheduler maintainers. The concern is that this would create an ABI that could be leveraged by some user-space tool and would then have to be maintained. For now, this testing is done with out-of-tree trace events.

Giovanni Gherdovich interjected that these trace events are very useful for debugging purposes and that he's been backporting them to his kernels for a few years now. Thankfully, Qais Yousef has been working on an upstream-friendly solution involving the separation of the trace points and the definition of their associated trace events, which hasn't met any severe objection. See this thread for more information.

Schneider also pointed out that he used this framework to write a test case for his first-ever mailing list patch review. According to him, it was fairly straightforward to create a synthetic workload and verify that the values obtained from the trace events behaved as described by the patch set, even though the actual implementation might have eluded him. This is a nice ramping up activity that benefits both reviewer and developer.

Now, as mentioned earlier, these tests target specific scheduler bits and thus rely heavily on having little interference from undesired tasks. Buildroot is used to obtain a minimal-yet-functioning system. When the user space cannot be changed (e.g. for testing on Android devices), the freezer control group is used for comparable (though inferior) results.

Still, all of this careful "white-rooming" cannot prevent system tasks from executing, such as sshd, adbd, NFS exchanges, etc. That is why the tests also monitor non-test-related tasks ("noisy" tasks) and assert that their runtime was not too significant. Should that not be the case, the test result will be changed to "undecided". This extra result type (compared to "pass" and "fail") was added to prevent real failures from being ignored: if it is known that a test is easily impacted by background activity and always has ~20% failure rate because of that, these failures would never be investigated (akin to the boy who cried wolf). Properly discarding test results when the initial conditions and expected environment are not met makes it easier to detect bugs and errors.

The list of usual noisy tasks was given for the HiKey 960:

irq/63-tsensor (thermal alarm IRQ)
sshd
rcu_preempt
sugov (scheduler governor kthread)

These tasks mostly run for less than 1% of the entire test duration, which is acceptable. Still, it is interesting to know what else is being run on the system.

Index entries for this article
Conference	OS-Directed Power-Management Summit/2019