Testing for kernel performance regressions
Still, there are places where more formalized regression testing could be helpful. Your editor has, over the years, heard a large number of presentations given by large "enterprise" users of Linux. Many of them expressed the same complaint: they upgrade to a new kernel (often skipping several intermediate versions) and find that the performance of their workloads drops considerably. Somewhere over the course of a year or so of kernel development, something got slower and nobody noticed. Finding performance regressions can be hard; they often only show up in workloads that do not exist except behind several layers of obsessive corporate firewalls. But the fact that there is relatively little testing for such regressions going on cannot help.
Recently, Mel Gorman ran an extensive set of benchmarks on a set of machines and posted the results. He found some interesting things that tell us about the types of performance problems that future kernel users may encounter.
His results include a set of scheduler tests, consisting of the "starve," "hackbench," "pipetest," and "lmbench" benchmarks. On an Intel Core i7-based system, the results were generally quite good; he noted a regression in 3.0 that was subsequently fixed, and a regression in 3.4 that still exists, but, for the most part, the kernel has held up well (and even improved) for this particular set of benchmarks. At least, until one looks at the results for other processors. On a Pentium 4 system, various regressions came in late in the 2.6.x days, and things got a bit worse again through 3.3. On an AMD Phenom II system, numerous regressions have shown up in various 3.x kernels, with the result that performance as a whole is worse than it was back in 2.6.32.
Mel has a hypothesis for why things may be happening this way: core kernel developers tend to have access to the newest, fanciest processors and are using those systems for their testing. So the code naturally ends up being optimized for those processors, at the expense of the older systems. Arguably that is exactly what should be happening; kernel developers are working on code to run on tomorrow's systems, so that's where their focus should be. But users may not get flashy new hardware quite so quickly; they would undoubtedly appreciate it if their existing systems did not get slower with newer kernels.
He ran the sysbench tool on three different filesystems: ext3, ext4, and xfs. All of them showed some regressions over time, with the 3.1 and 3.2 kernels showing especially bad swapping performance. Thereafter, things started to improve, with the developers' focus on fixing writeback problems almost certainly being a part of that solution. But ext3 is still showing a lot of regressions, while ext4 and xfs have gotten a lot better. The ext3 filesystem is supposed to be in maintenance mode, so it's not surprising that it isn't advancing much. But there are a lot of deployed ext3 systems out there; until their owners feel confident in switching to ext4, it would be good if ext3 performance did not get worse over time.
Another test is designed to determine how well the kernel does at satisfying high-order allocation requests (being requests for multiple, physically-contiguous pages). The result here is that the kernel did OK and was steadily getting better—until the 3.4 release. Mel says:
On the other hand, the test does well on idle systems, so the anti-fragmentation logic seems to be working as intended.
Quite a few other test results have been posted as well; many of them show regressions creeping into the kernel in the last two years or so of development. In a sense, that is a discouraging result; nobody wants to see the performance of the system getting worse over time. On the other hand, identifying a problem is the first step toward fixing it; with specific metrics showing the regressions and when they first showed up, developers should be able to jump in and start fixing things. Then, perhaps, by the time those large users move to newer kernels, these particular problems will have been dealt with.
That is an optimistic view, though, that is somewhat belied by the minimal response to most of Mel's results on the mailing lists. One gets the sense that most developers are not paying a lot of attention to these results, but perhaps that is a wrong impression. Possibly developers are far too busy tracking down the causes of the regressions to be chattering on the mailing lists. If so, the results should become apparent in future kernels.
Developers can also run these tests themselves; Mel has released the whole
set under the name MMTests. If this test
suite continues to advance, and if developers actually use it, the kernel
should, with any luck at all, see fewer core performance regressions in the
future. That should make users of all systems, large or small, happier.
Index entries for this article | |
---|---|
Kernel | Development tools/MMTests |
Kernel | Performance regressions |
Posted Aug 3, 2012 22:18 UTC (Fri)
by rbrito (guest, #66188)
[Link]
I will be glad to report whatever results I see with my computers (and I may even throw in some powerpc for comparison).
Posted Aug 3, 2012 22:52 UTC (Fri)
by mikov (guest, #33179)
[Link] (24 responses)
Posted Aug 3, 2012 23:32 UTC (Fri)
by gregkh (subscriber, #8)
[Link] (23 responses)
Posted Aug 4, 2012 8:33 UTC (Sat)
by fhuberts (subscriber, #64683)
[Link] (12 responses)
Posted Aug 4, 2012 10:11 UTC (Sat)
by Cato (guest, #7643)
[Link] (2 responses)
Organising this for Linux would be harder given a bootable system is required, but it could be done.
Posted Aug 4, 2012 11:49 UTC (Sat)
by jnareb (subscriber, #46500)
[Link] (1 responses)
This works so well because by default CPAN client does tests when installing modules, and send those results to CPANtesters. So it is very easy to become CPANtesters contributors.
Perhaps a request at install / upgrade time to perform regression benchmarks of one's system before first run would be a good idea for Linux testers project?
Posted Aug 4, 2012 15:09 UTC (Sat)
by Cato (guest, #7643)
[Link]
Posted Aug 4, 2012 10:27 UTC (Sat)
by siim@p6drad-teel.net (subscriber, #72030)
[Link]
Posted Aug 4, 2012 14:42 UTC (Sat)
by gregkh (subscriber, #8)
[Link] (5 responses)
That's exactly what we do, and what we expect when we do the -rc releases.
You are running them to test that nothing breaks on your machine, right?
If not, please do so.
Posted Aug 4, 2012 21:34 UTC (Sat)
by smoogen (subscriber, #97)
[Link]
Posted Aug 6, 2012 10:24 UTC (Mon)
by geertj (subscriber, #4116)
[Link] (3 responses)
Wrong. I am not testing -rc releases, because i have other stuff to do. And i'm not complaining that my hardware doesn't work either, which makes my behavior wholly consistent. Just pre-empting that comment.. :)
There is a lot more that could be done to make this "crowdsourced testing" more effective. Currently it is quite difficult to test out -rc releases. You have to know how to compile a kernel, and how to install and run it in your distribution. Certainly not rocket science, but not easy for the average distro user, which is who you'd need to go after for large-scale outsourced testing.
Just an idea.. What if bootable live test images could be created for -rc releases? Ideally they would need no local storage, but optionally they could use a dedicated partition. The live image could do a whole bunch of tests and send back the results, together with information the hardware the tests ran it. Those could be analyzed for problems. Also the tests could include performance tests.
Every time a new -rc would be released, you'd rebuild the live image, and ask people via G+, Facebook, Twitter, the mailing list, etc, to burn it to a CD and boot their system with it. The CD runs for a few hours overnight, sends back the results, and then says "Thank you, i'm done". I bet you that this could increase your testing base by 10x. Of course you want to be pretty sure that it is safe and put a lot of safeguards in place to make sure it is (which of course doesn't mean you don't need a pretty scary disclaimer before running the CD).
Posted Aug 6, 2012 11:14 UTC (Mon)
by niner (subscriber, #26151)
[Link]
In short: it would be a step forward but there's still plenty of stuff which users would have to test manually. But of course: the perfect is the enemy of the good.
Posted Aug 8, 2012 16:33 UTC (Wed)
by broonie (subscriber, #7078)
[Link]
Posted Feb 7, 2013 10:52 UTC (Thu)
by rbrito (guest, #66188)
[Link]
Having to compile the kernels is a burden indeed, especially for those with weaker machines.
At least for Debian-based disributions it seems that Canonical provides daily compiled kernels, which is cool to have in mind (I only remembered that when I read your comment):
http://kernel.ubuntu.com/~kernel-ppa/mainline/daily/
Posted Aug 4, 2012 23:11 UTC (Sat)
by krakensden (subscriber, #72039)
[Link] (1 responses)
It's a little sad.
Posted Aug 5, 2012 4:43 UTC (Sun)
by dirtyepic (guest, #30178)
[Link]
Posted Aug 4, 2012 10:30 UTC (Sat)
by man_ls (guest, #15091)
[Link] (6 responses)
Posted Aug 4, 2012 12:08 UTC (Sat)
by robert_s (subscriber, #42402)
[Link] (5 responses)
Posted Aug 4, 2012 15:21 UTC (Sat)
by man_ls (guest, #15091)
[Link] (3 responses)
Some examples: mount a virtualized SATA disk, format it, create a few files, read them back and check their contents. Mount a virtualized USB disk and do the same. Create several filesystems and stress-test them. Check that the commands issued are in correct order. And so on.
This is (obviously) spoken from utter ignorance of kernel internals, just from the point of view of basic software engineering: if it is not tested and verified it is not finished. I am sure kernel devs will know how to implement the idea or ignore it if the effort is not worth it. But for me it would be a fascinating project.
Posted Aug 4, 2012 19:09 UTC (Sat)
by robert_s (subscriber, #42402)
[Link] (2 responses)
But that's the stuff that gets tested anyway, by people. The trouble _is_ with the esoteric hardware.
Posted Aug 4, 2012 22:39 UTC (Sat)
by man_ls (guest, #15091)
[Link] (1 responses)
Posted Aug 6, 2012 11:17 UTC (Mon)
by niner (subscriber, #26151)
[Link]
Posted Aug 6, 2012 10:27 UTC (Mon)
by njd27 (subscriber, #5770)
[Link]
I actually work on the Linux driver for a particular family of input devices from one manufacturer. Most of the customers are integrating devices which are targeting older kernel versions rather than mainline, so our main focus is there. But we do track patches that are going into mainline and make sure they are sane, and do some occasional testing.
Posted Aug 4, 2012 10:33 UTC (Sat)
by ajb (guest, #9694)
[Link]
In practice, I suspect most hardware companies don't have sufficient incentive to do this, but they might for some products. Some teams also don't really have a model, preferring to develop their device using FPGAs. (I'm not counting RTL models, which are too expensive to run to use for software testing)
Posted Aug 4, 2012 21:47 UTC (Sat)
by mikov (guest, #33179)
[Link] (1 responses)
- Should I be testing my distro's kernel or the latest mainline? If the latter, why am I testing a kernel I am not going to use for years, if at all?
- where do I even report the bugs? Lkml? Bugzilla? My distro?
In reality, it would be a full time job for a couple of people to deal with all this. Highly paid jobs. Not every business can afford that.
A stable source level API would go a long way towards improving the situation, but that is not likely to happen :-)
Posted Aug 4, 2012 22:11 UTC (Sat)
by dlang (guest, #313)
[Link]
you should test the mainline kernel because your distro is going to be based off of the mainline. It's because your distro is based off the mainline that you should test it.
You don't have to test every -rc kernel, but the more testing that you do, the less likely you are to run into problems when you upgrade to the latest release of your distro.
If you test only when your distro upgrades their kernel every 5 years, then finding where in the 5 years of development things went wrong is an impossible task
If you test the released kernels every 3 months, there's only 3 months of work to go through.
If you test each kernel release around the -rc3-5 range, you are even better off as the developers are currently thinking about that release, and looking for problems to fix.
> - where do I even report the bugs? Lkml? Bugzilla? My distro?
That's easy, if you are testing a mainline kernel, LKML is the best place to report bugs.
Posted Aug 3, 2012 23:42 UTC (Fri)
by aliguori (guest, #30636)
[Link] (1 responses)
I had thought the kernel is moving towards autotest for this kind of stuff. autotest is actually pretty capable of harnessing benchmarks like this and collating results.
Posted Aug 4, 2012 23:43 UTC (Sat)
by pabs (subscriber, #43278)
[Link]
kerneloops.org is also still down :(
Posted Aug 4, 2012 10:52 UTC (Sat)
by copsewood (subscriber, #199)
[Link]
In one sense the current problem resulting from lack of a coherent test facility might be equivalent to the previous practice of trying to manage the kernel patch queue using an email spool without source code revision control. BitKeeper then Git weren't introduced that long ago and their lack must have constrained what Linus could realistically do. I suspect the test problem also inherently likely to get worse until some kind of standardisation and incentive mechanism enables a major distributed community effort to come together into a coherent test facility. Even if it can't cover all automated test requirements, if it could cover enough of these it seems likely (if it's at all feasible) greatly to improve likely quality of released software.
Those with the compute resources likely to be needed to crunch the test software (to the extent hardware emulation is possible), are not always the same as those with the incentive to write the test cases, when it comes to generic as opposed to hardware-specific kernel features. The incentive to contribute hardware emulations would be to get hardware onto a 'platinum level' support list so more purchasers buy it. The incentive to contribute test cases would be so that your tests are automatically run on time and regressions resulting from newer software are automatically reported.
Could the Linux Foundation working with other interested parties attract the resources to fund and develop a cloud type test facility ? This probably wouldn't work for drivers unless either software emulators for the hardware in question exist, or the physical hardware could be installed within the test farm using various standardised protocols allowing for timed tests, automated output comparisons and resets etc. I guess the funding of this would mainly come through hardware manufacturers who want the highest possible level of kernel support.
So if this thought experiment ever leads to feasible development, there seem likely to be 3 main contributors.
a. A vendor independant, trusted and funded community body which runs the main test rig, e.g. Linux Foundation.
b. Contributors of generic test cases for kernel features intended to run on many different types of hardware.
c. Contributors of physical hardware requiring dedicated device drivers, software emulators for that hardware and tests which can run on scaffolded or emulated hardware. (Eventually manufacturers could contractually commit to their customers to supporting physical and or emulated hardware within this facility for a stated period after it ceases to be in production.)
There seem likely to be a few reasons why this isn't feasible, but I can't think of any quite yet.
Posted Aug 4, 2012 22:12 UTC (Sat)
by drdabbles (guest, #48755)
[Link] (3 responses)
The next problem is that there is little incentive for the hardware vendors that contribute directly to the kernel (I'm looking at you, Intel) to also contribute a test suite for their contributions. Moreover, it's quite conceivable that contributing a test suite would allow people to reverse engineer the hardware in question to some extent. I don't see this as a problem, but then again I don't make billions of dollars a year selling chipsets embedded on nearly every device in existence. There may also be regulatory concerns here, much like open-source WiFi drivers had to contend with here in the US several years ago. So, it's a sticky situation.
Additionally, you have the problem of how to actually execute tests. Software bugs aren't always painfully obvious. You don't always get a panic or segfault when a program or driver messes up. In fact, sometimes the program runs perfectly unaware that it has completely ruined data. This problem of subtle bugs can be seen in audio chipset drivers frequently. Sometimes an upgrade causes audio artifacts, sometimes the ports are reversed, and sometimes the power saving mechanisms act wonky. But only after a suspend from a warm reboot where a particular bit in memory wasn't initialized to 0. These things are extremely hard to detect with a test suite, because the suite has no idea if the audio is scratchy or if the port a user expects to be enabled is working properly.
Finally, if you could overcome the issues above, you have the case where suspend/resume/reboot/long run time causes a problem. To test this, the test suite needs complete access to a computer at a very low level. Virtualization will only get you a small portion of the way there. Things like PCI pass through are making this kind of test easier, but that in itself invalidates tests on a very basic hardware access level. This is where the idea of hardware vendors contributing resources to a test farm becomes a great idea. And as a comment above mentioned, the Linux Foundation could create some incentives for this. A platinum list of vendors / devices would be excellent! My organization ONLY uses Linux and *BSD in the field, so having that list to purchase from would be a pretty big win for us.
I think the solution will have to be several layers. A social change will need to be made, where developers don't dread writing test suites for their software. A policy change may be needed such as, "No further major contributions will be accepted without a test suite attached as well". And finally, the technical requirements to actually execute these test suites.
The good news is that booting a kernel with TESTSUITE=yes option could kick the kernel into a mode where only test suites are executed would be pretty easy. The box would never boot the OS, but would sit there running tests for all built (module or in-kernel) components, hammering on hardware to make drivers fail. Passing a further option that points at a storage device could be useful for saving results to a raw piece of hardware in a format that could be uploaded and analyzed easily.
Posted Aug 4, 2012 23:44 UTC (Sat)
by pabs (subscriber, #43278)
[Link] (2 responses)
Posted Aug 4, 2012 23:54 UTC (Sat)
by drdabbles (guest, #48755)
[Link] (1 responses)
Having to install 2, 3, or 4 test suite packages just to run the tests means nobody will ever actually run them.
Perhaps a solution like GIT- built specifically for the Linux kernel use case, could be helpful.
Posted Aug 5, 2012 2:56 UTC (Sun)
by shemminger (subscriber, #5739)
[Link]
Random testing is often better than organized testing! Organized testing works for benchmarks, but the developer in tawain who boots on a new box and reports that the wireless doesn't work is priceless.
Posted Aug 5, 2012 0:03 UTC (Sun)
by amworsley (subscriber, #82049)
[Link]
Posted Aug 6, 2012 13:31 UTC (Mon)
by cmorgan (guest, #71980)
[Link] (3 responses)
Phoronix has been doing kernel benchmarking for years now and has pointed out a bunch of kernel releases/distro releases that have performance regressions.
Chris
Posted Aug 6, 2012 22:45 UTC (Mon)
by dlang (guest, #313)
[Link] (2 responses)
Many of them really don't make sense and some (disk related that I know of) are downright misleading, givng 'better' scores for situations where things are misbehaving.
Many people have tried to point this out and made no progress in getting them changed.
Posted Aug 6, 2012 23:02 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
But most other cases are fine. Phoronix usually captures quite real performance regressions.
Posted Aug 7, 2012 13:02 UTC (Tue)
by cmorgan (guest, #71980)
[Link]
Just wanted to point out that Phoronix has been a great resource for exactly the kind of benchmarking between kernel releases that the article was referring to, and has been doing so for years.
I wonder if any kernel developers have used the Phoronix results to help target fixes for various regressions.
Posted Aug 8, 2012 21:27 UTC (Wed)
by deater (subscriber, #11746)
[Link] (1 responses)
Posted Aug 8, 2012 23:00 UTC (Wed)
by dlang (guest, #313)
[Link]
However, if there is a series of tests over time, you can show that while there is a 10% noise factor in the tests, over the last X releases, there is a downwards trend or something like that.
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
I'd be willing to donate CPU cycles on most of my machines to test the kernel, drivers, etc. if the results of that would be aggregated somewhere...
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
>
> That's exactly what we do, and what we expect when we do the -rc releases.
> You are running them to test that nothing breaks on your machine, right?
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
A technical solution might be to simulate the hardware and test it automatically, perhaps in a special virtual machine. The initial effort would pay off after a few iterations.
Testing for kernel performance regressions
Testing for kernel performance regressions
In this case, anything is better than nothing. The more esoteric specialized hardware will probably harder to emulate; but just testing the most common hardware interfaces would surely save a lot of time and eventually enable devs to change code that may be currently too brittle to touch.
Testing for kernel performance regressions
Testing for kernel performance regressions
Yes, so it is wasting the time of those people who have to painstakingly compile -rc kernels, load them and check that everything works. Not to speak about performance regressions in the drivers and filesystems, which I imagine must be hard to find and boring work. While a test suite publicly available might be run privately and also on the public git repos after every push. It is called continuous deployment, we do it at my company and it is fun, challenging stuff!
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
Device driver testing
Testing for kernel performance regressions
- there is no static set of hardware that I use. You can't buy the same hardware even if you wanted to. So, I would have to be testing all the time. Considering that the kernel changes all the time...
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
Automated testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
Testing for kernel performance regressions
Part of the problem is that the kernel varies so much from release to release, often in un-git-bisectable ways.Testing for kernel performance regressions
I run regression tests of the overhead of the perf_event syscalls.
You can see some results
here.
There are often wild swings in the results of 5-10% from kernel release to release, but it just seems to be "noise". I've tried bisecting but it just gets you nowhere. When I went to the kernel devs they just said it's probably different cache layout and similar affects from unrelated pieces of the code. So sadly regressions are just completely lost in noise unless they are *major* slowdowns.
Testing for kernel performance regressions