Digging in the kernel dust

October 24, 2017

This article was contributed by Tom Yates

Refactoring the kernel means taking some part of the kernel that is showing its age and rewriting it so it works better. Thomas Gleixner has done a lot of this over the past decade; he spoke at Kernel Recipes about the details of some of that work and the lessons that he learned. By way of foreshadowing how much fun this can be, he subtitled the talk "Digging in Dust".

Gleixner's original motivation for taking up his spade was to get the realtime (RT) patches into the mainline kernel, which he found involved constantly working around the shortcomings of the mainline code base. In addition, ten years of spending every working day digging around in dust can make you quite angry, he said, which can also be a big incentive to make things better.

Hotplug support

The first concrete example he discussed was CPU hotplug support. This was added in 2002, when Rusty Russell implemented it with a notifier-based architecture. According to Gleixner, the naming scheme was poor and, as the community worked on the code, further problems were added: the locking was handled by uninstrumented magic that deliberately evaded lockdep oversight and the code developed interesting and undocumented ordering requirements. Moreover, the handling of registered notifiers at startup and teardown was asymmetric: startup reasonably did high-priority tasks first, but teardown also did high-priority tasks first. This is not reasonable: a pile should be disassembled in the reverse of its build order, he said. The choice of asymmetric behavior meant that anything that was sensitive to startup/teardown symmetry ended up having two notifiers. The absence of any debugging for the hotplug notifiers didn't help with figuring out what was going on.

One might expect such code to be less than completely robust; according to Gleixner it was known to be fragile for a long time. The response for most of this time was to apply successive layers of duct tape to hold it together; the attempt to add RT support made it fall apart completely, which is why Gleixner first picked up the pieces in 2012. He decided to convert the structure to a state machine and redo startup/teardown to be symmetric; he got as far as a proof-of-concept patch when he ran out of spare time. He handed the work over to a (tactfully unnamed) company that promised to pick it up, but the company merely applied yet more duct tape. So at the end of 2015, when he got some funding for the RT project, he revisited it; it has taken almost two years to complete the work.

The first step was to analyze all the notifiers and document the ordering requirements. During this process several dozen low-level bugs, such as resource leaks, memory leaks, and null-pointer problems, were found and fixed, predominantly in the error paths as these had received very little in-the-field testing over the years. Gleixner noted one advantage of a state machine that comes into play here: if startup fails halfway through, teardown is done using the regular state machine teardown code rather than some failure-scenario-specific code. Once the analysis and documentation were complete, a conversion of notifiers to hotplug states was started, gradually removing old infrastructure as it was no longer needed.

Once this was complete and the notifiers were gone, he tackled the locking problem, reimplementing the locks with semaphores. This meant that lockdep could see what was going on, revealing about 25 deadlocks lurking in the existing code. Although these were hard to trigger in the mainstream kernel, which is probably why they were not been fixed before, the RT patches tend to expose them. Steven Rostedt's hotplug-stresstest script would usually lock up any machine after four or five runs.

One problem identified came from the decision to have a user-space procfs interface writing into a variable that was used, live, by the kernel. This tended to produce nasty race conditions, which were handled in about a dozen stacked patches, the net effect of which was to reduce the size of the race window until it was hard to see. Gleixner's preferred solution was to create a shadow variable that user space writes into, which the kernel's internal variable was synchronized with when (and only when) the system was in a state that allowed it to change safely.

Much was learned from this experience. First, if you unearth bugs in the kernel, you're expected to fix them. Gleixner and his colleagues drew the maintainers' attention to the first few bugs they found in the notifier callbacks, asking for advice on how to deal with them, and were either completely ignored or knocked back on the basis that the code had been working fine for five years, so it must be correct. Once they followed this up with patches, maintainers responded.

When digging in the kernel "the amount of crap you find is immense; it's really amazing". He added (to nervous giggles from the audience): "when you think you've seen the worst already, no, it's getting worse next week". Patience is required. You start off with something simple; you pull at it and other bits fall out that need attention; before you know it you have to rewrite stuff all over the place. This is normal, and you mustn't give up, he said. He also learned that corporations cannot be trusted to keep their promises.

Finally, estimation of effort in kernel reworking is hard; his estimated time required was off by a factor of two and his estimated person-hours required was off by factor of three. In response to a question from the audience, he allowed that his estimate was reached by taking a reasonable guess, doubling it, and converting to the next largest unit — and still it was off by a factor of two. In the end, the total work required to rework CPU hotplugging was about two person-years, which resulted in about 1.2 patches per person-day.

Timer wheel

The next example he looked at was the timer wheel rework. This he approached by going back to the original work done back in 1997. He and his colleagues reread the original paper on which the work was based, re-examined why it was implemented in the first place, and tried to work out how it could be done better. They found that when you try to go back 20 years, everyone you ask strenuously denies remembering anything. It took them three months to do the analysis work (tracing and examining the code, figuring out the use cases and the precision each case required), followed by two months to design and write a proof-of-concept (during which they implemented several possible designs) and a month to do the final implementation, including the posting and review process.

Certain enhancements are still works-in-progress, such as the business of choosing which CPU to place a new timer on. It's desirable to have a timer fire on a busy CPU, to avoid waking a CPU from a lower-power state just to deal with it. The old timer wheel worked out which CPU would be busiest when the timer fired by putting it on the one that was busiest now. As Gleixner pointed out, that's not guaranteed to work well, though he didn't put it that kindly.

For the reworked code they noted that over 99% of timers are canceled before they fire; they therefore wished to keep the cost of setting a timer as low as possible, so they avoid a costly and complex decision-making process by always queueing it on the local CPU. If that CPU goes idle, it lets other CPUs know that it has pending timers; either they will be canceled, or if they actually trigger, a busy CPU can take over running the timer.

By his account, the rework was a success; there was minimal fallout, with tea and medals for all. Sadly, a week before Kernel Recipes 2017, and more than a year after the rework patches were merged, a user-space-visible regression surfaced. This issue is ongoing, because members of the network community (who were most affected by this regression) initially responded by arguing that the right fix was the removal of the whole timer wheel rework — until it was pointed out that the network subsystem would suffer the biggest performance hit from doing so.

Lessons learned from this rework include being prepared for paleontological research; going back 20 years in the kernel is pretty much a visit to Jurassic Park. Gleixner mentioned some projects out there that try to map tokens in the kernel code to contemporaneous discussions on mailing lists, and said how useful these would have been for a rework like this. Don't expect that anyone will know anything, or that anyone will care about what you're doing, he said. Finally, be prepared for late surprises; if you're reworking stuff this far down in the kernel, you can't burn your notes the day after the patches are merged.

A question from the audience asked how useful Git changelogs could be for reworking. Until five years ago, Gleixner said, they're useless; the vast majority of changelogs either say something like "fix bug", or describe what the patch is doing instead of why. As a maintainer, Gleixner spends a lot of time trying to teach people to write good changelogs, and reckons he rewrites about half of the changelogs that go through him. Whether as a result of this or not, changelogs have been getting steadily better over the last five years and can now often be helpful.

He also finds git grep/log/blame useful, along with the history git trees (both the BitKeeper history, which he has converted to Git and made available on kernel.org, and the really ancient history tree which imported the delta patches from one version to the next). He uses Coccinelle to figure out which pieces of code touch a given data structure and finds it extremely helpful. Mail archives, where they exist, are useful in direct proportion to their searchability. He uses Quilt to keep patch queues under control in big refactoring jobs. Finally, the utility of a good espresso machine should never, he said, be underestimated. In response to an audience question about cross-reference indexes, he expressed disdain for any tool that requires a web browser.

Should you, he asked rhetorically, have a go at refactoring? Yes, but not if you're faint of heart. You will need to be stubborn, you will have to dig into some dark and evil places, but you will learn a lot about kernel history and why things are the way they are, even though they sometimes don't make any sense at all. You get used to understanding undocumented code written by authors who have vanished, which is a useful skill. You will spend a lot of time fighting the "it works fine" mentality: merely proving a bug exists isn't enough, as many people regard bugs that haven't been triggered as non-issues. It's fun, for some definition of fun, and to laughter, he noted that one of the greatest rewards of refactoring is the understanding that the Linux kernel works largely by chance not design.

[We would like to thank LWN's travel sponsor, The Linux Foundation, for assistance with travel funding for Kernel Recipes.]

Index entries for this article
GuestArticles	Yates, Tom
Conference	Kernel Recipes/2017

Digging in the kernel dust

Posted Oct 24, 2017 11:32 UTC (Tue) by dskoll (subscriber, #1630) [Link]

That was entertaining. It also mirrors my experience with other largish and old software projects. I sheepishly admit to having gone through a similar process on a much smaller scale on software I wrote originally.

Digging in the kernel dust

Posted Oct 24, 2017 13:58 UTC (Tue) by petur (guest, #73362) [Link]

"the vast majority of changelogs either say something like "fix bug", or describe what the patch is doing instead of why"

I get a similar feeling with comments too

Digging in the kernel dust

Posted Oct 24, 2017 15:32 UTC (Tue) by firasha (guest, #4230) [Link] (1 responses)

A very enjoyable read, it's always great to hear what develops out of the RT work! Those interested in seeing Thomas's talk can do so here.

Sadly, a week before Kernel Recipes 2017, and more than a year after the rework patches were merged, a user-space-visible regression surfaced. This issue is ongoing, because members of the network community (who were most affected by this regression) initially responded by arguing that the right fix was the removal of the whole timer wheel rework — until it was pointed out that the network subsystem would suffer the biggest performance hit from doing so.

I was curious about the above issue, so went digging in the mailing list dust for the original threads:

[PATCH RESEND 0/2] enable hires timer to timeout datagram socket

[PATCH v2 0/2] enable hires timer to timeout datagram socket

[In future articles it would be great if you could include links to the talk(s) and mailing list discussion(s), to make it easy for interested readers to get the details.]

It's been almost a month since Thomas's talk, and as far as I can tell discussion about the above issue died after Thomas's last email of 25 September. No new mailing list threads on the subject have appeared (at least, that I could find) and the only change to /net/core/datagram.c since then is unrelated.

Digging in the kernel dust

Posted Oct 30, 2017 16:47 UTC (Mon) by nayna (subscriber, #91044) [Link]

Thank You, Tom, for this article and, firasha, for posting these links.

The TPM Device Driver is another subsystem impacted by this change. The original discussion and a comparison between msleep()/usleep-range()
can be found here:

https://sourceforge.net/p/linux-ima/mailman/message/35685...
https://sourceforge.net/p/linux-ima/mailman/message/35773...

As a result of the non-cascading timer wheel change, we had to replace msleep() with usleep_range().

Digging in the kernel dust

Posted Oct 24, 2017 18:15 UTC (Tue) by fredex (subscriber, #11727) [Link] (11 responses)

This quote reveals a pretty scary situation:

"one of the greatest rewards of refactoring is the understanding that the Linux kernel works largely by chance not design"

but I suppose that's true of many large systems, not just Linux.

Digging in the kernel dust

Posted Oct 24, 2017 20:54 UTC (Tue) by halla (subscriber, #14185) [Link] (9 responses)

No -- this is true of _all_ large systems -- there is no large system in existence that works by design. Humans are unable to _design_ a large system, then implement it. It can't be done, and it has never been done.

Digging in the kernel dust

Posted Oct 24, 2017 21:28 UTC (Tue) by dskoll (subscriber, #1630) [Link]

A fellow devotee of Systemantics, I see!

Digging in the kernel dust

Posted Oct 24, 2017 21:30 UTC (Tue) by roc (subscriber, #30627) [Link] (4 responses)

This is true, and suggests that for real security you need a microkernel architecture, perhaps something like seL4.

Digging in the kernel dust

Posted Oct 26, 2017 10:37 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

What is actually means is that to build a real-world complex system you need a dose of pragmatism and accept every single constrain can not be identified and analysed and designed against beforehand.

Digging in the kernel dust

Posted Oct 27, 2017 17:33 UTC (Fri) by hkario (subscriber, #94864) [Link]

"Every problem can be solved by introducing another level of indirection"

that's a joke

Digging in the kernel dust

Posted Oct 30, 2017 20:33 UTC (Mon) by ttelford (guest, #44176) [Link] (1 responses)

How does playing a shell game between kernel vs userspace help with the "human brain doesn't grok concurrency" problem?

Digging in the kernel dust

Posted Oct 30, 2017 21:40 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

By mitigating the impact of wrong decisions.

One of the more interesting features in Erlang is the "let if fail" design ideology. Instead of trying to recover from error conditions you just fail cleanly and restart the affected parts. And then you design your system with the assumption that individual components will fail, except maybe for a very limited minimal subset.

It translates well into microkernel architectures. In theory, that is.

Digging in the kernel dust

Posted Oct 27, 2017 2:47 UTC (Fri) by flussence (guest, #85566) [Link]

It's worse than that. Programming, apart from being a gigantic mess with multiple entire industries dedicated to dealing with how bad it is, is an abstraction we use to cope with the madness of digital electronics. Which is an abstraction over the horrors of analogue electronics at nanometre scale… and _that's_ also a lie the signals engineers comfort themselves with in order to avoid thinking about quantum physics. All systems are large.

Digging in the kernel dust

Posted Nov 1, 2017 4:19 UTC (Wed) by eternaleye (guest, #67051) [Link] (1 responses)

Hm, I'd argue that seL4 may be the best counterexample yet constructed in the software domain.

Digging in the kernel dust

Posted Nov 10, 2017 9:32 UTC (Fri) by mpr22 (subscriber, #60784) [Link]

seL4 is provable because it is (compared to the mainstream macrokernels) tiny.

Digging in the kernel dust

Posted Nov 2, 2017 14:14 UTC (Thu) by mips (guest, #105013) [Link]

This quote made me wonder whether anyone has attempted a rewrite of the kernel. I can't find anything; I assume it's simply too complex (the chap in the article has spent two years on what I presume to be a minute portion of the kernel), too much of a moving target, and perhaps there are rights/authorship issues.

Also presumably the people with the chops to rewrite it also have enough sense to know better.

Digging in the kernel dust

Posted Oct 24, 2017 21:29 UTC (Tue) by tonyblackwell (subscriber, #43641) [Link] (1 responses)

Thanks Tom for a superb summary of the talk, for those of us not into kernel intricacies.

Digging in the kernel dust

Posted Oct 26, 2017 6:15 UTC (Thu) by madhatter (subscriber, #4665) [Link]

You're welcome, and thank you for that.

Digging in the kernel dust

Posted Oct 25, 2017 7:29 UTC (Wed) by unixbhaskar (guest, #44758) [Link]

Thanks a bunch Thomas! it was entertaining and invigorating. Wish I have the capability to help you out in the changelog ...wondering ...

Digging in the kernel dust

Posted Oct 25, 2017 18:53 UTC (Wed) by bjorntopel (subscriber, #80345) [Link]

The whole talk is on YouTube: https://www.youtube.com/watch?v=mxxicJZ8cis

Excellent talk, really enjoyed it!

RT patchset size reduced?

Posted Oct 26, 2017 7:16 UTC (Thu) by swilmet (guest, #98424) [Link]

The talk didn't explain how much the RT patchset has been reduced thanks to the CPU hotplug and timer wheel refactorings. It would be great to see the size of the RT patchset over time, both for the number of commits and the number of insertions/deletions.

Digging in the kernel dust

Posted Oct 26, 2017 17:12 UTC (Thu) by xorbe (guest, #3165) [Link] (3 responses)

Can someone further elucidate on what the user-facing network regression was wrt the timer wheel rework?

Digging in the kernel dust

Posted Oct 26, 2017 22:08 UTC (Thu) by firasha (guest, #4230) [Link]

Between this comment and that of bjorntopel I feel like no one's even bothered to read mine. The links to the video and the userspace regression discussion were all provided...

Digging in the kernel dust

Posted Oct 27, 2017 16:31 UTC (Fri) by zlynx (guest, #2285) [Link] (1 responses)

The TLDR summary: timer slack caused requested socket timeout of 180 seconds to return after up to 190 seconds.

Code relied on this being an accurate timer even though it isn't specified to be accurate.

Digging in the kernel dust

Posted Oct 28, 2017 6:46 UTC (Sat) by nijhof (subscriber, #4034) [Link]

If 90+% of the socket timeouts get cancelled anyway, then maybe one could use a hybrid timer: use a timer wheel timeout for 150s, then in the few cases that that triggers, calculate the actual waiting time and schedule a hires timeout for the remaining 30 - slack seconds?

Of course I don't know how easy it is to find out how long the timer wheel timer waited for exactly. And the locking to avoid canceling while rearming would be interesting...