Digging in the kernel dust
Refactoring the kernel means taking some part of the kernel that is showing its age and rewriting it so it works better. Thomas Gleixner has done a lot of this over the past decade; he spoke at Kernel Recipes about the details of some of that work and the lessons that he learned. By way of foreshadowing how much fun this can be, he subtitled the talk "Digging in Dust".
Gleixner's original motivation for taking up his spade was to get the realtime (RT) patches into the mainline kernel, which he found involved constantly working around the shortcomings of the mainline code base. In addition, ten years of spending every working day digging around in dust can make you quite angry, he said, which can also be a big incentive to make things better.
Hotplug support
The first concrete example he discussed was CPU hotplug support. This was added in 2002, when Rusty Russell implemented it with a notifier-based architecture. According to Gleixner, the naming scheme was poor and, as the community worked on the code, further problems were added: the locking was handled by uninstrumented magic that deliberately evaded lockdep oversight and the code developed interesting and undocumented ordering requirements. Moreover, the handling of registered notifiers at startup and teardown was asymmetric: startup reasonably did high-priority tasks first, but teardown also did high-priority tasks first. This is not reasonable: a pile should be disassembled in the reverse of its build order, he said. The choice of asymmetric behavior meant that anything that was sensitive to startup/teardown symmetry ended up having two notifiers. The absence of any debugging for the hotplug notifiers didn't help with figuring out what was going on.
One might expect such code to be less than completely robust; according to Gleixner it was known to be fragile for a long time. The response for most of this time was to apply successive layers of duct tape to hold it together; the attempt to add RT support made it fall apart completely, which is why Gleixner first picked up the pieces in 2012. He decided to convert the structure to a state machine and redo startup/teardown to be symmetric; he got as far as a proof-of-concept patch when he ran out of spare time. He handed the work over to a (tactfully unnamed) company that promised to pick it up, but the company merely applied yet more duct tape. So at the end of 2015, when he got some funding for the RT project, he revisited it; it has taken almost two years to complete the work.
The first step was to analyze all the notifiers and document the ordering requirements. During this process several dozen low-level bugs, such as resource leaks, memory leaks, and null-pointer problems, were found and fixed, predominantly in the error paths as these had received very little in-the-field testing over the years. Gleixner noted one advantage of a state machine that comes into play here: if startup fails halfway through, teardown is done using the regular state machine teardown code rather than some failure-scenario-specific code. Once the analysis and documentation were complete, a conversion of notifiers to hotplug states was started, gradually removing old infrastructure as it was no longer needed.
Once this was complete and the notifiers were gone, he tackled the locking problem, reimplementing the locks with semaphores. This meant that lockdep could see what was going on, revealing about 25 deadlocks lurking in the existing code. Although these were hard to trigger in the mainstream kernel, which is probably why they were not been fixed before, the RT patches tend to expose them. Steven Rostedt's hotplug-stresstest script would usually lock up any machine after four or five runs.
One problem identified came from the decision to have a user-space procfs interface writing into a variable that was used, live, by the kernel. This tended to produce nasty race conditions, which were handled in about a dozen stacked patches, the net effect of which was to reduce the size of the race window until it was hard to see. Gleixner's preferred solution was to create a shadow variable that user space writes into, which the kernel's internal variable was synchronized with when (and only when) the system was in a state that allowed it to change safely.
Much was learned from this experience. First, if you unearth bugs in the kernel, you're expected to fix them. Gleixner and his colleagues drew the maintainers' attention to the first few bugs they found in the notifier callbacks, asking for advice on how to deal with them, and were either completely ignored or knocked back on the basis that the code had been working fine for five years, so it must be correct. Once they followed this up with patches, maintainers responded.
When digging in the kernel "the amount of crap you find is immense; it's really amazing". He added (to nervous giggles from the audience): "when you think you've seen the worst already, no, it's getting worse next week". Patience is required. You start off with something simple; you pull at it and other bits fall out that need attention; before you know it you have to rewrite stuff all over the place. This is normal, and you mustn't give up, he said. He also learned that corporations cannot be trusted to keep their promises.
Finally, estimation of effort in kernel reworking is hard; his estimated time required was off by a factor of two and his estimated person-hours required was off by factor of three. In response to a question from the audience, he allowed that his estimate was reached by taking a reasonable guess, doubling it, and converting to the next largest unit — and still it was off by a factor of two. In the end, the total work required to rework CPU hotplugging was about two person-years, which resulted in about 1.2 patches per person-day.
Timer wheel
The next example he looked at was the timer wheel rework. This he approached by going back to the original work done back in 1997. He and his colleagues reread the original paper on which the work was based, re-examined why it was implemented in the first place, and tried to work out how it could be done better. They found that when you try to go back 20 years, everyone you ask strenuously denies remembering anything. It took them three months to do the analysis work (tracing and examining the code, figuring out the use cases and the precision each case required), followed by two months to design and write a proof-of-concept (during which they implemented several possible designs) and a month to do the final implementation, including the posting and review process.
Certain enhancements are still works-in-progress, such as the business of choosing which CPU to place a new timer on. It's desirable to have a timer fire on a busy CPU, to avoid waking a CPU from a lower-power state just to deal with it. The old timer wheel worked out which CPU would be busiest when the timer fired by putting it on the one that was busiest now. As Gleixner pointed out, that's not guaranteed to work well, though he didn't put it that kindly.
For the reworked code they noted that over 99% of timers are canceled before they fire; they therefore wished to keep the cost of setting a timer as low as possible, so they avoid a costly and complex decision-making process by always queueing it on the local CPU. If that CPU goes idle, it lets other CPUs know that it has pending timers; either they will be canceled, or if they actually trigger, a busy CPU can take over running the timer.
By his account, the rework was a success; there was minimal fallout, with tea and medals for all. Sadly, a week before Kernel Recipes 2017, and more than a year after the rework patches were merged, a user-space-visible regression surfaced. This issue is ongoing, because members of the network community (who were most affected by this regression) initially responded by arguing that the right fix was the removal of the whole timer wheel rework — until it was pointed out that the network subsystem would suffer the biggest performance hit from doing so.
Lessons learned from this rework include being prepared for paleontological research; going back 20 years in the kernel is pretty much a visit to Jurassic Park. Gleixner mentioned some projects out there that try to map tokens in the kernel code to contemporaneous discussions on mailing lists, and said how useful these would have been for a rework like this. Don't expect that anyone will know anything, or that anyone will care about what you're doing, he said. Finally, be prepared for late surprises; if you're reworking stuff this far down in the kernel, you can't burn your notes the day after the patches are merged.
A question from the audience asked how useful Git changelogs could be for reworking. Until five years ago, Gleixner said, they're useless; the vast majority of changelogs either say something like "fix bug", or describe what the patch is doing instead of why. As a maintainer, Gleixner spends a lot of time trying to teach people to write good changelogs, and reckons he rewrites about half of the changelogs that go through him. Whether as a result of this or not, changelogs have been getting steadily better over the last five years and can now often be helpful.
He also finds git grep/log/blame useful, along with the history git trees (both the BitKeeper history, which he has converted to Git and made available on kernel.org, and the really ancient history tree which imported the delta patches from one version to the next). He uses Coccinelle to figure out which pieces of code touch a given data structure and finds it extremely helpful. Mail archives, where they exist, are useful in direct proportion to their searchability. He uses Quilt to keep patch queues under control in big refactoring jobs. Finally, the utility of a good espresso machine should never, he said, be underestimated. In response to an audience question about cross-reference indexes, he expressed disdain for any tool that requires a web browser.
Should you, he asked rhetorically, have a go at refactoring? Yes, but not if you're faint of heart. You will need to be stubborn, you will have to dig into some dark and evil places, but you will learn a lot about kernel history and why things are the way they are, even though they sometimes don't make any sense at all. You get used to understanding undocumented code written by authors who have vanished, which is a useful skill. You will spend a lot of time fighting the "it works fine" mentality: merely proving a bug exists isn't enough, as many people regard bugs that haven't been triggered as non-issues. It's fun, for some definition of fun, and to laughter, he noted that one of the greatest rewards of refactoring is the understanding that the Linux kernel works largely by chance not design.
[We would like to thank LWN's travel sponsor, The Linux Foundation, for assistance with travel funding for Kernel Recipes.]
Index entries for this article | |
---|---|
GuestArticles | Yates, Tom |
Conference | Kernel Recipes/2017 |
Posted Oct 24, 2017 11:32 UTC (Tue)
by dskoll (subscriber, #1630)
[Link]
That was entertaining. It also mirrors my experience with other largish and old software projects. I sheepishly admit to having gone through a similar process on a much smaller scale on software I wrote originally.
Posted Oct 24, 2017 13:58 UTC (Tue)
by petur (guest, #73362)
[Link]
I get a similar feeling with comments too
Posted Oct 24, 2017 15:32 UTC (Tue)
by firasha (guest, #4230)
[Link] (1 responses)
I was curious about the above issue, so went digging in the mailing list dust for the original threads:
[PATCH RESEND 0/2] enable hires timer to timeout datagram socket
[PATCH v2 0/2] enable hires timer to timeout datagram socket
[In future articles it would be great if you could include links to the talk(s) and mailing list discussion(s), to make it easy for interested readers to get the details.]
It's been almost a month since Thomas's talk, and as far as I can tell discussion about the above issue died after Thomas's last email of 25 September. No new mailing list threads on the subject have appeared (at least, that I could find) and the only change to /net/core/datagram.c since then is unrelated.
Posted Oct 30, 2017 16:47 UTC (Mon)
by nayna (subscriber, #91044)
[Link]
The TPM Device Driver is another subsystem impacted by this change. The original discussion and a comparison between msleep()/usleep-range()
https://sourceforge.net/p/linux-ima/mailman/message/35685...
As a result of the non-cascading timer wheel change, we had to replace msleep() with usleep_range().
Posted Oct 24, 2017 18:15 UTC (Tue)
by fredex (subscriber, #11727)
[Link] (11 responses)
"one of the greatest rewards of refactoring is the understanding that the Linux kernel works largely by chance not design"
but I suppose that's true of many large systems, not just Linux.
Posted Oct 24, 2017 20:54 UTC (Tue)
by halla (subscriber, #14185)
[Link] (9 responses)
Posted Oct 24, 2017 21:28 UTC (Tue)
by dskoll (subscriber, #1630)
[Link]
A fellow devotee of Systemantics, I see!
Posted Oct 24, 2017 21:30 UTC (Tue)
by roc (subscriber, #30627)
[Link] (4 responses)
Posted Oct 26, 2017 10:37 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link]
Posted Oct 27, 2017 17:33 UTC (Fri)
by hkario (subscriber, #94864)
[Link]
that's a joke
Posted Oct 30, 2017 20:33 UTC (Mon)
by ttelford (guest, #44176)
[Link] (1 responses)
Posted Oct 30, 2017 21:40 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
One of the more interesting features in Erlang is the "let if fail" design ideology. Instead of trying to recover from error conditions you just fail cleanly and restart the affected parts. And then you design your system with the assumption that individual components will fail, except maybe for a very limited minimal subset.
It translates well into microkernel architectures. In theory, that is.
Posted Oct 27, 2017 2:47 UTC (Fri)
by flussence (guest, #85566)
[Link]
Posted Nov 1, 2017 4:19 UTC (Wed)
by eternaleye (guest, #67051)
[Link] (1 responses)
Posted Nov 10, 2017 9:32 UTC (Fri)
by mpr22 (subscriber, #60784)
[Link]
Posted Nov 2, 2017 14:14 UTC (Thu)
by mips (guest, #105013)
[Link]
Also presumably the people with the chops to rewrite it also have enough sense to know better.
Posted Oct 24, 2017 21:29 UTC (Tue)
by tonyblackwell (subscriber, #43641)
[Link] (1 responses)
Posted Oct 26, 2017 6:15 UTC (Thu)
by madhatter (subscriber, #4665)
[Link]
Posted Oct 25, 2017 7:29 UTC (Wed)
by unixbhaskar (guest, #44758)
[Link]
Posted Oct 25, 2017 18:53 UTC (Wed)
by bjorntopel (subscriber, #80345)
[Link]
Excellent talk, really enjoyed it!
Posted Oct 26, 2017 7:16 UTC (Thu)
by swilmet (guest, #98424)
[Link]
Posted Oct 26, 2017 17:12 UTC (Thu)
by xorbe (guest, #3165)
[Link] (3 responses)
Posted Oct 26, 2017 22:08 UTC (Thu)
by firasha (guest, #4230)
[Link]
Posted Oct 27, 2017 16:31 UTC (Fri)
by zlynx (guest, #2285)
[Link] (1 responses)
Code relied on this being an accurate timer even though it isn't specified to be accurate.
Posted Oct 28, 2017 6:46 UTC (Sat)
by nijhof (subscriber, #4034)
[Link]
Of course I don't know how easy it is to find out how long the timer wheel timer waited for exactly. And the locking to avoid canceling while rearming would be interesting...
Digging in the kernel dust
Digging in the kernel dust
A very enjoyable read, it's always great to hear what develops out of the RT work! Those interested in seeing Thomas's talk can do so here.
Digging in the kernel dust
Sadly, a week before Kernel Recipes 2017, and more than a year after the rework patches were merged, a user-space-visible regression surfaced. This issue is ongoing, because members of the network community (who were most affected by this regression) initially responded by arguing that the right fix was the removal of the whole timer wheel rework — until it was pointed out that the network subsystem would suffer the biggest performance hit from doing so.
Digging in the kernel dust
can be found here:
https://sourceforge.net/p/linux-ima/mailman/message/35773...
Digging in the kernel dust
Digging in the kernel dust
Digging in the kernel dust
Digging in the kernel dust
Digging in the kernel dust
Digging in the kernel dust
How does playing a shell game between kernel vs userspace help with the "human brain doesn't grok concurrency" problem?
Digging in the kernel dust
Digging in the kernel dust
Digging in the kernel dust
Digging in the kernel dust
seL4 is provable because it is (compared to the mainstream macrokernels) tiny.
Digging in the kernel dust
Digging in the kernel dust
Digging in the kernel dust
Digging in the kernel dust
Digging in the kernel dust
Digging in the kernel dust
RT patchset size reduced?
Digging in the kernel dust
Between this comment and that of bjorntopel I feel like no one's even bothered to read mine. The links to the video and the userspace regression discussion were all provided...
Digging in the kernel dust
Digging in the kernel dust
Digging in the kernel dust