Kernel regression tracking, part 1

By Jonathan Corbet
October 31, 2017

2017 Kernel Summit

The kernel development community has run for some years without anybody tracking regressions; that changed one year ago when Thorsten Leemhuis stepped up to the task. Two conversations were held on the topic at the 2017 Kernel and Maintainers summits in Prague; this article covers the first of those, held during the open Kernel-Summit track.

Leemhuis begin by pointing out that he started doing this work even though he does not work for a Linux company; he is, instead, a journalist for the largest computer magazine in Germany. He saw a mention of the gap that was left after Rafael Wysocki stopped tracking regressions, and thought that he might be a good fit for the job. This work is being done in his spare time. When he started, he had thought that the job would be difficult and frustrating; in reality, it turned out to be even worse than he expected.

Why is it so hard? The first problem is that nobody actually tells him about regressions, so he has to hunt them down himself. That means digging through a lot of mailing lists and bug trackers. Wysocki noted that things are worse than they were years ago when he did the job, there are a lot more information sources. It is more, Wysocki said, than any one person can follow.

Leemhuis went on to say that a lot of regressions are also fixed without him even noticing. Nobody tells him about progress toward fixing regressions, so that, too, must be tracked manually. He had asked developers to include a special identifier in discussions on regressions, but nobody has done it. That is unfortunate, since he had thought it would be a useful mechanism; perhaps, he said, he should have tried harder. Ben Herrenschmidt agreed, saying that it can be hard to get people to change their established workflow to incorporate a new mechanism. James Bottomley noted that maintainers would, in general, rather avoid having their bugs termed "regressions", since that increases the pressure for an immediate fix.

Leemhuis raised the idea of creating a dedicated mailing list for regressions, with reporters asked to copy their reports there. Wysocki agreed that this might be useful, but said that the information on how to report regressions properly needs to be better communicated. Laura Abbott concurred, saying that the documentation in this area should be improved.

Herrenschmidt noted that most bug reports come from distributor kernels rather than the mainline. For distributions like Fedora, which ships something close to a current mainline kernel, these reports can be relevant, though are still a version or two behind the current development kernels. Reports of bugs in enterprise kernels, instead, have little value. Bottomley added that Linus Torvalds is mostly interested in mainline regressions; the resources just don't exist to track regressions in distributor kernels as well.

There was general agreement that only mainline regressions should be tracked, but Ted Ts'o said that the community could look for volunteers to track regressions in older kernel versions. The work is still useful, he said, and would train others to help with regression tracking. The problem with this idea, Bottomley replied, is that one has to be an idiot to want to do this work — an idea that Leemhuis seemed to concur with. There won't, Bottomley added, be a flood of volunteers in this area. Matthew Wilcox's suggestion that the situation could change because there are a lot of journalists being laid off was not seen as entirely helpful.

Abbott said that, in her role as a Fedora kernel maintainer, she sees a lot of bug reports, but many of them are of low quality. They need to be filtered before being passed on to any sort of core regression list. Arnd Bergmann added that Linaro has been doing more testing recently and finding regressions in linux-next. But Leemhuis said he is really only interested in regressions that make it to the mainline at this point.

Leemhuis went on to say that, while Wysocki used the kernel's Bugzilla tracker to handle regressions, it "looks like double-entry accounting" to him and he has avoided it. There is a lot of overhead associated with working in Bugzilla, and kernel developers tend not to like it. So he has been using the mailing lists instead, but perhaps that was the wrong decision?

Wysocki replied that he used Bugzilla because it was suitable for him; it provided a useful archive of the discussions around regressions. Ts'o said that the real problem is that Torvalds will not dictate a single bug-tracking system for the kernel, so the information is scattered around the community. The kernel Bugzilla is not perfect, he said, but it has the advantage of actually existing and being available. Wysocki added that there needs to be a database somewhere; it should be possible to point people to a definitive entry for a regression. Takashi Iwai said that, for distributors, the most important thing to have is an overview of the situation; that is missing now. There is no comprehensive list of problems, so distributors must go through the time-consuming task of polling a number of different bug trackers.

Wilcox asked if distributors use the regression list for decisions on which kernel versions to ship, or whether those decisions are purely based on time. Abbott replied that Fedora tries to ship the latest mainline kernel, but the decision on pushing a specific kernel does depend on the current regressions. A significant Intel or AMD graphics regression will cause a kernel to be held back, she said, while "an obscure USB dongle" problem will not.

Ben Hutchings said that the situation at Debian is similar, at least outside of the long-term support releases. Iwai said that openSUSE Tumbleweed ships the latest kernel, meaning that regression reports are relevant to the current mainline release, not the development kernel that the kernel developers are working on currently. There are, he said, not many people testing the -rc kernels. Jiri Kosina added that SUSE tracks the "Fixes" tags in patches to see which bug fixes are relevant to the kernels they have shipped; those fixes will be backported if needed. That has led to a reduction in the regressions reported with openSUSE kernels.

Leemhuis asked if he should query developers via email more often the way Wysocki did; Wysocki replied that he didn't do that — his scripts did. Mark Brown said that was a good thing, since the scripts were more polite than their author. Overall, there didn't appear to be any opposition to more email if that's what is needed to improve regression tracking.

As the discussion came to a close, it was noted that regression reporting is hard for most users. They don't know where to send their reports, and there is little information out there to help them. The noise on the mailing lists does not help. The kernel Bugzilla is especially problematic since it is the wrong place to report many bugs, but it's not clear which ones or where they should actually go. Ts'o said that, if it were up to him, he would designate the Bugzilla as being for all kernel bugs, and that subsystem maintainers would simply be told to cope with it. In the absence of such a policy, users will continue to struggle.

The final suggestion came from Abbott, who said that perhaps users who send email to the linux-kernel list (and nobody else) should get an automatic response. That response would inform them that email sent only to the list is unlikely to be read by many people and would thus probably not get a response. It would include suggestions regarding how to more successfully report bugs. This idea was generally well received.

This topic was revisited during the invitation-only Maintainers Summit two days later.

[Your editor would like to thank the Linux Foundation, LWN's travel sponsor, for supporting his travel to this event].

Index entries for this article
Kernel	Regression tracking
Conference	Kernel Summit/2017

Kernel regression tracking, part 1

Posted Oct 31, 2017 21:05 UTC (Tue) by roc (subscriber, #30627) [Link] (5 responses)

We quite often find kernel bugs when working on rr. For severe bugs, especially regressions, we post to LKML and address specific maintainers because it's the only way to get a timely response (although not guaranteed). Less severe bugs go to Bugzilla because it doesn't seem like a good idea to spam specific maintainers with such bugs. As this article says, a message sent to LKML without addressing specific individuals will be forever lost in the noise. It seems that the Bugzilla bugs are only rarely looked at, though, so we file them more because it's the right thing to do than because we need action.

Requiring bug reporters to directly email maintainers about specific bugs seems really bad for everyone. It makes the bar for reporting bugs very high, since you need to track down the right maintainers, which is practically impossible for end users, and if you guess wrong then your bug is likely to be ignored. I assume it also sucks for maintainers, since it must be difficult to share the workload between maintainers or if the maintainers change, plus every maintainer has to implement their own issue tracking in their email client.

It's amazing how far Linux has come with such immature development practices across the board. This nonsense is tolerated because Linux is so successful, but in a less successful project it would be scorned. If a serious alternative open-source kernel ever arises with more rational development practices and takes share from Linux, people will look back and say "how did they ever think developing *that* way was a good idea?"

Kernel regression tracking, part 1

Posted Nov 1, 2017 10:37 UTC (Wed) by pizza (subscriber, #46) [Link] (4 responses)

> It's amazing how far Linux has come with such immature development practices across the board. This nonsense is tolerated because Linux is so successful, but in a less successful project it would be scorned. If a serious alternative open-source kernel ever arises with more rational development practices and takes share from Linux, people will look back and say "how did they ever think developing *that* way was a good idea?"

So is Linux successful in part due to this methodology, or in spite of it? Or both?

Kernel regression tracking, part 1

Posted Nov 1, 2017 10:50 UTC (Wed) by roc (subscriber, #30627) [Link] (3 responses)

I think you could make a good case for both. I think it's pretty clear that lack of centralized issue tracking is a big problem now. But you could also argue that lack of centralized issue tracking made life easier or more fun for developers in the past, and that helped get Linux where it is today.

I just don't like to see questionable Linux development practices justified by "Linux is successful, so this must be right".

Kernel regression tracking, part 1

Posted Nov 1, 2017 12:22 UTC (Wed) by Paf (subscriber, #91811) [Link]

The most you can say is “Linux is successful, so this isn’t utterly crippling.”. (Like the single maintainer model with Linus, which eventually was.)

Kernel regression tracking, part 1

Posted Nov 1, 2017 21:33 UTC (Wed) by neilbrown (subscriber, #359) [Link] (1 responses)

> I think it's pretty clear that lack of centralized issue tracking is a big problem now.

Is it? Can you point to some evidence please.
Or maybe I should just say [citation needed].

Kernel regression tracking, part 1

Posted Nov 2, 2017 0:36 UTC (Thu) by roc (subscriber, #30627) [Link]

Isn't the information in this article evidence enough? Distro kernel people are independently tracking regressions. Wouldn't it be better if they could share data with each other, and with the maintainers who are also tracking regressions?

Whether maintainers are tracking regressions in their heads, in a text file, writing notes on a napkin, or however, they are doing it, just in a way that's inaccessible to others.

Kernel regression tracking, part 1

Posted Nov 1, 2017 3:05 UTC (Wed) by neilbrown (subscriber, #359) [Link] (10 responses)

I must confess that I don't really understand this fascination with tracking regressions.
As a developer or maintainer, my only interest in regressions is fixing them. Once fixed they don't need to be tracked - though they are usually recorded using Fixes: tags.
As a user, I still just want it to be fixed, though finding a work-around, or reverting to an old version are certainly options. So I look for someone to complain to and make a noise.

How is centralized tracking meant to help? I don't imagine that its use as a management-metric would really matter to most people. Seriously: who cares?

Kernel regression tracking, part 1

Posted Nov 1, 2017 7:01 UTC (Wed) by corbet (editor, #1) [Link] (4 responses)

I do believe the point is to (1) have a sense for how ready a given kernel is, and (2) increase the odds that regressions get fixed rather than falling through the cracks.

Kernel regression tracking, part 1

Posted Nov 1, 2017 16:14 UTC (Wed) by knurd (subscriber, #113424) [Link] (1 responses)

I'd add (3) (or many (2b)): get maintainers back on track in case they do not take regressions seriously (like it was the case recently with AppArmor: https://lkml.org/lkml/2017/10/3/1)

Kernel regression tracking, part 1

Posted Nov 1, 2017 22:14 UTC (Wed) by neilbrown (subscriber, #359) [Link]

> like it was the case recently with AppArmor: https://lkml.org/lkml/2017/10/3/1

This is not an example of "maintaining a regression list is useful". This is an example of "members of the community supporting each other to encourage change". James reported a regression, the maintainer disagreed, someone else (Thorsten, and eventually Linus) joined in to make the case.
You don't need a maintained list of regressions to get things fixed (and I doubt it helps much). You need people to care and report and contribute and persist.

We always need more competent people to follow issues in various fora, to review not only patches but also bug reports and design discussions and anything else. In this case Thorsten Leemhuis joined in and pushed things along. This was a valuable contribution to be applauded, but it is not a contribution that needs to be centralized; it just needs to be done. Were you following the thread at the time? Maybe even you could have pushed things along.

Kernel regression tracking, part 1

Posted Nov 1, 2017 21:29 UTC (Wed) by neilbrown (subscriber, #359) [Link] (1 responses)

> (1) have a sense for how ready a given kernel is

Given that mainline is the only focus discussed, and given the current development model, this seems to mean "should Linus release an -rc7, go straight to -final". If that is ever a hard decision, the just default to -rc7. In fact, I wonder why we don't have a fixed N-week cycle (with variation on if Linus' holidays require it).

> (2) increase the odds that regressions get fixed rather than falling through the cracks

Does it though? And is it the most beneficial way to achieve that?

My core point is that regression tracking is best done in a distributed fashion (like everything else in the community except "being Linus" which is centralized). If you find a regression then it is your responsibility to push for a solution, just as if I find a regression it is mine. If we both hit the same regression then we might end up working together and pushing harder for less individual effort.
The more people who take responsibility, the more data, testing, and expertise is available, and the more likely it is that a fix will be found. I think that reporting a bug and following through to a solution is a good way for people of any skill level to feel connected with the community. Giving people responsibility is an important first step to them taking it. If a "regression maintainer" takes that responsibility, we say to the community "we don't need you".

Kernel regression tracking, part 1

Posted Nov 1, 2017 22:59 UTC (Wed) by Paf (subscriber, #91811) [Link]

Maybe instead we say “check with this person” and “I know you don’t have time or don’t know how to follow up, so here’s someone who’s committed to making that easy”.

Not “we don’t need you”, but “we know you have little time, so here’s help”.

Kernel regression tracking, part 1

Posted Nov 1, 2017 9:44 UTC (Wed) by roc (subscriber, #30627) [Link]

In my experience, when I report a kernel regression maintainers do not drop everything else to fix it right away. (Nor should they necessarily; I imagine maintainers sometimes have higher-priority work to do than fix a minor regression.)

Kernel regression tracking, part 1

Posted Nov 1, 2017 10:52 UTC (Wed) by Funcan (subscriber, #44209) [Link]

Tracking helps reduce future regressions; e.g. if there's a clear trend for more regressions in one part of one subsystem, then it's a clear guide that more review and testing (preferably automated testing) should target that area. The amount of automated testing against the kernel is definitely rising, and so having a guide for that work is highly likely to be useful.

Kernel regression tracking, part 1

Posted Nov 1, 2017 12:27 UTC (Wed) by Paf (subscriber, #91811) [Link]

Another thought:
Regressions can sort of... sneak in. They are often unintended side-effects of changes, and can gradually accumulate because no one is looking at an area (notably of performance) any more. And then, gradually, something that used to be all tuned up doesn’t work well any more.

That’s a slightly different argument, though. I, like you, don’t really understand tracking regressions vs new bugs. So little of the kernel is actually providing totally novel functionality, outside of drivers, that I think most bugs could be considered regressions in a certain light. I mean, when you do the next rewrite of path lookup, that’s great and useful work, but it doesn’t provide new end user functionality, just performance. So I guess any bugs in it are regressions...?

Kernel regression tracking, part 1

Posted Nov 1, 2017 18:39 UTC (Wed) by jhoblitt (subscriber, #77733) [Link] (1 responses)

The current situation is that end users have to figure out if their 'regression' is fixed by trial and error. I believe the desire is some sort of useful bug tracker or documentation that can be referenced rather than having to compute by braille. Software projects with more moderate rates of change tend to have summary changelogs but that's probably mission impossible for the kernel. Perhaps a serious user-space regression changelog is in within the realm of possible?

Kernel regression tracking, part 1

Posted Nov 1, 2017 21:51 UTC (Wed) by neilbrown (subscriber, #359) [Link]

> The current situation is that end users have to figure out if their 'regression' is fixed by trial and error.

How else can you ever really know? It is standard-operating-procedure, when you hit a problem, to make sure you are running the latest version of everything - isn't it? It is only when you experience a bug on the latest software that any of this becomes important.
Surely the best way to find out if the regression is known is to use "your favorite search engine". That doesn't require a centralized regression list (except that one automatically maintained by the engine's web crawler).

Kernel regression tracking, part 1

Posted Nov 2, 2017 14:24 UTC (Thu) by jani (subscriber, #74547) [Link]

> As the discussion came to a close, it was noted that regression
> reporting is hard for most users. They don't know where to send
> their reports, and there is little information out there to help
> them. The noise on the mailing lists does not help. The kernel
> Bugzilla is especially problematic since it is the wrong place to
> report many bugs, but it's not clear which ones or where they
> should actually go.

MAINTAINERS has "B:" entry to specify the preferred channel for bug reports per subsystem. That's a start. But it needs to be used more, and then made more accessible to bug reporters.

> Ts'o said that, if it were up to him, he would designate the
> Bugzilla as being for all kernel bugs, and that subsystem
> maintainers would simply be told to cope with it. In the absence
> of such a policy, users will continue to struggle.

I don't think you can "simply tell" maintainers to do anything.

For drm/i915 we prefer bug reports at https://bugs.freedesktop.org/ because it's much more likely the graphics bugs get reassigned between kernel and userspace components than between kernel components. We cope with https://bugzilla.kernel.org/ by telling people to file the bugs at fdo instead.