James Bottomley: Distribution kernel bugzillas considered harmful

September 17, 2018

Related Material:

git_rebase.html

Additional Participants: Jiri Kosina, Konstantin Ryabitsev, Laura Abbott, Mark Brown, Mauro Carvalho Chehab, Paul E. McKenney, Steven Rostedt, and Takashi Iwai.

James Bottomley is seeing people running community distributions on the cloud who waste considerable effort tracking down performance regressions, usually ending up with an unsatisfactory result. By the time these people call James in, there is very little time left to fix the problem. Usually, upstream works and bisection finds the fix, except that bisection can be quite slow. One problem is that community distributions (even LTS ones) don't have the resources to even triage bugs. This calls for automated bisection that is so reliable that even cloud users can run them successfully. It also calls for community-distribution bugzillas to explicitly and strongly state that the bug reporter needs to have not only located the fix, but preferably also already backported it.

Mark Brown noted that improving bisectability would help many other testing efforts as well. Paul E. McKenney said that he rebases his tree to improve bisectability, however, that he can do this because his -rcu tree carries only a small number of patches. Furthermore, Paul sometimes has to apply fix patches for unrelated bugs while bisection, and supposes that this could be automated. Steven Rostedt asked how rebasing helps bisection, and Paul replied that the trick is to rebase the fix into the original buggy patch, thus avoiding leaving a span of commits containing the bug. However, Paul is not optimistic about the prospects of applying this sort of rebasing regime to something like -stable. James also rebases his SCSI tree, but only before submitting it to Linus. James also noted that rebasing is harder when others develop based on your tree. Paul leaves date-coded branches in place for at least six months to keep the old pre-rebase commits around to avoid pulling the rug out from under those developing on his tree. Jiri Kosina noted that SCSI rebases sometimes cause problems for his workflow, calling out a Linus Torvalds email arguing against rebasing. James asked for details so that breakage could be avoided. Jiri stated that he orders patches in his tree in the same git-topological ordering as they will be upstream. This requires a one-to-one mapping between patches and upstream commits, and this mapping is destroyed by rebases.

Takashi Iwai applauded the notion of automated bisection, but notes that hour-long full-distro builds can result in painfully slow bisection. Takashi would therefore like some semi-automatic way to reduce the config, thus reducing build times. Takashi also agreed that having the reporter do the fix confirmation and bisection would be good, and asked for suggestions to improve the process. Jiri recalled a web-based bisection tool that was never finished, the idea being that the user just clicked “Good” and “Bad” buttons, with all the rest automated. Konstantin Ryabitsev liked the idea of a “bisect at home”, saying “I'll totally host the hell out of this”. Sasha Levin noted that KernelCI is already working on automated bisection, so that further addition of simple testcases might get it where Konstantin wants it to be. Laura Abbott played with bisection scripts for Fedora, but found them to be uninteresting and fragile. Laura suggests extending the existing targets for distro builds to just build the sources, relying on things like COPR to package the binaries. Konstantin was glad to hear of KernelCI's progress, coining “bisecting as a service” while he was at it. Takashi liked the KernelCI news. Mauro Carvalho Chehab also liked KernelCI, but called out the need for hardware as a blocker in many cases.

Laura said that Fedora does encourage bug reporters to bisect, but that there are obstacles:

Bisecting on local machines is slow and people often don't want to give up their machine resources.
Reproduce-by test cases are required for bisection, but are pretty rare.
People are hesitant to run bisections and build kernels. There's a lot of steps involved. pointing people to wiki pages with instructions is not always a substitute for explaining how to do the setup.
People are hesitant to report bugs to the upstream, and need to be told where to report bug. It is sometimes necessary to run get_maintainer.pl for people otherwise they just file it against kernel.org bugzilla or just e-mail LKML. Laura started a skeleton of a web project to make a web interface for get_maintainer.pl but it never got very far.

Laura agreed that although tooling can help, additional documentation is also required, and some people will still need one-on-one guidance.