Sasha Levin: Bug-introducing patches

September 11, 2018

Related Material:

  1. May 2018 bug-introducing patches thread.
  2. Machine learning and stable kernels LWN article
  3. Greg Kroah-Hartman rant.
  4. Linus Torvalds rant.

LWN QoTW:

If I don't have special rights as a maintainer and you don't trust me that I use my common sense when I'm using these special rights, then you degraded me to a patch juggling monkey. On the day this happens, I'll step down.

— Thomas Gleixner

Additional Participants: Al Viro, Andy Shevchenko, Boris Brezillon, Dan Carpenter, Daniel Vetter, Dan Rue, Dan Williams, David Howells, David Lang, David Sterba, Eduardo Valentin, Eric W. Biederman, Fengguang Wu, Geert Uytterhoeven, Greg KH, Greg Kroah-Hartman, Guenter Roeck, James Bottomley, Jani Nikula, Jan Kara, Jens Axboe, Jiri Kosina, Justin Forbes, kbuild test robot, Ken Moffat via Ksummit-discuss, Krzysztof Kozlowski, Laura Abbott, Laurent Pinchart, Linus Torvalds, Mark Brown, Matthew Wilcox, Mauro Carvalho Chehab, Olof Johansson, Pavel Machek, Randy Dunlap, Stephen Rothwell, Steven Rostedt, Theodore Y. Ts'o, Thomas Gleixner, Tony Lindgren, Ulf Hansson, Vinod Koul, and Willy Tarreau.

Introduction

Sasha referred back to an earlier contentious discussion on regressions induced by bug fixes. Daniel Vetter asked for a summary of problems and different viewpoints from this earlier discussion, a view seconded by Steven Rostedt. Sasha obliged, stating that there was a general consensus that fixes going into late -rc cycles not only get less testing and review, but are buggier than merge-window patches. Sasha called out the following proposed solutions:

  1. Prohibiting non-critical patches beyond (say) -rc4.
  2. Prohibiting fixes to bugs introduced before the current release's merge window.
  3. One-week period just before release where no patches are taken.
  4. Mandate a minimum -next testing time for all fixes.

Sasha noted that almost all post-merge-window commits were written within 2-3 days of being merged, and, worse yet, these high-risk commits are very often pushed to -stable. Sasha also noted that a one-size-fits all approach seemed unlikely.

Which patches are most likely to introduce regressions?

Steve pointed out that the fast-paced development cycle inherently limits the amount of time to test regression fixes. Steve also argued that being in Linus's tree is the best test, adding the old saw “We don't need QA, that's what users are for”. @@@ Steven Rostedt pointer to Linus rant??? @@@ Greg KH responded by pointing to his 2013 rant on lack of testing of patches sent to -stable, but noted that the 2013 rant applied to patches between -rc1 and -rc2 rather than after -rc4.

Linus Torvalds argued that the sample of people in the the May 2018 bug-introducing patches thread was too small to be representative, and argued that merge-window patches should be expected to have more out-of-tree testing exactly because they are development patches rather than bug fixes. Linus then listed several other reasons why late-rc fixes should in fact be expected to have more bugs than merge-window commits:

  1. The bugs being fixed tend to be nastier: If they weren't nasty, they would have been found earlier.
  2. The fixes are more likely to be subtle, given that it took time to find the bug.
  3. Unlike merge-window patches, there is a sharp upper bound on the possible out-of-tree lifespan of late-rc fix based on the upper bound on the duration of the release cycle.

Sasha did not take this sitting down, arguing that many of the late -rc fixes were fixing bugs from long before the merge window, arguing strenuously that late -rc fixes needed to clear a higher bar than Linus currently requires, and that the current defacto policy was needlessly injecting bugs into -stable. Linus did not take Sasha's not sitting down sitting down, arguing that at least some of what Sasha was seeing was due to selection bias. Linus also called out downsides of alternative strategies, including deferring merging the fix (lack of testing), wait until the next merge window (defers backport to -stable), revert the commit that introduced the bug (not always feasible, for example, when the bug is a consequence of several widely separated commits [ Ed. note: For example, this ]). Steve suggested collecting data on what found the bugs injected after -rc5, suggesting that it would make sense to require more -next testing only if a significant fraction of them were detected by the various bots that are normally run on -next releases.

Linus also defined a “good merge window” as one where he doesn't have to bisect anything, and further called out the i915 developers as having greatly improved the quality of their submissions (and Daniel and Mark Brown confirmed that they have in fact ramped up their testing, and also that -next stability seems to have improved over time). However, Linus echoed Steve in stating that the real testing occurs once distro releases and real users come into the picture. But Sasha argued that reduced testing for late-in-cycle patches reduces the incentive for further development of bots. Sasha took exception to the call for better -stable curation, pointing out that some problematic commits came from the most careful maintainers. Linus made a number of counter-arguments, perhaps most notably predicting that merge-window -stable patches might prove to be more troublesome in v4.19 than late-rc patches, while still expecting that late-rc patches to be more troublesome in most releases. Sasha, noted that he and Linus agreed that late-rc patches were more problematic than merge-window patches, but disagreed as to the cause. Sasha said that he does not think that the answer is WONTFIX. Steve suggested CANTFIX.

Integration issues

Geert Uytterhoeven argued that splitting complex patches by subsystem reduces the amount of integration testing that the full set received. Daniel expressed surprise at how much SoC vendors split up their work, but Geert stated that SoC maintainers normally keep their own integration branches, and described how various device-introduction bootstrapping issues are handled. Olof pointed out that some splitting is required in order to scale the development/maintainer team across multiple subsystems and vendors.

Integration issues: -next

Linus also noted that -next testing covers his tree, so that accepting a patch does not prevent additional -next testing. However, Steve pointed out that -next quickly fills with new code after -rc5 [ Ed. note: Like RCU patches! ], which makes it more difficult for -next testing to find subtle bugs amidst the noise of commits being readied for the next merge window.

Guenter Roeck asked how useful added time in -next would be, given that it has been failing many of his tests over the past week. Stephen Rothwell asked if these failures were due to the mount API changes, and Guenter said that he believed so, but had not yet had time to bisect, but shortly after completed the bisection and reported the bug. Steven suggested that this was a situation where Linus should yell at maintainers, which led Eduardo Valentin to ask whether everyone should be doing that, further suggesting that things not be posted to -next until they were deemed ready to go to Linus. Steven agreed with this suggestion, stating that this is why he asked for Linus to yell at transgressors, but later softening his request for a good Linus rant. Nevertheless, Steve would like to see the 0day test robot have a couple of days to work over patches before they go to -next. Stephen Rothwell allowed as how maybe one of his -next maintainer duties was yelling at people, except that “yelling is not one of my better skills”. Linus liked the idea of others taking over at least some of his yelling duties (“when daddy gets home, you'll really get it”), but suggested that it would be valuable to track -next failures and report them to Linus, thus letting him avoid pulling the offending commits. However, this does require that the failure be pinpointed to at least which pull request is to blame.

As an alternative to yelling, Mark suggested consistently reporting test failures, noting that kernelci failure rates have improved over the past four years. Guenter demurred, stating that consistently reporting -next failures caused the reports' recipients to start yelling at him for the “noise”, for example, for not having pulled in fixes that weren't yet in -next (and Mark, Steve, and Daniel sympathized, and Dan Carpenter reported similar yelling induced by static-checker warnings). This yelling has caused him to stop reporting -next bugs, instead working on them himself in his copious spare time. Jani Nikula wondered why the fixes wouldn't be in -next, and Mark, Geert, and James pointed out that the required QA delays would delay the fixes reaching -next. Daniel works towards a six-hour CI SLA in order to avoid this problem. This does require about one week of test time be carried out in that six hours. [ Ed. note: Which requires 28 machines assuming a six-hour minimum inter-patch arrival time, may be outside the budget of many maintainers. ] Mark added that CI cannot do longer-running tests, which are needed to find some types of bugs.

James noted that -next builds on Australia time, which means a 24-hour fix delay for many other timezones. James therefore suggested some sort of coordination list to prevent multiple redundant bug reports, the idea being to reduce the irritation level and hopefully thus the amount of yelling that Geert must endure when reporting bugs in -next. Geert suggested the -next mailing list. Mark calls out kernel-build-reports at lists.linaro.org as one bug-reporting forum, but notes that it is pretty much a firehose, suggesting a bug tracker as an alternative. Mark also believes that most maintainers should be able to send repetitive emails without yelling, especially given that the Linux kernel community is not that overburdened with people reporting bugs from testing. James suggests that if a maintainer yelling causes tests not to be run, perhaps we are in fact overburdened, though Mark disagreed. [ Ed. note: Or was James's statement an example of British irony? ]

What should go to -next and when?

Tony Lindgren believes that the mm tree could use more testing before -next, but suspects that the complicated patches in mm might be a factor. Tony proposed that patches younger than 24 hours not go to -next. James Bottomley disagreed, arguing that 0day is overloaded and that about 80% of the automated tests won't be run until the commits hit -next. For example, 0day runs more thorough xfstests runs on commits that are in -next. James's criteria for pushing to -next are:

  1. Local tests pass.
  2. Code inspection complete.
  3. 0day test robot didn't find any issues.

Even with these gates, -next testing often finds additional bugs, so James believes that further delays entering -next testing will simply further delay location and fixing of bugs, noting that -next (and not 0day) is supposed to find integration bugs. Eduardo wondered why 0day cares whether a commit is in -next, implicitly asking why 0day doesn't run a more thorough xfstest unconditionally. Geert believes that 0day runs fewer tests when heavily loaded. More viewpoints were aired on what testing should be done by which bot at which time. Stephen called out the documented entrance criteria for -next testing:

Thanks for adding your subsystem tree as a participant of linux-next. As you may know, this is not a judgement of your code. The purpose of linux-next is for integration testing and to lower the impact of conflicts between subsystems in the next merge window.

You will need to ensure that the patches/commits in your tree/series have been:

Basically, this should be just what you would send to Linus (or ask him to fetch). It is allowed to be rebased if you deem it necessary.

Al Viro argued that the first point in Stephen's list above should instead read “Compatible with GPL v2”.

Eduardo expects that improving the test bots would be more helpful than adjusting -next admission criteria, but agrees wholeheartedly with Stephen's “successfully unit tested” criterion. Steve called out the kernel's tools/testing/selftests/ suite, but was skeptical of the community's ability to actually enforce the “successfully unit tested” criterion. Eduardo added that testing commits for the thermal subsystem required a thermal chamber, which the typical hacker seems not to have, and even so, thermal-chamber testing might miss corner cases from unusual workloads. Guenter suggested emulating a thermal chamber in qemu. Linus agreed that while the test bots have improved greatly, they are not all that helpful for hardware issues.

Thomas Gleixner wondered why some patches of interest to him were stuck in random branches (repeatedly rebased) for more than 18 months.

What should go into mainline and when?

Sasha argued that given that only critical fixes should go into late -rc releases, and that these fixes should be critically tested before hitting mainline, suggesting that a Tested-by tag be required, and also a posting to LKML and an Acked-by or a Reviewed-by tag. Steve pointed out that fixing new bugs late in -rc sometimes requires fixing old bugs as well, which makes it difficult to justify any rule requiring the fixing of only those bugs introduced in the most recent merge window.

What should go into -stable and when?

Linus proposed that -stable commits be more carefully curated, but also noted that the -stable maintainers face the same dilemma that he faces when considering late-in-cycle fixes (though he does argue that fixes should not go into -stable until a week or so after it hits mainline). Linus also stated that his rule for -rc5 was to take only things marked for -stable, regardless of which merge window the bug was injected during. Guenter also argued that -stable acceptance criteria should be more strict than those of mainline, and also argued that patches should be given some time to “soak” in mainline before being applied to -stable. Laura Abbott noted that she had previously proposed a longer -rc time for stable for this same reason: The new mainline fixes might themselves have bugs. Sasha responded with the following set of regressions from the CoreOS example, asking which should have been excluded from -stable:

There was an extended discussion on the merits and failings of these commits, with several process faults called out.

Laura agreed that these commits should have been added to stable, but only after some test time to find the corresponding bugs. Sasha agreed, but asked how this might be done, suggesting a stable-next branch containing candidate backports. Greg KH pointed out that he already (almost always) waits for a full -rc cycle and runs all the bots before pushing to -stable, and suspects that a stable-next branch would simply be ignored. Sasha responded that a longer wait would allow time for more manual testing, suggesting the following process:

  1. Grab stable tagged commits as they go in Linus's tree and put them on top of the appropriate stable-next branches (i.e. linux-4.14.y-next).
  2. X times a week pick a batch of ~2-3 week old commits, put them in the -rc branch and send out a review request.
  3. Wait a few days for reviews.
  4. Ship it.

Linus suggested automating the step of grabbing stable commits, later clarifying that he was suggesting doing the automated grabbing before Greg picks up the -stable patches, which Greg felt was semi-reasonable. Sasha has done something like that in the past, but got complaints about bug-report noise. Guenter suggested not widely reporting failures, but instead using the failures to block the progress of the offending commits to -stable (but notifying the submitter of such blocking). Guenter also suggested adding networking and virtual/network filesystem tests to catch more bugs. Mark recommended also including more of the subsystem maintainer's/developer's testsuites. Dan Rue said that Linaro builds and tests quite a few stable-rc branches, but noted that reducing the number of such branches allowed test coverage to be increased. Dan also noted that adding a new test suite to a CI/CD framework is a non-trivial undertaking, so that centralizing effort around fewer test suites also increases coverage. Ted argued that the test bots, though useful, were no substitute for a good set of subsystem-specific regression tests, a sentiment with which Jens Axboe agreed. Daniel agreed that well-run subsystems should have good test suites and processes.

Sasha created an autogenerated stable-next branch for v4.14. Sasha also noted that -next has a pending-fixes branch, and suggested that this be testing more heavily. Jens expressed doubt that this pending-fixes branch would be tested much. Mauro agreed, and gave this as a reason not to wait too long between sending to -next and sending to mainline.

Tests involving special hardware

Mauro argued that CI is even more advantageous for smaller teams, but called out difficulties in testing audio/video input hardware. Laurent Pinchart suggested that known scenes and signals can be used to test such input devices. Mauro agreed that video output could be handled reasonably, but that test setups for things like cameras, microphones, speakers, keyboards, mice, touchscreens, etc., were much more complex and fragile. For example, microphone testing is subject to interference from background noise. Laurent that test setups could be quite small. Laurent agreed that there was some expense, but pointed out that the degree to which the community valued this functionality could be measured by how much the community is willing to invest. Eduardo called out similar complexity for thermal testing, but noted that we might make progress by limiting the scope of the testing and through use of automation, perhaps including emulation.

Maintainer duty cycles

Mauro Carvalho Chehab noted that the merge window was the slowest time for him, with the busiest time being the weeks leading up to the merge window. Stephen said that his quiet time was -rc1 to -rc3, due to the offloading from -next to mainline during the merge window. Tony Lindgren said that he used to see regressions after -rc6, but over the past few years has reduced this by testing mainline a few times a week, reporting any bugs he finds, which then tend to get fixed quickly.