Sasha Levin: Issues with stable process

July 30, 2015

Related Material:

2014 ksummit-discuss: Jiri Kosina: Stable workflow
2014 ksummit-discuss: Li Zefan: Stable Issues
2013 ksummit-discuss: Jiri Kosina: Stable trees and pushy maintainers

Additional Participants: Andy Lutomirski, Geert Uytterhoeven, Greg KH, Greg Kroah-Hartman, Guenter Roeck, James Bottomley, Jan Kara, Jonathan Cameron, Jonathan Corbet, Josh Boyer, Linus Torvalds, Mark Brown, Masami Hiramatsu, NeilBrown, Olof Johansson, Rafael J. Wysocki, Stephen Rothwell, Steven Rostedt, and Zefan Li.

People tagged: Dave Jones, Edward Snowden, Kees Cook,

Please note that this is a summarized summary.

Sasha Levin calls out several issues with the current processes for stable trees:

Bug fixes introduced in -rc releases do not necessarily go through -next testing before acceptance. If these fixes introduce other bugs (as one in six fixes does no average), these other bugs will be directly injected into the various stable trees. This destabilizes the stable trees, which is not the desired effect.
Sasha has never received comments during the review cycle for a stable release. He instead gets comments from users after the patch breaks something. Sasha suggests extending the review cycle to include downstream users. [ Ed. note: Would this mean -rc releases for stable trees? ]
There is no standard way to verify that a given stable tree is taking in only those patches relevant to that tree. Sasha would like a mechanism to allow easy comparison of which patches are pulled into which -stable tree and also to compare backports.
Sasha wonders what people expect to happen when they mark a patch for stable, how much added noise would be acceptable, and if there is some better way to handle this than the current email-based scheme.

Bug Flow-Through Rate

Geert Uytterhoeven stated that if the bug in the original fix is serious, a fix to the fix should flow through quickly. Plus the original fix will appear in -next as soon as it hits mainline. Sasha argues that the original fix should have been in -next before it hit mainline, noting that there is often takes quite a delay between the fix and its acceptance. Also, many -stable maintainers have several weeks between releases, which might leave the bug unfixed for quite some time. Sasha suggests aligning -stable releases with the kernel releases and differentiating beetween patches going into LTS vs. other stable trees.

Neil Brown suggests that patches not go into -stable until after a full mainline release or after the patch has been in Linus's tree for two weeks. Neil suggests tagging urgent fixes with "URGENT" in the stable CC line. Andy Lutomirski agrees that non-critical fixes are getting into -stable too quickly. Sasha suggests that fixes should not go into mainline without at least two weeks of -next testing, unless a top-level maintainer overrides, presumably in addition to the two-week delay from mainline to stable that Neil suggests. Sasha also believes that mainline has become more stable over time, and that processes need to be adjusted accordingly. However, Neil doubts that two weeks of -next testing is really sufficient (Steven Rostedt and Rafael J. Wysocki agree). James Bottomley believes that two weeks in -next isn't going to catch significantly more bugs that two days would. Neil also suggests that the stable trees wait until -rc8 to pull that release's fixes from mainline.

Rafael J. Wysocki suspects that real testing doesn't happen until the stable release and also that some commits should not be CCed to stable until after they have spent some time in mainline. Mark Brown says that embedded developers make heavy use of -next due to it being the first integration point for multiple subsystems. Mark says that kernelci.org and Olof's testbot find a goodly number of -next bugs. Sasha also sends out quite a few patches for bugs found by -next testing. Sasha Levin notes that backported fixes by definition never show up in mainline or -next, so is concerned that they don't see enough testing.

Stable Process Escapes

Neil Brown noted that some fixes never did make mainline. Greg KH and Sasha asked for root-cause analysis. Neil clarified that he meant that the bug appeared in a stable release, but not in a mainline release (as opposed to a mainline release candidate).

-next Tree Criteria

Jon Corbet suspects that people see -next as -rc1 without quality control, which he believes reduces their willingness to test it. Stephen Rothwell would like to see more stability, in other words, more testing of patches before they hit -next. Geert Uytterhoeven notes that -rc1 used to be volatile and scary, so that great courage was required to test prior to -rc3 or -rc4. Geert believes -next plays the role that -rc1 used to play. James Bottomley agrees and seconds the thought that bug fixes should cook a bit in -next. Rafael J. Wysocki believes that -next evolves too fast for long-running tests to be useful. Mark Brown argues that -next is about integration rather than long-running tests. Geert Uytterhoeven notes that even when issues are found in -next, but commits still sometimes get into mainline unfixed.

Stephen Rothwell sees that a lot of code doesn't hit -next until the merge window opens, and that some doesn't hit -next ever. Mark Brown would like to see patches be in -next for at least one cycle so that the automated testing has a chance to run. However, Mark feels that urgent fixes should have a fastpath. Sasha Levin argues for simply delaying the next release if such bugs appear, especially if they appear in the later -rc releases. Mark countered that a anything that disables automated testing slows things down, so keeping critical fixes out is unacceptable. Sasha would rather revert the offending commit in those cases. Steven Rostedt notes that Linus often prefers a quick fix (even if untested) over a revert, Geert Uytterhoeven notes that reverting is sometimes highly non-trivial, and Jan Kara added that reverting usually happens if trivial and if the state after the offending commit is worse than before. Sasha replied that it depends on the triviality of the fix [compared to the triviality of the revert], and pointed out that more people rely on -rc releases than used to be the case, which suggests that policies be revisited. Steven Rostedt suggests that delaying movement of patches into stable releases might give better results. Neil Brown suggests that the rule be “in a 0day tree” instead of “in -next”, but Sasha called out the integration benefits of -next. Neil is concerned that the non-automated nature of -next will introduce excessive delays. James requires that patches be in -next for two days, which he believes is a reasonably short time period, and which has occasionally caught bogus fixes. Neil Brown suggests automatically checking whether patches accepted into mainline have undergone 0day testing. Andy Lutomirski is unconvinced that any sort of automated test is a substitutes for a -rc or a real release, noting a couple of bugs missed by 0day and -next.

Linus Torvalds stated that -next is supremely useful, but mostly because it catches build breakage and integration issues. However, Linus believes that very few people actually run -next, which limits its bug-finding capabilities. Jan Kara agreed, noting that he recently introduced a couple of regressions that none of the automated tooling caught, but which manual testing hit a couple of days after the regressions hit mainline.

Maintainership/Test Process

Steven Rostedt objected that fixes for critical bugs cannot wait for a two-week testing cycle. And will not wait that long if they break Linus's build or boot. James Bottomley argues that critical bugs are likely the result of failing to follow the process in the first place, especially in the case of build breakage, which should be caught by the 0day Test Robot. James suggests that maintainers run their own tests, but that they also package up these tests and hand them to Fengguang for inclusion in the 0day Test Robot's suite of tests (Mark Brown: kselftest!). Mark Brown noted that critical bugs can happen even if everyone is following the process because of interactions, hardware dependencies, and the like. Steven Rostedt feels that some of the tests he runs are not yet ready for handoff, and suggests a discussion on enabling others to run such tests, and also on the possibility of a long-running class of selftests. Andy Lutomirski believes that kselftests is already too slow as it is.

Testing Stable Release Candidates

Olof Johansson no longer runs his testbot on stable releases, but might resume such testing given tagged or branched -rc candidates from stable releases. Greg KH notes that kernelci.org now does stable -rc releases, and thus believes that Olof's testing them would be redundant. Olof seemed happy to remove this item from his todo list. Guenter Roeck is happy to test stable trees as they currently are. Guenter's tests do not commence until a couple of hours after the last commit to a stable tree to avoid testing churn. Zefan Li says that the 0day Test Robot includes linux-stable.git, but is not sure whether or not it tests the queue branches. Sasha Levin would like formal stable release candidates, suggesting that this would prompt additional testing by end users. Greg KH does weekly stable releases, so he expects that there is not enough time to get a release candidate propagated through a distro to its end users. Sasha Levin expect that this would work better for the LTS releases, and asked why once per week. Greg KH says that waiting longer than one week means huge piles of patches, and that people don't like waiting longer than that. Plus it is good to get bug fixes out sooner rather than later. Sasha Levin suggest keeping the weekly spacing, but releasing the current release at the same time as the next release's release candidate. Greg KH expects that such an approach would be confusing, that noone would actually test the release candidates, and that he does not have bandwidth for the additional releases.

Distro Testing of Stable Trees

Greg KH does get feedback from Fedora, but that bugs are unusual, indicating that more testing won't help much. Josh Boyer says that Fedora avoids .1 stable releases due to the relative instability of Linus's -rc1 releases. However, this policy was set before the current “must be in a Linus release” rule, so perhaps .1 stable releases are now more stable. However, there are normally a lot of post-rc1 fixes, so Josh suspects that waiting for .2 is still advisable.

Collaboration Between Stable Trees

Masami Hiramatsu suggested that bugzilla might be used to track patches going into different stable trees. Zefan Li runs scripts that check for 4.x fixes for regressions introduced by patches he has added to his 3.4 stable tree. He also noted that Ben is much more aggressive about backporting patches to 3.2, even when they don't trivially apply. Zefan suspects that a patch that applies to an older stable tree likely should be applied to the newer stable trees. Zefan would not notify the author of the original patch of any failures stemming from applying that patch to 3.4 because such failures are quite frequent. Sasha likes Zefan's scripts, but believes that a bad fix pulled into a late mainline -rc could still cause trouble. Sasha points to Zefan's experience with Ben's backporting as a reason for more collaboration among stable-tree maintainers, and wonders if more recent stable trees would be better served by notifying the original patch author. Zefan agreed that collaboration would be good, particularly among maintainers of closely related stable trees. Zefan calls out scripting as an especially fruitful area of collaboration. Sasha agreed on scripting, though Greg KH seemed reluctant to rewrite his scripts yet again. Greg nevertheless believes that a face-to-face discussion would be useful.