Dan Williams: Move Fast and Oops Things

May 15, 2014

Participants: Andy Grover, Chris Mason, Dan Carpenter, Geert Uytterhoeven, Greg KH, James Bottomley, John W. Linville, Josh Triplett, Laurent Pinchart, Matt Fleming, Neil Brown, Randy Dunlap, and Theodore Ts'o.

People tagged: Andrew Morton, Dave Chinner, Fengguang Wu, Linus Torvalds, and Stephen Rothwell.

Dan Williams notes that Facebook incorporates more code from more developers more quickly than does the Linux kernel, and suggests a few mechanisms that he believes might allow the kernel to move more quickly, including “merge karma”, the ability to merge NACKed code in a disabled state, and more aggressive discovery and running of tests. Neil Brown asks if the Linux kernel is really being slowed down by anything other than the limited amount of “competent engineering time”. Dan Williams believes that testing and limited audience are additional limiting factors.

Chris Brown suspended disbelief and described some differences between Facebook's and the kernel's modes of operation, noting Facebook's centralized patch-review tool, automated pre-review testing of patches, lack of maintainers, and use of “gatekeepers” to allow testing of new features against a subset of the users. Chris also called out the fact that, unlike the kernel community, Facebook has full control over deployment of their code. Randy Dunlap likes the automated testing, and agreed that the maintainership bottleneck is a problem. Andy Grover also called out non-technical differences, including the kernel's lack of a common management chain (maintainers can say “no” to changes, but cannot normally order a developer to work on some specific change); the kernel's lack of control over its developers (not paying them, after all!), and the the fact that kernel's developers might be working for competing organizations. Andy also asked what can be done about various kernel-maintainership failure modes, suggesting that lessons could be learned from other large-scale FOSS projects. Dan Carpenter suggested pinging the maintainer after a month of no response, falling back on Andrew Morton if all else fails. Dan also pointed out review as the tough part of maintainership.

Dan Carpenter clarified that he is not asking for a maintainer-free model, instead wanting to give maintainers more ways to say “no”, and also calling for more tests to be provided with patches. Greg KH argued that the reason testing is weak is that no one has stepped up to maintain the tests, which includes ensuring that the tests continue to work. Greg also said that he is getting funding for someone to work ths issue, hopefully in a few months. Matt Fleming argued that unit tests should instead be owned by everyone, so that if a given patch resulted in a test failure, that patch's author should be responsible for fixing things. Matt also suggested that the maintainership model might work well for collecting test results, reporting failures, and running the tests, pointing out Fengguang Wu and Stephen Rothwell as people in this sort of role, but expects that being responsible for fixing all test breakage would cause one's head to explode. Greg KH suggested that “owned by everyone” might not work well in the real world. Greg also noted that he has identified someone interested in maintaining tests, so would like to give it a chance, risks of cranial explosions notwithstanding, to which Matt replied “Go for it!”.

Neil Brown suggested that moving too fast would result in too many regressions. Dan Williams recommended avoiding these regressions by hiding suspect code behind “gatekeepers”, which are checks in the code that are disabled by default. Matt Fleming agreed that “gatekeepers”, also called “feature bits”, are often used in web development, but doubts that they can be effective in the Linux kernel because the kernel community does not control the execution environment. Neil is concerned that use of gatekeepers will delay application uptake of new features, which delays accumulating the experience needed to fix the new feature, thus actually slowing things down rather than speeding them up. Dan believes that the gatekeeper approach would work better for drivers than for core kernel code, where drivers for new hardware or new cross-cutting features could be made provisionally available. Neil Brown agreed (suggesting dynamic debug control, which Dan Williams liked very much), but Rafael J. Wysocki doubted that it would work even for drivers, given that there is no centralized control over the deployment of the Linux kernel. Dan disagrees that centralized control is necessary, arguing that overriding a gatekeeper is not all that different than applying the corresponding patch—more convenient, but not different in kind.

Laurent Pinchart argues that this is what the much-abused CONFIG_EXPERIMENTAL Kconfig option was intended for, and expects that any gatekeeper functionality would be abused in the same way. Laurent also pointed out the potential security dangers of gatekeepers, and that gatekeepers are not risk-free, for example, potentially breaking the non-enabled case, thus increasing rather than decreasing the maintainers' burden. Rafael reiterated security concerns, and argued that staging is already supposed to act as gatekeeper for drivers. Dan Carpenter argued that staging slows development, due to the stricter rules that apply to in-tree development. Dan believes that developers would often be better off fixing things up out of tree, but agrees that a benefit of staging is getting a bunch of isolated out-of-tree drivers into one place where they can be cleaned up. Geert Uytterhoeven noted that some of the rules speed things up, in particular, the rebasing requirements reducing the number of commits. James Bottomley replied that many maintainer trees, his included, do rebases, for example, when dropping problematic patches. John W. Linville pointed out that rebasing can be problematic when you have downstream trees. John also noted that there was a point of diminishing returns from stalling patches, as some bugs simply aren't going to show up until after those patches are merged. James replied that rebasing can be managed so as to avoid the pain. In his case, he encourages people to develop against Linus's tree instead of against his tree. These people's work can usually be merged without incident. James switches his tree to immutable mode only after sending pull requests to Linus. Geert Uytterhoeven noted that some of James's git commands could be shortened.

Theodore Ts'o noted that a number of maintainers are already using gatekeeper-like things, calling out xfs, ext4, and btrfs. Ted also pointed out that the advantage of centralized control of deployment is that you can identify everything using a given feature, and make coordinated changes. In contrast, the Linux kernel community has no way of determining who is using what feature, which makes it much more difficult to withdraw features or even to change their APIs. All of this means that a Google or a Facebook can afford to take more risk with new features because they have a way of updating them or even backing them out if needed. Furthermore, Google and Facebook need only worry about their particular workloads, which though large and complex, are a very small subset of the union of all workloads that the Linux kernel must support. In short, the Linux kernel cannot necessarily blindly adopt practices that work well at Google and Facebook. Dan Williams agreed that ABI concerns can loom large, but disagreed that the kernel necessarily needs to have a slower development cycle. Dan also said that he is not advocating that the Linux kernel adopt patches whose only purpose is to support mega-datacenters, and stated that although he is not advocating gatekeepers for the core kernel, he has not yet been convinced that drivers cannot use them. Ted agreed that gatekeepers might be useful when adding new functionality to existing drivers, but only when this new functionality does not require enhancements to the core kernel. However, Ted wonders why anything additional is needed to enable this, pointing out the possibility of using sysfs for this purpose.

Dan Williams quotes Dave Chinner as arguing that gatekeepered function should instead be maintained out of tree, noting that merging experiments can result in accidentally making long-term promises to users. Chris Mason argued that moving function upstream sooner rather than later was key to community ownership, and that merging btrfs helped it develop more quickly.

Greg KH said that he didn't believe that staging sped up development for other parts of the kernel, and noted that many new features would not meet their intended users until after an enterprise distro releases it, some years later. Greg further cautioned against assuming that a feature was OK before it had many real users. Dan Williams believes that staging does speed up development of the staging drivers because it attracts more developers to fix up the code. Dan also suggested that having a gatekeeper release valve might help relieve some of the pressure against merging new features. Greg KH agrees that this might work for a filesystem or a driver, but not for a new syscall or a new userspace API, calling out cgroups as an example of how hard it is to get these things right. Josh Boyer believes that the increasing use of community distributions is allowing new features to get use more quickly, without having to wait for the enterprise distributions, especially when these features are in support of a hot new project such as Docker.