Mark Brown: Testing

July 20, 2015

Related Prior ksummit-discuss Threads:

Additional Participants: Alexey Dobriyan, Andy Lutomirski, Dan Carpenter, Fengguang Wu, Geert Uytterhoeven, Guenter Roeck, Jiri Kosina, Josh Boyer, Julia Lawall, Kees Cook, Kevin Hilman, Mel Gorman, Michael Ellerman, Peter Hüwe, Shuah Khan, and Steven Rostedt.

People tagged: Geert Uytterhoeven, Grant Likely, Kevin Hilman, Stephen Rothwell, and Wolfram Sang.

Given the high-volume discussions in prior years, one might assume that there was nothing new to add. However, the topic of testing does appear to have progressed. This is a good thing, especially if it has managed to progress enough to keep up with the bugs.

Mark Brown noted that the topic of testing is usually covered, but suggested that additional discussion would be helpful in making people aware of what is available and also in working out what additional testing would be useful. Mark called out Shuah Khan's kselftest development, Fengguang Wu's 0day test robot, and kernelci.org. Mark suggests that further progress could be made by pooling resources and by upstreaming Kconfig fragments that are designed to support testing. Mark also suggested that defconfigs be slimmed down in favor of using Kconfig fragments. Finally, Mark called out the perennial topic of what additional testing would be useful. Mark suggested that this discussion take the form of a workshop session combined with a core-day readout. Shuah Khan volunteered to organize these sessions, though primarily focused on kselftest.

In response to Mark's Kconfig-fragment suggestion, Alexey Dobriyan argued that this would result in everyone testing with the same .configs, which could actually decrease test coverage. Mark replied that although that might happen, a big benefit of Kconfig fragments would be lowering the barriers to new people joining the testing efforts.

Steven Rostedt took this as his cue to argue that tests should have three results instead of two, with UNSUPPORTED joining the traditional PASS and FAIL. A test that (for example) attempts to use ftrace in a kernel that does not have ftrace enabled should result in UNSUPPORTED. As an alternative to Kconfig fragments, Steven suggested a central repository of working .config files, complete with documentation on what hardware or filesystem is required to support a given test. Mark agreed, but noted that current tests are supposed to call out the unsupported case, and agreed that easing the job of running tests was central to the discussion. Kees Cook pointed out that a given test might need a particular sysctl setting or privilege level as well, and that it would be good to record that information in machine-readable form to enable automated configuration checking. Andy Lutomirski noted that the topic of machine-readable output was discussed at the 2014 Linux Kernel Summit, but is not aware of it having been implemented. Mark replied that the 2014 discussion was focused on very basic PASS/FAIL critiria, and that it might be time to look at extending the automation.

Masami Hiramatsu would like selftesting to be extended to the tools in the tools/ directory, instead of focusing only on the kernel itself. Masami also points out that people can get working .config files on many systems via /proc/kconfig.gz. Steven Rostedt still likes the idea of having per-test .config files, possibly with an “all tests” .config file that covers all tests. Guenter Roeck would like more shared configurations, especially working configurations for qemu. Guenter stated that he does builds for various configurations for a number of architectures, but has no idea whether his chosen configurations are relevant. Mark Brown listed Olof's build-and-boot test, Dan's running smatch and other static-analysis tools, Coverity, and his own builder, in addition to the kselftest, 0day test robot, and kernelci.org (which Michael Ellerman and Guenter Roeck call out as the closest thing to a proper kernel continuous-integration setup) efforts that he called out in his initial posting. Other tests called out include:

Guenter Roeck's build test, to which he would like to add automated bisect and email notification.
Peter Hüwe's TPM tests. Peter also called out Wolfram Sang's ninja-check scripts for style checking.
Mel Gorman called out his “Marvin”, which does monthly testing, but using longer and more complex tests than does the 0day test robot. Marvin does manually triggered automatic bisection.
Geert Uytterhoeven (and Guenter Roeck) called out Geert's Build Regression system, which runs on each -rc release. Geert plans additional automation. Michael Ellerman called out Geert's buildbot as “epic”. Geert confirmed a very large number of builders, namely 900. This setup also generates about 10GB of log data per month, which sometimes requires manual cleanup.
Geert Uytterhoeven also called out Michael Ellerman's kisskb work, which at one time included automated bisection, however, this bisection was dropped due to excessive resource consumption. [ Ed. note: This is not too surprising considering the very large number of architectures and configurations that kisskb builds. ] Michael Ellerman said that kisskb does not automatically notify (he was afraid of it becoming a spam-bot), but that he now believes that automatic notification is the way to go to get people to act on the failures.

Dan Carpenter wondered if the 0day test robot had obsoleted Build Regression, and Geert replied that he believed that kisskb builds more and different configurations, but confirmed that people mostly ignore the resulting emails. Jiri Kosina speculated that the reason for them being ignored is that the emails are one huge report, so that it is difficult to tell if your code is responsible for a given failure, and that they are sent directly to LKML without CCing people of interest. Josh Boyer agreed that although LKML is a great archival mechanism, it is now useless as a general form of communication. Josh believes that this is the case for bug reports in general, not just Build Regression emails. Kevin Hilman asked how quickly build results are available after a branch is pushed and whether build artifacts are available. Kevin is interested in reducing overall computational load by having kernelci.org consume build artifacts produced by others. It turns out that that kernelci.org has an interface for consuming a JSON file requesting that a given build be tested. Guenter Roeck replied that the build-result latency depends on the load on the system and the branch in question, but that the typical time is anywhere between two hours and about three days. Guenter does not currently post build products, but could in theory do so. In practice, this will require about 100GB storage somewhere out on the web, as well as a lot of bandwidth. Guenter used to do JSON, but had to curtail this due to the high CPU load from the resulting JSON requests. Kevin feels Guenter's storage pain, given that kernelci.org uses about 400GB for 45 days worth of builds, boots, and tests. Michael Ellerman noted that some corporations' lawyers were less than happy with the thought of distributing binaries on an ongoing basis. Guenter had not considered the legal aspect, but believes that gaining approval should be possible given reasonable lawyers.

Mark Brown noted that the 0day test robot did not suffer from Build Regression's issues because the 0day test robot sends the errors directly to the people mentioned in the offending commit. That said, 0day test robot's notifications are one-offs: For many classes of problems, if the problem persists, no additional notifications are sent. Mark notes that the regular “all the issues” email from Build Regression helps keep these issues on the front burner. Fengguang Wu stated that the 0day test robot tests randconfig and all(yes|no|def|mod)config, and that it also builds anything that it finds in arch/*/configs. However, to maintain the one-hour response-time goal, only about 10% of them are run immediately, with the remaining 90% tested as time permits. The 0day test robot now tests 543 git trees, and Fengguang is always interested in adding more. Fengguang agreed that the 0day test robot can sometimes fail to catch errors, for example, due to:

Regressions in 0day itself, which is in active development, where “active development” means more than 1,000 patches over the past year, plus an additional 2,000 patches to the related Linux Kernel Performance (LKP) tests.
0day's report-once logic is tricky and can sometimes miss issues.
Bisection can fail. This is rare, but it does sometimes happen.
Machine hangs, network and disk failures, and other hazards of running large build farms.

Fengguang added that heavy load can increase latency, but should not result in anything being lost. He also noted that you can request a full build report by dropping him an email. [ Ed. note: I subscribe to the full build report for RCU, and it can be quite helpful. ] Mark Brown suspected that he was noting latency, as sometimes the problem was fixed in -next before 0day reported it. Mark would also prefer to get build results on demand rather than being emailed them. Fengguang Wu noted that 0day prioritizes -next testing at a lower priority, and said that he would increase that priority due to the interest in it. He also asked that people send him reports when errors are missed or subjected to excessive delays so that he can fix any problems. Fengguang explained that the report-once logic only suppresses re-notification for ten days, so that build errors will eventually be re-reported if not fixed. Finally, Fengguang suggested that Mark use procmail to direct the full build reports to a local mbox and then check it on demand. Mark objected that this procmail approach wouldn't make the errors go away when fixed, to which Fengguang suggested just checking the most recent full build report, which would contain all the latest information.

Mark Brown also noted that Grant Likely and Kevin Hillman have been working on testing qemu, including working configuations. Julia Lawall added that the 0day test robot runs Coccinelle, and that Coccinelle can run correctly and find problems even if the build fails. Shuah Khan asked whether there is interest in including kselftest in qemu boot tests (to which Guenter Roeck replied in the affirmative), and noted that a “quicktest” option was added in 4.2 in order to meet qemu time constraints.

Alexey Dobriyan argues that the confusion about testing tools happens because the testsuite directories are hidden under tools/ (in tools/testing/selftests/), and would like the directory containing self-tests to instead appear as a top-level directory, with make test running its tests, rather than the current make kselftest. [ Ed. note: There appears to be some disagreement over what the various .config files would contain and what variety of hardware the tests are expected to run on automatically. ] Shuah Khan questioned the usefulness of relocating selftests/, and suggested that ktest could handle the required .config files. Alexey Dobriyan said that his goal was visibility rather than usability, calling out git's t/ as an example of good visibility. Steven Rostedt is willing to make ktest run the kselftests, but believes that some other way of handling .config files is needed, especially in the case where ktest is not being run from within a git repository. Shuah thanked Steven for the clarification, and pointed out that a few of the kselftests depend on specific kernel configurations, and that kselftest simply exits gracefully if the proper configuration is not in place. So kselftest test as much as it can given the kernel configuration, and relies on the user to build the kernel appropriately.

Guenter Roeck suggested that listing all of the testing efforts would be helpful. Guenter also argued that simply providing test results on the web is insufficient, because people don't go looking. Guenter believes that even email notification is insufficient, arguing that the “Build regression” emails sent by Geert are widely ignored. Guenter instead argues that new problems should be automatically bisected, with the patch author and maintainer being notified. Mark Brown agreed with the need for bisection, calling it out as a necessary aspect of interpretation of test results (and later noting that his own builder's failure reports are “lovingly hand crafted using traditional artisan techniques”). Mark also pointed out that testers need to actively push and to test against ongoing development. Prompt analysis and notification is something that Mark believes the 0day test robot gets right. [ Ed. note: Agreed! ] Finally, Mark pointed out that the more we come to rely on automated test, the greater the impact of bugs that break the build or prevent successful boot. Guenter Roeck agreed, calling out 4.1 as being particularly bad in the build-and-boot department, and wonders if automatic revert of broken patches will be needed to minimize the impact. Mark Brown suggested that this impact needs discussing and that automated reversion is one solution, but that Stephen Rothwell's carrying of fixup patches in -next is another possible approach.