Josh Triplett: Kernel tinification: shrinking the kernel and avoiding size regressions

May 2, 2014

Participants: Andy Lutomirski, Christoph Lameter, Dave Jones, David Woodhouse, Guenter Roeck, H. Peter Anvin, James Bottomley, Jan Kara, Jiri Kosina, Josh Boyer, Josh Triplett, Julia Lawall, Mark Brown, Matthew Wilcox, Michael Kerrisk, Steven Rostedt, Ted Ts'o, Tim Bird, and Tony Luck.

People tagged: Paul E. McKenney, Darren Hart, Greg KH, Andrew Morton, Sarah Sharp, Julia Lawall, Dan Carpenter, Tom Zanussi, and Michael Kerrisk.

Josh Triplett suggested a renewed focus on size requirements, both for main memory and mass storage. Josh listed a number of tools that can help spot size regressions and shrink size, and also suggested discussions on why size matters as well as how to avoid size regressions.

Dave Jones added that reducing size can also reduce attack surface, thus improving security. Dave would like to see Kconfig options that allow the more obscure (and thus perhaps more buggy) system calls to be removed, in particular sys_remap_file_pages(). Dave would like to see discussion of which syscalls could be configured out without too much userspace damage and what the optimal degree of configurability would be. He also notes that a number of syscalls are already configurable in this way. Josh Triplett would like to see most syscalls be optional, which would allow specialized devices to reduce both memory footprint and attack surface. However, Josh notes that seccomp also decreases attack surface, and does so without the need to build a separate kernel, but that seccomp does not free us from the obligation of securing kernel APIs from hostile userspace. Josh included a list of syscalls that do not appear in kernel/sys_ni.c, and thus always exist, and also included a list of related classes of system calls (for example, legacy syscalls could be excluded by devices running only non-legacy userspace code). Tim Bird described a mechanism leveraging SYSCALL_DEFINE that allowed individual syscalls to be excluded. Josh countered with a suggestion to make syscall functions garbage-collectible [presumably via LTO or something similar], and to keep only those that are referenced from the actual syscall table. Christoph Lameter noted that kernel size matters for performance, with smaller size leaving more of the processor caches for the application. Christoph therefore calls for the ability to remove unwanted functionality (e.g., cgroups), and for userspace tools (e.g., systemd) to tolerate a kernel with reduced functionality. James Bottomley contrasted memory footprint with cache footprint, arguing that in some cases, unused kernel code does not take any of the processor cache away from the user application. That said, James agreed that cgroups does add to the execution path of a number of system calls, but asked what the measured performance impact actually is. James also suggested using static branching to out-of-line areas to reduce that impact, if needed. Christoph responded that instruction layout matters, so that just focusing on instruction count will miss optimization opportunities. In particular, although static branching can reduce the number of instructions speculated and executed, it still puts pressure on TLBs. Christoph suggested sorting functions so as to put the most frequently used set in one place, where they could be covered by a single huge page, preferably using automated tools for this purpose. Christoph also noted that there are older kernels in production use in the financial industry because these older kernels have better performance and latency. Smaller memory footprint is required to get these site to move to newer kernels. Steven believes that most of the core kernel code (excluding modules) is already covered by huge pages. Steven also noted that his experiments moving tracepoint code out of line did not produce measurable benefits, and asked if the reason for use of older kernels wasn't due more to fewer features rather than on raw size. Christoph agreed that the kernel is covered by huge pages, but noted that there are a limited number of huge-page TLB entries, and a bloated kernel would consume them at the expense of application code, which also wants to use huge pages. Christoph agreed that features were also important, and noted that folding of small functions into larger ones (and vice versa) can help, but that it can be difficult to determine which way to go. Josh argued that factoring out helper functions, when done properly, should improve the cache hit rate of the code making up the helper functions. James agreed that link-time optimizations can group functions, but reiterated Steven's call for actual measurements of the benefit. James also pointed out that the compiler often inlines functions, undoing the careful by-hand refactoring. Christoph objected that providing proof requires doing all the work up front. Steven replied “Hello Chicken, Meet Egg!” Julia Lawall asked what sorts of functions are to be refactored, pointing out that similar drivers often have similar code, but that only the code for drivers used by a particular OS instance will be executed. Julia then asked if all of the similar functions need to actually be executed for there to be any benefit. Steven said that he was thinking more in terms of core kernel code than of drivers. Matthew Wilcox believes that any benefit will be workload dependent, with scientific workloads typically being more sensitive to cache issues than memory-intensive commercial workloads.

Mark Brown would like a way of auditing which system calls are actually in use on a given system as a tinification aid, which prompted Tony Luck to suggest strace -c, which in turn prompted Mark to point out that he needs a system-wide view of system calls, where strace -c only tracks a single process. Dave Jones suggested tracepoints or kprobes, Andy Lutomirski suggested programming seccomp to send SIGSYS and then watch the kernel logs, and David Woodhouse suggested setting up per-syscall audit rules for each system call believed to be unused. Mark Brown raised concerns about tracepoint buffer overflow, but agreed that it could work in a suitably constrained setup. He also agreed that kprobes could work, at least given a reasonably canned setup. Mark is also concerned that the userspace tools required for per-syscall audit might be too heavyweight for many target systems, but nevertheless believes that this approach would work in many cases. Tony Luck pointed out the worst-case syscall usage is needed, and that any monitoring tool will only list out typical syscall usage. For example, trivial testing might show that bash does not use the pipe() system call, resulting in fatal disappointment the first time some user typed dmesg | grep ixgbe. H. Peter Anvin suggested using seccomp to sandbox processes, preventing them from using functionality not required for a given super-low-end embedded system. Jan Kara believes that security modules and audit subsystems are to be used for this purpose, but then asked whether he was dreaming too much.

James Bottomley liked the idea of reducing attack surface, but is concerned about having a huge number of per-syscall Kconfig options and about userspace binary incompatibilities induced by kernels with different sets of supported syscalls. He prefers an approach where there is a Kconfig option for each use case, such as secure routers and reduced-attack-surface distributions. This was seconded by Guenter Roeck and by Steven Rostedt, who recalls Linus asking for config profiles. Dave Jones agreed that it will be tricking to draw a precise line between core and optional syscalls, and that having workload-specific “profiles” could be helpful. However, Dave was skeptical that a reduced-attack-surface option would help, given the tendency of people to want the reduced attack surface, but also to want one or more of the normally excluded system calls. Dave also pointed out that the large distros are guaranteed to have a critical mass of users for each and every system call, which led him to suggest a runtime option to disable unneeded system calls. David Woodhouse questioned the utility of use-case-based configurations, asking if anyone had seen the list of things that OpenWRT packages. James uses OpenWRT, and likes its kitchen-sink approach. However, James doubts that his use case is typical, and thus expects that a secure-router profile would include OpenWRT.

Josh Boyer noted that new system calls could be disabled by default, which would prevent users from growing attached to them, and would also allow the distros to gauge demand for a given new system call. H. Peter Anvin argued that disabing a system call be default was equivalent to not providing it at all, seconded by Michael Kerrisk. Josh countered that the system call could be enabled as soon as some package requiring it was added to that distro, but noted that this does not help the “one binary doesn't work on multiple distros” problem. In fact, Josh believes that any large general-purpose distro would simply enable the widest range of system call.

Ted Ts'o pointed out that system calls are a small fraction of the total attack surface, and that new system calls are added fairly infrequently in any case (though Michael Kerrisk noted that sched_getattr() and sched_setattr() were added just this past March), and sometimes (as in renameat()) require very little code. Ted is more concerned with the attack surface provided by things like pluggable security LSMs, control groups, namespaces, and systemd. Dave Jones agreed that the rate of addition of system calls has slowed down, but noted that the rate at which bugs were exposed via system calls has accelerated. Dave is not all that concerned about system calls like renameat(), which simply enhance other system calls, instead calling out system calls that enable significant quantities of code, especially those system calls that are used only be a few very well-written applications, which tend to avoid exposing buggy corner cases by design. Steven Rostedt argued that the acceleration in bug-finding is due more to advances in testing (specifically, Dave Jones's trinity) than to added system calls. Michael Kerrisk agreed with Dave, arguing that the APIs delivered to userspace “continue to be infested with bugs and design infelicities, many of which go undetected for a long time.” Michael gave the addition of the recvmmesg() function's timeout argument as an example of a poorly done feature addition (for more information, see the bugzilla or discussion thread).

Ted called out the attack surface exposed via non-syscall mechanisms such as pseudo filesystems, new ioctls, fallocate code points, and so on, but with special concern for code that can be exercised by non-root users. Ted notes that root-only code tends to be used by a few well-behaved programs, which makes it easier to change root-only code. In contrast, code used by non-root programs might be used by any code anywhere, making it almost impossible to change the user-visible API, which in turn suggests maximal paranoia is design, coding, review, and testing. Dave Jones pointed out that secure boot means that some root-only code might be used by large amounts of code of dubious provenance. Michael Kerrisk is tracking kernel API changes here.

Ben Hutchings suggested restricting large-code-size features to root, for example, using perf_event_paranoid=3 to restrict sys_perf_event_open() to programs running as root. Ben also pointed out that Michael Kerrisk's documentation efforts seemed to find odd corner cases, which re-raises the old question of whether code should not be accepted into the kernel until the documentation is done. Dave Jones suggested that lack of test cases also block acceptance of new features. Michael Kerrisk agreed, arguing that test cases and documentation should go hand in glove, further suggesting this as a separat LKS topic.

Mention of secure boot prompted Josh Boyer to ask if last year's “What to do about the secure_modules/trusted_kernel/whatever patch set that distros are carrying to support Secure Boot?” topic should be reprised, given that progress along these lines appeared to have been derailed again, and the resulting subthread summary may be found here.

Jiri Kosina notes that systemd is a mandatory feature of a number of distros that depends on optional kernel features. In short, pruning the kernel might require pruning userspace utilities and daemons. Ted Ts'o suggested that anyone interested in working this problem feel free to start a separate ksummit-discuss thread, but preferably only after coming up with at least one proposed solution.

Tim Bird called out some of his work on deeply embedded systems (here and here). Tim said that these techniques eliminated 161 syscalls from a default-configured kernel (saving 95KB) and 120 syscalls from a minimal-configured kernel (saving 48KB). H. Peter Anvin called out the irony of one of Tim's techniques being to prevent LTO from preventing unreferenced code from being optimized out.