Ric Wheeler: Persistent memory devices and the storage/file system stack

August 2, 2013

Participants: Ric Wheeler, Daniel Phillips, Chris Mason, Dave Jones, Miklos Szeredi, Tony Luck, Ben Hutchings, Christoph Lameter.

People tagged: Matthew Wilcox, Jens Axboe, Jeff Moyer, Ingo Molnar, Zach Brown.

Ric Wheeler proposes a break-out session covering persistent memory and storage. Ric notes that with persistent RAM, you cannot expect a simple reboot to fully clean up state, which might be a bit of a surprise for some parts of the boot path. Ric also calls out devices that support atomic write and shingled disk drives.

Daniel Phillips asks what we are going to do with these devices in real kernel code, arguing that we should not leave this topic solely to the applications. Daniel also fears a proliferation of standards, and would like the kernel community to experiment with this new hardware so that we can drive these standards in a sane direction.

Chris Mason has been working on patches to support atomic I/O (presumably including the atomic writes that Ric called out), and is working on enabling kernel interfaces that will atomically create files pre-filled with data, or atomically execute multiple writes to multiple files.

Dave Jones seemed excited by this prospect from a viewpoint of more bugs for Trinity to find, but also asked after Ingo Molnar's and Zach Brown's syslet work. Miklos Szeredi suggested use of the at*() variants fo syscalls, generalizing the “dirfs” argument to contain the transaction.

Matthew Wilcox argued that the advent of low-latency storage devices means that the storage stack must pay more attention to NUMA issues. Tony Luck expressed interest, but noted that there were some tradeoffs. Do you move to a CPU close to the device, giving up memory and cache locality (and taking a migration-induced hit to performance and latency) in favor of storage locality? How would the application decide whether or not the improved scheduler latency was worth the cost? Tony also noted that this decision depends on platform/generation-specific information, and is suspicious of rules of thumb such as “use this if you are going to access >32MB of data.” [ Ed.: Suppose you need several storage devices, each of which happens to be associated with different NUMA nodes? ] Matthew agreed that migrating from one socket to another mid-stream was usually unwise, but noted that some applications can determine where the storage is before building up local state, giving database queries and git compression/checksumming as examples. Longer term, Matthew would like to get the scheduler involved in this decision process. Mel Gorman expressed support for this longer-term plan, but sooner rather than later. Mel also wants to see hints back up to the application, and raised the issue of NUMA-awareness of accesses to mmap()ed paged. Christoph Lameter called out the potential performance benefits of binding acquisitions of a given lock to a given CPU, binding threads to storage devices, and storage controllers that write directly into CPU cache (thus saving the presumed destination CPU the overhead of a cache miss, as in Intel DDIO and PCIe TLP Processing Hints). Christoph also favors RDMA capabilities for storage, getting the kernel out of the storage data path in a manner similar to InfiniBand.

Some discussion of the merits and limitations ensued. Chris Mason liked the idea of binding processes close to specific devices, including NUMA-aware swap-device selection. Matthew Wilcox called out multipath I/O and RAID configurations as possible confounding factors. [ Ed.: Back in the day when SCSI disks over FibreChannel was considered “low latency” the dinosaurs then roaming the earth simply ensured that all NUMA nodes had a FibreChannel path to each device, which eliminated disk-device affinity from consideration. Alas, not practical for today's high-speed solid-state storage devices! FibreChannel can no longer be considered to be particularly low latency by comparison. ] Chris Mason suggested an API that takes a file descriptor and returns one node mask of CPUs for reads and another for writes, but notes that this would not necessarily be a small change. Christoph Lameter asked that processes be guaranteed to wake up on the socket that has the relevant device attached, similar to what can be done with the networking stack. Matthew expressed some dissatisfaction with Christoph's suggestions regarding RDMA and wakeup locality, which indicates that a face-to-face discussion between the two of them would be entertaining, whatever the potential for enlightenment might be. Ben Hutchings noted that networking chooses the place to put incoming packets based on where the corresponding user thread has been running.

Chris Mason listed the following three desired pieces for a solution to the problem of NUMA-aware storage:

Allow the application to ask the kernel which CPUs are close to a given resource, where “resources” include memory, networking, and storage.
Add NUMA-awareness to the kernel writeback infrastructure, so that writeback daemons and filesystem helper threads are local to the corresponding storage.
“The scheduler already has code to pull procs toward the CPUs waking them. Tuning this for NUMA, multiqueue and high IOPS isn't obvious at all.”