Joerg Roedel: Core Kernel support for Compute-Offload Devices

August 11, 2015

Related Material:

Additional Participants: Andy Lutomirski, Arnd Bergmann, Benjamin Herrenschmidt, Catalin Marinas, Christoph Lameter, David Woodhouse, Jerome Glisse, Joerg Roedel, and Rik van Riel.

People tagged: Andrea Arcangeli, Arnd Bergmann, Benjamin Herrenschmidt, Christoph Lameter, David Woodhouse, Jérôme Glisse, Jerome Glisse, Jesse Barnes, Mel Gorman, Paul E. McKenney, Rik van Riel, and Will Deacon.

Please note that this is a summarized summary.

Joerg Roedel called out the need for core-kernel support for compute-offload devices, including the following questions:

Do we need the concept of an off-CPU task in the kernel together with a common interface to create and manage them and probably a (collection of) batch scheduler(s) for these tasks?
Changes in memory management for devices accessing user address spaces:
1. How can we best support the different memory models these devices support?
2. How do we handle the off-CPU users of an mm_struct?
3. How can we attach common state for off-CPU tasks to mm_struct (and what needs to be in there)?
Does it make sense to implement automatic migration of system memory to device memory (when available) and vice versa? How do we decide what and when to migrate?
What features do we require in the hardware to support it with a common interface?

Benjamin Herrenschmidt noted a connection to the FPGA topic.

1. Off-CPU Task?

Rik van Riel pointed out that in cases where compute-offload devices share address space with CPUs, it would be easiest if the mm_struct had references for threads running on the compute-offload devices.

Ben Herrenschmidt suggested that page-fault handling for devices could be improved. However, scheduling on IBM's CAPI devices is implemented entirely in hardware. That said, SPU scheduling on Cell Processor was done by the CPU, so an optional scheduler for attached devices is needed (and Arnd Bergmann recounted a few HMM patchset already does all this for anonymous memory, and that he is working on a proof of concept for file-backed storage, and asked Joerg for review. Ben Herrenschmidt noted that cache-coherent devices may need to be handled a bit differently than are non-cache-coherent devices.

David also notes that Intel has hardware support that can indicate that a given page was accessed from a device, but there is no indication of which device if there was more than one. Ben Herrenschmidt pointed out that is it useful to also track which CPUs have accessed a given page, in addition to tracking the devices. Joerg suggests separate per-device page tables, but also pointed out that this goes against a common goal of having the external hardware reuse the CPU page tables (Jerome's HMM patch maintains separate page tables because some current hardware's page tables are not compatible with those of the CPU). Joerg also wondered what sort of hardware assist was being considered in earlier discussions.

4. Common Interfaces

Joerg Roedl favors greater commonality for these devices, and also noted the likely need for IOMMU-related work. David noted that while internal on-chip functionality might reasonably have per-vendor specifications, longer term PCIe devices will support the transaction-layer packet (TLP) prefixes needed for shared virtual memory (SVM) use, which would argue for the commonality that Joerg favors.