David Woodhouse: Device error handling / reporting / isolation

May 12, 2014

Participants: Bjorn Helgaas, Daniel Vetter, James Bottomley, Joerg Roedel, Josh Triplett, Laurent Pinchart, Matthew Wilcox, Rafael J. Wysocki, Roland Dreier, Tony Luck, and Will Deacon.

People tagged: Joerg Roedel, someone with intimate knowledge of EEH as used on Power systems, and KVM folks.

David Woodhouse raised the topic of device-driver errors, particularly in the presence of IOMMUs, and even more particularly in cases where a device is emitting an endless stream of errors that prevents the kernel from getting anything else done. It turns out that there are a number of ways to shut up such a device, including PCI function-level reset, power cycling the device, or perhaps configuring the IOMMU to ignore further errors from that device. Of course, the possibility of ignoring errors raises the question of when they should be re-enabled. David would like these decisions to be made in generic device code, not inconsistently in each IOMMU driver, whether for PCI or for non-PCI. Bjorn Helgaas, Rafael J. Wysocki, Joerg Roedel, and James Bottomley all expressed interest, with James indicating a further interest in avoiding an Intel-IOMMU-centric solution. James also asked if the errors in question were due to the device sending addresses that don't have IOTLB entries. David confirmed this case, adding attempts to write through read-only mappings. However, David believes that other errors will come to light as well, and that Intel IOMMUs have properties similar to those of other vendors.

Laurent Pinchart suggests that one of the other classes of errors will prove to be attempting to perform secure accesses on non-secure IOTLB entries. Laurent also suggested partitioning the problem into (1) identifying the offending device and (2) identifying a mechanism to handle the errors. Laurent doubts that #1 can be completely generic, and believes that #2 will require both generic and driver-specific code. Will Deacon pointed out that some non-PCI devices lack a specified way of making sure that a newly-killed device does not still have transactions in flight. In addition, ignoring fault reports can result in queue overflows in some implementations. Josh Triplett is interested in using IOMMUs to protect against buggy and even against malicious devices. This is particularly important for devices (like some laptops) that allow external PCIe devices to be plugged in. David Woodhouse notes that this use case is what prompts the current implementation to give devices with no driver zero privileges and to give devices with a driver carefully whitelisted privileges. Roland Dreier argues that no special action should be required if there is no device driver, because no bits get turned on in the PCI command register until pci_enable_device() time. Roland also notes that his wifi adapter can already sniff and modify all his network traffic, so there is a limit to what IOMMU-level protection can accomplish, and wishes that VT-d was in better shape so that distros might enable it by default. Tony Luck likes the idea of defending against buggy hardware from a RAS perspective. James Bottomley wonders what exactly needs to be done for RAS beyond having the IOMMU corral the device. Joerg Roedel wants proper fault handling, even on laptops and desktops, arguing that this will be needed for newer GPUs. Laurent Pinchart wants a mechanism to correctly report and handle the IOMMU faults in order to prevent interrupt storms from causing DoS. Daniel Vetter has considerable experience with these sorts of interrupt storms, in fact, they cause so much trouble that Daniel disables IOMMUs on his development systems, which in turn causes regressions, reinforcing distro's decisions not to enable IOMMUs by default. Daniel therefore would like to see IOMMU interrupt-storm handling as a first step towards making IOMMU enablement safe on both development and production systems. Joerg Roedel agrees that the developer use case must be taken into account, but believes that there needs to be some way of re-enabling a device that was previously ignored due to interrupt storms. Daniel suggests that a disable/enable cycle of the PCI bus master should be a sufficient signal, but also suggests that simply re-enabling the IOMMU whenever any child device is re-enabled would suffice. In the latter case, if the interrupt storm resumed, the storm handling would simply kick in once again.

Roland Dreier notes that there are many other PCI errors besides IOMMU faults, and wonders if this other error handling can also be consolidated. Roland is concerned about NVMe devices, which are PCIe-connected devices that might be put into hot-pluggable JBODs, at which point the fact that the kernel reacts less well to PCIe hotplug than to (say) SAS hotplug becomes apparent. Matthew Wilcox has been hearing rumors about NVMe hotplug, but hasn't seen bug reports. Matthew therefore requested that people put up or shut up on this topic. Roland replied that he was not trying to spread FUD, and that in any case the issues he is seeing are PCIe configuration problems rather than bugs in the NVMe driver itself.