Hannes Reinecke: SCSI error handling

July 8, 2013

Participants: Hannes Reinecke, Jan Kara, Alasdair G Kergon.

People tagged: (none)

Hannes Reinecke, who is working on revamping SCSI error handling, is looking into ways to make good use of the SCSI sense code. He is also interested in the more general problem of improving SCSI error handling.

SCSI Sense Code

He lists a couple of existing approaches, including uevents (no way to prioritize) and syslog. Hannes proposes leveraging the latest version of vprintk_emit() to emit only structured data, so that there would be no text message in dmesg, but instead a structured message that could be processed by tools designed for this purpose. Hannes would like a discussion of the merits of this approach and of any alternatives.

Jan Kara suggested use of the existing netlink facility. Hannes replied that he had in fact tried netlink, and found the following shortcomings:

A new infrastructure was required.
The limited depth of netlink sockets can result in lost messages.
Because netlink must do skb_alloc() for each message, which fails in low-memory situations (although Hannes admits that this last is a weak argument.

Hannes said that a second attempt by Ewan Milne used udev (patchset submitted to linux-scsi), but that this also had problems. Hannes then reiterated his preference for vprintk_emit(). Alasdair G Kergon expressed interest in Hannes's vprintk_emit() idea, noting that device-mapper is under increasing pressure to “abuse” (Alasdair's quotes) uevents to report error conditions.

Hidehiro Kawai is trying to handle user-space errors by adding a hash value to structured printk() output. H. Peter Anvin attested to the “warm” reception that the idea of unique IDs received at 2011 LKS.

Bringing SCSI Error Handling Into the 21st Century

Hannes states that current SCSI error handling is based on the old SCSI-2 standard, which dates back to 1994. Hannes argues that the FAST_FAIL bit was invented specifically to bypass the old-style error handler, and believes that a better way forward is to update the error handler. Hannes is working on doing just this, and has updates that permit command aborts to be sent from the timeout handler and that implement an overal eh_deadline to specify a time limit on error handling, after which a host reset is sent. Hannes would also like to fail commands before the error handler completes in order to avoid I/O stalls, to dispense with the now-obsolete TARGET RESET command, and to account for the fact that BUS RESET has no direct meaning on modern SCSI transports. Hannes would like to define a meaningful error escalation strategy that takes into account modern SCSI commands. This escalation strategy should preferably terminate early when recovery proves impossible, disabling the LUN in this case.