[dm-devel] IO error on DM device

lbt transter at gmail.com
Thu Mar 30 18:17:58 UTC 2006


On 3/29/06, Murthy, Narasimha Doraswamy (STSD) <narasimha.murthy at hp.com>
wrote:
>
> Hi Alasdair,
>
> We are seeing an IO error problem on a DM device, when the HBA ports of
> another host, seen through the same switch are disabled/enable. We are not
> understanding on why the paths are failed when ports on other hosts are
> disabled. Please explain.
>
> Below is the problem description and steps to reproduce.
>
> *Problem     : * I/O Error on DM device**** on one host when HBA ports of
> another host are disabled.
>
> *OS distros :*  RHEL4.0 U2/U3.
>
> *HOW-TO**** reproduce the problem****:*****
>
> 1. Configure 2 storage arrays (A1, A2) and two host (H1, H2) in the same
> zone, so that both the hosts can see**** both the arrays. Create and present
> LUNs (L1, L2) from array (A1) to host (H1)
>
> 2. Stop the multipathd daemon (for testing purpose on why the IO error
> when ports of other hosts are failed). Not stopping it may take long time
> to reproduce the problem.
>
> 3. Start I/O on DM device representing luns L1 and L2 on host H1. We used*
> ** dt tool* for IO exercising.
>
> 4. Disable host ports of host H2 or any port of array A2 one after the
> other (few times) OR disable and enable the same port of the other host –few times (may be 4-5 times).
>
> 5. Application (dt tool) aborts with IO error on host H1.
>
> =====
>
>  Snippet of*** s****ys log output* (while doing I/O on /dev/dm-0)
>
> Feb  1 11:47:14 apwtest52 kernel: SCSI error : <2 0 0 1> return code =
> 0x20000
>
Feb  1 11:47:14 apwtest52 kernel: end_request: I/O error, dev sda, sector
> 1584600
>
> Feb  1 11:47:14 apwtest52 kernel: device-mapper: dm-multipath: Failing
> path 8:0.     <=================path failed, after disabling/enabling the H2
> host port 1
>
> Feb  1 11:47:14 apwtest52 kernel: end_request: I/O error, dev sda, sector
> 1584608
>
> Feb  1 11:47:45 apwtest52 kernel: SCSI error : <3 0 1 1> return code =
> 0x20000
>
> Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sdg, sector
> 861400
>
> Feb  1 11:47:45 apwtest52 kernel: device-mapper: dm-multipath: Failing
> path 8:96.   <=================path failed, after disabling/enabling  the H2
> host port 2
>
> Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sdg, sector
> 861408
>
> Feb  1 11:47:45 apwtest52 kernel: SCSI error : <3 0 0 1> return code =
> 0x20000
>
> Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector
> 452760
>
> Feb  1 11:47:45 apwtest52 kernel: device-mapper: dm-multipath: Failing
> path 8:64.  <=================path failed after disabling/enabling the H2
> host port 1
>
> Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector
> 452768
>
> Feb  1 11:47:45 apwtest52 kernel: SCSI error : <3 0 0 1> return code =
> 0x20000
>
> Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector
> 453784
>
> Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector
> 453792
>
> Feb  1 11:47:45 apwtest52 kernel: SCSI error : <3 0 0 1> return code =
> 0x20000
>
> Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector
> 454808
>
> Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector
> 454816
>
> Feb  1 11:47:45 apwtest52 kernel: SCSI error : <3 0 0 1> return code =
> 0x20000
>
> Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector
> 863960
>
> Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector
> 863968
>
> Feb  1 11:48:40 apwtest52 kernel: SCSI error : <2 0 1 1> return code =
> 0x20000
>
> Feb  1 11:48:40 apwtest52 kernel: end_request: I/O error, dev sdc, sector
> 935384
>
> Feb  1 11:48:40 apwtest52 kernel: device-mapper: dm-multipath: Failing
> path 8:32.  <================= after disabling/enabling  the H2 host port 2
>
> Feb  1 11:48:40 apwtest52 kernel: end_request: I/O error, dev sdc, sector
> 935392
>
> Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical
> block 116924    <============All path to the device /dev/dm-0 failed
>
> Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical
> block 116925
>
> Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical
> block 116926
>
> Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical
> block 116927
>
> Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical
> block 116928
>
> Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical
> block 116929
>
> Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical
> block 116930
>
> Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical
> block 116931
>
> Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical
> block 116932
>
> Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical
> block 116933
>
> Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical
> block 116934
>
> Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical
> block 116935
>
> Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical
> block 116936
>
> Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical
> block 116937
>
> Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical
> block 116938
>
> Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical
> block 116939
>
> Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical
> block 116940
>
>
>
> *Observations :*
>
>       As we do the port failure on the other host, paths of the dm device
> is failed and the subsequent port (i.e A2 or H2 ports) disabling/enabling
> results into more numbers of path failure and that leads into all path
> failure condition, which in turn results into IO error on RHEL4.0 U2/U3.
>
>      Through the device-mapper debug driver we are finding that the there
> is no valid path in*** __choose_pgpath()* and *** m->current_pgpath* (m is
> pointer to struct multipath) is null when it comes to map_io() in
> dm-mpath.c.
>
> Another observation is that we are not seeing any IO errors when the sametest is executed on SLES9 SP3/SP4.
>

Are you using the Qlogic or Emulex driver with your RHEL4 and SLES9 systems?


The 0x20000 errors you are seeing corresponds to DID_BUS_BUSY error. This is
considered one of the 'retryable' errors by the SCSI layer. As far as I
know, Qlogic driver uses the DID_BUS_BUSY error return code to force a retry
of I/O for various fabric events. (I think they are planning on, or maybe
already have, cleaned this up in their driver to remove this hack and use
block/unblock interface. This could be double-checked by looking at qlogic
source code for the driver version you're using). Since dm I/O uses a
failfast flag, these retryable errors won't get retried by the SCSI layer
and get immediately propagated up to dm, which is probably why you're
getting errors even on paths that should be okay. If this is the case, I
would expect this same problem to occur on your SLES9 systems too if you're
using QLogic driver.

I assume you're not using queue_if_no_path? From a dm and user perspective,
this is the only thing I can think of to work around this issue until the
patch to propagate error codes up the stack is included and/or Qlogic stops
using DID_BUS_BUSY to force retries.

Thanks,
lan

> Please provide some pointers on why we are seeing this behavior or is this a
> known thing at this point in time?
>
> Thanks and regards
>
> -Murthy
>
>
>
>
>
>
>
>
>
>
>
> --
> dm-devel mailing list
> dm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20060330/49442775/attachment.htm>


More information about the dm-devel mailing list