Fwd: [dm-devel] IO error on DM device

Ryan O'Hara rohara at redhat.com
Thu Mar 30 18:59:56 UTC 2006


Pardon the top-post.

When a port gets disabled or unplugged, etc. the switch issues an RSCN 
event. This is normal. All the HBAs still connected to the switch (and 
in the same zone, if I understand) will see this RSCN event. This is 
propagated as a DID_BUS_BUSY, which is the 0x20000 you are seeing and 
should be transient. However this error seems to cause all other IO 
paths to die when caught in a multipath environment.

For md, the solution is to use mdmpd (md's version of the multipath 
daemon). This keeps the "innocent" IO paths up and running. Without it, 
unplugging one HBA will cause all the IO paths to fail.

So based on the comments below and steps to reproduce, I'd say this is 
expected behavior since multipathd is turned off. What happens if 
multipathd is left running? It is implied that not stopping multipathd 
will still cause a failure but will take much longer. I am no expert on 
multipathd, so perhaps one of the dm developers can comment on 
multipathd handling these SCSI error.

Ryan



Jonathan E Brassow wrote:
 >
> This is the bus busy error that makes it through the multipath 
> drivers... Do you guys have any comments/status that you can give on 
> this issue?
> 
> brassow
> 
> Begin forwarded message:
> 
>     *From: *"Murthy, Narasimha Doraswamy (STSD)" <narasimha.murthy at hp.com>
>     *Date: *March 29, 2006 10:03:03 AM CST
>     *To: *"Alasdair G Kergon" <agk at redhat.com>
>     *Cc: *device-mapper development <dm-devel at redhat.com>, Securelinux
>     <securelinux at hp.com>
>     *Subject: [dm-devel] IO error on DM device
>     Reply-To: *device-mapper development <dm-devel at redhat.com>
> 
>     Hi Alasdair,
> 
>     We are seeing an IO error problem on a DM device, when the HBA ports
>     of another host, seen through the same switch are disabled/enable.
>     We are not understanding on why the paths are failed when ports on
>     other hosts are disabled. Please explain.
> 
>     Below is the problem description and steps to reproduce.
> 
>     *Problem     : * I/O Error on DM device on one host when HBA ports
>     of another host are disabled.
> 
>     *OS distros :*  RHEL4.0 U2/U3.
> 
>     *HOW-TO reproduce the problem:*
> 
>     1. Configure 2 storage arrays (A1, A2) and two host (H1, H2) in the
>     same zone, so that both the hosts can see both the arrays. Create
>     and present LUNs (L1, L2) from array (A1) to host (H1)
> 
>     2. Stop the multipathd daemon (for testing purpose on why the IO
>     error when ports of other hosts are failed). Not stopping it may
>     take long time to reproduce the problem.
> 
>     3. Start I/O on DM device representing luns L1 and L2 on host H1. We
>     used* dt tool* for IO exercising.
> 
>     4. Disable host ports of host H2 or any port of array A2 one after
>     the other (few times) OR disable and enable the same port of the
>     other host – few times (may be 4-5 times).
> 
>     5. Application (dt tool) aborts with IO error on host H1.
> 
> 
>     =====
> 
>      Snippet of* sys log output* (while doing I/O on /dev/dm-0)
> 
> 
>     Feb  1 11:47:14 apwtest52 kernel: SCSI error : <2 0 0 1> return code
>     = 0x20000           
> 
>     Feb  1 11:47:14 apwtest52 kernel: end_request: I/O error, dev sda,
>     sector 1584600
> 
>     Feb  1 11:47:14 apwtest52 kernel: device-mapper: dm-multipath:
>     Failing path 8:0.     <=================path failed, after
>     disabling/enabling the H2 host port 1
> 
>     Feb  1 11:47:14 apwtest52 kernel: end_request: I/O error, dev sda,
>     sector 1584608
> 
>     Feb  1 11:47:45 apwtest52 kernel: SCSI error : <3 0 1 1> return code
>     = 0x20000                
> 
>     Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sdg,
>     sector 861400
> 
>     Feb  1 11:47:45 apwtest52 kernel: device-mapper: dm-multipath:
>     Failing path 8:96.   <=================path failed, after
>     disabling/enabling  the H2 host port 2
> 
>     Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sdg,
>     sector 861408
> 
>     Feb  1 11:47:45 apwtest52 kernel: SCSI error : <3 0 0 1> return code
>     = 0x20000
> 
>     Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde,
>     sector 452760
> 
>     Feb  1 11:47:45 apwtest52 kernel: device-mapper: dm-multipath:
>     Failing path 8:64.  <=================path failed after
>     disabling/enabling the H2 host port 1
> 
>     Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde,
>     sector 452768
> 
>     Feb  1 11:47:45 apwtest52 kernel: SCSI error : <3 0 0 1> return code
>     = 0x20000
> 
>     Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde,
>     sector 453784
> 
>     Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde,
>     sector 453792
> 
>     Feb  1 11:47:45 apwtest52 kernel: SCSI error : <3 0 0 1> return code
>     = 0x20000
> 
>     Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde,
>     sector 454808
> 
>     Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde,
>     sector 454816
> 
>     Feb  1 11:47:45 apwtest52 kernel: SCSI error : <3 0 0 1> return code
>     = 0x20000
> 
>     Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde,
>     sector 863960
> 
>     Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde,
>     sector 863968
> 
>     Feb  1 11:48:40 apwtest52 kernel: SCSI error : <2 0 1 1> return code
>     = 0x20000
> 
>     Feb  1 11:48:40 apwtest52 kernel: end_request: I/O error, dev sdc,
>     sector 935384
> 
>     Feb  1 11:48:40 apwtest52 kernel: device-mapper: dm-multipath:
>     Failing path 8:32.  <================= after disabling/enabling  the
>     H2 host port 2
> 
>     Feb  1 11:48:40 apwtest52 kernel: end_request: I/O error, dev sdc,
>     sector 935392
> 
>     Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0,
>     logical block 116924    <============All path to the device
>     /dev/dm-0 failed
> 
>     Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0,
>     logical block 116925
> 
>     Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0,
>     logical block 116926
> 
>     Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0,
>     logical block 116927
> 
>     Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0,
>     logical block 116928
> 
>     Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0,
>     logical block 116929
> 
>     Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0,
>     logical block 116930
> 
>     Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0,
>     logical block 116931
> 
>     Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0,
>     logical block 116932
> 
>     Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0,
>     logical block 116933
> 
>     Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0,
>     logical block 116934
> 
>     Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0,
>     logical block 116935
> 
>     Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0,
>     logical block 116936
> 
>     Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0,
>     logical block 116937
> 
>     Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0,
>     logical block 116938
> 
>     Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0,
>     logical block 116939
> 
>     Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0,
>     logical block 116940
> 
>                             
> 
>     *Observations :*
> 
>           As we do the port failure on the other host, paths of the dm
>     device is failed and the subsequent port (i.e A2 or H2 ports)
>     disabling/enabling results into more numbers of path failure and
>     that leads into all path failure condition, which in turn results
>     into IO error on RHEL4.0 U2/U3.
> 
>          Through the device-mapper debug driver we are finding that the
>     there is no valid path in*/ __choose_pgpath()/* and */
>     m->current_pgpath/* (m is pointer to struct multipath) is null when
>     it comes to map_io() in dm-mpath.c.
> 
>     Another observation is that we are not seeing any IO errors when the
>     same test is executed on SLES9 SP3/SP4.
> 
>     Please provide some pointers on why we are seeing this behavior or
>     is this a known thing at this point in time?
> 
>     Thanks and regards
> 
>     -Murthy
> 
> 
> 
> 
> 
> 
> 
>      
> 
>      
> 
>     -- 
>     dm-devel mailing list
>     dm-devel at redhat.com
>     https://www.redhat.com/mailman/listinfo/dm-devel
> 




More information about the dm-devel mailing list