[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [dm-devel] IO error on DM device





On 3/29/06, Murthy, Narasimha Doraswamy (STSD) <narasimha murthy hp com> wrote:

Hi Alasdair,

We are seeing an IO error problem on a DM device, when the HBA ports of another host , seen through the same switch are disabled/enable . We are not understanding on why the paths are failed when ports on other hosts are disabled. Please explain.

Below is the problem description and steps to reproduce.

Problem     :  I/O Error on DM device on one host when HBA ports of another host are disabled.

OS distros :  RHEL4.0 U2/U3.

HOW-TO reproduce the problem :

1. Configure 2 storage arrays (A1 , A2) and two host (H1, H2) in the same zone, so that both the hosts can see both the arrays. Create and p resent LUNs (L1, L2) from array (A1) to host (H1)

2. Stop the multipathd daemon (for testing purpose on why the IO error when ports of other hosts are failed) . Not stopping it may take long time to reproduce the problem.

3. Start I/O on DM device representing luns L1 and L2 on host H1. We used dt tool for IO exercising.

4. Disable host ports of host H2 or any port of array A2 one after the other (few times) OR disable and enable the same port of the other host few times (may be 4-5 times).

5. Application (dt tool) aborts with IO error on host H1.


=====

 Snippet of sys log output (while do ing I/O on /dev/dm-0 )


Feb  1 11:47:14 apwtest52 kernel: SCSI error : <2 0 0 1> return code = 0x20000           

Feb  1 11:47:14 apwtest52 kernel: end_request: I/O error, dev sda, sector 1584600

Feb  1 11:47:14 apwtest52 kernel: device-mapper: dm-multipath: Failing path 8:0.     <=================path failed, after disabling/enabling the H2 host port 1

Feb  1 11:47:14 apwtest52 kernel: end_request: I/O error, dev sda, sector 1584608

Feb  1 11:47:45 apwtest52 kernel: SCSI error : <3 0 1 1> return code = 0x20000                

Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sdg, sector 861400

Feb  1 11:47:45 apwtest52 kernel: device-mapper: dm-multipath: Failing path 8:96.   <=================path failed, after disabling/enabling  the H2 host port 2

Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sdg, sector 861408

Feb  1 11:47:45 apwtest52 kernel: SCSI error : <3 0 0 1> return code = 0x20000

Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector 452760

Feb  1 11:47:45 apwtest52 kernel: device-mapper: dm-multipath: Failing path 8:64.  <=================path failed after disabling/enabling the H2 host port 1

Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector 452768

Feb  1 11:47:45 apwtest52 kernel: SCSI error : <3 0 0 1> return code = 0x20000

Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector 453784

Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector 453792

Feb  1 11:47:45 apwtest52 kernel: SCSI error : <3 0 0 1> return code = 0x20000

Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector 454808

Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector 454816

Feb  1 11:47:45 apwtest52 kernel: SCSI error : <3 0 0 1> return code = 0x20000

Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector 863960

Feb  1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector 863968

Feb  1 11:48:40 apwtest52 kernel: SCSI error : <2 0 1 1> return code = 0x20000

Feb  1 11:48:40 apwtest52 kernel: end_request: I/O error, dev sdc, sector 935384

Feb  1 11:48:40 apwtest52 kernel: device-mapper: dm-multipath: Failing path 8:32.  <================= after disabling/enabling  the H2 host port 2

Feb  1 11:48:40 apwtest52 kernel: end_request: I/O error, dev sdc, sector 935392

Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116924    <============All path to the device /dev/dm-0 failed

Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116925

Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116926

Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116927

Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116928

Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116929

Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116930

Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116931

Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116932

Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116933

Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116934

Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116935

Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116936

Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116937

Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116938

Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116939

Feb  1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116940

                        

Observations :

      As we do the port failure on the other host, paths of the dm device is failed and the subsequent por t (i.e A2 or H2 ports) disabling/enabling results into more numbers of path failure and that leads into all path failure condition , which in turn results into IO error on RHEL4.0 U2/U3.

     Through the device-mapper debug driver we are finding that the there is no valid path in __choose_pgpath() and  m->current_pgpath (m is pointer to struct multipath) is null when it comes to map_io() in dm-mpath.c.

Another observation is that we are not seeing any IO errors when the same test is executed on SLES9 SP3/SP4.


Are you using the Qlogic or Emulex driver with your RHEL4 and SLES9 systems?

The 0x20000 errors you are seeing corresponds to DID_BUS_BUSY error. This is considered one of the 'retryable' errors by the SCSI layer. As far as I know, Qlogic driver uses the DID_BUS_BUSY error return code to force a retry of I/O for various fabric events. (I think they are planning on, or maybe already have, cleaned this up in their driver to remove this hack and use block/unblock interface. This could be double-checked by looking at qlogic source code for the driver version you're using). Since dm I/O uses a failfast flag, these retryable errors won't get retried by the SCSI layer and get immediately propagated up to dm, which is probably why you're getting errors even on paths that should be okay. If this is the case, I would expect this same problem to occur on your SLES9 systems too if you're using QLogic driver.

I assume you're not using queue_if_no_path? From a dm and user perspective, this is the only thing I can think of to work around this issue until the patch to propagate error codes up the stack is included and/or Qlogic stops using DID_BUS_BUSY to force retries.

Thanks,
lan

Please provide some pointers on why we are seeing this behavior or is this a known thing at this point in time?

Thanks and regards

-Murthy







 

 



--
dm-devel mailing list
dm-devel redhat com
https://www.redhat.com/mailman/listinfo/dm-devel



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]