[linux-cluster] multipath issue... Smells of hardware issue.

Thu Jul 5 20:01:28 UTC 2007

On Fri, Jun 29, 2007 at 05:23:20PM +0200, Kristoffer Lippert wrote:
>    Hi,
> 
>    I have a setup with two identical RX200s3 FuSi servers talking to a SAN
>    (SX60 + extra controller), and that works fine with gfs1.
> 
>    I do however see some errors on one of the servers. It's in my message log
>    and only now and then now and then (though always under load, but i cant
>    load it and thereby force it to give the error).
> 
>    The error says:
>    Jun 28 15:44:17 app02 multipathd: 8:16: mark as failed
>    Jun 28 15:44:17 app02 multipathd: main_disk_volume1: remaining active
>    paths: 1
>    Jun 28 15:44:17 app02 kernel: sd 2:0:0:0: SCSI error: return code =
>    0x00070000
>    Jun 28 15:44:17 app02 kernel: end_request: I/O error, dev sdb, sector
>    705160231
>    Jun 28 15:44:17 app02 kernel: device-mapper: multipath: Failing path 8:16.
>    Jun 28 15:44:22 app02 multipathd: sdb: readsector0 checker reports path is
>    up
>    Jun 28 15:44:22 app02 multipathd: 8:16: reinstated
>    Jun 28 15:44:22 app02 multipathd: main_disk_volume1: remaining active
>    paths: 2
>    Jun 28 15:46:02 app02 multipathd: 8:32: mark as failed
>    Jun 28 15:46:02 app02 multipathd: main_disk_volume1: remaining active
>    paths: 1
>    Jun 28 15:46:02 app02 kernel: sd 3:0:0:0: SCSI error: return code =
>    0x00070000
>    Jun 28 15:46:02 app02 kernel: end_request: I/O error, dev sdc, sector
>    739870727
>    Jun 28 15:46:02 app02 kernel: device-mapper: multipath: Failing path 8:32.
>    Jun 28 15:46:06 app02 multipathd: sdc: readsector0 checker reports path is
>    up
>    Jun 28 15:46:06 app02 multipathd: 8:32: reinstated
>    Jun 28 15:46:06 app02 multipathd: main_disk_volume1: remaining active
>    paths: 2
> 
>    To me i looks like a fiber that bounces up and down. (There is no switch
>    involved).
> 
>    Sometimes i only get a slightly shorter version:
>    Jun 29 09:04:32 app02 kernel: sd 2:0:0:0: SCSI error: return code =
>    0x00070000
>    Jun 29 09:04:32 app02 kernel: end_request: I/O error, dev sdb, sector
>    2782490295
>    Jun 29 09:04:32 app02 kernel: device-mapper: multipath: Failing path 8:16.
>    Jun 29 09:04:32 app02 multipathd: 8:16: mark as failed
>    Jun 29 09:04:32 app02 multipathd: main_disk_volume1: remaining active
>    paths: 1
>    Jun 29 09:04:37 app02 multipathd: sdb: readsector0 checker reports path is
>    up
>    Jun 29 09:04:37 app02 multipathd: 8:16: reinstated
>    Jun 29 09:04:37 app02 multipathd: main_disk_volume1: remaining active
>    paths: 2
> 
>    Any sugestions, but start swapping hardware?

It's possible that your scsi device is timing out the scsi read command from the
readsector0 path checker, which is what it appears that your setup is using to
check the path status.  This checker has it's timeout set to 5 minutes, but I
suppose that it is possible to take this long if your hardware is a flaky. If
you're willing to recompile the code, you can change this default by changing
DEF_TIMEOUT in libcheckers/checkers.h. DEF_TIMEOUT is the scsi command timeout
in milliseconds.

Otherwise, if you are only seeing this on one server, swapping hardware seems
like a reasonable thing to try.

-Ben

>    Mvh / Kind regards
> 
>    Kristoffer Lippert
>    Systemansvarlig
>    JP/Politiken A/S
>    Online Magasiner
> 
>    Tlf. +45 8738 3032
>    Cell. +45 6062 8703

> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster