[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] GFS blows up after a SAN path failover



I've been working for several weeks to get my RHEL5 cluster configured
with GFS and multipathing to a Sun 6140 SAN. The multipathing has been
giving me the most trouble and after some tests this morning I concluded
that everything is working great. But then I took a closer look at the
nodes in my cluster and all was not as it seemed.

While running a "write" test to the GFS volume on all my nodes, I
simulated a SAN path failure, causing multipathd to fail over to the
second path. All the multipathing worked as it should, however my GFS
volumes blew up and my cluster is now in disarray.

Details:
- RHEL5-64
- GFS1
- multipathing
- Sun 6140 SAN
- stock RH qlogic driver


This is the typical dmesg output on my cluster nodes:
device-mapper: multipath: Failing path 8:32.
device-mapper: multipath: Failing path 8:80.
Buffer I/O error on device dm-11, logical block 9811906
lost page write due to I/O error on dm-11
GFS: fsid=p1_logging_mdi:mdi_log.4: fatal: I/O error
GFS: fsid=p1_logging_mdi:mdi_log.4:   block = 13073841
GFS: fsid=p1_logging_mdi:mdi_log.4:   function = gfs_logbh_wait
GFS: fsid=p1_logging_mdi:mdi_log.4:   file =
/builddir/build/BUILD/gfs-kmod-0.1.16/_kmod_build_/src/gfs/dio.c, line =
925
GFS: fsid=p1_logging_mdi:mdi_log.4:   time = 1187897620
GFS: fsid=p1_logging_mdi:mdi_log.4: about to withdraw from the cluster
GFS: fsid=p1_logging_mdi:mdi_log.4: telling LM to withdraw
Buffer I/O error on device dm-11, logical block 9811907
lost page write due to I/O error on dm-11
Buffer I/O error on device dm-11, logical block 9811908
lost page write due to I/O error on dm-11

Please help. The GFS volumes should be able to handle a path failover.
Especially when it only takes several seconds for the failover to occur.
This sounds like it's timout related but where do I tweak the timeout
values? At the scsi layer? If so how?  At the HBA layer? At the GFS
layer? 

Trying to reboot any one of my nodes is now a problem. They all just
hang trying to umount the GFS volumes. 

I'm quite new to this so I'm hoping for some serious hand holding.

cheers,
--james




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]