[Linux-cluster] Hard lockups when writing a lot to GFS

Fri Dec 10 06:22:49 UTC 2004

On Thu, Dec 09, 2004 at 02:31:29PM -0800, Rick Stevens wrote:
> I have a two-node setup on a dual-port SCSI SAN.  Note this is just
> for test purposes.  Part of the SAN is a GFS filesystem shared between
> the two nodes.
> 
> When we fetch content to the GFS filesystem via an rsync pull (well, 
> several rsync pulls) on node 1, it runs for a while then node 1 hard
> locks (nothing on the console, network dies, console dies, it's frozen
> solid).  Of course, node 2 notices it and marks node 1 down 
> (/proc/cluster/nodes shows an "X" for node 1 under "Sts").  So the
> cluster behaviour is OK.  If I "fence-ack-manual -n node1" on node 2,
> it runs along happily.  I can reboot node 1 and everything returns to
> normalcy.
> 
> The problem is, why is node 1 dying like this?  It is important that
> this get sorted out as we have a LOT of data to synchronize (rsync is
> just the test case--we'll probably use a different scheme on
> deployment), and I suspect it's heavy write activity on that node
> that's causing the crash.
> 
> Oh, both nodes have the GFS filesystem mounted with "-o rw,noatime".
> 
> Any ideas would be GREATLY appreciated!

You might want to verify it's not the hardware giving you problems.  SCSI
RAID devices might behave badly when heavily loaded from more than one
initiator at a time.  I'd partition the device or create two LUN's, put
ext3 on each, have each node mount one of the ext3's, and run the same
kind of load test.  Since you don't see any SCSI errors on either machine
it might seem unlikely, but it's a start.

-- 
Dave Teigland  <teigland at redhat.com>