[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] GFS2 becomes non-responsive, no fencing



Hi everyone,

Have run into a strange problem on our RH cluster installation.  We
have a cluster that uses iscsi shared storage for GFS2.  It's been
running for months with no problems.

Today, the app on one node died.  I logged in, assumed things were
fenced, and tried to go about my business of restarting it.  After
some fiddling, I got the box back in the cluster fine.

It just happened again, and I've dug in a bit more.  I was wrong - the
failed node has not been fenced.  The last thing in dmesg on the
failing node is:

GFS2: fsid=: Trying to join cluster "lock_dlm", "sensors:rrd_gfs"
GFS2: fsid=sensors:rrd_gfs.1: Joined cluster. Now mounting FS...
GFS2: fsid=sensors:rrd_gfs.1: jid=1, already locked for use
GFS2: fsid=sensors:rrd_gfs.1: jid=1: Looking at journal...
GFS2: fsid=sensors:rrd_gfs.1: jid=1: Done

Any reads or writes to the mounted filesystem hangs like the DLM can't
get locks.  Connectivity to the storage is good: no interfaces show
dropped packets or errors.  cman_tool reports the node as healthy:

[root sensor01 ~]# cman_tool status
Version: 6.0.1
Config Version: 14
Cluster Name: sensors
Cluster Id: 14059
Cluster Member: Yes
Cluster Generation: 368
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Total votes: 2
Quorum: 2  
Active subsystems: 7
Flags: 
Ports Bound: 0 11  
Node name: sensor01.dc3
Node ID: 1
Multicast addresses: 239.192.54.34 

The missing vote is a third node that is not yet live, but it's been
in that state of rweeks now with no problems.

[root sensor01 ~]# cman_tool nodes -f
Node  Sts   Inc   Joined               Name
   1   M    360   2008-08-25 16:24:29  sensor01.dc3
       Last fenced:   2008-08-25 16:04:25 by leaf8b-2.dc3
   2   M    364   2008-08-25 16:24:29  sensor02.dc3
   3   X    364                        sensor03.dc3
       Node has not been fenced since it went down

The fencing above is when I rebooted the node - because processes were
hung on GFS I/O, I had to hard reset the box, which caused the other
nodes to fence it.

Cluster LVM operations seem to work fine - I can query all LVM objects
without a problem.  But as soon as I try a filesystem operation, boom,
I hang.

Any hints on where I can start looking?

-- 
Ross Vandegrift
ross kallisti us

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
	--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]