[Linux-cluster] gfs problem during shutdown of master lock server

Mon Feb 21 14:21:13 UTC 2005

Hi,

We've been using redhat GFS 6.0.0-1.2 for a few months now with
generally very good reliability.  We have the system configured with 5
lock servers (2 of which are the servers that physically mount the
shared storage which is connected via shared scsi on an hp dl380
packaged cluster).  We've taken down slave lock servers before without
incident but recently we went to reboot the master lock server and the
filesystem became inaccessible from the other server.  The logs indicate
what _seems_ to be a successful change of the master lock server over to
one of the other nodes but then there are messages indicating that a
request for a lock on the filesystem results in a 'Busy' message.

We then took all the servers down and booted up one of the filesystem
mounting nodes and all was well, no need for fsck.

The 5 nodes are as follows:
cluster1 (lock + fs node, was master, shutdown initiated)
cluster2 (lock +fs node, becomes master on cluster1 shutdown)
lvs1 (lock node)
lvs2 (lock node)
intra4 (lock node)

the locking is done with pool on two volumes, pool_home and
pool_shared.  ccsd uses the shared storage for the cluster config info
on cluster1 and cluster2, and local files for the rest of the lock
servers.  If I need to provide more information I'd be happy to post
full config details.

Here are the messages in the logs on cluster2 around the time of the
shutdown.  Any help on how to prevent this lockup from happening (or
pointers on what I'm doing wrong) would be appreciated!

Thanks!

-------------------------
Feb 18 17:02:24 cluster2 kernel: lock_gulm: Checking for journals for
node "cluster1.sonitrol.net"
Feb 18 17:02:24 cluster2 lock_gulmd_core[1345]: Master Node has logged
out.
Feb 18 17:02:24 cluster2 kernel: lock_gulm: Checking for journals for
node "cluster1.sonitrol.net"
Feb 18 17:02:24 cluster2 lock_gulmd_core[1345]: ERROR [core_io.c:1029]
Got error from reply: (cluster1.sonitrol.net:172.16.6.131) 1:Unknown
GULM Err
Feb 18 17:02:24 cluster2 lock_gulmd_core[1345]: ERROR [core_io.c:1034]
Errors on xdr: (cluster1.sonitrol.net:172.16.6.131) -104:104:Connection
reset by peer
Feb 18 17:02:33 cluster2 lock_gulmd_core[1345]: I see no Masters, So I
am Arbitrating until enough Slaves talk to me.
Feb 18 17:02:33 cluster2 lock_gulmd_LTPX[1351]: New Master at
cluster2.sonitrol.net:172.16.6.132
Feb 18 17:02:48 cluster2 lock_gulmd_core[1345]: lvs1.sonitrol.net missed
a heartbeat (time:1108764168893978 mb:1)
Feb 18 17:02:48 cluster2 lock_gulmd_core[1345]: lvs2.sonitrol.net missed
a heartbeat (time:1108764168893978 mb:1)
Feb 18 17:02:48 cluster2 lock_gulmd_core[1345]: intra4 missed a
heartbeat (time:1108764168893978 mb:1)
Feb 18 17:02:49 cluster2 lock_gulmd_core[1345]: Still in Arbitrating:
Have 2, need 3 for quorum.
Feb 18 17:02:49 cluster2 lock_gulmd_core[1345]: New Client: idx:5 fd:10
from (172.16.6.150:intra4)
Feb 18 17:02:49 cluster2 lock_gulmd_core[1345]: Member update message
Logged in about intra4 to lvs1.sonitrol.net is lost because node is in
OM
Feb 18 17:02:49 cluster2 lock_gulmd_core[1345]: Member update message
Logged in about intra4 to lvs2.sonitrol.net is lost because node is in
OM
Feb 18 17:02:49 cluster2 lock_gulmd_core[1345]: Now have Slave quorum,
going full Master.
Feb 18 17:02:49 cluster2 lock_gulmd_core[1345]: New Client: idx:6 fd:11
from (172.16.6.231:lvs2.sonitrol.net)
Feb 18 17:02:49 cluster2 lock_gulmd_core[1345]: Member update message
Logged in about lvs2.sonitrol.net to lvs1.sonitrol.net is lost because
node is in OM
Feb 18 17:02:49 cluster2 lock_gulmd_LTPX[1351]: Logged into LT000 at
cluster2.sonitrol.net:172.16.6.132
Feb 18 17:02:49 cluster2 lock_gulmd_LTPX[1351]: Finished resending to
LT000
Feb 18 17:02:50 cluster2 lock_gulmd_LT000[1348]: New Client: idx 2 fd 7
from (172.16.6.132:cluster2.sonitrol.net)
Feb 18 17:02:50 cluster2 lock_gulmd_LT000[1348]: New Client: idx 3 fd 8
from (172.16.6.231:lvs2.sonitrol.net)
Feb 18 17:02:50 cluster2 lock_gulmd_core[1345]: New Client: idx:7 fd:12
from (172.16.6.230:lvs1.sonitrol.net)
Feb 18 17:02:50 cluster2 lock_gulmd_core[1345]: Timeout (15000000) on
fd:5 (cluster1.sonitrol.net:172.16.6.131)
Feb 18 17:02:52 cluster2 lock_gulmd_LT000[1348]: Attached slave
lvs2.sonitrol.net:172.16.6.231 idx:4 fd:9 (soff:3 connected:0x8)
Feb 18 17:02:52 cluster2 lock_gulmd_LT000[1348]: New Client: idx 5 fd 10
from (172.16.6.230:lvs1.sonitrol.net)
Feb 18 17:02:54 cluster2 lock_gulmd_LT000[1348]: Attached slave
lvs1.sonitrol.net:172.16.6.230 idx:6 fd:11 (soff:2 connected:0xc)
Feb 18 17:02:54 cluster2 lock_gulmd_LT000[1348]: New Client: idx 7 fd 12
from (172.16.6.150:intra4)
Feb 18 17:03:04 cluster2 lock_gulmd_LT000[1348]: Attached slave
intra4:172.16.6.150 idx:8 fd:13 (soff:1 connected:0xe)
Feb 18 17:03:04 cluster2 kernel: GFS: fsid=alpha:shared.1: jid=0: Trying
to acquire journal lock...
Feb 18 17:03:04 cluster2 kernel: GFS: fsid=alpha:shared.1: jid=0: Busy
Feb 18 17:03:04 cluster2 kernel: GFS: fsid=alpha:home.1: jid=0: Trying
to acquire journal lock...
Feb 18 17:03:04 cluster2 kernel: GFS: fsid=alpha:home.1: jid=0: Busy
Feb 18 17:03:04 cluster2 kernel: GFS: fsid=alpha:home.1: jid=0: Trying
to acquire journal lock...
Feb 18 17:03:04 cluster2 kernel: GFS: fsid=alpha:home.1: jid=0: Busy
Feb 18 17:03:04 cluster2 kernel: GFS: fsid=alpha:shared.1: jid=0: Trying
to acquire journal lock...
Feb 18 17:03:04 cluster2 kernel: GFS: fsid=alpha:shared.1: jid=0: Busy
Feb 18 17:05:03 cluster2 lock_gulmd_core[1345]: New Client: idx:1 fd:5
from (172.16.6.131:cluster1.sonitrol.net)
Feb 18 17:05:05 cluster2 lock_gulmd_LT000[1348]: Attached slave
cluster1.sonitrol.net:172.16.6.131 idx:9 fd:14 (soff:0 connected:0xf)
Feb 18 17:05:05 cluster2 lock_gulmd_LT000[1348]: New Client: idx 10 fd
15 from (172.16.6.131:cluster1.sonitrol.net)
Feb 18 17:05:06 cluster2 ypserv[1395]: refused connect from
172.16.6.131:753 to procedure ypproc_all (LTSP,auto.master;-4)
Feb 18 17:05:55 cluster2 login(pam_unix)[1828]: session opened for user
root by LOGIN(uid=0)
Feb 18 17:05:55 cluster2  -- root[1828]: ROOT LOGIN ON tty2
Feb 18 17:06:02 cluster2 lock_gulmd_core[1345]: "cluster1.sonitrol.net"
is logged out. fd:5
Feb 18 17:06:02 cluster2 kernel: lock_gulm: Checking for journals for
node "cluster1.sonitrol.net"
Feb 18 17:06:02 cluster2 lock_gulmd_LT000[1348]: EOF on xdr
(cluster1.sonitrol.net:172.16.6.131 idx:10 fd:15)
Feb 18 17:06:02 cluster2 kernel: GFS: fsid=alpha:shared.1: jid=0: Trying
to acquire journal lock...
Feb 18 17:06:02 cluster2 kernel: GFS: fsid=alpha:shared.1: jid=0: Busy
Feb 18 17:06:02 cluster2 kernel: GFS: fsid=alpha:home.1: jid=0: Trying
to acquire journal lock...
Feb 18 17:06:02 cluster2 kernel: GFS: fsid=alpha:home.1: jid=0: Busy

-----------------------------

-- 
----------------------------------
Marc Swanson, Software Engineer
Sonitrol Communications Corp.
Hartford, CT

Email: mswanson at sonitrol.net
Phone: (860) 616-7036
Pager: (860) 948-6713
 Cell: (603) 512-1267
  Fax: (860) 616-7589
----------------------------------