[Linux-cluster] test hung after 36 hours

Daniel McNeil daniel at osdl.org
Tue Apr 12 00:13:06 UTC 2005


I started my mount/tar/rm/ tests on Apr  4 17:41 and I hit
a problem at Apr  6 05:30.  So the test ran for 36 hours.
cl030 and cl031 were getting "SM: process_reply invalid"
messages and cl032 got "No response" and "Missed too many
heartbeats"


cl032:
[-- MARK -- Wed Apr  6 05:15:00 2005]
CMAN: removing node cl030a from the cluster : Missed too many heartbeats
CMAN: removing node cl031a from the cluster : No response to messages
CMAN: quorum lost, blocking activity
[-- MARK -- Wed Apr  6 05:30:00 2005]
GFS: Trying to join cluster "lock_dlm", "gfs_cluster:stripefs"

cl030:
[-- MARK -- Wed Apr  6 05:15:00 2005]
CMAN: removing node cl032a from the cluster : Missed too many heartbeats
GFS: Trying to join cluster "lock_dlm", "gfs_cluster:stripefs"
GFS: fsid=gfs_cluster:stripefs.0: Joined cluster. Now mounting FS...
GFS: fsid=gfs_cluster:stripefs.0: jid=0: Trying to acquire journal lock...
GFS: fsid=gfs_cluster:stripefs.0: jid=0: Looking at journal...
GFS: fsid=gfs_cluster:stripefs.0: jid=0: Done
GFS: fsid=gfs_cluster:stripefs.0: jid=1: Trying to acquire journal lock...
GFS: fsid=gfs_cluster:stripefs.0: jid=1: Looking at journal...
GFS: fsid=gfs_cluster:stripefs.0: jid=1: Done
GFS: fsid=gfs_cluster:stripefs.0: jid=2: Trying to acquire journal lock...
GFS: fsid=gfs_cluster:stripefs.0: jid=2: Looking at journal...
GFS: fsid=gfs_cluster:stripefs.0: jid=2: Done
GFS: fsid=gfs_cluster:stripefs.0: jid=3: Trying to acquire journal lock...
GFS: fsid=gfs_cluster:stripefs.0: jid=3: Looking at journal...
GFS: fsid=gfs_cluster:stripefs.0: jid=3: Done
SM: process_reply invalid id=20496 nodeid=4294967295
SM: process_reply invalid id=20497 nodeid=4294967295

cl031:
[-- MARK -- Wed Apr  6 05:15:00 2005]
SM: process_reply invalid id=20496 nodeid=4294967295
SM: process_reply invalid id=20496 nodeid=4294967295
SM: process_reply invalid id=20496 nodeid=4294967295
SM: process_reply invalid id=20497 nodeid=4294967295
SM: process_reply invalid id=20497 nodeid=4294967295
SM: process_reply invalid id=20497 nodeid=4294967295
SM: process_reply invalid id=20500 nodeid=4294967295
SM: process_reply invalid id=20500 nodeid=4294967295
SM: process_reply invalid id=20500 nodeid=4294967295
SM: process_reply invalid id=20501 nodeid=4294967295
SM: process_reply invalid id=20501 nodeid=4294967295
SM: process_reply invalid id=20501 nodeid=4294967295
SM: process_reply invalid id=20504 nodeid=4294967295
SM: process_reply invalid id=20504 nodeid=4294967295
SM: process_reply invalid id=20504 nodeid=4294967295
GFS: Trying to join cluster "lock_dlm", "gfs_cluster:stripefs"
SM: process_reply invalid id=20505 nodeid=4294967295
GFS: fsid=gfs_cluster:stripefs.1: Joined cluster. Now mounting FS...

A bit more info is available here.
http://developer.osdl.org/daniel/GFS/test.04apr2005/

Any ideas on what is going on?

Daniel






More information about the Linux-cluster mailing list