[Linux-cluster] cluster failure ...

Ivan Pantovic ivanp at yu.net
Sun Sep 17 23:36:44 UTC 2006


In past couple of weeks..

Cluster fence node for missed too many heartbeats. Node goes away. No 
other node in a cluster tries to acquire his part of lock.
Fenced node do come up and again joins a cluster in meanwhile there is a 
lock on a shared fs and it ends in a high load nobody can log in.

> Sep 16 15:06:37 clu-V kernel: CMAN: node clu-III has been removed from 
> the cluster : Missed too many heartbeats
> Sep 16 15:09:07 clu-V kernel: CMAN: node clu-III rejoining
After a cluster restart everything is fine.

Again when I manually issue fence_node <nodename> i do get this messages 
of other nodes trying to acquire part of dlm.

> tail /var/log/messages
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Looking 
> at journal...
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:shared.1: jid=4: Trying 
> to acquire journal lock...
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:mailbox.1: jid=4: Busy
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:shared.1: jid=4: Busy
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Acquiring 
> the transaction lock...
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Replaying 
> journal...
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Replayed 
> 1 of 2 blocks
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: replays = 
> 1, skips = 1, sames = 0
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Journal 
> replayed in 1s
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Done
Did anyone have this kind of a problem?

I have to mention this happened over weekend or night when there is no 
significant load on a cluster.
the GFS version is cvs-20060714

-- 
Ivan Pantovic, System Engineer
-----
YUnet International  http://www.eunet.yu
Dubrovacka 35/III,   11000 Belgrade
Tel: +381 11 311 9901;  Fax: +381 11 311 9901; Mob: +381 63 302 288
-----
This  e-mail  is confidential and intended only for the recipient.
Unauthorized  distribution,  modification  or  disclosure  of  its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone  +381 11 311 9901.
-----




More information about the Linux-cluster mailing list