[Linux-cluster] cluster failure ...
Ivan Pantovic
ivanp at yu.net
Sun Sep 17 23:36:44 UTC 2006
In past couple of weeks..
Cluster fence node for missed too many heartbeats. Node goes away. No
other node in a cluster tries to acquire his part of lock.
Fenced node do come up and again joins a cluster in meanwhile there is a
lock on a shared fs and it ends in a high load nobody can log in.
> Sep 16 15:06:37 clu-V kernel: CMAN: node clu-III has been removed from
> the cluster : Missed too many heartbeats
> Sep 16 15:09:07 clu-V kernel: CMAN: node clu-III rejoining
After a cluster restart everything is fine.
Again when I manually issue fence_node <nodename> i do get this messages
of other nodes trying to acquire part of dlm.
> tail /var/log/messages
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Looking
> at journal...
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:shared.1: jid=4: Trying
> to acquire journal lock...
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:mailbox.1: jid=4: Busy
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:shared.1: jid=4: Busy
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Acquiring
> the transaction lock...
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Replaying
> journal...
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Replayed
> 1 of 2 blocks
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: replays =
> 1, skips = 1, sames = 0
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Journal
> replayed in 1s
> Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Done
Did anyone have this kind of a problem?
I have to mention this happened over weekend or night when there is no
significant load on a cluster.
the GFS version is cvs-20060714
--
Ivan Pantovic, System Engineer
-----
YUnet International http://www.eunet.yu
Dubrovacka 35/III, 11000 Belgrade
Tel: +381 11 311 9901; Fax: +381 11 311 9901; Mob: +381 63 302 288
-----
This e-mail is confidential and intended only for the recipient.
Unauthorized distribution, modification or disclosure of its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone +381 11 311 9901.
-----
More information about the Linux-cluster
mailing list