[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] cluster failure ...



In past couple of weeks..

Cluster fence node for missed too many heartbeats. Node goes away. No other node in a cluster tries to acquire his part of lock. Fenced node do come up and again joins a cluster in meanwhile there is a lock on a shared fs and it ends in a high load nobody can log in.

Sep 16 15:06:37 clu-V kernel: CMAN: node clu-III has been removed from the cluster : Missed too many heartbeats
Sep 16 15:09:07 clu-V kernel: CMAN: node clu-III rejoining
After a cluster restart everything is fine.

Again when I manually issue fence_node <nodename> i do get this messages of other nodes trying to acquire part of dlm.

tail /var/log/messages
Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Looking at journal... Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:shared.1: jid=4: Trying to acquire journal lock...
Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:mailbox.1: jid=4: Busy
Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:shared.1: jid=4: Busy
Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Acquiring the transaction lock... Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Replaying journal... Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Replayed 1 of 2 blocks Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: replays = 1, skips = 1, sames = 0 Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Journal replayed in 1s
Sep 18 01:22:02 clu-X kernel: GFS: fsid=mail:homes.1: jid=4: Done
Did anyone have this kind of a problem?

I have to mention this happened over weekend or night when there is no significant load on a cluster.
the GFS version is cvs-20060714

--
Ivan Pantovic, System Engineer
-----
YUnet International  http://www.eunet.yu
Dubrovacka 35/III,   11000 Belgrade
Tel: +381 11 311 9901;  Fax: +381 11 311 9901; Mob: +381 63 302 288
-----
This  e-mail  is confidential and intended only for the recipient.
Unauthorized  distribution,  modification  or  disclosure  of  its
contents is prohibited. If you have received this e-mail in error,
please notify the sender by telephone  +381 11 311 9901.
-----


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]