[Linux-cluster] Freeze with cluster-2.03.11

Ben Yarwood ben.yarwood at juno.co.uk
Fri Mar 27 02:21:01 UTC 2009


Replaying a journal as below usually idicates a node has withdrawn from that
file system I believe.  You should grep messages on all nodes for 'GFS', if
any node is repoting errors with this fs then it will need rebooting/fencing
before access to that fs can be achieved.

Ben


-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kadlecsik Jozsef
Sent: 26 March 2009 22:47
To: linux clustering
Subject: [Linux-cluster] Freeze with cluster-2.03.11

Hi,

Freshly built cluster-2.03.11 reproducibly freezes as mailman started. 
The versions are:

linux-2.6.27.21
cluster-2.03.11
openais from svn, subrev 1152 version 0.80
LVM2.2.02.44

This is a five node cluster wich was just upgraded from cluster-2.01.00, 
node by node. All nodes went fine except when the last one, which runs the 
mailman queue manager was upgraded: after the upgrade as the manager is 
started, the system freezes completely. No error message in the screen or 
in the kernel log. The system responds to ping, that's all, but nothing 
can be done at the console except rebooting. Usually when this node is 
fenced off, shortly after the fencing node freezes as well. What I could 
find in the kernel log of this second machine is as follows:

Mar 26 23:09:24 lxserv1 kernel: dlm: closing connection to node 1
Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Trying to 
acquire journal lock...
Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: Trying 
to acquire journal lock...
Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Looking at 
journal...
Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: Looking 
at journal...
Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: 
Acquiring the transaction lock...
Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Acquiring 
the transaction lock...
Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: 
Replaying journal...
Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Replaying 
journal...
Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Replayed 65 
of 85 blocks
Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: replays = 
65, skips = 12, sames = 8
Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: Replayed 
888 of 994 blocks
Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: replays 
= 888, skips = 66, sames = 40
Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Journal 
replayed in 1s
Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: Done

Does it indicate anything, which could help to fix the cluster?

Best regards,
Jozsef
--
E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster







More information about the Linux-cluster mailing list