[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] Instability troubles



Hi all,

I'm having some major stability problems with my three-node CS/GFS cluster. Every two or three days, one of the nodes fences another, and I have to hard-reboot the entire cluster to recover. I have had this happen twice today. I don't know what's triggering the fencing, since all the nodes appear to me to be up and running when it happens. In fact, I was logged on to node3 just now, running 'top', when node2 fenced it.

When they come up, they don't automatically mount their GFS filesystems, even with "_netdev" specified as a mount option; however, the node which comes up first mounts them all as part of bringing all the services up.

I did notice a couple of disconcerting things earlier today. First, I was running "watch clustat". (I prefer to see the time updating, where I can't with "clustat -i") At one point, "clustat" crashed as follows:

Jan 2 15:19:54 node2 kernel: clustat[17720]: segfault at 0000000000000024 rip 0000003629e75bc0 rsp 00007fff18827178 error 4

Fairly shortly thereafter, clustat reported that node3 as "Online, Estranged, rgmanager". Can anyone shed light on what that means? Google's not telling me much.

At the moment, all three nodes are running CentOS 5.1, with kernel 2.6.18-53.1.4.el5. Can anyone point me in the right direction to resolve these problems? I wasn't having trouble like this when I was running a CentOS 4 CS/GFS cluster. Is it possible to downgrade, likely via a full rebuild of all the nodes, from CentOS 5 CS/GFS to 4? Should I instead consider setting up a single node to mount the GFS filesystems and serve them out, to get around these fencing issues?

Thanks,

James


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]