Oops! Here is the cluster.conf file.
Description: Binary data
On Fri, 14 Mar 2008, Volkan YAZICI <yazicivo ttmail com> writes: > We have two RHEL5.1 boxes installed on IBM X3850 machines sharing a > single DS4700 SAN with IBM 2005-B16 fence devices. System is configured > as a high-availability system for database systems. We are facing > serious non-deterministic (can happen in anywhere, at anytime without a > single clue) problems. > > One of the most repeating problems are fence_tool related. > > # service cman start > Starting cluster: > Loading modules... done > Mounting configfs... done > Starting ccsd... done > Starting cman... done > Starting daemons... done > Starting fencing... fence_tool: can't communicate with fenced -1 > > # fenced -D > 1204556546 cman_init error 0 111 > > # clustat > CMAN is not running. > > # cman_tool join > > # clustat > msg_open: Connection refused > Member Status: Quorate > Member Name ID Status > ------ ---- ---- ------ > mobilizc1 1 Online, Local > mobilizc2 2 Offline > > > # groupd -D > 1204556993 cman: our nodeid 1 name mobilizc1 quorum 1 > 1204556993 found uncontrolled kernel object rgmanager in /sys/kernel/dlm > 1204556993 found uncontrolled kernel object clvmd in /sys/kernel/dlm > 1204556993 local node must be reset to clear 2 uncontrolled instances of gfs and/or dlm > > Sometimes this problem gets solved if the two machines are rebooted at > the same time. But in the current HA configuration, I cannot guarantee > two systems will be rebooted at the same time for every problem we > face. At least one of them should start without a problem. > > Moreover, we were facing problems with the rgmanager. Below are the > related /var/log/messages lines: > > kernel: clurgmgrd: segfault at 0000000000000000 rip 0000000000408905 rsp 00007fff9075f0b0 error 4 > clurgmgrd: <crit> Watchdog: Daemon died, rebooting... > > We contacted with our RH support and they asked for a clurgmgrd > backtrace from use. But unfortunately, we couldn't manage to start cman > service to be able to start clurgmgrd. (You are asking why we couldn't > cman? Really dunno. Same "fence_tool: can't communicate with fenced -1" > problem. As I said previously, it sometimes works, sometimes doesn't > work.) Later, they sent new not-released-yet > rgmanager-2.0.36-1.el5.x86_64.rpm to us to try. Somehow, we managed to > stnart cman on both machines and then started rgmanager service with this > new rgmanager RPM. (Couldn't reproduce clurgmgrd SegFault.) And this > solved clurgmgrd SegFault problem. But we are still having "can't > communicate with fenced -1" errors occasionally. > > Sorry for the long post, but I try to help to people who will try to > help to figure out the problem. I also attach my cluster.conf file with > the post. Any kind of help will be really, really appreciated! Thanks so > much for your kindly interest by reading this far. > > > Regards.