[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

RE: [Linux-cluster] Node fencing problem



See if the fence agent is configured correctly. Run this command to see if it shuts down the node in question.

fence_drac -a 10.100.2.40 -l testdb -p <passwd> -o off

If it does not work I would check to see if access is enabled. By default, the telnet interface is not enabled. To enable the interface, you will need to use the racadm command in the racser-devel rpm available from Dell. To enable telnet on the DRAC:

[root]# racadm config -g cfgSerial -o cfgSerialTelnetEnable 1

[root]# racadm

Racreset


-----Original Message-----
From: linux-cluster-bounces redhat com [mailto:linux-cluster-bounces redhat com] On Behalf Of Borgström Jonas
Sent: Wednesday, August 22, 2007 9:22 AM
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

As you can see here http://pastebin.com/m7ac9376d I've configured both fence_drac and fence_manual.

And fenced appears to be running:
[root test-db1 ~]# ps ax | grep fence
 3412 ?        Ss     0:00 /sbin/fenced
 5109 pts/0    S+     0:00 grep fence
[root test-db1 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010001 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

And on test-db2:
[root test-db2 ~]# ps ax | grep fence
 3428 ?        Ss     0:00 /sbin/fenced
 8848 pts/0    S+     0:00 grep fence
[root test-db2 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010002 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

/ Jonas
-----Original Message-----
From: linux-cluster-bounces redhat com [mailto:linux-cluster-bounces redhat com] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 15:47
To: linux clustering
Subject: RE: [Linux-cluster] Node fencing problem

What type of fencing method are you using on your cluster?

Also can you run "cman_tool services" on both nodes to make sure Fenced is running?

-----Original Message-----
From: linux-cluster-bounces redhat com [mailto:linux-cluster-bounces redhat com] On Behalf Of Borgström Jonas
Sent: Wednesday, August 22, 2007 4:07 AM
To: linux-cluster redhat com
Subject: [Linux-cluster] Node fencing problem

Hi,

We're having some problems getting fencing to work as expected on our two-node cluster. 

Our cluster.conf file: http://pastebin.com/m7ac9376d
kernel version: 2.6.18-8.1.8.el5
cman version: 2.0.64-1.0.1.el5

When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason:

Steps to reproduce:
1. Start the cluster
2. Mount a GFS filesystem on both nodes (test-db1 and test-db2)
3. Simulate a net failure on test-db1
    http://pastebin.com/m19fda088

Expected result:
1. Node test-db2 would detect that test-db1 failed
2. test-db1 get fenced by test-db2
3. test-db2 replays the GFS journal (filesystem writable again)
4. Fail over services from test-db1 to test-db2

Actual result:
1. Node-test-db2 detects that something happened to test-db1
2. test-db2 replays the GFS journal (filesystem writable again)
3. The service on test-db1 is still listed as started and not failed
   over to test-db2 even though test-db2 thinks test-db1 is "offline".

Log files and debug output from test-db2:
   /var/log/messages after the failure: http://pastebin.com/m2fe4ce36
   "group_tool dump fence" output: http://pastebin.com/m79d21ed9
   clustat output: http://pastebin.com/m4d1007c2

And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption.

I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough.

What am I doing wrong?

Regards,
Jonas

--
Linux-cluster mailing list
Linux-cluster redhat com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster redhat com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster redhat com
https://www.redhat.com/mailman/listinfo/linux-cluster


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]