[Linux-cluster] SAN with GFS2 on RHEL 6 beta: STONITH right after start

Wed Jul 28 19:44:00 UTC 2010

OK, with the help of Andrew, I tried it again.

Some important logs from the problem:

~snip~

1317 Jul 28 00:46:31 pcmknode-1 corosync[2618]: [TOTEM ] A processor failed, forming new configuration.
1318 Jul 28 00:46:32 pcmknode-1 kernel: dlm: closing connection to node -1147763583
1319 Jul 28 00:46:32 pcmknode-1 corosync[2618]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 624: memb=1, new=0, lost=1
1320 Jul 28 00:46:32 pcmknode-1 corosync[2618]: [pcmk ] info: pcmk_peer_update: memb: pcmknode-1 3130426497
1321 Jul 28 00:46:32 pcmknode-1 corosync[2618]: [pcmk ] info: pcmk_peer_update: lost: pcmknode-2 3147203713

~snip~

1338 Jul 28 00:46:32 pcmknode-1 crmd: [2629]: info: crm_update_peer: Node pcmknode-2: id=3147203713 state=lost (new) addr=r(0) ip(192.168.150.187) votes=1 born=620 seen=620 proc=00000000000000000000000000111312
1339 Jul 28 00:46:32 pcmknode-1 crmd: [2629]: info: erase_node_from_join: Removed node pcmknode-2 from join calculations: welcomed=0 itegrated=0 finalized=0 confirmed=1
1340 Jul 28 00:46:32 pcmknode-1 crmd: [2629]: info: crm_update_quorum: Updating quorum status to false (call=45)

~snip~

1351 Jul 28 00:46:32 pcmknode-1 pengine: [2628]: WARN: pe_fence_node: Node pcmknode-2 will be fenced because it is un-expectedly down
1352 Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: determine_online_status_fencing: ha_state=active, ccm_state=false, crm_state=online, join_state=member, expected=member
1353 Jul 28 00:46:32 pcmknode-1 pengine: [2628]: WARN: determine_online_status: Node pcmknode-2 is unclean

~snip~

I then removed the LVM from /dev/sdb2 and created the GFS2 right on /dev/sdb2 
(without LVM). That does not solve the problem.

Starting corosync on only one node works fine, even the GFS2 disk can get 
mounted. But as soon as the GFS2 disk will be mounted on the second node, the 
node gets fenced immediately. I set WebFSClone to target Stopped, and as soon 
as I manually started it again, the node got fenced.
Manually mounting the GFS2 disk (with mount -t gfs2...) on the second node also 
causes the STONITH.

One word about my STONITH: It is SBD which runs via /dev/sdb1. I got the 
cluster-glue SRPM from clusterlabs and extracted it, to manually compile SBD 
(only SBD, nothing else). I then installed SBD. So my system runs all packages 
from those repositories: RHEL 6 beta, EPEL, Clusterlabs.

I monitored the network traffic with tcpdump and analyzed it afterwards with 
Wireshark. The two DLMs are communicating, but I don't know if probably 
something goes wrong there. I see there a packet going from pcmknode-2 to 
pcmknode-1, with that content (by Wireshark): (some lines omitted which I think 
are not interesting, can provide them if needed)
Command: message (1)
Message Type: lookup message (11)
External Flags: 0x08, Return the contents of the lock value block
Status: Unknown (0)
Granted Mode: invalid (-1)
Request Mode: exclusive (5)

And then the response from pcmknode-1 to pcmknode-2:
Command: message (1)
Message Type: request reply (5)
External Flags: 0x08, Return the contents of the lock value block
Status: granted (2)
Granted Mode: exclusive (5)
Request Mode: invalid (-1)

I wonder why pcmknode-1 says "Granted: exclusive" to pcmknode-2. 
Immediately after the request reply, the pcmknode-2 writes to log "Now mounting 
FS..." and gets fenced and shut down.

So, is there probably something wrong with the DLM?

Regards,
Benedikt