[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] pull plug on node, service never relocates





On Mon, May 17, 2010 at 4:56 PM, Corey Kovacs <corey kovacs gmail com> wrote:
The service scripts you have in the config above look made up. Are
those some scripts or wrote or are you actually using sys V inits?

I wrote the resource scripts. They all respond to {start|status|stop} as necessary.
 
Also, can you include a complete log segment? it's quite hard to debug
someone's problem with only partial information.

Here's a segment with the APC PDU as the fencing device:
May 12 10:50:00 c1n2 openais[26524]: [TOTEM] The token was lost in the OPERATIONAL state.
May 12 10:50:00 c1n2 openais[26524]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
May 12 10:50:00 c1n2 openais[26524]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
May 12 10:50:00 c1n2 openais[26524]: [TOTEM] entering GATHER state from 2.
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] entering GATHER state from 0.
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] Saving state aru 2c3 high seq received 2c3
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] Storing new sequence id for ring ae4
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] entering COMMIT state.
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] entering RECOVERY state.
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] position [0] member 192.168.1.103:
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] previous ring seq 2784 rep 192.168.1.103
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] aru 2c3 high delivered 2c3 received flag 1
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] position [1] member 192.168.1.104:
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] previous ring seq 2784 rep 192.168.1.103
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] aru 2c3 high delivered 2c3 received flag 1
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] position [2] member 192.168.1.105:
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] previous ring seq 2784 rep 192.168.1.103
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] aru 2c3 high delivered 2c3 received flag 1
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] position [3] member 192.168.1.102:
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] previous ring seq 2784 rep 192.168.1.103
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] aru 2c3 high delivered 2c3 received flag 1
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] Did not need to originate any messages in recovery.
May 12 10:50:05 c1n2 openais[26524]: [CLM  ] CLM CONFIGURATION CHANGE
May 12 10:50:05 c1n2 openais[26524]: [CLM  ] New Configuration:
May 12 10:50:05 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.103) 
May 12 10:50:05 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.104) 
May 12 10:50:05 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.105) 
May 12 10:50:05 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.102) 
May 12 10:50:05 c1n2 openais[26524]: [CLM  ] Members Left:
May 12 10:50:05 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.101) 
May 12 10:50:05 c1n2 openais[26524]: [CLM  ] Members Joined:
May 12 10:50:05 c1n2 openais[26524]: [CLM  ] CLM CONFIGURATION CHANGE
May 12 10:50:05 c1n2 openais[26524]: [CLM  ] New Configuration:
May 12 10:50:05 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.103) 
May 12 10:50:05 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.104) 
May 12 10:50:05 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.105) 
May 12 10:50:05 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.102) 
May 12 10:50:05 c1n2 openais[26524]: [CLM  ] Members Left:
May 12 10:50:05 c1n2 openais[26524]: [CLM  ] Members Joined:
May 12 10:50:05 c1n2 openais[26524]: [SYNC ] This node is within the primary component and will provide service.
May 12 10:50:05 c1n2 openais[26524]: [TOTEM] entering OPERATIONAL state.
May 12 10:50:05 c1n2 kernel: dlm: closing connection to node 1
May 12 10:50:05 c1n2 openais[26524]: [CLM  ] got nodejoin message 192.168.1.103
May 12 10:50:05 c1n2 openais[26524]: [CLM  ] got nodejoin message 192.168.1.104
May 12 10:50:05 c1n2 openais[26524]: [CLM  ] got nodejoin message 192.168.1.105
May 12 10:50:05 c1n2 openais[26524]: [CLM  ] got nodejoin message 192.168.1.102
May 12 10:50:05 c1n2 openais[26524]: [CPG  ] got joinlist message from node 5
May 12 10:50:05 c1n2 openais[26524]: [CPG  ] got joinlist message from node 2
May 12 10:50:05 c1n2 openais[26524]: [CPG  ] got joinlist message from node 3
May 12 10:50:05 c1n2 openais[26524]: [CPG  ] got joinlist message from node 4
May 12 10:50:08 c1n2 fenced[26544]: 192.168.1.101 not a cluster member after 3 sec post_fail_delay
May 12 10:50:08 c1n2 fenced[26544]: fencing node "192.168.1.101"
May 12 10:50:12 c1n2 fenced[26544]: fence "192.168.1.101" success
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] entering GATHER state from 11.
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] Saving state aru 3e high seq received 3e
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] Storing new sequence id for ring ae8
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] entering COMMIT state.
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] entering RECOVERY state.
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] position [0] member 192.168.1.103:
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] previous ring seq 2788 rep 192.168.1.103
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] aru 3e high delivered 3e received flag 1
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] position [1] member 192.168.1.104:
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] previous ring seq 2788 rep 192.168.1.103
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] aru 3e high delivered 3e received flag 1
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] position [2] member 192.168.1.105:
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] previous ring seq 2788 rep 192.168.1.103
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] aru 3e high delivered 3e received flag 1
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] position [3] member 192.168.1.101:
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] previous ring seq 2788 rep 192.168.1.101
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] aru a high delivered a received flag 1
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] position [4] member 192.168.1.102:
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] previous ring seq 2788 rep 192.168.1.103
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] aru 3e high delivered 3e received flag 1
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] Did not need to originate any messages in recovery.
May 12 10:54:04 c1n2 openais[26524]: [CLM  ] CLM CONFIGURATION CHANGE
May 12 10:54:04 c1n2 openais[26524]: [CLM  ] New Configuration:
May 12 10:54:04 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.103) 
May 12 10:54:04 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.104) 
May 12 10:54:04 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.105) 
May 12 10:54:04 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.102) 
May 12 10:54:04 c1n2 openais[26524]: [CLM  ] Members Left:
May 12 10:54:04 c1n2 openais[26524]: [CLM  ] Members Joined:
May 12 10:54:04 c1n2 openais[26524]: [CLM  ] CLM CONFIGURATION CHANGE
May 12 10:54:04 c1n2 openais[26524]: [CLM  ] New Configuration:
May 12 10:54:04 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.103) 
May 12 10:54:04 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.104) 
May 12 10:54:04 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.105) 
May 12 10:54:04 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.101) 
May 12 10:54:04 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.102) 
May 12 10:54:04 c1n2 openais[26524]: [CLM  ] Members Left:
May 12 10:54:04 c1n2 openais[26524]: [CLM  ] Members Joined:
May 12 10:54:04 c1n2 openais[26524]: [CLM  ]     r(0) ip(192.168.1.101) 
May 12 10:54:04 c1n2 openais[26524]: [SYNC ] This node is within the primary component and will provide service.
May 12 10:54:04 c1n2 openais[26524]: [TOTEM] entering OPERATIONAL state.
May 12 10:54:04 c1n2 openais[26524]: [CLM  ] got nodejoin message 192.168.1.103
May 12 10:54:04 c1n2 openais[26524]: [CLM  ] got nodejoin message 192.168.1.104
May 12 10:54:04 c1n2 openais[26524]: [CLM  ] got nodejoin message 192.168.1.105
May 12 10:54:04 c1n2 openais[26524]: [CLM  ] got nodejoin message 192.168.1.101
May 12 10:54:04 c1n2 openais[26524]: [CLM  ] got nodejoin message 192.168.1.102
May 12 10:54:04 c1n2 openais[26524]: [CPG  ] got joinlist message from node 5
May 12 10:54:04 c1n2 openais[26524]: [CPG  ] got joinlist message from node 2
May 12 10:54:04 c1n2 openais[26524]: [CPG  ] got joinlist message from node 3
May 12 10:54:04 c1n2 openais[26524]: [CPG  ] got joinlist message from node 4

Please notice in the above log that the APC PDU reported to node2 (192.168.1.102), and node2 reported in its log, that fencing was successful.
Also please note that no service relocation occurred for the service node1 was running for the four minutes it took for node1 to come back online.

Here's another log segment after taking out the APC PDU and inserting manual_fencing as the fencing device:
May 18 11:34:12 c1n2 root: MARK I begin test. doing ifcfg eth0 down && ifcfg eth1 down on node c1n1 
May 18 11:35:03 c1n2 openais[25546]: [TOTEM] The token was lost in the OPERATIONAL state.
May 18 11:35:03 c1n2 openais[25546]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
May 18 11:35:03 c1n2 openais[25546]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
May 18 11:35:03 c1n2 openais[25546]: [TOTEM] entering GATHER state from 2.
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] entering GATHER state from 0.
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] Saving state aru 1ec high seq received 1ec
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] Storing new sequence id for ring c5c
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] entering COMMIT state.
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] entering RECOVERY state.
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] position [0] member 192.168.1.103:
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] previous ring seq 3160 rep 192.168.1.103
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] aru 1ec high delivered 1ec received flag 1
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] position [1] member 192.168.1.104:
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] previous ring seq 3160 rep 192.168.1.103
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] aru 1ec high delivered 1ec received flag 1
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] position [2] member 192.168.1.105:
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] previous ring seq 3160 rep 192.168.1.103
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] aru 1ec high delivered 1ec received flag 1
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] position [3] member 192.168.1.102:
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] previous ring seq 3160 rep 192.168.1.103
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] aru 1ec high delivered 1ec received flag 1
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] Did not need to originate any messages in recovery.
May 18 11:35:08 c1n2 openais[25546]: [CLM  ] CLM CONFIGURATION CHANGE
May 18 11:35:08 c1n2 openais[25546]: [CLM  ] New Configuration:
May 18 11:35:08 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.103) 
May 18 11:35:08 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.104) 
May 18 11:35:08 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.105) 
May 18 11:35:08 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.102) 
May 18 11:35:08 c1n2 openais[25546]: [CLM  ] Members Left:
May 18 11:35:08 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.101) 
May 18 11:35:08 c1n2 openais[25546]: [CLM  ] Members Joined:
May 18 11:35:08 c1n2 openais[25546]: [CLM  ] CLM CONFIGURATION CHANGE
May 18 11:35:08 c1n2 openais[25546]: [CLM  ] New Configuration:
May 18 11:35:08 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.103) 
May 18 11:35:08 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.104) 
May 18 11:35:08 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.105) 
May 18 11:35:08 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.102) 
May 18 11:35:08 c1n2 openais[25546]: [CLM  ] Members Left:
May 18 11:35:08 c1n2 openais[25546]: [CLM  ] Members Joined:
May 18 11:35:08 c1n2 openais[25546]: [SYNC ] This node is within the primary component and will provide service.
May 18 11:35:08 c1n2 openais[25546]: [TOTEM] entering OPERATIONAL state.
May 18 11:35:08 c1n2 kernel: dlm: closing connection to node 1
May 18 11:35:08 c1n2 openais[25546]: [CLM  ] got nodejoin message 192.168.1.103
May 18 11:35:08 c1n2 openais[25546]: [CLM  ] got nodejoin message 192.168.1.104
May 18 11:35:08 c1n2 openais[25546]: [CLM  ] got nodejoin message 192.168.1.105
May 18 11:35:08 c1n2 openais[25546]: [CLM  ] got nodejoin message 192.168.1.102
May 18 11:35:08 c1n2 openais[25546]: [CPG  ] got joinlist message from node 4
May 18 11:35:08 c1n2 openais[25546]: [CPG  ] got joinlist message from node 5
May 18 11:35:08 c1n2 openais[25546]: [CPG  ] got joinlist message from node 2
May 18 11:35:08 c1n2 openais[25546]: [CPG  ] got joinlist message from node 3
May 18 11:35:11 c1n2 fenced[25566]: 192.168.1.101 not a cluster member after 3 sec post_fail_delay
May 18 11:35:11 c1n2 fenced[25566]: fencing node "192.168.1.101"
May 18 11:35:11 c1n2 fence_manual: Node 192.168.1.101 needs to be reset before recovery can procede.  Waiting for 192.168.1.101 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n 192.168.1.101)
May 18 11:37:30 c1n2 ccsd[25540]: Attempt to close an unopened CCS descriptor (5280).
May 18 11:37:30 c1n2 ccsd[25540]: Error while processing disconnect: Invalid request descriptor
May 18 11:37:30 c1n2 fenced[25566]: fence "192.168.1.101" success
May 18 11:41:31 c1n2 root: MARK II node c1n1 up now, no service relocation of service core1 occurred
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] entering GATHER state from 11.
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] Saving state aru 3f high seq received 3f
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] Storing new sequence id for ring c60
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] entering COMMIT state.
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] entering RECOVERY state.
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] position [0] member 192.168.1.103:
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] previous ring seq 3164 rep 192.168.1.103
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] aru 3f high delivered 3f received flag 1
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] position [1] member 192.168.1.104:
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] previous ring seq 3164 rep 192.168.1.103
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] aru 3f high delivered 3f received flag 1
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] position [2] member 192.168.1.105:
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] previous ring seq 3164 rep 192.168.1.103
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] aru 3f high delivered 3f received flag 1
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] position [3] member 192.168.1.101:
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] previous ring seq 3164 rep 192.168.1.101
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] aru a high delivered a received flag 1
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] position [4] member 192.168.1.102:
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] previous ring seq 3164 rep 192.168.1.103
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] aru 3f high delivered 3f received flag 1
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] Did not need to originate any messages in recovery.
May 18 11:41:41 c1n2 openais[25546]: [CLM  ] CLM CONFIGURATION CHANGE
May 18 11:41:41 c1n2 openais[25546]: [CLM  ] New Configuration:
May 18 11:41:41 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.103) 
May 18 11:41:41 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.104) 
May 18 11:41:41 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.105) 
May 18 11:41:41 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.102) 
May 18 11:41:41 c1n2 openais[25546]: [CLM  ] Members Left:
May 18 11:41:41 c1n2 openais[25546]: [CLM  ] Members Joined:
May 18 11:41:41 c1n2 openais[25546]: [CLM  ] CLM CONFIGURATION CHANGE
May 18 11:41:41 c1n2 openais[25546]: [CLM  ] New Configuration:
May 18 11:41:41 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.103) 
May 18 11:41:41 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.104) 
May 18 11:41:41 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.105) 
May 18 11:41:41 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.101) 
May 18 11:41:41 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.102) 
May 18 11:41:41 c1n2 openais[25546]: [CLM  ] Members Left:
May 18 11:41:41 c1n2 openais[25546]: [CLM  ] Members Joined:
May 18 11:41:41 c1n2 openais[25546]: [CLM  ]     r(0) ip(192.168.1.101) 
May 18 11:41:41 c1n2 openais[25546]: [SYNC ] This node is within the primary component and will provide service.
May 18 11:41:41 c1n2 openais[25546]: [TOTEM] entering OPERATIONAL state.
May 18 11:41:41 c1n2 openais[25546]: [CLM  ] got nodejoin message 192.168.1.103
May 18 11:41:41 c1n2 openais[25546]: [CLM  ] got nodejoin message 192.168.1.104
May 18 11:41:41 c1n2 openais[25546]: [CLM  ] got nodejoin message 192.168.1.105
May 18 11:41:41 c1n2 openais[25546]: [CLM  ] got nodejoin message 192.168.1.101
May 18 11:41:41 c1n2 openais[25546]: [CLM  ] got nodejoin message 192.168.1.102
May 18 11:41:41 c1n2 openais[25546]: [CPG  ] got joinlist message from node 5
May 18 11:41:41 c1n2 openais[25546]: [CPG  ] got joinlist message from node 2
May 18 11:41:41 c1n2 openais[25546]: [CPG  ] got joinlist message from node 3
May 18 11:41:41 c1n2 openais[25546]: [CPG  ] got joinlist message from node 4
May 18 11:41:48 c1n2 kernel: dlm: connecting to 1

Wonder what these indicate from that segment:
May 18 11:37:30 c1n2 ccsd[25540]: Attempt to close an unopened CCS descriptor (5280).
May 18 11:37:30 c1n2 ccsd[25540]: Error while processing disconnect: Invalid request descriptor

Really, I don't want to hear  "It should work, I've done everything
right" since clearly something is wrong. I as have many people here
built several if not dozens of these clusters and we are making
suggestions where we have seen the most problems.

Ok. This is my seventh cluster. All previously built clusters, on RHEL5U3 (this one is the first I've built on RHEL5U4) functioned perfectly. I appreciate you making suggestions - I'm just saying that I'd stated three times that the APC unit is fencing properly. Have tested with fence_tool across all nodes.
 

It's always funny how the software engineers are never to blame for
there software not working until I prove to them that it's there
fault. Not trying to be a jerk, OR point fingers but open your mind a bit.

My mind is open. I keep saying the APC PDU is successfully fencing and keep getting asked if the APC PDU is successfully fencing and I keep reporting that the APC PDU is successfully fencing.
 

Sometimes fencing can be hindered by simply being logged into the device while
the cluster is trying to talk to it. First Gen iLO's and some APC
firmware have problems with this.

Understood - I've been making sure to log out when testing.
 


Whats the output of ...

cman_tool status
# cman_tool status
Version: 6.2.0
Config Version: 102
Cluster Name: hpss1
Cluster Id: 3299
Cluster Member: Yes
Cluster Generation: 3168
Membership state: Cluster-Member
Nodes: 5
Expected votes: 5
Total votes: 5
Quorum: 3
Active subsystems: 9
Flags: Dirty
Ports Bound: 0 11 177
Node name: 192.168.222.86
Node ID: 2
Multicast addresses: 239.192.12.239
Node addresses: 192.168.2.86

 
cman_tool nodes
  cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M   3168   2010-05-18 11:41:41  192.168.1.101
   2   M   3140   2010-05-18 11:32:12  192.168.1.102
   3   M   3156   2010-05-18 11:32:12  192.168.1.103
   4   M   3156   2010-05-18 11:32:12  192.168.1.104
   5   M   3160   2010-05-18 11:32:12  192.168.1.105

cman_tool services
  cman_tool services
type             level name              id       state
fence            0     default           00010004 none
[1 2 3 4 5]
dlm              1     clvmd             00010005 none
[1 2 3 4 5]
dlm              1     rgmanager         00020004 none
[1 2 3 4 5]
dlm              1     m1_hpssSource     00020002 none
[2]
dlm              1     m1_varHpss        00040002 none
[2]
dlm              1     m1_varHpssAdmCor  00060002 none
[2]
gfs              2     m1_hpssSource     00010002 none
[2]
gfs              2     m1_varHpss        00030002 none
[2]
gfs              2     m1_varHpssAdmCor  00050002 none
[2]


Are you sure that all the nodes have the right cluster config?
ccs_tool update /etc/cluster/cluster.conf ?
 
Yes, absolutely positive.


What are using you using to manage the config? ricci/luci,
system-config-cluster, vi?

Sometimes luci, sometimes vi.
 

Can we see a "real" cluster.conf?

The IPs have been changed through-out this email (as thoroughly and made-to-be consistent as possible) as per security directives.

<?xml version="1.0"?>
<cluster config_version="102" name="hpss7">
    <fence_daemon clean_start="0" post_fail_delay="3" post_join_delay="60"/>
    <clusternodes>
        <clusternode name="192.168.1.101" nodeid="1" votes="1">
            <fence>
                <method name="1">
                    <device name="manual_fence_c1n1" nodename="192.168.1.101"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="192.168.1.102" nodeid="2" votes="1">
            <fence>
                <method name="1">
                    <device name="manual_fence_c1n2" nodename="192.168.1.102"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="192.168.1.103" nodeid="3" votes="1">
            <fence>
                <method name="1">
                    <device name="manual_fence_c1n3" nodename="192.168.1.103"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="192.168.1.104" nodeid="4" votes="1">
            <fence>
                <method name="1">
                    <device name="manual_fence_c1n4" nodename="192.168.1.104"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="192.168.1.105" nodeid="5" votes="1">
            <fence>
                <method name="1">
                    <device name="manual_fence_c1n5" nodename="192.168.1.105"/>
                </method>
            </fence>
        </clusternode>
    </clusternodes>
    <cman/>
    <fencedevices>
        <fencedevice agent="fence_manual" name="manual_fence_c1n3"/>
        <fencedevice agent="fence_manual" name="manual_fence_c1n1"/>
        <fencedevice agent="fence_manual" name="manual_fence_c1n2"/>
        <fencedevice agent="fence_manual" name="manual_fence_c1n4"/>
        <fencedevice agent="fence_manual" name="manual_fence_c1n5"/>
    </fencedevices>
    <rm>
        <failoverdomains>
            <failoverdomain name="fd_core1" nofailback="1" ordered="1" restricted="1">
                <failoverdomainnode name="192.168.1.101" priority="1"/>
                <failoverdomainnode name="192.168.1.102" priority="2"/>
                <failoverdomainnode name="192.168.1.103" priority="3"/>
                <failoverdomainnode name="192.168.1.104" priority="4"/>
                <failoverdomainnode name="192.168.1.105" priority="5"/>
            </failoverdomain>
            <failoverdomain name="fd_mover1" nofailback="1" ordered="1" restricted="1">
                <failoverdomainnode name="192.168.1.101" priority="5"/>
                <failoverdomainnode name="192.168.1.102" priority="1"/>
                <failoverdomainnode name="192.168.1.103" priority="2"/>
                <failoverdomainnode name="192.168.1.104" priority="3"/>
                <failoverdomainnode name="192.168.1.105" priority="4"/>
            </failoverdomain>
            <failoverdomain name="fd_mover2" nofailback="1" ordered="1" restricted="1">
                <failoverdomainnode name="192.168.1.101" priority="4"/>
                <failoverdomainnode name="192.168.1.102" priority="5"/>
                <failoverdomainnode name="192.168.1.103" priority="1"/>
                <failoverdomainnode name="192.168.1.104" priority="2"/>
                <failoverdomainnode name="192.168.1.105" priority="3"/>
            </failoverdomain>
            <failoverdomain name="fd_vfs1" nofailback="1" ordered="1" restricted="1">
                <failoverdomainnode name="192.168.1.101" priority="3"/>
                <failoverdomainnode name="192.168.1.102" priority="4"/>
                <failoverdomainnode name="192.168.1.103" priority="5"/>
                <failoverdomainnode name="192.168.1.104" priority="1"/>
                <failoverdomainnode name="192.168.1.105" priority="2"/>
            </failoverdomain>
        </failoverdomains>
        <resources>
            <ip address="192.168.2.40" monitor_link="1"/>
            <ip address="10.10.1.74" monitor_link="1"/>
            <ip address="192.168.2.41" monitor_link="1"/>
            <ip address="10.10.1.75" monitor_link="1"/>
            <ip address="192.168.2.42" monitor_link="1"/>
            <ip address="10.10.1.76" monitor_link="1"/>
            <ip address="192.168.2.43" monitor_link="1"/>
            <ip address="10.10.1.77" monitor_link="1"/>
            <script file="/ha/bin/ha-hpss-core1" name="ha-hpss-core1"/>
            <script file="/ha/bin/ha-hpss-mover1" name="ha-hpss-mover1"/>
            <script file="/ha/bin/ha-hpss-mover2" name="ha-hpss-mover2"/>
            <script file="/ha/bin/ha-hpss-vfs1" name="ha-hpss-vfs1"/>
            <clusterfs device="/dev/mapper/c1_hpss_vg-c1_db2Backup" force_unmount="1" fsid="34722" fstype="gfs2" mountpoint="/ha/c1_db2Backup" name="c1_db2Backup" self_fence="1"/>
            <clusterfs device="/dev/mapper/c1_hpss_vg-c1_hpssSource" force_unmount="1" fsid="41961" fstype="gfs2" mountpoint="/ha/c1_hpssSource" name="c1_hpssSource" self_fence="1"/>
            <clusterfs device="/dev/mapper/c1_hpss_vg-c1_varHpss" force_unmount="1" fsid="31374" fstype="gfs2" mountpoint="/ha/c1_varHpss" name="c1_varHpss" self_fence="1"/>
            <clusterfs device="/dev/mapper/c1_hpss_vg-c1_varHpssAdmCor" force_unmount="1" fsid="46145" fstype="gfs2" mountpoint="/ha/c1_varHpssAdmCor" name="c1_varHpssAdmCor" self_fence="1"/>
            <clusterfs device="/dev/mapper/c1_hpssdb2v95_vg-c1_optIbmDb2V95" force_unmount="1" fsid="38858" fstype="gfs2" mountpoint="/ha/c1_optIbmDb2V95" name="c1_optIbmDb2V95" self_fence="1"/>
            <clusterfs device="/dev/mapper/c1_hpssdb_vg-c1_hpssdbUserSp1" force_unmount="1" fsid="17090" fstype="gfs2" mountpoint="/ha/c1_hpssdbUserSp1" name="c1_hpssdbUserSp1" self_fence="1"/>
            <clusterfs device="/dev/mapper/c1_hpssdb_vg-c1_varHpssHpssdb" force_unmount="1" fsid="55384" fstype="gfs2" mountpoint="/ha/c1_varHpssHpssdb" name="c1_varHpssHpssdb" self_fence="1"/>
            <clusterfs device="/dev/mapper/c1_hpssdblog_vg-c1_db2LogCfg" force_unmount="1" fsid="7401" fstype="gfs2" mountpoint="/ha/c1_db2LogCfg" name="c1_db2LogCfg" self_fence="1"/>
            <clusterfs device="/dev/mapper/c1_hpssdblog_vg-c1_db2LogSubs1" force_unmount="1" fsid="65529" fstype="gfs2" mountpoint="/ha/c1_db2LogSubs1" name="c1_db2LogSubs1" self_fence="1"/>
            <clusterfs device="/dev/mapper/c1_hpssdblogm_vg-c1_db2LogMCfg" force_unmount="1" fsid="30927" fstype="gfs2" mountpoint="/ha/c1_db2LogMCfg" name="c1_db2LogMCfg" self_fence="1"/>
            <clusterfs device="/dev/mapper/c1_hpssdblogm_vg-c1_db2LogMSubs1" force_unmount="1" fsid="47193" fstype="gfs2" mountpoint="/ha/c1_db2LogMSubs1" name="c1_db2LogMSubs1" self_fence="1"/>
            <clusterfs device="/dev/mapper/m1_hpss_vg-m1_varHpss" force_unmount="1" fsid="19005" fstype="gfs2" mountpoint="/ha/m1_varHpss" name="m1_varHpss" self_fence="1"/>
            <clusterfs device="/dev/mapper/m1_hpss_vg-m1_varHpssAdmCor" force_unmount="1" fsid="17130" fstype="gfs2" mountpoint="/ha/m1_varHpssAdmCor" name="m1_varHpssAdmCor" self_fence="1"/>
            <clusterfs device="/dev/mapper/m2_hpss_vg-m2_hpssSource" force_unmount="1" fsid="32169" fstype="gfs2" mountpoint="/ha/m2_hpssSource" name="m2_hpssSource" self_fence="1"/>
            <clusterfs device="/dev/mapper/m2_hpss_vg-m2_varHpss" force_unmount="1" fsid="30456" fstype="gfs2" mountpoint="/ha/m2_varHpss" name="m2_varHpss" self_fence="1"/>
            <clusterfs device="/dev/mapper/m2_hpss_vg-m2_varHpssAdmCor" force_unmount="1" fsid="10387" fstype="gfs2" mountpoint="/ha/m2_varHpssAdmCor" name="m2_varHpssAdmCor" self_fence="1"/>
            <clusterfs device="/dev/mapper/v1_hpss_vg-v1_hpssSource" force_unmount="1" fsid="46624" fstype="gfs2" mountpoint="/ha/v1_hpssSource" name="v1_hpssSource" self_fence="1"/>
            <clusterfs device="/dev/mapper/v1_hpss_vg-v1_varHpss" force_unmount="1" fsid="46980" fstype="gfs2" mountpoint="/ha/v1_varHpss" name="v1_varHpss" self_fence="1"/>
            <clusterfs device="/dev/mapper/v1_hpss_vg-v1_varHpssAdmCor" force_unmount="1" fsid="22473" fstype="gfs2" mountpoint="/ha/v1_varHpssAdmCor" name="v1_varHpssAdmCor" self_fence="1"/>
            <clusterfs device="/dev/mapper/m1_hpss_vg-m1_hpssSource" force_unmount="1" fsid="44889" fstype="gfs2" mountpoint="/ha/m1_hpssSource" name="m1_hpssSource" self_fence="1"/>
        </resources>
        <service autostart="1" domain="fd_core1" exclusive="1" name="core1" recovery="relocate">
            <ip ref="192.168.2.40"/>
            <ip ref="10.10.1.74"/>
            <script ref="ha-hpss-core1"/>
            <clusterfs fstype="gfs" ref="c1_db2Backup"/>
            <clusterfs fstype="gfs" ref="c1_hpssSource"/>
            <clusterfs fstype="gfs" ref="c1_varHpss"/>
            <clusterfs fstype="gfs" ref="c1_varHpssAdmCor"/>
            <clusterfs fstype="gfs" ref="c1_optIbmDb2V95"/>
            <clusterfs fstype="gfs" ref="c1_hpssdbUserSp1"/>
            <clusterfs fstype="gfs" ref="c1_varHpssHpssdb"/>
            <clusterfs fstype="gfs" ref="c1_db2LogCfg"/>
            <clusterfs fstype="gfs" ref="c1_db2LogSubs1"/>
            <clusterfs fstype="gfs" ref="c1_db2LogMCfg"/>
            <clusterfs fstype="gfs" ref="c1_db2LogMSubs1"/>
        </service>
        <service autostart="1" domain="fd_mover1" exclusive="1" name="mover1" recovery="relocate">
            <ip ref="192.168.2.41"/>
            <ip ref="10.10.1.75"/>
            <script ref="ha-hpss-mover1"/>
            <clusterfs fstype="gfs" ref="m1_hpssSource"/>
            <clusterfs fstype="gfs" ref="m1_varHpss"/>
            <clusterfs fstype="gfs" ref="m1_varHpssAdmCor"/>
        </service>
        <service autostart="1" domain="fd_mover2" exclusive="1" name="mover2" recovery="relocate">
            <ip ref="192.168.2.42"/>
            <ip ref="10.10.1.76"/>
            <script ref="ha-hpss-mover2"/>
            <clusterfs fstype="gfs" ref="m2_hpssSource"/>
            <clusterfs fstype="gfs" ref="m2_varHpss"/>
            <clusterfs fstype="gfs" ref="m2_varHpssAdmCor"/>
        </service>
        <service autostart="1" domain="fd_vfs1" exclusive="1" name="vfs1" recovery="relocate">
            <ip ref="192.168.2.43"/>
            <ip ref="10.10.1.77"/>
            <script ref="ha-hpss-vfs1"/>
            <clusterfs fstype="gfs" ref="v1_hpssSource"/>
            <clusterfs fstype="gfs" ref="v1_varHpss"/>
            <clusterfs fstype="gfs" ref="v1_varHpssAdmCor"/>
        </service>
    </rm>
</cluster>

 

Finally, are you my chance using the HA-LVM stuff to manage disks
across nodes or are you using GFS?

The filesystems are all GFS2.
 


As I said above, and you clearly agree, something is not right and the
more information you can share, the better.




Okham was right for the most part.

For the most part, yes.

Thanks man.


Corey


On Mon, May 17, 2010 at 10:00 PM, Dusty <dhoffutt gmail com> wrote:
> Addendum, and a symptom of this issue is that because node1 does not reboot
> and rejoin the cluster, the service it was running never relocates.
>
> I left it in this state over the weekend. Came back Monday morning, the
> service had still not relocated.
>
> It is not fencing. It is not fencing. Fencing works. Fencing works.
>
> On Mon, May 17, 2010 at 3:58 PM, Dusty <dhoffutt gmail com> wrote:
>>
>> I appreciate the help - but I'm saying the same thing for like the fourth
>> or fifth time now.
>>
>> Fencing is working well. All cluster nodes are able to communicate to the
>> fence device (APC PDU).
>>
>> Another example. The cluster is quorate with five nodes and four running
>> services. Am pulling the plug on node1. I need service1, that happens to be
>> running on node1 right now, to relocate ASAP - not after node1 has rebooted.
>> Node3 is a member of the cluster and is available to accept a service
>> relocation.
>>
>> I have cluster ssh logged into all nodes and am tailing their
>> /var/log/messages file.
>>
>> 14:57 "Pulling the plug" on node1 now (really just turning off the
>> electrical port on the APC).
>> About five seconds later....
>> 14:58:00 node2 fenced[7584] fencing node "192.168.1.1"
>> 14:58:05 node2 fenced[7584] fence "192.168.1.1" success
>> --- Right now the service SHOULD be being relocated - but it doesn't! ---
>> -- a few minutes later, node1 has rebooted after being successfully fenced
>> via node2 operating the APC PDU.
>> 15:03:43 node3 clurgmgrd[4813]: <notice> Recovering failed service
>> service:service1
>>
>> Second test now - doing the exact same thing, but this time really pulling
>> the plug on node1.
>>
>> Everything happens the same except node2 fencing node1 has no effect
>> because I've simulated a complete node failure on node1. It is not going to
>> boot.
>>
>>
>> On Sat, May 15, 2010 at 1:25 PM, Corey Kovacs <corey kovacs gmail com>
>> wrote:
>>>
>>> The reason I was pointing at the fencing config is that the service
>>> will only re-locate when fenced is able to confirm that the offending
>>> node has been fenced. If this can't happen, then services will not
>>> relocate since the cluster doesn't know the state of all the nodes. If
>>> a node get's an anvil dropped on it, then it should stop responding
>>> and the cluster should then try to invoke the fence on that node to
>>> make sure that it is indeed dead, even if it only cycles the power
>>> port for n already dead node.
>>>
>>> Given you description you should experience the same "problem" if you
>>> simply turn the node off. Nomally, when you turn the power off (not
>>> pull the plug) then boot the node, the cluster either should have
>>> aleady fenced the node, or it will fence as it's booting. Looks odd
>>> but it's correct since the cluster has to get things to a known state.
>>>
>>> After the fence and before the node boots, services should start
>>> migrating. All of this you probably know but it's worth saying anwyay.
>>>
>>> Basically, if your services only migrate after the node boots up, then
>>> I believe fencing is not working properly. The services should migrate
>>> while the node is booting or even before.
>>>
>>> So it appears to me that when you power the apc yourself, or pull the
>>> plug on the node, you have the same condition.
>>>
>>> The way to really testing fencing, is to watch the logs on a node and
>>> issue
>>>
>>> cman_tool kill <cluster memner> and tell cman to fence the node.
>>>
>>> One thought, can all your cluster nodes talk the APC at all times?
>>>
>>>
>>> -Corey
>>>
>>>
>>>
>>>
>>> On Sat, May 15, 2010 at 5:50 PM, Dusty <dhoffutt gmail com> wrote:
>>> > Fencing works great - no problems there. The APC PDU responds
>>> > beautifully to
>>> > any node's attempt to fence.
>>> >
>>> > The issue is this:
>>> >
>>> > The service only relocates after the fenced node reboots and rejoins
>>> > the
>>> > cluster. Then the service relocates to another node. This happens well
>>> > and
>>> > without fail.
>>> >
>>> > But what if the node that was fenced refuses to boot back up because,
>>> > say an
>>> > anvil fell out of the sky and smashed it, or its motherboard fried?
>>> >
>>> > This is what I am simulating by pulling the plug on a node that happens
>>> > to
>>> > be running a service. The service will not relocate until the failed
>>> > node
>>> > has rebooted.
>>> >
>>> > I don't want that. I want the service to relocate ASAP regardless of if
>>> > the
>>> > failed node reboots or not.
>>> >
>>> > Thank you so much for your consideration.
>>> >
>>> > --
>>> > Linux-cluster mailing list
>>> > Linux-cluster redhat com
>>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>>> >
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster redhat com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster redhat com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

--
Linux-cluster mailing list
Linux-cluster redhat com
https://www.redhat.com/mailman/listinfo/linux-cluster


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]