[Linux-cluster] pull plug on node, service never relocates

Dustin Henry Offutt dhoffutt at gmail.com
Thu May 20 19:20:23 UTC 2010


Thank you for that -- excellent tip.

Yesterday evening forced a re-install of all cluster associated RPMs just in
case of maybe some sort of binary corruption... Still getting same result.
This log is from yesterday after increasing the log level of rgmanager.

This is the log from the node that did the fencing. The "spare" machine did
not pick up the service until after the "failed" node was noticed by all
other nodes with a " clurgmgrd[5234]: <info> State change: 192.168.1.101 UP"
- which is, of course, after the node was fenced and had rebooted and
rejoined the cluster.... Really weird issue.

May 19 16:19:13 c1n2 root: MARK I fail c1n1 running core1 by ifconfigging
its ethernet ports off
May 19 16:19:35 c1n2 openais[4660]: [TOTEM] The token was lost in the
OPERATIONAL state.
May 19 16:19:35 c1n2 openais[4660]: [TOTEM] Receive multicast socket recv
buffer size (288000 bytes).
May 19 16:19:35 c1n2 openais[4660]: [TOTEM] Transmit multicast socket send
buffer size (262142 bytes).
May 19 16:19:35 c1n2 openais[4660]: [TOTEM] entering GATHER state from 2.
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] entering GATHER state from 11.
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] Saving state aru 8b high seq
received 8b
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] Storing new sequence id for ring
d0c
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] entering COMMIT state.
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] entering RECOVERY state.
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] position [0] member
192.168.1.103:
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] previous ring seq 3336 rep
192.168.1.103
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] aru 8b high delivered 8b
received flag 1
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] position [1] member
192.168.1.104:
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] previous ring seq 3336 rep
192.168.1.103
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] aru 8b high delivered 8b
received flag 1
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] position [2] member
192.168.1.105:
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] previous ring seq 3336 rep
192.168.1.103
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] aru 8b high delivered 8b
received flag 1
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] position [3] member
192.168.1.102:
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] previous ring seq 3336 rep
192.168.1.103
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] aru 8b high delivered 8b
received flag 1
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] Did not need to originate any
messages in recovery.
May 19 16:19:40 c1n2 openais[4660]: [CLM  ] CLM CONFIGURATION CHANGE
May 19 16:19:40 c1n2 openais[4660]: [CLM  ] New Configuration:
May 19 16:19:40 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.103)
May 19 16:19:40 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.104)
May 19 16:19:40 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.105)
May 19 16:19:40 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.102)
May 19 16:19:40 c1n2 openais[4660]: [CLM  ] Members Left:
May 19 16:19:40 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.101)
May 19 16:19:40 c1n2 openais[4660]: [CLM  ] Members Joined:
May 19 16:19:40 c1n2 openais[4660]: [CLM  ] CLM CONFIGURATION CHANGE
May 19 16:19:40 c1n2 openais[4660]: [CLM  ] New Configuration:
May 19 16:19:40 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.103)
May 19 16:19:40 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.104)
May 19 16:19:40 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.105)
May 19 16:19:40 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.102)
May 19 16:19:40 c1n2 openais[4660]: [CLM  ] Members Left:
May 19 16:19:40 c1n2 openais[4660]: [CLM  ] Members Joined:
May 19 16:19:40 c1n2 openais[4660]: [SYNC ] This node is within the primary
component and will provide service.
May 19 16:19:40 c1n2 openais[4660]: [TOTEM] entering OPERATIONAL state.
May 19 16:19:40 c1n2 kernel: dlm: closing connection to node 1
May 19 16:19:40 c1n2 clurgmgrd[5234]: <info> State change: 192.168.1.101
DOWN
May 19 16:19:40 c1n2 openais[4660]: [CLM  ] got nodejoin message
192.168.1.103
May 19 16:19:40 c1n2 openais[4660]: [CLM  ] got nodejoin message
192.168.1.104
May 19 16:19:40 c1n2 openais[4660]: [CLM  ] got nodejoin message
192.168.1.105
May 19 16:19:40 c1n2 openais[4660]: [CLM  ] got nodejoin message
192.168.1.102
May 19 16:19:40 c1n2 openais[4660]: [CPG  ] got joinlist message from node 5

May 19 16:19:40 c1n2 openais[4660]: [CPG  ] got joinlist message from node 2

May 19 16:19:40 c1n2 openais[4660]: [CPG  ] got joinlist message from node 3

May 19 16:19:40 c1n2 openais[4660]: [CPG  ] got joinlist message from node 4

May 19 16:19:43 c1n2 fenced[4680]: 192.168.1.101 not a cluster member after
3 sec post_fail_delay
May 19 16:19:43 c1n2 fenced[4680]: fencing node "192.168.1.101"
May 19 16:19:45 c1n2 clurgmgrd[5234]: <info> Waiting for node #1 to be
fenced
May 19 16:19:47 c1n2 fenced[4680]: fence "192.168.1.101" success
May 19 16:19:47 c1n2 clurgmgrd[5234]: <info> Node #1 fenced; continuing
May 19 16:20:05 c1n2 clurgmgrd: [5234]: <info> Executing
/ha/bin/ha-hpss-mover1 status
May 19 16:22:37 c1n2 clurgmgrd: [5234]: <info> Executing
/ha/bin/ha-hpss-mover1 status
May 19 16:23:27 c1n2 last message repeated 3 times
May 19 16:24:57 c1n2 last message repeated 3 times
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] entering GATHER state from 11.
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] Saving state aru 3e high seq
received 3e
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] Storing new sequence id for ring
d10
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] entering COMMIT state.
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] entering RECOVERY state.
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] position [0] member
192.168.1.103:
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] previous ring seq 3340 rep
192.168.1.103
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] aru 3e high delivered 3e
received flag 1
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] position [1] member
192.168.1.104:
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] previous ring seq 3340 rep
192.168.1.103
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] aru 3e high delivered 3e
received flag 1
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] position [2] member
192.168.1.105:
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] previous ring seq 3340 rep
192.168.1.103
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] aru 3e high delivered 3e
received flag 1
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] position [3] member
192.168.1.101:
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] previous ring seq 3340 rep
192.168.1.101
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] aru a high delivered a received
flag 1
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] position [4] member
192.168.1.102:
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] previous ring seq 3340 rep
192.168.1.103
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] aru 3e high delivered 3e
received flag 1
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] Did not need to originate any
messages in recovery.
May 19 16:25:17 c1n2 openais[4660]: [CLM  ] CLM CONFIGURATION CHANGE
May 19 16:25:17 c1n2 openais[4660]: [CLM  ] New Configuration:
May 19 16:25:17 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.103)
May 19 16:25:17 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.104)
May 19 16:25:17 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.105)
May 19 16:25:17 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.102)
May 19 16:25:17 c1n2 openais[4660]: [CLM  ] Members Left:
May 19 16:25:17 c1n2 openais[4660]: [CLM  ] Members Joined:
May 19 16:25:17 c1n2 openais[4660]: [CLM  ] CLM CONFIGURATION CHANGE
May 19 16:25:17 c1n2 openais[4660]: [CLM  ] New Configuration:
May 19 16:25:17 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.103)
May 19 16:25:17 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.104)
May 19 16:25:17 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.105)
May 19 16:25:17 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.101)
May 19 16:25:17 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.102)
May 19 16:25:17 c1n2 openais[4660]: [CLM  ] Members Left:
May 19 16:25:17 c1n2 openais[4660]: [CLM  ] Members Joined:
May 19 16:25:17 c1n2 openais[4660]: [CLM  ]     r(0) ip(192.168.1.101)
May 19 16:25:17 c1n2 openais[4660]: [SYNC ] This node is within the primary
component and will provide service.
May 19 16:25:17 c1n2 openais[4660]: [TOTEM] entering OPERATIONAL state.
May 19 16:25:17 c1n2 openais[4660]: [CLM  ] got nodejoin message
192.168.1.103
May 19 16:25:17 c1n2 openais[4660]: [CLM  ] got nodejoin message
192.168.1.104
May 19 16:25:17 c1n2 openais[4660]: [CLM  ] got nodejoin message
192.168.1.105
May 19 16:25:17 c1n2 openais[4660]: [CLM  ] got nodejoin message
192.168.1.101
May 19 16:25:17 c1n2 openais[4660]: [CLM  ] got nodejoin message
192.168.1.102
May 19 16:25:17 c1n2 openais[4660]: [CPG  ] got joinlist message from node 2

May 19 16:25:17 c1n2 openais[4660]: [CPG  ] got joinlist message from node 3

May 19 16:25:17 c1n2 openais[4660]: [CPG  ] got joinlist message from node 4

May 19 16:25:17 c1n2 openais[4660]: [CPG  ] got joinlist message from node 5

May 19 16:25:24 c1n2 kernel: dlm: connecting to 1
May 19 16:25:27 c1n2 clurgmgrd: [5234]: <info> Executing
/ha/bin/ha-hpss-mover1 status
May 19 16:25:57 c1n2 clurgmgrd: [5234]: <info> Executing
/ha/bin/ha-hpss-mover1 status
May 19 16:26:00 c1n2 clurgmgrd[5234]: <info> State change: 192.168.1.101 UP
May 19 16:26:27 c1n2 clurgmgrd: [5234]: <info> Executing
/ha/bin/ha-hpss-mover1 status
May 19 16:26:56 c1n2 xinetd[9002]: Exiting...
May 19 16:26:56 c1n2 xinetd[2236]: xinetd Version 2.3.14 started with
libwrap loadavg labeled-networking options compiled in.
May 19 16:26:56 c1n2 xinetd[2236]: Started working: 1 available service
May 19 16:26:57 c1n2 clurgmgrd: [5234]: <info> Executing
/ha/bin/ha-hpss-mover1 status
May 19 16:28:57 c1n2 last message repeated 2 times
May 19 16:28:58 c1n2 root: MARK II - end of test

On Wed, May 19, 2010 at 2:42 PM, Alfredo Moralejo <amoralej at redhat.com>wrote:

>  What is the state of service that was running in the node after pulling
> the power cables? stopped, failed?
>
> Set rgmanager in verbose mode with <*rm log_level*="7"
> log_facility="local4">
>
> Regards
>
> Alfredo
>
>
>
> On 05/19/2010 07:08 PM, Dusty wrote:
>
> In the interest of trouble-shooting I've taken all the failover domains out
> of the configuration.
>
> This resulted in no change:
>
> Service on a failed node does not relocate until the failed node reboots.
>
> To reiterate: Similar cluster configuration on similar hardware worked
> perfectly on RHEL5U3.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> --
>
> Alfredo Moralejo
> Red Hat - Senior consultant
>
> Office: +34 914148838
> Cell: +34 607909535
> Email: alfredo.moralejo at redhat.com
>
> Dirección Comercial: C/Jose Bardasano Baos, 9, Edif. Gorbea 3, planta 3ºD,
> 28016 Madrid, Spain
> Dirección Registrada: Red Hat S.L., C/ Velazquez 63, Madrid 28001, Spain
> Inscrita en el Reg. Mercantil de Madrid – C.I.F. B82657941
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100520/b33db302/attachment.htm>


More information about the Linux-cluster mailing list