[Cluster-devel] [PATCH] dlm_controld.pcmk: Fix membership change judging issue

Fri May 14 11:28:59 UTC 2010

I'll test the patch later on. But I may not finished this testing by
today because
I have some problem to access the hardwre at this moment.

Thanks a lot ;-)
Jiaju

On Fri, May 14, 2010 at 6:15 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
> On Fri, May 14, 2010 at 5:04 AM, Tim Serong <tserong at novell.com> wrote:
>> On 5/14/2010 at 06:19 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>
>>> Does the behavior still occur with pacemaker 1.1.2?
>>>
>>
>> Yes.
>>
>> For the record, the most minimal testcase I've managed for this
>> so far is as follows (substitute "/etc/init.d/corosync start" or
>> whatever for "rcopenais start" if you're not on something SUSE-based):
>>
>> 1) Configure corosync/openais on two nodes.
>>   Do not start the cluster yet.
>>
>> 2) On one node:
>>
>>     # rm /var/lib/heartbeat/crm/*
>>     # rcopenais start
>>     # while ! crm_mon -1 | grep -qi online; do \
>>         echo -n "." ; sleep 5 ; done
>>
>> 3) Now we have one node online, configure Pacemaker:
>>
>>     # cat <<CONF | crm configure
>>     primitive dlm ocf:pacemaker:controld
>>     primitive clvm ocf:lvm2:clvmd
>>     group g dlm clvm
>>     clone c g meta interleave="true"
>>     property stonith-enabled="false"
>>     property no-quorum-policy="ignore"
>>     commit
>>     CONF
>>
>>   Watch "crm_mon -r" until that clone comes online.
>>   Should only take a few seconds.
>>
>> 4) On the other node:
>>
>>     # rm /var/lib/heartbeat/crm/*
>>     # rcopenais start
>>
>> The first node will now either wedge up spectacularly, and/or
>> dlm_recoverd and clvmd will be stuck in D state on both nodes.
>
> Presumably each thinks the other node isn't a member?
> Perhaps something like this will help:
>
> diff -r b59c27dc114a lib/ais/plugin.c
> --- a/lib/ais/plugin.c  Wed May 12 10:51:56 2010 +0200
> +++ b/lib/ais/plugin.c  Fri May 14 12:12:33 2010 +0200
> @@ -498,9 +498,8 @@ static void *pcmk_wait_dispatch (void *a
>                    ais_notice("Respawning failed child process: %s",
>                               pcmk_children[lpc].name);
>                    spawn_child(&(pcmk_children[lpc]));
> -               } else {
> -                   send_cluster_id();
>                }
> +               send_cluster_id();
>            }
>        }
>        sched_yield ();
> @@ -661,6 +660,7 @@ int pcmk_startup(struct corosync_api_v1
>            }
>        }
>     }
> +    send_cluster_id();
>
>     return 0;
>  }
>
>