[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Cluster-devel] [PATCH] Fix pacemaker's wrong quorum view in a CMAN+pacemaker cluster



On Sun, Mar 13, 2011 at 2:17 PM, Simone Gotti <simone gotti gmail com> wrote:
> Hi all,
>
> Testing a cman+pacemaker cluster on rhel6 I noticed a very nasty
> behavior when some nodes were leaving and rejoining the cluster. When a
> nodes starts leaving and rejoining the cluster the quorum view of
> pacemaker starts becoming sometimes different from the quorum view of
> cman. The one not telling the truth was pacemaker.

Do they ever start agreeing again?  In other words, is the situation
transient or is Pacemaker always 1 (or more) behind after that?

>
> I reproduced the problem with a simple test case made of 2 nodes using
> cman (no two_nodes flag) and pacemaker (started only on the first node:
> pcmk01).
>
> For the tests I was using the latest version of pacemaker (1.1.5) while
> keeping the original versions of corosync and cluster (cman) packages
> provided by the rhel6 (corosync-1.2.3-21.el6.x86_64,
> cman-3.0.12-23.el6.4.x86_64)
>
> The problem is that when a node joins a cluster (starting cman) the cman
> on the other nodes emits not one but 2 events (I didn't investigated if
> this is normal or present only in some versions of cman) but when crmd
> calls cman_dispatch it's using the flag CMAN_DISPATCH_ONE so only one of
> the two events is dequeued. In the subsequent cluster event the old one
> is dequeued.
>
> The fix I tried used CMAN_DISPATCH_ALL instead of CMAN_DISPATCH_ONE and
> looks like its working.
>
> I'm CCing the cluster-devel list as they can be interested in the double
> event emitted by cman.
>
>
> Thanks.
>
> Bye!
>
>
> == Test case ==
>
> === Without the patch ===
>
> Start with both nodes with cman started (so the cluster is quorate).
>
>
> Now stop cman on pcmk02. Output on pcmk01:
>
> pcmk01 corosync[16793]:   [CMAN  ] quorum lost, blocking activity
> pcmk01 corosync[16793]:   [QUORUM] This node is within the non-primary
> component and will NOT provide any services.
> pcmk01 corosync[16793]:   [QUORUM] Members[1]: 1
> pcmk01 corosync[16793]:   [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> pcmk01 corosync[16793]:   [CPG   ] downlist received left_list: 1
> pcmk01 corosync[16793]:   [CPG   ] chosen downlist from node r(0)
> ip(192.168.200.71)
> pcmk01 corosync[16793]:   [MAIN  ] Completed service synchronization,
> ready to provide service.
> pcmk01 crmd: [16993]: notice: cman_event_callback: Membership 668:
> quorum lost
>
> Only one event is enqueued.
>
> Now start again cman on pcmk02. Output on pcmk01:
>
> pcmk01 corosync[16793]:   [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> pcmk01 corosync[16793]:   [CMAN  ] quorum regained, resuming activity
> pcmk01 corosync[16793]:   [QUORUM] This node is within the primary
> component and will provide service.
> pcmk01 corosync[16793]:   [QUORUM] Members[2]: 1 2
> pcmk01 corosync[16793]:   [QUORUM] Members[2]: 1 2
> pcmk01 crmd: [16993]: notice: cman_event_callback: Membership 672:
> quorum acquired
> pcmk01 corosync[16793]:   [CPG   ] downlist received left_list: 0
> pcmk01 corosync[16793]:   [CPG   ] downlist received left_list: 0
> pcmk01 corosync[16793]:   [CPG   ] chosen downlist from node r(0)
> ip(192.168.200.71)
> pcmk01 corosync[16793]:   [MAIN  ] Completed service synchronization,
> ready to provide service.
>
> As you can see two events are enqueued and only one si dequeued (due to
> the CMAN_DISPATCH_ONE flag passed to cman_dispatch).
>
> The quorum is ragained both on cman and crmd. But there's another event
> saying that the quorum is regained in the queue.
>
>
> Now stop again cman on pcmk02. Output on pcmk01:
>
> pcmk01 corosync[16793]:   [CMAN  ] quorum lost, blocking activity
> pcmk01 corosync[16793]:   [QUORUM] This node is within the non-primary
> component and will NOT provide any services.
> pcmk01 corosync[16793]:   [QUORUM] Members[1]: 1
> pcmk01 corosync[16793]:   [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> pcmk01 corosync[16793]:   [CPG   ] downlist received left_list: 1
> pcmk01 corosync[16793]:   [CPG   ] chosen downlist from node r(0)
> ip(192.168.200.71)
> pcmk01 corosync[16793]:   [MAIN  ] Completed service synchronization,
> ready to provide service.
> pcmk01 crmd: [16993]: info: cman_event_callback: Membership 676: quorum
> retained
>
> CMAN says that the quorum is lost and only one event is dispatched. But
> crmd dequeued the previous event and thinks that we have the quorum.
>
>
> Now start again cman on pcmk02. Output on pcmk01:
>
> pcmk01 corosync[16793]:   [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> pcmk01 corosync[16793]:   [CMAN  ] quorum regained, resuming activity
> pcmk01 corosync[16793]:   [QUORUM] This node is within the primary
> component and will provide service.
> pcmk01 corosync[16793]:   [QUORUM] Members[2]: 1 2
> pcmk01 corosync[16793]:   [QUORUM] Members[2]: 1 2
> pcmk01 crmd: [16993]: notice: cman_event_callback: Membership 680:
> quorum lost
> pcmk01 corosync[16793]:   [CPG   ] downlist received left_list: 0
> pcmk01 corosync[16793]:   [CPG   ] downlist received left_list: 0
> pcmk01 corosync[16793]:   [CPG   ] chosen downlist from node r(0)
> ip(192.168.200.71)
> pcmk01 corosync[16793]:   [MAIN  ] Completed service synchronization,
> ready to provide service.
>
> CMAN says that the quorum is regained but crmd dequeued again the old
> event and now it says that the quorum is lost. And so on...
>
>
>
> === With the patch ===
>
> stop cman on pcmk02. Output on pcmk01:
>
> pcmk01 corosync[13149]:   [CMAN  ] quorum lost, blocking activity
> pcmk01 corosync[13149]:   [QUORUM] This node is within the non-primary
> component and will NOT provide any services.
> pcmk01 corosync[13149]:   [QUORUM] Members[1]: 1
> pcmk01 corosync[13149]:   [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> pcmk01 corosync[13149]:   [CPG   ] downlist received left_list: 1
> pcmk01 corosync[13149]:   [CPG   ] chosen downlist from node r(0)
> ip(192.168.200.71)
> pcmk01 corosync[13149]:   [MAIN  ] Completed service synchronization,
> ready to provide service.
>
>  pcmk01 crmd: [13351]: notice: cman_event_callback: Membership 648:
> quorum lost
>
> Only one event is enqued.
>
>
> Now start again cman on pcmk02. Output on pcmk01:
>
> pcmk01 corosync[13149]:   [TOTEM ] A processor joined or left the
> membership and a new membership was formed.
> pcmk01 corosync[13149]:   [CMAN  ] quorum regained, resuming activity
> pcmk01 corosync[13149]:   [QUORUM] This node is within the primary
> component and will provide service.
> pcmk01 corosync[13149]:   [QUORUM] Members[2]: 1 2
> pcmk01 corosync[13149]:   [QUORUM] Members[2]: 1 2
> pcmk01 crmd: [13351]: notice: cman_event_callback: Membership 652:
> quorum acquired
> pcmk01 corosync[13149]:   [CPG   ] downlist received left_list: 0
> pcmk01 corosync[13149]:   [CPG   ] downlist received left_list: 0
> pcmk01 corosync[13149]:   [CPG   ] chosen downlist from node r(0)
> ip(192.168.200.71)
> pcmk01 corosync[13149]:   [MAIN  ] Completed service synchronization,
> ready to provide service.
> pcmk01 crmd: [13351]: info: cman_event_callback: Membership 652: quorum
> retained
>
> As you can see two events are enqued and both are dequeued.
>
>
>
> --
> Simone Gotti
>
>
>


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]