[Linux-cluster] qdiskd master election and loss of quorum

Wed Nov 11 17:06:31 UTC 2009

On Wed, 2009-11-11 at 11:49 -0500, Lon H. Hohberger wrote:
> On Thu, 2009-11-05 at 15:28 +0100, Gianluca Cecchi wrote:
> 
> > Nov  5 12:52:53 mork clurgmgrd[2633]: <notice> Member 2 shutting down 
> > Nov  5 12:52:57 mork qdiskd[2214]: <info> Node 2 shutdown 
> 
> > Nov  5 12:55:41 mork openais[2185]: [TOTEM] The token was lost in the
> > OPERATIONAL state. 
> 
> That's very interesting.  It looks like the what happened to cause the
> state change failures was the huge lag time between when rgmanager sent
> its "good bye kiss" and the time openais noticed the node was offline.
> The timeout was large enough that rgmanager gave up.
> 
> This isn't actually the quorum disk master election problem at all...
> It's also very strange.
> 
> - rgmanager should have known this was unnecessary.  The other node said
> it was going away.
> - cman probably should have caused a transition sooner, I think (??)

So... rgmanager treats a node which sends the 'EXITING' message as
offline.  It makes no sense why it would do this and subsequently fail
to update the cluster state.

        case RG_EXITING:
                if (!member_online(msg_hdr->gh_arg1))
                        break;

                logt_print(LOG_NOTICE, "Member %d shutting down\n",
                       msg_hdr->gh_arg1);
                member_set_state(msg_hdr->gh_arg1, 0);
                node_event_q(0, msg_hdr->gh_arg1, 0, 1);
                break;

You said in your previous mail that mindy shut down cleanly -- so I'm
really stumped...

-- Lon