[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] CMAN: got WAIT barrier not in phase 1 TRANSITION.96 (2)

> > Oct 13 04:17:18 ey00-s00017 kernel: CMAN: got WAIT barrier not in phase
> > 1 TRANSITION.96 (2)
> That message should be harmless. does it prevent the cluster reaching quorum ?
Hello Patrick / list, I've been working with Tom on this problem.

It doesn't prevent quorum, although after this point the nodes
mysteriously can't seem to join the fence domain.  I've checked and it
doesn't appear that anyone is trying to fence anyone else, so I'm at a
bit of a loss to explain what's going on.

The really bizarre thing is that the old nodes don't seem to play with
the new ones despite them being joined into the cluster (i.e. fence
domain on old nodes shows running, fence domain on new node says joining
indefinitely).  If you prod it enough (start enough new nodes),
eventually the existing cluster will blow apart (nodes start kicking
each other for inconsistency and the like).

Let me explain a few things about our cluster:

We are running Xen.

The control VM for each node is in the cluster with 1 vote.

The application VMs are dynamically spawned and are entered into the

The application VMs have 0 votes (so as to prevent one physical machine
from accidentally grabbing a quorum of votes if it has too many
application VMs running on it).

We are currently using fence_manual for debugging purposes (we have an
APC MasterSwitch to eventually use for fencing).

We are experiencing the following problems:

After a certain size (about 20 cluster members) we start having serious
issues with the cluster holding together.  Nodes are sometimes kicked
for having an inconsistent view.  There is often a complaint about the
count of members not matching between nodes as well.  Right now we have
the 1.03 version of everything installed (it was packaged and we are
trying to avoid building too much from scratch).

When a node starts up with an old cluster.conf, it never seems to
automatically update to the newer version.  If the file is updated while
a node is down, must it be manually synched up before resuming?

Finally, a random question.  When I'm debugging this stuff, I use
"cman_tool services" to keep tabs on some things.  What does the stuff
in the Code column mean?
Jayson Vantuyl
Quality Humans, Inc.

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]