[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] I give up

Kevin Anderson wrote:
Not sure what you mean by 3 to 1 using IP tie breaker.  How are you maintaining quorum without qdisk as a voting entity?

I have three nodes. If one fails the other two are expected to maintain quorum and continue. I would really like a second failure to keep going on it's own (last man standing). For this to work I would need to set expected votes to 1 and make sure the correct node wins the ensuing fencing race.

Case two. I remove one node from the cluster to maintain it. Now I have a two node cluster. Same issues as above. Luci wants to set two_node = 1 in this case instead of just dealing with expected votes = 1. I haven't test this because I'm testing all this with node 2 and node 3 while the future node 1 is currently our production server.

The ping gateway test/IP tie-breaker was my way of reliably running down to last man standing.

During network partition test, expecting a fencing race where I control the outcome, one node would not fence the other and did not takeover the service until the other node attempted to rejoin the cluster (way too late).

Is this resolved with the 5.1 release we did a few weeks ago?

I'm using the latest release.

Another poster stated that he could not get the cluster to function properly since the switch to Openais. Hence I'm speculating that they are related.
Doubtful.  There have been issues with cisco switch configurations with allowing multicast properly.  All of those have been resolved with a switch configuration setting change.

I don't know why it "stared at me" instead of recovering the service, because debugging is lacking. I really think that even if the "verbose debugging" was a compile time option and users had to install "testing" rpms, that all the problems would have been flushed out long ago.

Both of these are part of the bigger picture resource monitoring work that Lon and some of the linux ha guys are jointly working on converging to a single base.  See this page -

Which again, not very visible :-(.
>From a distance, it seems that 5.0 and 5.1 are less stable than 4.4 and 4.5 (I've only tried the current ones). If big changes were made and released prematurely, it's being shaken out by production clusters instead of test clusters.

How much of this "not very visible" work is being tested by a larger group?

3. Time for Cluster Summit again - location preferences, timeframe, funding, etc?

Summit's are better than closed development but users like me are never going to attend. A community based site is a good foundation.

By the way, I am a C programmer. (From windows land though we use RH on all of our servers.) I've spent a month trying to get this to work. It's open source and given enough time I can make it go. I don't have any more time. It's supposed to be production quality.

I have a failure case staring at me but debugging is lacking so I have to look else where for a solution. I can't sit here dangling my feet waiting and I can't spend weeks fixing it myself.

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]