[Linux-cluster] redhat cluster running on debian 6.

Sat Oct 15 15:47:51 UTC 2011

On 10/15/2011 09:30 AM, Joey L wrote:
>>
>> I don't use apache, so I can't speak to that resource agent's config. I can
>> say though that overall it looks okay with two exceptions.
>>
>> You *must* configure fencing for the cluster to work properly. Even without
>> shared storage, a node failure will trigger a fence call which, because it
>> can't succeed, will leave your cluster hung hard.
>>
>> Change the cluster names to the output of `uname -n` (should be the FQDN).
>>
> 
> i looked at the docs - i think it says fencing is not required.

What docs? That is misleading.

Consider;
* Node 1 wants to start Service A.
* Node 1 requests a DLM lock, gets it, starts Service A.
* Meanwhile, Node 2 wants to start Service A.
* Node 2 requests a DLM lock, is refused because the lock is out to Node A.
* Node 1 finishes starting Service A and tells the cluster.
* Node 1 releases the lock.
* Node 2, having seen now that Service A is running, no longer tries to
start Service A.

Time pases, and suddenly Node A fails.

* After a short period of time, the cluster will detect Node 1's death.
* The cluster enters an unknown state (is Node 1 dead, hung, ?).
* The clyster will call a fence and DLM will block. With DLM blocked,
nothing can get a lack and, without a lock, services can not be recovered.
* The fence call completes successfully and tells the cluster that
things are back into a known state.
* DLM unblocks.
* RGManager sorts out what services where lost (Service A), figures out
who can recover the lost service (Node 2).
* Node 2 requests a lock from DLM and you know the rest of the story.

> I do not have any sofisticated fencing devices on my network - so it
> would not help anyways ??

Wrong. Without fencing, you can not have a stable cluster. In clustering;

 "The only thing you know is what you don't know."

As soon as a node goes silent, you can only know that it has stopped
responding. Has it hung and will it come back? Has it completely powered
off? You can't guess.

Fencing puts the silent node into a known state. That is, it is either
disconnected from the cluster's network or forced off. Only then can the
state of the silent node be known. Until it's state is known, the
cluster can not operate safely.

This is a general high-availability cluster concept.

> Or do you have a solution in that scenerio ??

Yup, you can get a switched PDU. The APC brand PDUs are very good and
very well supported using the 'fence_apc_snmp' fence agent.

I've used this one in many clusters (as a backup to iLO/IPMI based
fencing). It's just fine as a primary fence agent as well.

http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900

> Have you used Heartbeat ?

A long time ago. The heartbeat project is effectively deprecated.
Linbit, the company behind DRBD, has taken over the project and have
announced that they plan no further development. They are maintaining it
as bugs are found, but that is it.

Both RHCS and the Pacemaker project are primarily on corosync know.

> It looks a lot less complicated then RH Cluster and seems like there
> are more docs and vidoes on the net.

I think you are referring to Pacemaker. That is the resource management
layer. Whether you think it is simpler or not is, of course, up to the
user's perspective. That said, Pacemaker is a perfectly good clustered
resource manager and I have no reason to argue against it. I just can't
help with it as I'm mostly familiar with Red Hat's current cluster suite.

> Do you have a simple cluster.conf file that I can use to see if I am
> setting this up correctly?

"Simple", no. I do have an extensive tutorial though.

https://alteeve.com/w/Red_Hat_Cluster_Service_2_Tutorial

It's for EL5 and RHCS Stable 2, and the current version is stable 3, but
the configuration is all but the same. The only (visible) change are the
way the config file is validated (ccs_config_validate instead of the
xmllint call) and how updated versions are pushed out to the rest of the
cluster ('cman_tool version -r' instead of 'ccs_tool update
/etc/cluster/cluster.conf').

> I do not see the any of my shared services when i look at my node in
> the members tab.

Is rgmanager running? What does 'clustat' show?

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"At what point did we forget that the Space Shuttle was, essentially,
a program that strapped human beings to an explosion and tried to stab
through the sky with fire and math?"