[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

RE: [Linux-cluster] Recommended HP servers for cluster suite



Cosimo, fencing takes place any time a condition exists, where "the cluster" 
cannot communicate with a node, or cannot gaurentee the state of a particular

node. Other can surely do the question more justtice but in a nutshell that's
it. To test this, simply pull the netwrok cable from a node. The others will 
not be able to status it and it will get fenced. Same thing happens if you
call 'fence_node node01' or whatever your nodes are named. The machine will
actually be booted _twice_. Once from the command, then another when the 
cluster decides it's no longer talking. I think the fence command should at
least have an option to inform the cluster that a node was fenced but it's
not
a big deal.


If oom is killing your cluster nodes, I think your out of luck. GFS can
gobble 
memory from my experience. More is better..  Also, in GFS 6.0x there is a bug
that causes system ram to be exhausted by GFS locks. The newest release has a
tunable paramter "inoded_purge" which allows you to tune a periodic
percentage
of locks to try and purge. This helped me a LOT. I was having nodes hang cuz
nodes
could not fork.

BTW, if the GFS folks are reading this, I'd like ot make a suggestion. I have
not
gone code diving yet but it seems that if the mechanism for a node to respond
was
actually spawning a thread or something that required the system to be able
to fork
then systems that are starved of memory would indeed get fenced since the
"OK" response
would not get back to the cluster. I realize that doesn't FIX anything per
se' but 
it would prevent the system from hanging for any length of time.

On the start/stop of SAN resources, what exactly do you mean? It sounds like
you
are talking about what happens when qlogic drivers load and unload. If that's
the
case, you need to properly set up zoning on your fibre switch. The
load/unload of
the qlogic drivers causes a scsi reset to be sent along the bus, which in the
case
of fibre channel, is every device in the fibre mesh. You need to set up
individual 
zones for your storage ports, then zones which include the host ports, and
the storage
together. So on a 5 node cluster, you'd end up with 5 zones, one for storage,
and 4
host/storage combos, then make them all part of the active config. That way
any scsi 
resets are not seen by other nodes HBA's.  I had problems that were causeing
nodes
to go down due to lost connections to the storage from the scsi resets, not
good....


Heartbeat should not need any tweaking if everything else is working. Not to
say you
can't tune it to your situation, just that it should be fine with default
settings while
you get things stable.



Hope this helps


Corey



-----Original Message-----
From: linux-cluster-bounces redhat com
[mailto:linux-cluster-bounces redhat com] On Behalf Of Cosimo Streppone
Sent: Monday, May 08, 2006 3:31 PM
To: linux clustering
Subject: Re: [Linux-cluster] Recommended HP servers for cluster suite

Kovacs, Corey J. wrote:

> iLO fencing works just fine.
> [...]
 > If you are using RHEL4 + GFS 6.1, then it is simpler since the
> config is expected to be in the same file etc.
>
 > [...]

I seem to have got past the SSL modules installation, so that is not the
problem.

Thanks for sharing your experience, but I admit I still haven't understood
when fencing takes place. What are the conditions that trigger fencing?

> Any specific problem you are having?

Yes.
The main problem is that I'm now beginning to find my way through RHCS4. :-)
Other random problems that I had:

- oom-killer kernel thread killed my ccs daemon, causing the
   entire two-node cluster to suddenly become unmanageable;
- start/stop of shared filesystem resources (SAN) is causing errors
   and is therefore not managed properly;
- don't know how to properly configure heartbeat;

I know these are not iLO problem. In fact, I'm trying to solve one problem at
a time, and don't know if iLO fencing can be the cause of these problems.

I need to do some more researching. I'll be back with more useful info.

--
Cosimo

--
Linux-cluster mailing list
Linux-cluster redhat com
https://www.redhat.com/mailman/listinfo/linux-cluster


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]