Re: [Linux-cluster] RHEL3 Cluster network hangup

On Fri, 2005-07-08 at 08:27 +0200, Gunther Schlegel wrote:

> I have been running 1.2.22.

Yup, that fixed the status problem, but...

> > Also, the most recent errata fixed a signal handling problem which
> > broke JVMs from running under it.

> There have not been any log messages.
> > I'd try the latest release from RHN (clumanager-

... it is very important to note that JVMs weren't the only thing that
broke because of the signal bug.

The signal bug was not fixed until to (latest errata).  Some
processes use signals to communicate and avoid deadlocks or blocking,
but if the signals are blocked, they don't much help with those

As an example - a process which calls alarm(5) to set a timer to wake
itself up right before it calls, say, a blocking select().  5 seconds
later, SIGALRM comes in - but because it is blocked, the process gets
stuck in select() forever.

> > If that doesn't work, I'd call Red Hat Support...
> While calling support is always on option, I am pretty much sure that it 
> will not lead to a solution. In the end they will not be able to 
> reproduce it and I can't test on a customers production system.

I suspect that the first thing they would have you do is try the latest
errata from RHN (which fixes the signal problem):


(Yeah, it's that bad.)

... which is why I recommended trying it *before* calling support.

> Do not point me to test systems -- they are there, but they do not have 
> the problem. Seems to be related to the workload of the machine, which 
> is hard to simulate.

> Hmm, I will probably not start up the cluster again... :(

(snipped from earlier)

Use your own judgment, and make the choices that are right for you and
your customer, whatever they are.  I am sorry I could not be more

Good luck.

-- Lon

