Re: [Linux-cluster] RHEL3 Cluster network hangup

On Wed, 2005-07-06 at 08:30 +0200, Gunther Schlegel wrote:
> The clustered application does a lot of printing (lprng), 
> faxing(hylafax) and mailing(sendmail). It uses shell scripts to pass the 
> jobs to the operating systems daemons.

> The client programs of these daemons, which pass jobs to the daemons 
> using network connections to localhost start to behave irregular when 
> the cluster is up for about 2 weeks.

> Examples:
> - hylafax faxstat stops listing the transmitted faxes in the middle of 
> the list ( but always at the same job )
> - sendmail opens a connection to the local daemon but does not transfer 
> the message. Both processes sit there and wait, after some time the 
> server closes the connection because of missing input from the clients side.
> - same with lpr.
> I assume that something locks up in the ip stack. Not all services are 
> affected at the same time.
> I guess this is related to the cluster software as we run that 
> application on a lot of servers which all do not show this behaviour and 
> that are all not clustered.

I doubt it, but it's not out of the realm of possibility.  The cluster
software does three things mostly:

(a) figures out who's online
(b) shoots nodes
(c) manages services using shell scripts

The shell scripts call standard utilities (ifconfig, route, etc.).

Now -- here's the thing.  Earlier versions of clumanager (<1.2.22) had a
problem where sometimes (and randomly!), services would get a bogus
status return and restart on the same node.  Also, the most recent
errata fixed a signal handling problem which broke JVMs from running
under it.  Either of these may have caused the problems on your cluster,
I don't know.  The former would have associated log messages; the latter

I'd try the latest release from RHN (clumanager-

If that doesn't work, I'd call Red Hat Support...

-- Lon

