[Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting" on node1 when starting node2

Fri Jun 5 17:26:37 UTC 2009

On Fri, 2009-06-05 at 13:20 -0400, Charlie Brady wrote:
> On Fri, 5 Jun 2009, Steven Dake wrote:
> 
> > On Fri, 2009-06-05 at 11:49 -0500, David Teigland wrote:
> >> On Fri, Jun 05, 2009 at 12:50:57PM -0400, Charlie Brady wrote:
> >>>
> >>> On Fri, 5 Jun 2009, David Teigland wrote:
> >>>
> >>>> On Fri, Jun 05, 2009 at 11:42:59AM -0400, Charlie Brady wrote:
> >>>>>
> >>>>> On Fri, 5 Jun 2009, David Teigland wrote:
> >>>>>
> >>>>>> They are all complaining that the the cluster is down, which is a polite
> >>>>>> way
> >>>>>> of saying that aisexec has died/crashed/failed/killed/gone-away.
> >>>>>
> >>>>> Thanks. Why might that have occurred? Where would I look for clues? How
> >>>>> can I increase logging output from aisexec?
> >>>>
> >>>> If you're lucky it'll leave a core file, otherwise aisexec is notorious for
> >>>> disappearing without leaving any clues about why.
> >>>
> >>> That's very disconcerting to hear. Doesn't sound like HA. :-(
> >>
> >> To clarify, aisexec does not often disappear, it's very reliable.  The point
> >> was that in the rare case when it does, it's notorious for not leaving any
> >> reasons behind.
> >>
> >> Dave
> >>
> >
> > 99.9% of the time there would be a core file in /var/lib/openais/core*
> > if aisexec faults.
> 
> Only file I have there is named.
> 
> ringid_10.39.171.212
> 
> >  We have not seen faults during normal operations for
> > years in a released version under typical gfs2 usage scenarios.  If
> > there is no core, it means some other component failed, exited, and
> > caused that node to be fenced, or the core file could not be written by
> > the OS because of some other OS specific failure.  Another option is
> > that the OOM killer killed aisexec.
> 
> No sign of the oom killer in the log I quoted yesterday.
> 
> >  I would have a hard time believing
> > aisexec would crash without a core file while the operating system was
> > still functional.
> >
> > In the trunk we are enhancing our failure analysis to do fulltime event
> > tracing so failures can be debugged more rapidly then looking at a core
> > file.  I hope that helps.
> 
> Thanks.
> 
> I'll try to reproduce the scenario. Meanwhile I'm still looking for hints 
> as to how to get more visibility of what is happening.

some users change their default core file storage location.  This would
then override the defaults used by openais.  another possibility is
selinux is enabled.  aisexec integration with selinux needs more work
and selinux might prevent a core file from being written.

You can check selinux by looking /etc/selinux/config. If it is set to
enforcing or permisssive, that may be your culprit.

Regards
-steve