[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] "dlm_controld[nnnn]: cluster is down, exiting" on node1 when starting node2



On Fri, 2009-06-05 at 11:49 -0500, David Teigland wrote:
> On Fri, Jun 05, 2009 at 12:50:57PM -0400, Charlie Brady wrote:
> > 
> > On Fri, 5 Jun 2009, David Teigland wrote:
> > 
> > >On Fri, Jun 05, 2009 at 11:42:59AM -0400, Charlie Brady wrote:
> > >>
> > >>On Fri, 5 Jun 2009, David Teigland wrote:
> > >>
> > >>>On Thu, Jun 04, 2009 at 04:23:13PM -0400, Charlie Brady wrote:
> > >>>>Jun  4 10:55:34 sun4150node1 dlm_controld[7916]: cluster is down, 
> > >>>>exiting
> > >>>>Jun  4 10:55:34 sun4150node1 fenced[7910]: cluster is down, exiting
> > >>>>Jun  4 10:55:34 sun4150node1 gfs_controld[7922]: cluster is down, 
> > >>>>exiting
> > >>>>Jun  4 10:55:35 sun4150node1 qdiskd[8128]: <err> cman_dispatch: Host is
> > >>>>down
> > >>>
> > >>>They are all complaining that the the cluster is down, which is a polite
> > >>>way
> > >>>of saying that aisexec has died/crashed/failed/killed/gone-away.
> > >>
> > >>Thanks. Why might that have occurred? Where would I look for clues? How
> > >>can I increase logging output from aisexec?
> > >
> > >If you're lucky it'll leave a core file, otherwise aisexec is notorious for
> > >disappearing without leaving any clues about why.
> > 
> > That's very disconcerting to hear. Doesn't sound like HA. :-(
> 
> To clarify, aisexec does not often disappear, it's very reliable.  The point
> was that in the rare case when it does, it's notorious for not leaving any
> reasons behind.
> 
> Dave
> 

99.9% of the time there would be a core file in /var/lib/openais/core*
if aisexec faults.  We have not seen faults during normal operations for
years in a released version under typical gfs2 usage scenarios.  If
there is no core, it means some other component failed, exited, and
caused that node to be fenced, or the core file could not be written by
the OS because of some other OS specific failure.  Another option is
that the OOM killer killed aisexec.  I would have a hard time believing
aisexec would crash without a core file while the operating system was
still functional.

In the trunk we are enhancing our failure analysis to do fulltime event
tracing so failures can be debugged more rapidly then looking at a core
file.  I hope that helps.

regards
-steve
> --
> Linux-cluster mailing list
> Linux-cluster redhat com
> https://www.redhat.com/mailman/listinfo/linux-cluster


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]