[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Cluster-devel] cluster/group/daemon cman.c cpg.c gd_internal. ...



On Tue, 2006-06-20 at 14:43 -0500, Robert Peterson wrote:
> David Teigland wrote:
> > Might be a good idea, I don't really know.  I'm not even sure we'd need to
> > save much or any additional state that couldn't be pulled from the gfs/dlm
> > instances themselves.  It seems to me the challenge would be writing the
> > daemons so they could put all the pieces and interconnections back
> > together again.
> >
> > If this ends up being a big enough problem to get more attention, I think
> > the first practical improvement we could make is something like
> > blocking/clearing i/o from the residual fs's (like we do in withdraw) and
> > adding the ability to fully purge instances of gfs/dlm from the kernel
> > without rebooting the node.  Then the machines could all start from
> > scratch without rebooting or fencing
> Here's another idea that came to me:
> 
> For critical cluster processes like cman and fenced, maybe we could use 
> init's ability
> to restart processes, i.e. the "respawn" option in /etc/inittab.  Maybe 
> we can use
> "respawn" or something similar to ensure that if a critical process like 
> fenced dies,
> it gets restarted automatically and immediately.  Of course, that might 
> cause problems
> for shutdown, etc., and it would probably make it harder to test certain 
> things...
> 

These daemons should have zero segv bugs and should never crash.  If
they do crash, I believe we want to understand why by having community
complaints so they can be fixed rather then an init process restart
workaround.

The cman runs as a component within aisexec, and it is our goal with
that project to ensure there are zero crash bugs in the code.  If a node
crashes, we will surely hear complaints from the community about "got
crash".  Of course this is difficult with 55k lines of C code, but
achievable.

The one problem we tend to have is OOM, since handling OOM gracefully in
a distributed system is exceedingly complex.

Regards
-steve

> Bob Peterson
> Red Hat Cluster Suite
> 


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]