[Linux-cluster] Using cman,etc for a non-gfs app

Thu Jun 23 15:23:39 UTC 2005

On Wed, 2005-22-06 at 18:36 -0400, Lon Hohberger wrote: 
> On Wed, 2005-06-22 at 15:17 -0400, Olivier Crete wrote:

> > Our application is asymetric, we have a (duplicated active-passive)
> > master server and work nodes. What I need from the cman is to know the
> > state of each node and notification when the state changes. The policy
> > decision (as to the fail-over, etc) would be taken by our master server.
> > >From what I can see, cman/ucman can already do that.
> > 
> > But, I need to monitor the application (have some kind of application
> > heartbeat) so I can know if the app has deadlocked or segfaulted. And
> > inform the masters (active/passive) of what happened so they can take
> > the proper decision. 
> 
> If you use CMAN's service manager, you will be able to tell if the app
> has crashed (all nodes in that service group will be notified of the
> state change).

In the RHEL4 branch there does not seem to be a userspace API for the
Service manager.. apart from the ioctl and libmagma. Is libmagma your
long term api ? Also, can libmagma be used in non-GPL apps? I saw some
scary comments in magmamsg.h... 

> Internal deadlocks are harder to detect from the cluster infrastructure
> perspective.  I'd consider using the kernel watchdog timer.

An easy way would be to have a cluster watchdog (ie.. the app must
"ping" the cman daemon at least one in X seconds and if it isnt its
considered deadlocked..)

> > It would also be nice to have a library version of fence, but for now I
> > guess I can just system() fence_node (that does not use fenced, right?).
> > Or something like stonithd (from the linux-ha folks) where the fencing
> > equipement can be connected to different nodes, but be controlled in a
> > transparent way. And, I want to retain control from my app...
> 
> First off:  Generally, an application crashing shouldn't generally cause
> an eviction of the node from the cluster.  There should be other
> cleanup/coordination mechanisms in place.  Ok, that said:

Our application uses semi-shared storage, and if it crashes.. it may
leave it in an unknown state.. and the easiest way is just to reboot the
machine and have another machine take over the storage..

> * With libgulm, you can register as an "important" service: "If this
> process dies, evict & fence me."

But gulm is going away, right ?

> * libmagma provides cp_fence() / clu_fence() which work on both CMAN and
> gulm.
> 
> * You can fork/exec the fence_node command.
> 
> > Oh yea, and I need something relatively stable before September too...
> > Can I do that with your stuff? 
>
>
> The other caveat was that you didn't want to be controlled by resource
> scripts / managers, right?

Ideally, I'd want to reduce the amount of forking... Especially when a critical event happens. 

-- 
Olivier Crête
ocrete at max-t.com
Maximum Throughput Inc.