[Linux-cluster] CS4 script resource is looping on error

Thu Mar 30 16:00:47 UTC 2006

On Wed, 2006-03-29 at 12:59 +0200, Marco Lusini wrote:
>  
> As a test I have created script that always return error on status,
> and set up a service
> with the failing script as the single resource.
>  
> ------- snip ------
> #!/bin/bash
> case "$1" in
>     start)
>         exit 0
>         ;;
>     stop)
>         exit 0
>         ;;
>     status)
>         exit 1
>         ;;
> esac
> ------ snip -------
>  
> Now my cluster keeps restarting that resource or, if I change the
> recover
> policy, keeps relocating it forever. If I choose disable as recover
> policy,
> the reource get disabled whitout even trying to restart/recover.
>  
> Is this the expected behaviour?

Yes.

> Is there a way to configure CS4 to first try to restart locally, then
> to try relocate and finally disable the service?

If it fails to start after a failure, it is relocated to another node. 

If it fails to start on all nodes, it is placed in the 'stopped' state.

Tracking the history of nodes where a service has started but at some
point failed is slightly difficult since node IDs are not guaranteed to
be static in linux-cluster right now: a node can leave and a different
node can join and take the vacant node ID.  It is, however, possible to
configure static node IDs in cluster.conf, which would help.

If you are worried about this particular state (where the service is
horribly broken and moving around or restarting a lot), you can perform
checks in your script to see if it just started + crashed locally; this
will help.

You can also file a bugzilla / feature request -- it's certainly not
impossible to implement at some point.

-- Lon