[Linux-cluster] CS4 Update 2 & Patch watchdog on

Wed Sep 13 14:18:10 UTC 2006

On Wed, 2006-09-13 at 09:51 +0200, Alain Moulle wrote:
> >> The self-watchdog patch adds a process which monitors the "real"
> >> clurgmgrd.  The monitoring process should be the lower-numbered PID
> >> (it's the parent of the one doing the work).
> 
> >> The monitoring process watches for crash signals (SIGBUS, SIGSEGV,
> >> etc.), and will simply exit if you kill the child with SIGKILL.
> 
> >> So, basically, killing the higher-numbered PID with something like
> >> SIGSEGV should cause the node to reboot.
> 
> >> -- Lon
> 
> Thanks Lon, I understand.
> And if I kill -9 (SIGKILL) the higher-numbered PID at test purpose,
> is it expected to reboot or not ?
> 
> I see in code :
>                 case SIGCHLD:
>                 case SIGILL:
>                 case SIGFPE:
>                 case SIGSEGV:
>                 case SIGBUS:
>                         setup_signal(i, SIG_DFL);
>                         break;
>                 default:
>                         setup_signal(i, signal_handler);
> but can't conclude for a SIGKILL on higher-numbered PID process ...

No, sigkill will just cause the watchdog to commit suicide:

                if (waitpid(child, &status, 0) <= 0)
                        continue;

                if (WIFEXITED(status))
                        exit(WEXITSTATUS(status));

                if (WIFSIGNALED(status)) {
                        if (WTERMSIG(status) == SIGKILL) {
                                clulog(LOG_CRIT, "Watchdog: Daemon
killed, exiting\n");
                                raise(SIGKILL);

Use something like SIGSEGV (e.g. to simulate a crash) and the
nanny/watchdog process should reboot the node.

-- Lon