[Linux-cluster] qdiskd + cman: trying to fix the use of quorumdev_poll.

Patrick Caulfield pcaulfie at redhat.com
Tue Jan 9 13:34:47 UTC 2007


Simone Gotti wrote:
> Hi all,
> 
> I'm using the openais based cman-2.0.35.el5 and I'm trying to understand
> how the quorum disk concept is implemented in rhcs, after various
> experiments I think that I found at least 2 problems:
> 
> Problem 1)
> 
> Little bug in the quorum disk polling mechanism:
> 
> looking at the code in cman/daemon/commands.c the variable
> quorumdev_poll = 10000 is expressed in milliseconds and used to call
> "quorum_device_timer_fn" every quorumdev_poll interval to check if
> qdiskd is informing cman that the node can use the quorum votes.
> 
> The same variable is then used in quorum_device_timer_fn, but here it's
> used as seconds:
> 
> if (quorum_device->last_hello.tv_sec + quorumdev_poll < now.tv_sec) {
> 
> so, when the qdisks dies, or the access to the quorum disk is lost it
> will take more than 2 hours to notify this and recalculate the quorum.
> 
> After changing the line:
> ========================================================================
> --- cman-2.0.35.orig/cman/daemon/commands.c  2007-01-07
> 21:01:30.000000000 +0100
> +++ cman-2.0.35.patched/cman/daemon/commands.c  2007-01-05
> 18:12:33.000000000 +0100
> @@ -1038,15 +1037,12 @@ static void ccsd_timer_fn(void *arg)
> 
>  static void quorum_device_timer_fn(void *arg)
>  {
>         struct timeval now;
>         if (!quorum_device || quorum_device->state == NODESTATE_DEAD)
>                 return;
> 
>         gettimeofday(&now, NULL);
> -       if (quorum_device->last_hello.tv_sec + quorumdev_poll <
> now.tv_sec) {
> +       if (quorum_device->last_hello.tv_sec + quorumdev_poll/1000 <
> now.tv_sec) {
>                 quorum_device->state = NODESTATE_DEAD;
>                 log_msg(LOG_INFO, "lost contact with quorum device\n");
>                 recalculate_quorum(0);
> ========================================================================
> 

Thanks. I've committed that version for now.

> it worked. A more precise fix should be the use if tv_usec/1000 instead
> of tv_sec.

True, it needs to take both into account. For the sake of time I've left the
granularity at seconds.
-- 

patrick




More information about the Linux-cluster mailing list