[Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

Tue Dec 20 10:39:08 UTC 2011

Hi,

On Mon, 2011-12-19 at 12:47 -0500, David Teigland wrote:
> On Mon, Dec 19, 2011 at 01:07:38PM +0000, Steven Whitehouse wrote:
> > >  struct lm_lockstruct {
> > >  	int ls_jid;
> > >  	unsigned int ls_first;
> > > -	unsigned int ls_first_done;
> > >  	unsigned int ls_nodir;
> > Since ls_flags and ls_first also also only boolean flags, they could
> > potentially be moved into the flags, though we can always do that later.
> 
> yes, I can use a flag in place of ls_first.
> 
> > > +	int ls_recover_jid_done; /* read by gfs_controld */
> > > +	int ls_recover_jid_status; /* read by gfs_controld */
> >                                           ^^^^^^^^^^^ this isn't
> > actually true any more. All recent gfs_controld versions take their cue
> > from the uevents, so this is here only for backwards compatibility
> > reasons and these two will be removed at some future date.
> 
> I'll add a longer comment saying something like that.
> 
> > > +	/*
> > > +	 * Other nodes need to do some work in dlm recovery and gfs2_control
> > > +	 * before the recover_done and control_lock will be ready for us below.
> > > +	 * A delay here is not required but often avoids having to retry.
> > > +	 */
> > > +
> > > +	msleep(500);
> > Can we get rid of this then? I'd rather just wait for the lock, rather
> > than adding delays of arbitrary time periods into the code.
> 
> I dislike arbitrary delays also, so I'm hesitant to add them.
> The choices here are:
> - removing NOQUEUE from the requests below, but with NOQUEUE you have a
>   much better chance of killing a mount command, which is a fairly nice
>   feature, I think.
> - removing the delay, which results in nodes often doing fast+repeated
>   lock attempts, which could get rather excessive.  I'd be worried about
>   having that kind of unlimited loop sitting there.
> - using some kind of delay.
> 
> While I don't like the look of the delay, I like the other options less.
> Do you have a preference, or any other ideas?
> 
Well, I'd prefer to just remove the NOQUEUE command in that case, so
that we don't spin here. The dlm request is async anyway, so we should
be able to wait for it in an interruptible manner and send a cancel if
required.

> 
> > > +static int control_first_done(struct gfs2_sbd *sdp)
> > > +{
> > > +	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> > > +	char lvb_bits[GDLM_LVB_SIZE];
> > > +	uint32_t start_gen, block_gen;
> > > +	int error;
> > > +
> > > +restart:
> > > +	spin_lock(&ls->ls_recover_spin);
> > > +	start_gen = ls->ls_recover_start;
> > > +	block_gen = ls->ls_recover_block;
> > > +
> > > +	if (test_bit(DFL_BLOCK_LOCKS, &ls->ls_recover_flags) ||
> > > +	    !test_bit(DFL_MOUNT_DONE, &ls->ls_recover_flags) ||
> > > +	    !test_bit(DFL_FIRST_MOUNT, &ls->ls_recover_flags)) {
> > > +		/* sanity check, should not happen */
> > > +		fs_err(sdp, "control_first_done start %u block %u flags %lx\n",
> > > +		       start_gen, block_gen, ls->ls_recover_flags);
> > > +		spin_unlock(&ls->ls_recover_spin);
> > > +		control_unlock(sdp);
> > > +		return -1;
> > > +	}
> > > +
> > > +	if (start_gen == block_gen) {
> > > +		/*
> > > +		 * Wait for the end of a dlm recovery cycle to switch from
> > > +		 * first mounter recovery.  We can ignore any recover_slot
> > > +		 * callbacks between the recover_prep and next recover_done
> > > +		 * because we are still the first mounter and any failed nodes
> > > +		 * have not fully mounted, so they don't need recovery.
> > > +		 */
> > > +		spin_unlock(&ls->ls_recover_spin);
> > > +		fs_info(sdp, "control_first_done wait gen %u\n", start_gen);
> > > +		msleep(500);
> > Again - I don't want to add arbitrary delays into the code. Why is this
> > waiting for half a second? Why not some other length of time? We should
> > figure out how to wait for the end of the first mounter recovery some
> > other way if that is what is required.
> 
> This msleep slows down a rare loop to wake up a couple times vs once with
> a proper wait mechanism.  It's waiting for the next recover_done()
> callback, which the dlm will call when it is done with recovery.  We do
> have the option here of using a standard wait mechanism, wait_on_bit() or
> something.  I'll see if any of those would work here without adding too
> much to the code.
> 
Ok. That would be a better option I think.

> 
> > > +static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,
> > > +				 unsigned int result)
> > > +{
> > > +	struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> > > +
> > > +	/* don't care about the recovery of own journal during mount */
> > > +	if (jid == ls->ls_jid)
> > > +		return;
> > > +
> > > +	/* another node is recovering the journal, give it a chance to
> > > +	   finish before trying again */
> > > +	if (result == LM_RD_GAVEUP)
> > > +		msleep(1000);
> > Again, lets put in a proper wait for this condition. If the issue is one
> > of races between cluster nodes (thundering herd type problem), then we
> > might need some kind of back off, but in that case, it should probably
> > be for a random time period.
> 
> In this case, while one node is recovering a journal, the other nodes will
> all try to recover the same journal (and fail), as quickly as they can.  I
> looked at using queue_delayed_work here, but couldn't tell if that was ok
> with zero delay... I now see others use 0, so I'll try it.
> 
Yes, that should be fine. Using queue_delayed_work with zero delay is
just the same as queuing a non-delayed work item. We use it all the time
for the glock workqueue.

> 
> > > +	error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE,
> > > +				  &ops, &ls->ls_dlm);
> > > +
> > > +	if (error == -EOPNOTSUPP) {
> > > +		/*
> > > +		 * dlm does not support ops callbacks,
> > > +		 * old dlm_controld/gfs_controld are used, try without ops.
> > > +		 */
> > > +		fs_info(sdp, "dlm lockspace ops not used %d\n", error);
> > > +		free_recover_size(ls);
> > > +
> > > +		error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE,
> > > +					  NULL, &ls->ls_dlm);
> > > +		if (error)
> > > +			fs_err(sdp, "dlm_new_lockspace error %d\n", error);
> > > +		return error;
> > > +	}
> > > +
> > Hmm. This is a bit complicated. Can't we just make it return 0 anyway?
> > If we do need to know whether the dlm supports the recovery ops, then
> > lets just make it signal that somehow (e.g. returns 1 so that >= 0 means
> > success and -ve means error). It doesn't matter if we don't call
> > free_recover_size until umount time I think, even if the dlm doesn't
> > support that since the data structures are fairly small.
> 
> I went with this because I thought it was simpler than adding a second
> return value for the ops status.  It would also let us simply drop the
> special case in the future.  The alternative is:
> 
> int dlm_new_lockspace(const char *name, const char *cluster,
>                       uint32_t flags, int lvblen,
>                       struct dlm_lockspace_ops *ops, void *ops_arg,
>                       int *ops_error, dlm_lockspace_t **lockspace);
> 
> I'm willing to try that if you think it's clearer to understand.
> 
> Dave
> 

Yes, thats fine. Adding an extra arg to dlm_new_lockspace is not a big
deal since it is not in the fast path at all,

Steve.