[dm-devel] [PATCH v3 01/19] libmultipath: fix tur checker timeout

Martin Wilck mwilck at suse.com
Mon Oct 1 19:51:22 UTC 2018


On Fri, 2018-09-21 at 18:05 -0500, Benjamin Marzinski wrote:
> The code previously was timing out mode if ct->thread was 0 but
> ct->running wasn't. This combination never happens.  The idea was to
> timeout if for some reason the path checker tried to kill the thread,
> but it didn't die.  The correct thing to check for this is ct-
> >holders.
> ct->holders will always be at least one when libcheck_check() is
> called,
> since libcheck_free() won't get called until the device is no longer
> being checked. So, if ct->holders is 2, that means that the tur
> thread
> is has not shut down yet.
> 
> Also, instead of returning PATH_TIMEOUT whenever the thread hasn't
> died,
> it should only time out if the thread didn't successfully get a
> value,
> which means the previous state was already PATH_TIMEOUT.

What about PATH_UNCHECKED?

> 
> Signed-off-by: Benjamin Marzinski <bmarzins at redhat.com>
> ---
>  libmultipath/checkers/tur.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/libmultipath/checkers/tur.c
> b/libmultipath/checkers/tur.c
> index bf8486d..275541f 100644
> --- a/libmultipath/checkers/tur.c
> +++ b/libmultipath/checkers/tur.c
> @@ -355,12 +355,15 @@ int libcheck_check(struct checker * c)
>  		}
>  		pthread_mutex_unlock(&ct->lock);
>  	} else {
> -		if (uatomic_read(&ct->running) != 0) {
> -			/* pthread cancel failed. continue in sync mode
> */
> -			pthread_mutex_unlock(&ct->lock);
> -			condlog(3, "%s: tur thread not responding",
> -				tur_devt(devt, sizeof(devt), ct));
> -			return PATH_TIMEOUT;
> +		if (uatomic_read(&ct->holders) > 1) {
> +			/* pthread cancel failed. If it didn't get the
> path
> +			   state already, timeout */
> +			if (ct->state == PATH_PENDING) {
> +				pthread_mutex_unlock(&ct->lock);
> +				condlog(3, "%s: tur thread not
> responding",
> +					tur_devt(devt, sizeof(devt),
> ct));
> +				return PATH_TIMEOUT;
> +			}
>  		}

I'm not quite getting it yet. We're entering this code path if 
ct->thread is 0. As you point out, if there are >1 holders, a
previously cancelled TUR thread must be still "alive". But now we want
to check *again*. The previous code refused to do that - a single
hanging TUR thread could block further checks from happening, forever.
The PATH_TIMEOUT return code was actually questionable - in this
libcheck_check() invocation, we didn't even try to check, so nothing
could actually time out (but that's obviously independent of your
patch).

With your patch, if ct->state != PATH_PENDING, we'd try to start
another TUR thread although there's still a hanging one. That's not
necessarily wrong, but I fail to see why it depends on ct->state.
Anyway, if we do this, we take the risk of starting more and more TUR
threads which might all end up hanging; I wonder if we should stop
creating more threads if ct->holders grows further (e.g. > 5).

Regards,
Martin


>  		/* Start new TUR checker */
>  		ct->state = PATH_UNCHECKED;

-- 
Dr. Martin Wilck <mwilck at suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)





More information about the dm-devel mailing list