[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [dm-devel] RFC for multipath queue_if_no_path timeout.



On Thu, Oct 17 2013 at  3:03pm -0400,
Frank Mayhar <fmayhar google com> wrote:

> Dragging this back up into the light...
> 
> On Thu, 2013-09-26 at 19:49 -0400, Mike Snitzer wrote:
> > Frank, I had a look at your patch.  It leaves a lot to be desired, I was
> > starting to clean it up but ultimately found myself agreeing with
> > Alasdair's original point: that this policy should be implemented in the
> > userspace daemon.
> 
> I've found and fixed a couple of bugs but I would still like to know
> what issues you had with the patch.  As I said before, I would be more
> than happy to clean it up.

I don't recall, will let you know if/when I do have time to look again.
 
> In the time since we had this discussion, by the way, we ran into a
> problem that a userspace daemon can't solve:  That of shutdown.  We ran
> into a number of failures in which systems were hung for hours.  It
> turned out that they were caused by a regular system shutdown.  Our
> backing store is network-based and networking was getting killed before
> applications (as is usually the case), leaving I/O outstanding on the
> device.  Since queue_if_no_path was set, the I/O wasn't dumped and our
> daemon was killed by shutdown very shortly thereafter so it couldn't
> recover (otherwise it would have cleaned things up).
> 
> With those I/Os sitting queued in multipath, with no network and no
> daemon to turn off queue_if_no_path, the systems just sat.  When we
> finally diagnosed this, we realized that the timeout would work
> perfectly to solve the problem, automatically turning queue_if_no_path
> off shortly after the network went away without depending on the
> intervention of the no-longer-running daemon.
> 
> So how do you guys deal with this failure scenario?

Shouldn't you wait for the application to shutdown before ripping the
network out?  Seems odd to just throw away queued IO.

A proper shutdown sequence really should avoid this problem in general,
the multipath daemon would only be shutdown once all mpath devices are
deactivated.

Then, if you still want to gracefully handle the case where there is no
network (and hence no paths) on shutdown the multipathd would still be
around to transition to a table that doesn't have queue_if_no_path.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]