[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [dm-devel] RFC for multipath queue_if_no_path timeout.

Dragging this back up into the light...

On Thu, 2013-09-26 at 19:49 -0400, Mike Snitzer wrote:
> Frank, I had a look at your patch.  It leaves a lot to be desired, I was
> starting to clean it up but ultimately found myself agreeing with
> Alasdair's original point: that this policy should be implemented in the
> userspace daemon.

I've found and fixed a couple of bugs but I would still like to know
what issues you had with the patch.  As I said before, I would be more
than happy to clean it up.

In the time since we had this discussion, by the way, we ran into a
problem that a userspace daemon can't solve:  That of shutdown.  We ran
into a number of failures in which systems were hung for hours.  It
turned out that they were caused by a regular system shutdown.  Our
backing store is network-based and networking was getting killed before
applications (as is usually the case), leaving I/O outstanding on the
device.  Since queue_if_no_path was set, the I/O wasn't dumped and our
daemon was killed by shutdown very shortly thereafter so it couldn't
recover (otherwise it would have cleaned things up).

With those I/Os sitting queued in multipath, with no network and no
daemon to turn off queue_if_no_path, the systems just sat.  When we
finally diagnosed this, we realized that the timeout would work
perfectly to solve the problem, automatically turning queue_if_no_path
off shortly after the network went away without depending on the
intervention of the no-longer-running daemon.

So how do you guys deal with this failure scenario?
Frank Mayhar

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]