[dm-devel] LSF: Multipathing and path checking question

Fri Apr 17 07:50:37 UTC 2009

Hi Mike,

Mike Christie wrote:
> Hey,
> 
> For this topic:
> 
> -----------------------
> Next-Gen Multipathing
> ---------------------
> Dr. Hannes Reinecke
> 
> ......
> 
> Should path checkers use sd->state to check for errors or availability?
> ----------------------
> 
> What was decided?
> 
> Could this problem be fixed or helped if multipath tools always sets the
> fast io fail tmo for FC or the replacement_timeout for iscsi?
> 
No, I already do this for FC (should be checking the replacement_timeout, too ...)

> If those are set then IO in the blocked queue and in the driver will get
> failed after fast io fail tmo/replacement_timeout seconds (driver has to
> implement a terminate rport IO callback and only mptfc does not now). So
> at this time, do we want to fail the path?
> 
> Or are people thinking that we want to fail the path when the problem is
> initially detected like when the LLD deletes the rport for fc for example?
> 
Well, the idea is the following:

The primary purpose of the path checkers is to check the availability of
the paths (my, that was easy :-).

And the main problem we have with the path checkers is that they are using
actual SCSI commands to determine this, thereby incurring unrelated errors
(Disk errors, delaying response due to blocked path behaviour or error handling
etc). So we have to invest quite a bit of logic to separate the 'true' path
condition from unrelated errors, simply because we're checking at the wrong
level; the path state is maintained by the transport layer, not by the
SCSI layer.

So the suggestion here is to check the transport layer for the path states
and do away with the existing path_checker SG_IO mechanism.

The secondary use of the path checkers (determine inactive paths) will have
to be delegated to the priority callouts, which then have to arrange the
paths correctly.

FC Transport already maintains an attribute for the path state, and even
sends netlink events if and when this attribute changes. For iSCSI I have
to defer to your superior knowledge; of course it would be easiest if
iSCSI could send out the very same message FC does.

> 
> 
> Also for this one:
> -----------------------
> How to communication device went away:
> 1) send event to udev (uses netlink)
> -----------------------
> 
> Is this an event when dev_loss_tmo fires or when the LLD first detects
> something like a link down (or any event it might block the rport for),
> or would it be for when the fast fail io tmo fires (when the fc class is
> going to fail running IO and incoming IO), or would we have events for
> all of them?
> 
Currently the event is sent when the device itself is removed from sysfs.
And only then can we actually update the path maps and (possibly) change
to another part. We cannot do anything when the path is blocked (ie when
dev_loss_tmo is active) as we require this interval to capture jitter on
the line.

So we have this state diagram:

sdev state:   RUNNING  <-> BLOCKED -> CANCEL
mpath state:  path up  <-> <stall> -> path down / remove from map

Notice the '<stall>' here; we cannot check the path state when the
sdev is blocked as all I/O will be queued. And also note that we
now lump two different multipath path states together; a path down
is basically always followed immediately by a path remove event.

However, when all paths are down (and queue_if_no_path is active) we might
run into a deadlock when a path comes back, as we might not have enough
memory to actually create the required structures.

Idea was to modify the state machine so that fast_fail_io_tmo is
being made mandatory, which transitions the sdev into an intermediate
state 'DISABLED' and sends out a netlink message.

sdev state:   RUNNING <-> BLOCKED <-> DISABLED -> CANCEL
mpath state:  path up <-> <stall> <-> path down -> remove from map

This will allow us to switch paths early, ie when it moves into
'DISABLED' state. But the path structure themselves are still alive,
so when a path comes back between 'DISABLED' and 'CANCEL' we won't
have an issue reconnecting it. And we could even allow to set a
dev_loss_tmo to infinity thereby simulating the 'old' behaviour.

However, this proposal didn't go through.

Instead it was proposed to do away with the unlimited queue_if_no_path
setting and _always_ have a timeout there, so that the machine is able
to recover after a certain period of time.

I still like my original proposal, though.

Maybe we can do the EU referendum thing and just ask again and again
until everyone becomes tired of it and just says 'yes' to get rid
of this issue ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare at suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)