[dm-devel] what is the current utility in testing active paths from multipat hd?

Wed Apr 27 18:36:27 UTC 2005

On 4/27/05, Lars Marowsky-Bree <lmb at suse.de> wrote:
> On 2005-04-27T12:27:32, "goggin, edward" <egoggin at emc.com> wrote:
> 
> > Although I know it sounds a bit radical and counter intuitive,
> > but I'm not sure of the utility gained in the current multipathing
> > implementation by multipathd periodically testing paths which
> > are known to be in an active state in the multipath target driver.
> > Possibly someone can convince me otherwise.
> 
> Because user-space doesn't know whether any IO has actually gone down a
> given path, and that would be the only time the kernel would detect the
> error.
> 
> > If not, it may be possible to significantly reduce the cpu&io
> > resource utilization consumed by multipathd path testing on
> > enterprise scale configurations by only testing those paths
> > which the kernel thinks are in a failed state -- obviously a
> > much smaller set of paths.
> 
> I could see not testing paths if we knew IO was hitting them; as an
> approximization, the active paths from the active PG might be omitted.
> However, the paths in the inactive PG all need to be tested, or else you
> are never going to find out that the paths have gone bad on you until
> you try to failover.
> 
> The best way to minimize path (re-)testing needed is to figure in the
> hierarchy of components involved; as long as the FC switch is still bad,
> there's no point testing any target which we could reach through it,
> etc; testing whether the switch itself is healthy would round-robin
> through our various connections to the switch, to make sure we don't
> declare the switch down because we got hung up on one failed path.
> 
> Another option would be to not mechanically test every N seconds, but to
> retest a failed path after 1s - 2s - 4s - ... 32s max as a cascading
> back-off, and maybe start at 2 - 64s for paths in inactive PGs.
> 
> Not testing paths however isn't a real option.
> 

I think it's a good idea to make a distinction between testing paths
for probing (i.e. making sure they have not gone dead) and for
reclamation. Possibly this would mean having two separate testing
threads. This way users could decide which policies they would want to
use for each type of testing. Some users may not care so much for
probing. For example, if they have large configurations and and are
willing to trade off immediate knowledge of system degradation for
saved cycles, then they may decide to not have probing at all, and can
live with having paths fail due to failed I/O. Or use a probing policy
that doesn't consume so many resources, e.g. use a lower probing
frequency than reclamation testing. Reclamation is more crucial I
think and would be of more concern for users.  Enabling users to
determine the policy for reclamation, e.g. the testing frequency or
enable cascade-backoff, etc., would be good since factors for this
decision would be based on knowledge users have of their own
configuration and data load.

> > multipathd, this will no longer be true.  This seems unlikely
> > apparently due to the difficulty in implementing consistently
> > accurate path testing in user space.
> 
> Uh? How is path testing in user-space difficult?
> 
> Sincerely,
>     Lars Marowsky-Brée <lmb at suse.de>
> 
> --
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business
> 
> --
> dm-devel mailing list
> dm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>