[dm-devel] 2.6.10-rc1-udm1: multipath work in progress

Fri Nov 5 22:05:04 UTC 2004

On 2004-10-29T23:23:02, Alasdair G Kergon <agk at redhat.com> wrote:

Hi Alasdair and list members,

I'd like to revisit the discussion on the "bypassed" flag to priority
groups. I've come to the conclusion that this is a suboptimal and
non-elegant way of reaching the desired goal and would like to hear your
comments on these thoughts. (Which came while thinking how to configure
this from user-space.)

First, lets start with the goal definition. What motivated the bypassed
flag?

The idea was (at least on my side) to minimize unnecessary switches
between priority groups by the kernel itself, as this might have
undesireable side effects. (Performance or affecting other nodes.) In
short, to move the policy decision of when to switch the priority group
out of the kernel to user-space.

The only time when the kernel was 'authorized' to switch PGs internally
was in response to all paths failing or when the PG was "pulled away"
from under us (in a non-fatal condition). In short, when immediate and
unavoidable action was necessary to keep the IO moving.

Admins may have different policies for controlling the switching of PGs.
In a cluster, it may need to be coordinated across multiple nodes
(either by actual communication or via the timing based model I've been
discussing on the list). On a single node even, we may want to stick to
the PG which has the most healthy paths, both for redundancy and/or load
balancing. Or we may want to switch back to some PG immediately right
away as soon as it becomes available again, because it has 10GB/s FC
and the other one is on a 1GB/s link. I believe such policy decisions
would complicate the kernel too much.

Now, why do I think the bypassed flag isn't the way to go?

- It makes switching PGs icky from user-space. If you want the kernel to
  switch to some PG, you need to set bypassed on all others and clear it
  on just that one.
  -> excessive communication.

- It makes error recovery in the kernel annoying. When we _must_ switch
  PGs in response to errors, we'd eventually ignore the bypassed flag
  anyway and switch to the one which still has healthy paths, so why
  have it in the first place?

- User-space can't tell us right now which PGs it might want bypassed
  when loading the table in the first place, but it has to send
  additional messages afterwards, when it might already be too late.

  (Unless we want user-space to have to sort the table differently on
  each load, which adds unnecessary complexity and also means that
  admins using dmsetup status to look at it will have to pay more
  attention, which they won't.)

- Following Alasdair's description of how table loads should be very
  rare, and that the modus operandi should instead be changed by atomic
  messages, I think this can be implemented better.

So, I believe the goals would be better served by making it easy for
user-space to command us which PG to use, not having to flag which PGs
_not_ to use, and having to clear those flags etc. 

(At least that is how my design sensors work. Yours may be different.)

My proposal would thus be

- drop the bypassed flag completely.

- Allow user-space to send us a "switch_pg" message, specifying the
  number of the PG to switch to instead.

- Report the number of the active PG in the status report.

  (Random remark: Maybe a good idea to have a multipath -q command to
  query the state of the multipath device in some more human readable
  format.)

- To keep the mapping itself stable, allow user-space to specify the no
  of the PG which should be active initially at table load time.

- The kernel keeps track of the "current_pg" already and never
  gratuitiously changes PGs. Only in response to errors (all paths
  failed), or when the error handler tells us to SWITCH_PG do we switch,
  and then we switch to the first other PG with healthy paths.

  And send an event to user-space to allow it to find out that we did,
  and then user-space can coordinate the switch-back at it's discretion.

- (Actually, I briefly entertained the idea of querying user-space to
  tell us which "other" PG we should switch to, or figuring it out in
  some fancy way ourselves. But then it occured to me that in 95% of all
  scenarios, I believe we'll be dealing with exactly two PGs anyway,
  and thus there'll be just that other PG to switch to. If we eventually
  need more, we can always introduce a feature flag.)

I believe this captures the goals better and will also be cleaner to
implement.

Comments?

If you approve, I can do a patch for dm-mpath on Monday.

Sincerely,
    Lars Marowsky-Brée <lmb at suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business