Wouldn't it practical to bypass mpio completely on submit your io to the paths instead ?
----- Message d'origine -----
> On 2010-11-06T11:51:02, Alasdair G Kergon <agk redhat com> wrote:
> Hi Neil, Alasdair,
> thanks for the feedback. Answering your points in reverse order -
> > > Might it make sense to configure a range of the device where writes
> > > always went down all paths? That would seem to fit with your
> > > problem description and might be easiest??
> > Indeed - a persistent property of the device (even another interface
> > with a different minor number) not the I/O.
> I'm not so sure that would be required though. The equivalent of our
> "mkfs" tool wouldn't need this. Also, typically, this would be a
> partition (kpartx) on top of a regular MPIO mapping (that we want to be
> managed by multipathd).
> Handling this completely differently would complicate setup, no?
> > And what is the nature of the data being written, given that I/O to
> > one path might get delayed and arrive long after it was sent,
> > overwriting data sent later. Successful stale writes will always be
> > recognised as such by readers - how?
> The very particular use case I am thinking of is the "poison pill" for
> node-level fencing. Nodes constantly monitor their slot (using direct
> IO, bypassing all caching, etc), and either can successfully read it or
> commit suicide (assisted by a hardware watchdog to protect against
> The writer knows that, once the message has been successfully written,
> the target node will either have read it (and committed suicide), or
> been self-fenced because of a timeout/read error.
> Allowing for the additional timeouts incurred by MPIO here really slows
> this mechanism down to the point of being unusable.
> Now, even if a write was delayed - which is not very likely, it's more
> likely that some of the IO will just fail if indeed one of the paths
> happens to go down, and this would not resubmit it to other paths -, the
> worst that could happen would be a double fence. (If it gets written
> after the node has cycled once and cleared its message slot; that would
> imply a significant delay already, since servers take a bit to boot.)
> For the 'heartbeat' mechanism and others (if/when we get around for
> adding them), we could ignore the exact contents that have been written
> and just watch for changes; worst, the node death detection will take a
> bit longer.
> Basically, the thing we need to get around is the possible IO latency in
> MPIO, for things like poison pill fencing ("storage-based death") or
> qdisk-style plugins. I'm open for other suggestions as well.
> Architect Storage/HA, OPS Engineering, Novell, Inc.
> SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
> dm-devel mailing list
> dm-devel redhat com