[dm-devel] [PATCH 0/5] request-based dm-multipath (v3)
Kiyoshi Ueda
k-ueda at ct.jp.nec.com
Tue Jun 2 06:57:56 UTC 2009
Hi,
This patch-set is the updated version of request-based dm-multipath.
Alasdair, please consider to include this for 2.6.31.
The patches are based on the combination of the following trees and patches:
- Jens' block/for-2.6.31 (c143dc903d7c0b15f5052e00b2c7de33a8b4299c)
- Alasdair's linux-next patches on June 2nd (UTC).
- Mike Snitzer's dm-topology patch:
http://marc.info/?l=dm-devel&m=124345842831206&w=2
(Actually, no functional dependencies on this patch.)
This patch-set doesn't have barrier support yet.
Some basic function/performance testings are done with NEC iStorage
(active-active multipath), and no problem was found.
Also heavy stress testing with a lot of path up/down and table
switching is done to check panic, stall and memory leak don't occur.
Please review and apply if no problem.
NOTE:
Recently, __sector and __data_len of struct request are commented
as 'internal, NEVER access directly' because they reflect the latest
status of request processing and change as BIOs are added/completed.
However, since request-based dm needs to copy the latest status
of the original request to clone, those fields are accessed directly
in this patch-set.
Further block layer change will be needed to clean up this situation.
I'm going to discuss about it separately in linux-kernel.
Summary of the patch-set
========================
1/5: dm core: add core functions for request-based dm
2/5: dm core: enable request-based dm
3/5: dm core: don't set QUEUE_ORDERED_DRAIN for request-based dm
4/5: dm core: disable interrupt when taking map_lock
5/5: dm-mpath: convert to request-based
drivers/md/dm-ioctl.c | 13
drivers/md/dm-mpath.c | 193 +++++---
drivers/md/dm-table.c | 130 +++++
drivers/md/dm.c | 947 ++++++++++++++++++++++++++++++++++++++++--
drivers/md/dm.h | 27 +
include/linux/device-mapper.h | 9
6 files changed, 1218 insertions(+), 101 deletions(-)
CHANGES
=======
v3:
- Rebased to
* Jens' block/for-2.6.31 (c143dc903d7c0b15f5052e00b2c7de33a8b4299c)
* Alasdair's linux-next patches on June 2nd (UTC)
* Mike Snitzer's dm-topology patch:
http://marc.info/?l=dm-devel&m=124345842831206&w=2
(Actually, no functional dependencies on this patch.)
- Removed the patch which rejects I/Os violating new queue limits,
since Hannes pointed out that the patch seems to cause unexpected
-EIO in no-path situation.
* This unexpected -EIO problem is still under investigation.
* The patch is for dangerous table swapping like shrinking
the queue limits. Generally, such table swapping shouldn't
be done. So removing the patch as of now shouldn't cause
a problem.
- Merged the integrity patch into the core function patch (PATCH 1)
- Fixed types of variables (PATCH 1, PATCH 2)
* Changed to use 'unsigned' instead of 'int' for 'i' in
dm_table_any_busy_target() and dm_table_set_type(),
since 'i' is always positive value and compared with
't->num_targets', which is 'unsigned'.
* Changed to use 'unsigned' instead of 'int' for 'bio_based'
and 'request_based' in dm_table_set_type(), since they are
always positive values.
- Moved table type from struct io_restrictions to struct dm_table
(PATCH 2)
* In v2, table type was stored in struct io_restrictions as
'no_request_stacking'. But Mike Snitzer's dm-topology patch
replaced the dm-specific io_restrictions with generic I/O
topology framework (queue_limits).
* Since table type is dm-specific, it cannot be a part of
queue_limits. So 'type' member is added to struct dm_table.
- Dropped cast in multipath_busy(), since it's from void (PATCH 5)
v2: (http://marc.info/?l=dm-devel&m=124056068208687&w=2)
- Rebased to 2.6.30-rc3
- Fixed memory leak when bio_integrity_clone() fails (PATCH 2)
* Added missing bio_free() when bio_integrity_clone() fails.
- Added a patch not to set QUEUE_ORDERED_DRAIN for request-based dm
(PATCH 4)
* This version of request-based dm doesn't have barrier support
yet. So set QUEUE_ORDERED_DRAIN only for bio-based dm.
Since the device type is decided at the first table binding
time, the flag set is deferred until then.
v1: (http://marc.info/?l=dm-devel&m=123736590931733&w=2)
- Rebased to 2.6.29-rc8
- Fixed oops on table swapping because of broken suspend (PATCH 1)
http://marc.info/?l=dm-devel&m=123667308218868&w=2
* dm_suspend() recognized that suspend completed while some
clones were still in flight. So swapping/freeing table
was possible against the in-use table, and it caused oops
* The problem was a race condition on in-flight checking:
new I/O could become in-flight while suspend code found
no I/O was in-flight and went on
* Fixed it by protecting the in-flight counting and queue
suspension with queue lock. Also using q->in_flight
instead of md->pending since it is straightforward to
understand the meaning of the in-flight clones
- Fixed NULL pointer reference when allocation for clone bio fails
(From Hannes Reinecke) (PATCH 1)
http://marc.info/?l=dm-devel&m=123666946315335&w=2
* free_bio_clone() used clone->end_io_data to get md
* But clone->end_io_data can be NULL when free_bio_clone()
is called from error handling path in clone_request_bios()
* Fixed it by passing md directly to free_bio_clone()
* Removed the work around code to check bio->bi_private:
http://marc.info/?l=dm-devel&m=123666946315335&w=2
- Fixed unnecessary -EIO on table swapping by removing
the map-existence check in dm_make_request() (PATCH 1)
http://marc.info/?l=dm-devel&m=123322573410050&w=2
* The check was there to reject BIOs until a table is loaded
and the device and the queue are initialized correctly
* But the check is not needed because BIOs are rejected
in __generic_make_request() by device size checking
until the device and the queue are initialized
- Fixed freeing of in-use md (PATCH 1)
* Upper layer can close the device just after all BIOs are
completed by blk_update_request() or blk_end_request(),
even if the request is still in the process of completion.
It creates a small window where dm_put() can free the md
which is needed for the process of completion
* Fixed it by incrementing md->holders while I/O is in-flight
- Fixed memory leak (missing unmap) on failure of dispatching
a mapped clone (PATCH 1)
* dm_kill_request() was used in dm_dispatch_request() when
the dispatching fails
* But dm_kill_requst() is for not-mapped clone, so target's
rq_end_io() function is not called. This caused memory leak.
* Fixed it by creating dm_complete_request() for mapped clone
and calling it in dm_dispatch_request()
- Changed exported function names to prevent misuse like above
(PATCH 1, PATCH 3, PATCH 5)
* dm_kill_request() => dm_kill_unmapped_request()
* dm_requeue_request() => dm_requeue_unmapped_request()
- Removed the REQ_QUIET flag in dm_kill_unmapped_request(), since
"I/O error" message is not printed for failed I/O (PATCH 1)
- Removed 'inline' for simple static functions, since compiler
will do inline anyway (PATCH 1)
- Removed if-request-based check in dm_prep_fn(), since the device
is always request-based when dm_prep_fn() is called (PATCH 1)
* The check was for the case where the table is switched from
request-based to bio-based. But currently no plan to support
such switching
- Fixed the problem that switching device type is unexpectedly
available on table swapping (PATCH 2)
* table_load() checks md->map for the current table type
to prevent the device type switching
* But table_load() can run in parallel with dm_swap_table(),
which has a small no-map window.
So dm_init_md_mempool() in table_load() could switch
the mempool type for an in-use device in this window
* Fixed it by placing the pre-allocated mempools in table
and binding them on resume. The resume is rejected if
the device is in-use with different type
- Fixed the problem that BIO is submitted to a request-based device
in which some settings may not be done yet (PATCH 2)
* The QUEUE_FLAG_STACKABLE flag was set in the middle of
dm_table_set_restrictions()
* But the flag needs to be set after all queue settings are
done and they become visible to other CPUs, because:
o BIOs can be submitted during dm_table_set_restrictions()
o Once the QUEUE_FLAG_STACKABLE flag is set in the queue,
BIOs are passed to request-based dm and eventually to
__make_request() where the queue settings are used
* Fixed it by re-ordering the flag set and putting a barrier
before the flag set.
- Fixed lockdep warnings (From Christof Schmitt) (PATCH 4)
http://marc.info/?l=dm-devel&m=123501258326935&w=2
* lockdep warns some self-deadlock possibilities since:
o map_lock is taken without disabling irq
o request-based dm takes map_lock after taking
queue_lock with disabling irq in dm_request_fn()
o so logically a deadlock may happen:
write_lock(map_lock)
<interrupt>
spin_lock_irqsave(queue_lock)
dm_request_fn()
=> dm_get_table()
=> read_lock(map_lock)
* dm_request_fn() is never called in interrupt context, so
such self-deadlocks don't happen actually
* But fixed the unneeded warnings by disabling irq when
taking map_lock
Summary of the design and request-based dm-multipath are below.
(No changes since the previous post.)
BACKGROUND
==========
Currently, device-mapper (dm) is implemented as a stacking block device
at bio level. This bio-based implementation has an issue below
on dm-multipath.
Because hook for I/O mapping is above block layer __make_request(),
contiguous bios can be mapped to different underlying devices
and these bios aren't merged into a request.
Dynamic load balancing could happen this situation, though
it has not been implemented yet.
Therefore, I/O mapping after bio merging is needed for better
dynamic load balancing.
The basic idea to resolve the issue is to move multipathing layer
down below the I/O scheduler, and it was proposed from Mike Christie
as the block layer (request-based) multipath:
http://marc.info/?l=linux-scsi&m=115520444515914&w=2
Mike's patch added new block layer device for multipath and didn't
have dm interface. So I modified his patch to be used from dm.
It is request-based dm-multipath.
DESIGN OVERVIEW
===============
While currently dm and md stacks block devices at bio level,
request-based dm stacks at request level and submits/completes
struct request instead of struct bio.
Overview of the request-based dm patch:
- Mapping is done in a unit of struct request, instead of struct bio
- Hook for I/O mapping is at q->request_fn() after merging and
sorting by I/O scheduler, instead of q->make_request_fn().
- Hook for I/O completion is at bio->bi_end_io() and rq->end_io(),
instead of only bio->bi_end_io()
bio-based (current) request-based (this patch)
------------------------------------------------------------------
submission q->make_request_fn() q->request_fn()
completion bio->bi_end_io() bio->bi_end_io(), rq->end_io()
- Whether the dm device is bio-based or request-based is determined
at table loading time
- Keep user interface same (table/message/status/ioctl)
- Any bio-based devices (like dm/md) can be stacked on request-based
dm device.
Request-based dm device *cannot* be stacked on any bio-based device.
Expected benefit:
- better load balancing
Additional explanations:
Why does request-based dm use bio->bi_end_io(), too?
Because:
- dm needs to keep not only the request but also bios of the request,
if dm target drivers want to retry or do something on the request.
For example, dm-multipath has to check errors and retry with other
paths if necessary before returning the I/O result to the upper layer.
- But rq->end_io() is called at the very late stage of completion
handling where all bios in the request have been completed and
the I/O results are already visible to the upper layer.
So request-based dm hooks bio->bi_end_io() and doesn't complete the bio
in error cases, and gives over the error handling to rq->end_io() hook.
Thanks,
Kiyoshi Ueda
More information about the dm-devel
mailing list