[dm-devel] [PATCH 0/8] request-based dm-multipath (v1)

Kiyoshi Ueda k-ueda at ct.jp.nec.com
Wed Mar 18 08:44:36 UTC 2009


Hi,

This patch-set is the updated version of request-based dm-multipath
including a lot of bug fixes. (Most of them are corner case scenarios
found in heavy stress testing and code reviews. Also thanks to everyone
testing the previous patches and reporting bugs.)

 drivers/md/dm-ioctl.c         |   13 
 drivers/md/dm-mpath.c         |  194 +++++---
 drivers/md/dm-table.c         |  122 +++++
 drivers/md/dm.c               |  973 ++++++++++++++++++++++++++++++++++++++++--
 drivers/md/dm.h               |   21 
 fs/bio-integrity.c            |    5 
 fs/bio.c                      |    2 
 include/linux/bio.h           |    5 
 include/linux/device-mapper.h |   10 
 9 files changed, 1238 insertions(+), 107 deletions(-)


The patches are created to replace the current version of
the Alasdair's editing tree:
    PATCH 1: replacing dm-add-request-based-facility.patch
    PATCH 2: replacing dm-permit-request-based-facility.patch
    PATCH 3: replacing dm-enforce-new-queue-limits.patch
    PATCH 4: newly added to prevent lockdep warnings
    PATCH 5: replacing dm-mpath-change-to-be-request-based.patch
    PATCH 6: replacing block-add-gfp_mask-to-bio_integrity_clone.patch
    PATCH 7: replacing dm-add-gfp_mask-to-bio_integrity_clone.patch
    PATCH 8: replacing dm-retain-integrity-request-based-clones.patch

The patches have some cosmetic dependencies on the previously posted
dynamic load balancer patches:
    http://marc.info/?l=dm-devel&m=123736539330980&w=2
but no functional dependency on them.  (Actually, dynamic load
balancers rely on this patch-set for effectiveness.)


Changes from the version in the Alasdair's editing tree are:

    - Fixed oops on table swapping because of broken suspend (PATCH 1)
      http://marc.info/?l=dm-devel&m=123667308218868&w=2
        * dm_suspend() recognized that suspend completed while some
          clones were still in flight.  So swapping/freeing table
          was possible against the in-use table, and it caused oops
        * The problem was a race condition on in-flight checking:
          new I/O could become in-flight while suspend code found
          no I/O was in-flight and went on
        * Fixed it by protecting the in-flight counting and queue
          suspension with queue lock.  Also using q->in_flight
          instead of md->pending since it is straightforward to
          understand the meaning of the in-flight clones

    - Fixed NULL pointer reference when allocation for clone bio fails
      (From Hannes Reinecke) (PATCH 1)
      http://marc.info/?l=dm-devel&m=123666946315335&w=2
        * free_bio_clone() used clone->end_io_data to get md
        * But clone->end_io_data can be NULL when free_bio_clone()
          is called from error handling path in clone_request_bios()
        * Fixed it by passing md directly to free_bio_clone()
        * Removed the work around code to check bio->bi_private:
          http://marc.info/?l=dm-devel&m=123666946315335&w=2

    - Fixed unnecessary -EIO on table swapping by removing
      the map-existence check in dm_make_request() (PATCH 1)
      http://marc.info/?l=dm-devel&m=123322573410050&w=2
        * The check was there to reject BIOs until a table is loaded
          and the device and the queue are initialized correctly
        * But the check is not needed because BIOs are rejected
          in __generic_make_request() by device size checking
          until the device and the queue are initialized

    - Fixed freeing of in-use md (PATCH 1)
        * Upper layer can close the device just after all BIOs are
          completed by blk_update_request() or blk_end_request(),
          even if the request is still in the process of completion.
          It creates a small window where dm_put() can free the md
          which is needed for the process of completion
        * Fixed it by incrementing md->holders while I/O is in-flight

    - Fixed memory leak (missing unmap) on failure of dispatching
      a mapped clone (PATCH 1)
        * dm_kill_request() was used in dm_dispatch_request() when
          the dispatching fails
        * But dm_kill_requst() is for not-mapped clone, so target's
          rq_end_io() function is not called. This caused memory leak.
        * Fixed it by creating dm_complete_request() for mapped clone
          and calling it in dm_dispatch_request()

    - Changed exported function names to prevent misuse like above
      (PATCH 1, PATCH 3, PATCH 5)
        * dm_kill_request()    => dm_kill_unmapped_request()
        * dm_requeue_request() => dm_requeue_unmapped_request()

    - Removed the REQ_QUIET flag in dm_kill_unmapped_request(), since
      "I/O error" message is not printed for failed I/O (PATCH 1)

    - Removed 'inline' for simple static functions, since compiler
      will do inline anyway (PATCH 1)

    - Removed if-request-based check in dm_prep_fn(), since the device
      is always request-based when dm_prep_fn() is called (PATCH 1)
        * The check was for the case where the table is switched from
          request-based to bio-based.  But currently no plan to support
          such switching

    - Fixed the problem that switching device type is unexpectedly
      available on table swapping (PATCH 2)
        * table_load() checks md->map for the current table type
          to prevent the device type switching
        * But table_load() can run in parallel with dm_swap_table(),
          which has a small no-map window.
          So dm_init_md_mempool() in table_load() could switch
          the mempool type for an in-use device in this window
        * Fixed it by placing the pre-allocated mempools in table
          and binding them on resume.  The resume is rejected if
          the device is in-use with different type

    - Fixed the problem that BIO is submitted to a request-based device
      in which some settings may not be done yet (PATCH 2)
        * The QUEUE_FLAG_STACKABLE flag was set in the middle of
          dm_table_set_restrictions()
        * But the flag needs to be set after all queue settings are
          done and they become visible to other CPUs, because:
            o BIOs can be submitted during dm_table_set_restrictions()
            o Once the QUEUE_FLAG_STACKABLE flag is set in the queue,
              BIOs are passed to request-based dm and eventually to
              __make_request() where the queue settings are used
        * Fixed it by re-ordering the flag set and putting a barrier
          before the flag set.

    - Fixed lockdep warnings (From Christof Schmitt) (PATCH 4)
      http://marc.info/?l=dm-devel&m=123501258326935&w=2
        * lockdep warns some self-deadlock possibilities since:
            o map_lock is taken without disabling irq
            o request-based dm takes map_lock after taking
              queue_lock with disabling irq in dm_request_fn()
            o so logically a deadlock may happen:
                write_lock(map_lock)
                <interrupt>
                spin_lock_irqsave(queue_lock)
                dm_request_fn()
                    => dm_get_table()
                        => read_lock(map_lock)
        * dm_request_fn() is never called in interrupt context, so
          such self-deadlocks don't happen actually
        * But fixed the unneeded warnings by disabling irq when
          taking map_lock


Some basic function/performance testings are done with NEC iStorage
(active-active multipath), and no problem was found.
Also heavy stress testing with a lot of path up/down and table
switching is done to check panic, stall and memory leak don't occur.
Please review and apply if no problem.


Summary of the design and request-based dm-multipath are below.
(No changes since the previous post.)

BACKGROUND
==========
Currently, device-mapper (dm) is implemented as a stacking block device
at bio level.  This bio-based implementation has an issue below
on dm-multipath.

    Because hook for I/O mapping is above block layer __make_request(),
    contiguous bios can be mapped to different underlying devices
    and these bios aren't merged into a request.
    Dynamic load balancing could happen this situation, though
    it has not been implemented yet.
    Therefore, I/O mapping after bio merging is needed for better
    dynamic load balancing.

The basic idea to resolve the issue is to move multipathing layer
down below the I/O scheduler, and it was proposed from Mike Christie
as the block layer (request-based) multipath:
    http://marc.info/?l=linux-scsi&m=115520444515914&w=2

Mike's patch added new block layer device for multipath and didn't
have dm interface.  So I modified his patch to be used from dm.
It is request-based dm-multipath.


DESIGN OVERVIEW
===============
While currently dm and md stacks block devices at bio level,
request-based dm stacks at request level and submits/completes
struct request instead of struct bio.


Overview of the request-based dm patch:
  - Mapping is done in a unit of struct request, instead of struct bio
  - Hook for I/O mapping is at q->request_fn() after merging and
    sorting by I/O scheduler, instead of q->make_request_fn().
  - Hook for I/O completion is at bio->bi_end_io() and rq->end_io(),
    instead of only bio->bi_end_io()
                  bio-based (current)     request-based (this patch)
      ------------------------------------------------------------------
      submission  q->make_request_fn()    q->request_fn()
      completion  bio->bi_end_io()        bio->bi_end_io(), rq->end_io()
  - Whether the dm device is bio-based or request-based is determined
    at table loading time
  - Keep user interface same (table/message/status/ioctl)
  - Any bio-based devices (like dm/md) can be stacked on request-based
    dm device.
    Request-based dm device *cannot* be stacked on any bio-based device.


Expected benefit:
  - better load balancing


Additional explanations:

Why does request-based dm use bio->bi_end_io(), too?
Because:
  - dm needs to keep not only the request but also bios of the request,
    if dm target drivers want to retry or do something on the request.
    For example, dm-multipath has to check errors and retry with other
    paths if necessary before returning the I/O result to the upper layer.

  - But rq->end_io() is called at the very late stage of completion
    handling where all bios in the request have been completed and
    the I/O results are already visible to the upper layer.
So request-based dm hooks bio->bi_end_io() and doesn't complete the bio
in error cases, and gives over the error handling to rq->end_io() hook.


Thanks,
Kiyoshi Ueda




More information about the dm-devel mailing list