[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[dm-devel] [DOC] Linux multipath implementation



Hello,

I would like to submit this doc to your criticism.
Beware, the english will certainly be awful, as I'm not anglozone native.
I will take any guidance about what to do to make it useful for the widest
audiance.

 <<multipath.html>> 
regards,
cvaroqui

The Linux multipath implementation

Original author : Christophe Varoqui, Feb 2004

This document is shared under the OpenContent Licence (http://www.opencontent.org/opl.shtml)

Introduction

The most common multipathed environment today is a Fibre Channel (FC) Storage Area Network (SAN). This beasts can be found in most Datacenters. The lego blocks forming a SAN are :

  • FC switchs : core switchs (multiprotocols chassis and FC boards), or stacked switchs linked by Inter Switch Links (ISL). This layer, the Fabric layer, can be subdived in two major fabric types :

    • Simple fabrics : all storage ports can be routed to all hosts ports. Hosts an storage controlers can have a single attachement to the the fabric.

    • Dual independent Fabrics : two sets of switchs are completely segregated (no ISL). They form two independent naming domains. The hosts and storage controlers must be attached the the two fabrics to assure redondancy. This technology is used to provide the maximum availability as one fabric can be shut, for planned or unplanned reasons, without pertubation seen on the other one.

  • FC storage controlers : most of them provide multiple ports to attach to the switchs layer. The physical storage they drive is arranged in virtual drives we will refer to as Logical Units (LU). Each LU is provided by its host controler a unique identifier as per the SCSI standard. We will refer to this identifier as the World Wide Identifier (WWID) or World Wide Name (WWN).

  • Host Bus Controlers (HBA) : the PCI / FC coupling adapters. A server can embed multiple HBA.

The multipath term simply means that a host can acess a LU by muliple paths, the path being a route from one host HBA port to one storage controler port.

Examples :

  • A host with 2 HBA attached to a single fabric is presented a LU by a 4 ports storage controler. The host then see 8 paths to the LU

  • A host with 2 HBA attached to a dual independent fabric (1 HBA on eache fabric) is presented a LU by a 4 ports storage controler (2 ports on each fabric). The host then see 4 paths to the LU : 2 paths through fabric A, plus 2 through the fabric B.



The Linux kernel choose not the mask the individual paths, that appears as normal SCSI Disks (SD).

Multipath awareness and support for an operating system can be described as :

  • Provide a single block device node for a multipathed LU

  • Ensure that IO are re-routed to available paths when a loss of path occurs, with no userspace process disruption other than an short pause.

  • Ensure that failed paths get revalidated as soon as possible

  • Ensure stability of the naming of that node

  • Configure the multipaths to maximize performance : spread IO when possible path switching is free, and not spread when it's costly.

  • Configure the multipaths automatically early at boot to permit OS install on a multipathed LU

  • Reconfigure the multipaths automatically when events occur

  • The multipath must be partitionable

  • In the Linux way : simple and hardware vendor agnostic

All these goals are met by leveraging a set of userspace tools ans kernel subsystems :

  • the kernel device mapper

  • the hotplug kernel subsystem

  • the udev device naming tool

  • the multipath userspace configuration tool

  • the kpartx userspace configuration tool

  • the early userspace Linux kernel boot environment

The rest of this document describes these individual tools and subsystems and their interactions.

Device Mapper

Starting with Linux kernel 2.6, a new lightweight block subsystem named Device Mapper enables advanced storage management with style. This component features a plugable design. At the time of this writing available plugins are :

  • segments concatenation

  • segment striping

  • segment snapshoting

  • segment mirroring, with and without persistance

  • segment on-the-fly encryption

  • segment multipathing

This last policy is the core component of the multipath tool chain. It is not included in the main kernel tree as of linux-2.6.3. It is part of a patchset maintained by Joe Thornber (thornber at redhat dot com) that can be downloaded at http://people.sistina.com/~thornber/dm/

This component fills the following requirements :

  • Provide a single block device node for a multipathed LU

  • Ensure that IO are re-routed to available paths when a loss of path occurs, with no userspace process disruption other than an short pause.

  • Ensure that failed paths get revalidated as soon as possible

So, let's see how it works.

The Device Mapper is configured one map at a time. A device map, also referred to as a table, is a list of segments in the form of :

0 35258368 linear 8:48 65920

35258368 35258368 linear 8:32 65920

70516736 17694720 linear 8:16 17694976

88211456 17694720 linear 8:16 256



The first 2 parameters of each line are the segment starting block in the virtual device and the lengh of the segment. The next keyword ist the target policy (linear). The rest of the line is the target parameters.

The Device Mapper can be fed its tables through the use of a library : libdevmapper. Dmsetup, LVM2, the multipath configuration tool and kpartx all link this lib. A table setup boils down to sprintf'ing the right segment definitions in a char *. Should the DM user-kernel interface change from being ioctl based to a pseudo filesystem, the libdevmapper API should remain stable.

Here is an example of a multipath target :

0 27262976 multipath 10 2 1 round-robin 2 0 /dev/sda /dev/sdk 1 round-robin 2 0 /dev/sdc /dev/sdm



The multipath target parameters are :

  • $0 : the â??multipathâ?? keyword

  • $1 : the test IO posting interval, used to revalidate failed paths, in seconds

  • $2 : the number of priority groups for the segment

  • $3 : the first priority group parameters :

    • $3.1 : the priority of this PG

    • $3.2 : the scheduler to be used to spread IO inside the PG

    • $3.3 : the number of paths in the PG

    • $3.4 : the number of paths parameters (usualy 0)

    • $3.5 : the paths list for this PG

  • $4 : next priority group

For completion, here is an example of a pure failover target definition for the same LU :

0 27262976 multipath 10 4 1 round-robin 1 0 /dev/sda 1 round-robin 1 0 /dev/sdc 1 round-robin 1 0 /dev/sdk 1 round-robin 1 0 /dev/sdm

And a full spread (multibus) target one :

0 27262976 multipath 10 1 1 round-robin 4 0 /dev/sda /dev/sdc /dev/sdk /dev/sdm



Upon device map creating, a new block kernel object named dm-[0-9]* is instantiated, and a hotplug call is triggered. Each device map can be assigned a symbolic name when created through libdevmapper, but this name won't be available anywhere but through a libdevmapper request.

hotplug subsystem and udev

Starting with Linux kernel 2.6, the hotplug callbacks are commonized through the presence of a new pseudo filesystem : sysfs. This filesystem presents to userspace kernel objects like bus, driver instances or block devices in a hierarchicaly and homogenous manner. The hotplug subsystem is leveraged by triggering a /sbin/hotplug call upon file creation and deletion in the sysfs filesystem.

For our needs this facility provides :

  • userspace callbacks upon paths additions and suppressions

  • userspace callbacks upon device maps additions and suppressions

It may also provide in the future callbacks upon FC transport events like a â??Port Database Rescanâ??. This callbacks could then be used to trigger SCSI Bus Rescan to bring a fully dynamic storage layer.

Here is how we use this callbacks for the multipath implementation :

  • The paths additions and suppressions callbacks are routed to the multipath userspace configuration tool described later. This tool ensure the multipath maps are always up-to-date with the fabric topology, and this ensure optimal performance by adding new paths to the existing maps as soon as they become available.

  • The udev userspace tool is triggered upon every block sysfs entry creation and suppression, and assume the responsibility of the associated device node creation and naming. Udev default naming policies can be complemented by add-on scripts or binaries. As it does not currently have a default policy for device maps naming, we plug a little tool named devmap_name that resolve the sysfs dm-[0-9]* names in map names as set at map creation time. Provided the map naming is rightly done, this plugin provides the naming stability and meaningfullness required for a proper multipath implementation.

  • The userspace callbacks upon device maps additions and suppressions also triggers the kpartx tool to create the device maps over eventual partitions

Udev is a reimplementation in userspace of the devfs kernel facility. It provides a dynamic /dev space, with an agnostic naming policy. Greg Kroah-Hartman is the main developer and maintainer of this package. It can be found at http://ftp.kernel.org/pub/linux/utils/kernel/hotplug/

To synthetize what implementation details these subsystems fill :

  • Ensure stability of the naming of that node

  • Reconfigure the multipaths automatically when events occur

  • The multipath must be partitionable

multipath userspace config tool

This tool is responsible for the paths coalescing and device maps creation. As seen earlier, it is triggered by the hotplug calls on new paths additions and suppressions. It must deal with hardware specifics and abstract them for the others subsystems.

Here is how it works :

  • draw a list of all available devices in the system through a sysfs scan. For each device, get a bunch of information :

    • Host / Bus / Target / Lun tuple

    • SCSI Device Strings : Vendor / Model / Revision

    • SCSI Serial String

  • Considering the informations fetched, elect a LU WWID method and an IO spreading policy. Ie deal with hardware specifics.

  • Get the LU WWID with the elected method. This method defaults to the standard 128 bit EUID found in the EVPD 0x83 inquiry page of the device.

  • Coalesce the paths to form the multipath structs

  • Create and name the device maps associated with the multipath structs with the selected IO spreading policy

There are currently 3 spreading policy implemented :

  • failover : 1 path per priority group. IO thus get routed to one path only.

  • multibus : 1 priority group containing all paths to the LU. Brings the maximum spreading, but assumes that all paths are sollicitable without penalty.

  • group_by_serial : 1 priority group per storage controler (serial), paths through one controler are assigned to the associated PG. This policy applies to controlers that impose a latency penalty on LU management hand-over between a pair of redondant controlers.

The device maps naming policy is â??name by LU WWIDâ??.

To illustrate this synopsis, here is an example verbose output :

xa-s03:~/udev-016/extras# multipath -v
600508b4000156d700012000000b0000 (0 0 1 1) /dev/sda [HSV110 (C)COMPAQ]
600508b4000156c30001200000210000 (0 0 1 2) /dev/sdb [HSV110 (C)COMPAQ]
600508b4000156d700012000000b0000 (0 0 2 1) /dev/sdc [HSV110 (C)COMPAQ]
600508b4000156c30001200000210000 (0 0 2 2) /dev/sdd [HSV110 (C)COMPAQ]
60001fe1000bdad0000903507109004b (0 0 3 1) /dev/sde [HSG80 ]
60001fe1000bdad000090371312100bf (0 0 3 2) /dev/sdf [HSG80 ]
60001fe1000bdad000090371312100c2 (0 0 3 3) /dev/sdg [HSG80 ]
60001fe1000bdad00009037131210067 (0 0 4 1) /dev/sdh [HSG80 ]
60001fe1000bdad000090371312100b3 (0 0 4 2) /dev/sdi [HSG80 ]
60001fe1000bdad00009035071090024 (0 0 4 3) /dev/sdj [HSG80 ]
600508b4000156d700012000000b0000 (1 0 1 1) /dev/sdk [HSV110 (C)COMPAQ]
600508b4000156c30001200000210000 (1 0 1 2) /dev/sdl [HSV110 (C)COMPAQ]
600508b4000156d700012000000b0000 (1 0 2 1) /dev/sdm [HSV110 (C)COMPAQ]
600508b4000156c30001200000210000 (1 0 2 2) /dev/sdn [HSV110 (C)COMPAQ]
600508b4000156d700012000000b0000
\_(0 0 1 1) /dev/sda [HSV110 (C)COMPAQ]
\_(0 0 2 1) /dev/sdc [HSV110 (C)COMPAQ]
\_(1 0 1 1) /dev/sdk [HSV110 (C)COMPAQ]
\_(1 0 2 1) /dev/sdm [HSV110 (C)COMPAQ]
600508b4000156c30001200000210000
\_(0 0 1 2) /dev/sdb [HSV110 (C)COMPAQ]
\_(0 0 2 2) /dev/sdd [HSV110 (C)COMPAQ]
\_(1 0 1 2) /dev/sdl [HSV110 (C)COMPAQ]
\_(1 0 2 2) /dev/sdn [HSV110 (C)COMPAQ]
60001fe1000bdad0000903507109004b
\_(0 0 3 1) /dev/sde [HSG80 ]
60001fe1000bdad000090371312100bf
\_(0 0 3 2) /dev/sdf [HSG80 ]
60001fe1000bdad000090371312100c2
\_(0 0 3 3) /dev/sdg [HSG80 ]
60001fe1000bdad00009037131210067
\_(0 0 4 1) /dev/sdh [HSG80 ]
60001fe1000bdad000090371312100b3
\_(0 0 4 2) /dev/sdi [HSG80 ]
60001fe1000bdad00009035071090024
\_(0 0 4 3) /dev/sdj [HSG80 ]
U:600508b4000156d700012000000b0000:0 27262976 multipath 10 2 1 round-robin 2 0 /dev/sda /dev/sdk 1 round-robin 2 0 /dev/sdc /dev/sdm
U:600508b4000156c30001200000210000:0 31457280 multipath 10 2 1 round-robin 2 0 /dev/sdb /dev/sdl 1 round-robin 2 0 /dev/sdd /dev/sdn
U:60001fe1000bdad0000903507109004b:0 106669167 multipath 10 1 1 round-robin 1 0 /dev/sde
U:60001fe1000bdad000090371312100bf:0 142229246 multipath 10 1 1 round-robin 1 0 /dev/sdf
U:60001fe1000bdad000090371312100c2:0 142229246 multipath 10 1 1 round-robin 1 0 /dev/sdg
U:60001fe1000bdad00009037131210067:0 213338334 multipath 10 1 1 round-robin 1 0 /dev/sdh
U:60001fe1000bdad000090371312100b3:0 213338334 multipath 10 1 1 round-robin 1 0 /dev/sdi
U:60001fe1000bdad00009035071090024:0 71114623 multipath 10 1 1 round-robin 1 0 /dev/sdj



The first section shows the list of all paths detected on the host. The second shows the multipath structs produced by the coalescing logic. The third shows the device maps submited to the Device Mapper.

Of interest is the creation of device maps for single path LU : this enable to system to operate normaly when booted in a degraded SAN context. The missing paths will be added to the maps when they become available.

This tool is packaged with udev, in the extras/ section. The devmap_name tool is distributed in the same tree.

The implementation requirements filled by this tool are :

  • Ensure stability of the naming of that node (in complement of udev)

  • Configure the multipaths to maximize performance : spread IO when possible path switching is free, and not spread when it's costly.

  • Configure the multipaths automatically at boot

  • Reconfigure the multipaths automatically when events occur ( in complement of hotplug)

kpartx userspace config tool

This tool, derived from util-linux' partx, reads partition tables on specified device and create device maps over partitions segments detected. It is called from hotplug upon device maps creation and deletion.

For now kpartx can be found at http://dsit.free.fr/

Early userspace

Starting with Linux kernel 2.6, an early userspace execution environment is available in the name of initramfs. The grand plan is to package a set of tools in a cpio archive concatanated to the kernel. This archive is expanded in an in-memory filesystem early at boot and the tools are called to assume logics that previously belonged in the kernel : dhcp requests and setups, nfsroot stuffing ...

Being concatenated to the kernel, the size of this archive matters a lot. A slim libc implementation is required and provided in the name of klibc, maintained by Hans Peter Anvin.

The multipath implementation toolchain fits in this early userspace definition. Udev, multipath and kpartx are linked against klibc and can be packaged with the cpio archive to bring up the multipathed device early enough to boot on.

So is met the last multipath implementation requirement.

Quick installation guide

  • Download and unarchive a recent 2.6 kernel

  • Download, unarchive and apply the associated udm patchset from http://people.sistina.com/~thornber/dm/

  • Configure, compile and install the kernel, reconfigure the bootloader if needed

  • Download udev, at least 017, from http://ftp.kernel.org/pub/linux/utils/kernel/hotplug/

  • Compile udev with klibc linking (edit the Makefile to uncomment KLIBC=true). Make install.

  • In udev/extras/multipath, make and make install

  • Reboot under the new kernel and see the magic operate


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]