[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[dm-devel] dm-multipath (multipathd) not removing/adding channels back on one device in a multipath array


I'm experiencing repeatable issues with multipathd (but not the kernel detecting, or multipath manually) failing to add and/or remove paths to a single device on a dual-loop FC disk tray. If I stop multipathd from running, the kernel sees the paths as unreachable and marks them as 'failed' in the multipath -l output. If I run 'multipath' manually, it _always_ picks up or removes the appropriate channels for all devices.

The failure mode comes up when using multipathd to auto-correct for path failures. There is only a /single/ device (the first FC drive in the array) that (reliably) has issues.

When running multipathd, the drive that is enumerated as /dev/sdb && /dev/sdp (14-drive enclosure sdb-sdo, drive naming re-starts at /dev/sdp) gets skipped upon removal or addition of the path at least 50% of the time. No amount of time I've waited has resulted in multipathd making another attempt at fixing the path, however, running 'multipath' immediately results in IT cleaning up the straggler and it is made proper with output to that effect. Of note, if I leave multipathd off and do not manually run multipath before reconnecting the FC channel, upon disconnecting it again the system OOPS'es like mad, and hard-crashes.

// Example of multipathd leaving the drive (mpath3) with only one path "up", while the others have both paths present:

mpath10 (32000000c50e8df4b)
[size=136 GB][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=0][active]
 \_ 2:0:8:0  sdj  8:144  [active][undef]
 \_ 3:0:8:0  sdx  65:112 [active][undef]
mpath3 (320000011c6bdfbd5)
[size=136 GB][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=0][active]
 \_ 2:0:0:0  sdb  8:16   [active][undef]

// Example output (snippet) of 'multipath -v4' after 'multipathd' fails to fix it:

mpath3: set ACT_RELOAD (path group topology change)
reload: mpath3 (320000011c6bdfbd5)
[size=136 GB][features=0][hwhandler=0]
\_ round-robin 0 [prio=2][undef]
 \_ 2:0:0:0  sdb  8:16   [active][ready]
 \_ 3:0:0:0  sdp  8:240  [undef][ready]

The ACT_RELOAD line is what differs at this point, as all the other fully-populated multipath devices show for instance "mpath0: set ACT_NOTHING (map unchanged)". It seems that whatever criteria multipathd is using to test a device and adjust its settings are failing on this first-enumerated disk when it starts looking at the drives through the second FC loop.

I've attached typescript of both "multipathd -d" in one file, and the multipath -l and multipath -v output in a second file. It indicates in detail the sequence of events on both loop addition and removal from the system. dmesg output also attached.

I'd love to be of as much assistance as can, as I have eight systems currently with this problem, and can't do much with them as of yet. I have a set of QLogic qla2300 controllers as well as different disk trays I'll be testing to see if this is a controller or enclosure-specific issue.

Please let me know what more I can do/provide/try to make myself useful.




Some info:

2x Opteron 248, 8GB RAM

Tyan S2882 and Arima HDAMA boards tested.

kernel 2.6.18 (64-bit, SMP, NUMA)

dm-multipath v0.4.7 (03/12, 2006)

2 LSI FC adapters single-port 2G

14-drive LSI FC JBOD tray (model 2600/0834) dual-controller 2G

defaults {
        polling_interval        5
        path_grouping_policy    multibus
        rr_min_io               100
        failback                15
        no_path_retry           2

Attachment: multipathd-debug.out.bz2
Description: Binary data

Attachment: multipath-debug.out.bz2
Description: Binary data

Attachment: dmesg.2006-10-09.bz2
Description: Binary data

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]