[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[dm-devel] [PATCH]: multipath: fail path on zero size devices



Hi,

 

I am sending this as a continuum to the  situation described in the following patch:

https://patchwork.kernel.org/patch/62094

 

We are using MD32xxi storage arrays (ISCSI, RDAC) in a clustering environment where we have two machines logged into both controllers of the storage array.

 

This is the output of multipath –ll on one of the machines when all is well.

lun1 (36842b2b00063c13d000003594ce9f82b) dm-1 DELL,MD32xxi

[size=1.0T][features=2 pg_init_retries 50][hwhandler=1 rdac][rw]

\_ round-robin 0 [prio=200][active]

\_ 6:0:0:1  sdf 8:80  [active][ready]

\_ 5:0:0:1  sdh 8:112 [active][ready]

\_ round-robin 0 [prio=0][enabled]

\_ 8:0:0:1  sdj 8:144 [active][ghost]

\_ 7:0:0:1  sdk 8:160 [active][ghost]

lun0 (36842b2b000571923000003933b4c0d04) dm-0 DELL,MD32xxi

[size=1.0T][features=2 pg_init_retries 50][hwhandler=1 rdac][rw]

\_ round-robin 0 [prio=200][active]

\_ 8:0:0:0  sdd 8:48  [active][ready]

\_ 7:0:0:0  sdc 8:32  [active][ready]

\_ round-robin 0 [prio=0][enabled]

\_ 6:0:0:0  sdb 8:16  [active][ghost]

\_ 5:0:0:0  sde 8:64  [active][ghost]

 

When I disconnect the switch connecting between the machines and the storage array (while there is I/O to the devices) and rescan the storage during this period, I get READ_CAPACITY failures and as result the devices receive a zero size.  It seems that in this case, not only do we have a ping-pong between path-groups on the same machine, we also have a ping-pong between LUNs on the two machines, which causes havoc to our system.

 

This was the output of multipath –ll during the failure:

lun1 (36842b2b00063c13d000003594ce9f82b) dm-1 ,

[size=1.0T][features=2 pg_init_retries 50][hwhandler=1 rdac][rw]

\_ round-robin 0 [prio=0][active]

\_ #:#:#:# sdf 8:80  [active][undef]

\_ #:#:#:# sdh 8:112 [active][undef]

\_ round-robin 0 [prio=0][enabled]

\_ #:#:#:# sdj 8:144 [active][undef]

\_ #:#:#:# sdk 8:160 [active][undef]

lun0 (36842b2b000571923000003933b4c0d04) dm-0 ,

[size=1.0T][features=2 pg_init_retries 50][hwhandler=1 rdac][rw]

\_ round-robin 0 [prio=0][enabled]

\_ #:#:#:# sdd 8:48  [failed][undef]

\_ #:#:#:# sdc 8:32  [failed][undef]

\_ round-robin 0 [prio=0][enabled]

\_ #:#:#:# sdb 8:16  [failed][undef]

\_ #:#:#:# sde 8:64  [failed][undef]

 

Instead of not allowing 0 size paths to enter the map, we should allow them to get in but fail them although the path checker tells us that the path is up.

The following patch is over multipathd in device-mapper-multipath-0.4.7-42.el5 (RHEL56):

 

--- a/multipathd/main.c                2010-12-07 08:02:23.000000000 +0200

+++ b/multipathd/main.c             2011-02-10 10:51:57.000000000 +0200

@@ -1050,4 +1050,11 @@

                                                                                                               &(pp->checker.timeout));

                                                               newstate = checker_check(&pp->checker);

+                                                             if (newstate != PATH_DOWN) {

+                                                                             unsigned long long size = 0;

+

+                                                                             sysfs_get_size(sysfs_path, pp->dev, &size);

+                                                                             if (size == 0)

+                                                                                             newstate = PATH_DOWN;

+                                                             }

                               }

Since the path is down, the path groups no longer compete on LUN ownership and things remain stable.

In addition, with the ping-pong running wild I need to rescan again after storage comes up – with this fix I do not.

 

Menny

 

 

 

Menny Hamburger

Engineer

Dell | IDC

office +972 97698789,  fax +972 97698889

Dell IDC. 4 Hacharoshet St, Raanana 43657, Israel

 


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]