[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[dm-devel] [PATCH 27/42] Update 'no_path_retry' correctly for failed paths



The bug is triggered if path failed event is received by multipathd after all
paths have been already marked as failed. Surprisingly enough, it seems to
happen quite often; colleague of mine who tested this hit this bug every time.

Here is event sequence that explains this bug. I left some messages for
clarity; full log is available on request. We have completed initialization and
set feature queue_if_no_path for map CX_201 by virtue of using no_path_retry >
0.

Aug 31 10:49:09 | CX_201: devmap event #18
Aug 31 10:49:09 | CX_201: discover
Aug 31 10:49:09 | CX_201: rr_weight = 1 (internal default)
Aug 31 10:49:09 | CX_201: pgfailback = -2 (controller setting)
Aug 31 10:49:09 | CX_201: no_path_retry = 2 (controller setting)
Aug 31 10:49:09 | pg_timeout = NONE (internal default)
Aug 31 10:49:09 | 65:192: mark as failed
Aug 31 10:49:09 | CX_201: remaining active paths: 3
Aug 31 10:49:09 | 8:192: mark as failed
Aug 31 10:49:09 | CX_201: remaining active paths: 2
Aug 31 10:49:09 | CX_201: devmap event #19
Aug 31 10:49:09 | CX_201: discover
Aug 31 10:49:09 | CX_201: rr_weight = 1 (internal default)
Aug 31 10:49:09 | CX_201: pgfailback = -2 (controller setting)
Aug 31 10:49:09 | CX_201: no_path_retry = 2 (controller setting)
Aug 31 10:49:09 | pg_timeout = NONE (internal default)

Two paths failed by driver, multipahd marked them as failed.

Aug 31 10:49:09 | checker failed path 66:0 in map CX_201
Aug 31 10:49:09 | CX_201: remaining active paths: 1

Checker failed third path

Aug 31 10:49:09 | checker failed path 8:96 in map CX_201
Aug 31 10:49:09 | CX_201: Entering recovery mode: max_retries=2
Aug 31 10:49:09 | CX_201: remaining active paths: 0

Checker failed last path; multipathd entered retry loop.

Aug 31 10:49:10 | CX_201: devmap event #20

We got late event about failed path

Aug 31 10:49:10 | CX_201: discover

Start discovery. Call update_multipath -> setup_multipath ->
update_multipath_strings -> update_multipath_tablle -> disassemble_map.

Now disassemble_map tries to set no_path_retry value from kernel. This
obviously is not going to work as kernel is able remembering only Boolean
(queue/fail), while no_path_retry is arbitrary integer. So no_path_retry is set
to NO_PATH_RETRY_QUEUE from kernel.

Aug 31 10:49:10 | CX_201: rr_weight = 1 (internal default)
Aug 31 10:49:10 | CX_201: pgfailback = -2 (controller setting)

At this point we call set_no_path_retry:

set_no_path_retry(struct multipath *mpp)
{
        mpp->retry_tick = 0;
        mpp->nr_active = pathcount(mpp, PATH_UP) + pathcount(mpp, PATH_GHOST);
        if (mpp->nr_active > 0)
                select_no_path_retry(mpp);

So

1) retry_tick is reset
2) nr_active = 0 (no active path)
3) we do not set no_path_retry from config file because nr_active == 0 => left
with NO_PATH_RETRY_QUEUE.

Aug 31 10:49:10 | pg_timeout = NONE (internal default)

>From now on there is no state changes, so map is hung forever.

Signed-off-by: Martin Wilck <martin wilck ts fujitsu com>
Signed-off-by: Hannes Reinecke <hare suse de>
---
 libmultipath/structs_vec.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/libmultipath/structs_vec.c b/libmultipath/structs_vec.c
index 384afb7..7073915 100644
--- a/libmultipath/structs_vec.c
+++ b/libmultipath/structs_vec.c
@@ -306,8 +306,7 @@ set_no_path_retry(struct multipath *mpp)
 {
 	mpp->retry_tick = 0;
 	mpp->nr_active = pathcount(mpp, PATH_UP) + pathcount(mpp, PATH_GHOST);
-	if (mpp->nr_active > 0)
-		select_no_path_retry(mpp);
+	select_no_path_retry(mpp);
 
 	switch (mpp->no_path_retry) {
 	case NO_PATH_RETRY_UNDEF:
-- 
1.7.4.2


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]