[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Cluster-devel] [RFC][PATCH] dlm: Reset fs_notified when check_fs_done


About the issue that dlm_controld and fs_controld sit spinning,
retrying and replying for the fs_notified check, I have a suspision
that another scenario may also hit that logic:

If the node->fs_notified has been set to 1 by previous change, when a
new change comes and needs to check the node->fs_notified, because it
has not been reset to 0, so check_fs_done will succeed even if
dlm_controld has not received the notification from fs_controld this
For example, given that the following membership changes n, n+1, n+2,
we see what happens on node X:
Step 1: cg n: node Y leaves with CPG_REASON_NODEDOWN reason,
        eventually in node X's ls->node_history, node Y's fs_notified
        = 1
Step 2: cg n+1: node Y joins ...
Step 3: cg n+2: node Y leaves with CPG_REASON_NODEDOWN reason, one
        possible scenario is: before fs_controld's notification
        arrives, dlm_controld has known node Y is down from CPG
        message and done a lot of work, and it saw node Y's
        fs_notified = 1 (been set in Step 1) then passed the fs check
        wrongly. So node Y's check_fs reset to 0.
Step 4: fs_controld's notification arrives, it sees node Y's check_fs
        = 0 and assumes dlm_controld has not known node Y is down and
        retries to send the notification. But in fact, dlm_controld
        has already known this and finished all the work, which will
        result in the spinning ... 

I'm not sure if I read the code correctly :-) Below is the patch which
reset the node->fs_notified. Review and comments are highly


Signed-off-by: Jiaju Zhang <jjzhang linux gmail com>
 group/dlm_controld/cpg.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/group/dlm_controld/cpg.c b/group/dlm_controld/cpg.c
index d5245ce..b257595 100644
--- a/group/dlm_controld/cpg.c
+++ b/group/dlm_controld/cpg.c
@@ -636,6 +636,7 @@ static int check_fs_done(struct lockspace *ls)
 		if (node->fs_notified) {
 			node->check_fs = 0;
+			node->fs_notified = 0;
 		} else {
 			log_group(ls, "check_fs nodeid %d needs fs notify",

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]