[Cluster-devel] Cluster Project branch, master, updated. gfs-kernel_0_1_22-67-g966f6d0

Tue Mar 11 18:49:33 UTC 2008

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "Cluster Project".

http://sources.redhat.com/git/gitweb.cgi?p=cluster.git;a=commitdiff;h=966f6d098ec0576be68603852ea29e38fe12e7fc

The branch, master has been updated
       via  966f6d098ec0576be68603852ea29e38fe12e7fc (commit)
      from  e2c8836c385b24f1e56674aa458e3b298b7a1cb9 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
commit 966f6d098ec0576be68603852ea29e38fe12e7fc
Author: David Teigland <teigland at redhat.com>
Date:   Tue Mar 11 12:18:23 2008 -0500

    groupd: purge messages from dead nodes
    
    bz 436984
    
    In the fix for bug 258121, 70294dd8b717de89f2d168c0837c011648908558,
    we began taking nodedown events via the groupd cpg, instead of via the per
    group cpg.  Messages still come in via the per group cpg.  I believe that
    that opened the possibility of processing a message from a node after
    processing the nodedown for it.
    
    In Nate's revolver test, we saw it happen; revolver killed nodes 1,2,3,
    leaving just node 4:
    1205198713 0:default confchg left 3 joined 0 total 1
    1205198713 0:default confchg removed node 1 reason 3
    1205198713 0:default confchg removed node 2 reason 3
    1205198713 0:default confchg removed node 3 reason 3
    ...
    1205198713 0:default mark_node_started: event not starting 12 from 2

-----------------------------------------------------------------------

Summary of changes:
 group/daemon/app.c         |   25 +++++++++++++++++++++++++
 group/daemon/cpg.c         |    1 +
 group/daemon/gd_internal.h |    1 +
 3 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/group/daemon/app.c b/group/daemon/app.c
index f73d3bc..93f5c21 100644
--- a/group/daemon/app.c
+++ b/group/daemon/app.c
@@ -691,6 +691,31 @@ int queue_app_message(group_t *g, struct save_msg *save)
 	return 0;
 }
 
+/* This is called when we get the nodedown for the per-group cpg; we know
+   that after the cpg nodedown we won't get any further messages. bz 436984
+   It's conceivable but unlikely that the nodedown processing (initiated by
+   the groupd cpg nodedown) could begin before the per-group cpg nodedown
+   is received where this purging occurs.  If it does, then we may need to
+   add code to wait for the nodedown to happen in both the groupd cpg and the
+   per-group cpg before processing the nodedown. */
+
+void purge_node_messages(group_t *g, int nodeid)
+{
+	struct save_msg *save, *tmp;
+
+	list_for_each_entry_safe(save, tmp, &g->messages, list) {
+		if (save->nodeid != nodeid)
+			continue;
+
+		log_group(g, "purge msg from dead node %d", nodeid);
+
+		list_del(&save->list);
+		if (save->msg_long)
+			free(save->msg_long);
+		free(save);
+	}
+}
+
 static void del_app_nodes(app_t *a)
 {
 	node_t *node, *tmp;
diff --git a/group/daemon/cpg.c b/group/daemon/cpg.c
index ecdf418..3ac42f9 100644
--- a/group/daemon/cpg.c
+++ b/group/daemon/cpg.c
@@ -402,6 +402,7 @@ void process_confchg(void)
 		case CPG_REASON_NODEDOWN:
 		case CPG_REASON_PROCDOWN:
 			/* process_node_down(g, saved_left[i].nodeid); */
+			purge_node_messages(g, saved_left[i].nodeid);
 			break;
 		default:
 			log_error(g, "unknown leave reason %d node %d",
diff --git a/group/daemon/gd_internal.h b/group/daemon/gd_internal.h
index 8562801..404e769 100644
--- a/group/daemon/gd_internal.h
+++ b/group/daemon/gd_internal.h
@@ -263,6 +263,7 @@ void groupd_down(int nodeid);
 char *msg_type(int type);
 int process_app(group_t *g);
 int is_our_join(event_t *ev);
+void purge_node_messages(group_t *g, int nodeid);
 
 /* main.c */
 void app_stop(app_t *a);


hooks/post-receive
--
Cluster Project