[Cluster-devel] [PATCH] fs/dlm: fix connection close handling.

For some reason, connections which were closed were being put back on
the work queue, causing a hang in trying to connect to a blocked node,
or a crash trying to access a closed connection.

David provided a fix which introduced the CF_CLOSE flag, but which still
could trigger the crash. Chrissie provided a fix which cleared the
CONNECT_ and WRITE_PENDING flags, but which still could trigger the hang
(I think because send_to_sock() would still attempt to connect in its
retry path). I added a fix which avoided the unconditional call to
send_to_sock() and also cancelled any work which might still be on the

Combined, these three fix the hangs and crashes I have been seeing when
a node was killed (bugzilla.novell.com #52422). 

I'm not perfectly happy with this patch; it feels as if it is fixing
symptoms. In particular, I don't quite understand where
lowcomms_connect_to_sock() ends up being called from with the connection
closed, but I've resisted the urge to insert a BUG() in the if clause
there so far. Maybe someone else is inspired by this patch to reevaluate
the connection handling completely ;-)

Acked-by: teigland redhat com
Acked-by: ccaulfie redhat com

Index: dlm/lowcomms.c
--- dlm.orig/lowcomms.c
+++ dlm/lowcomms.c
@@ -106,6 +106,7 @@ struct connection {
 #define CF_INIT_PENDING 4
 #define CF_IS_OTHERCON 5
+#define CF_CLOSE 6
 	struct list_head writequeue;  /* List of outgoing writequeue_entries */
 	spinlock_t writequeue_lock;
 	int (*rx_action) (struct connection *);	/* What to do when active */
@@ -299,6 +300,8 @@ static void lowcomms_write_space(struct
 static inline void lowcomms_connect_sock(struct connection *con)
+	if (test_bit(CF_CLOSE, &con->flags))
+		return;
 	if (!test_and_set_bit(CF_CONNECT_PENDING, &con->flags))
 		queue_work(send_workqueue, &con->swork);
@@ -1370,6 +1373,15 @@ int dlm_lowcomms_close(int nodeid)
 	log_print("closing connection to node %d", nodeid);
 	con = nodeid2con(nodeid, 0);
 	if (con) {
+		clear_bit(CF_CONNECT_PENDING, &con->flags);
+		clear_bit(CF_WRITE_PENDING, &con->flags);
+		set_bit(CF_CLOSE, &con->flags);
+		if (cancel_work_sync(&con->swork)) {
+			log_print("swork cancelled for node %d", nodeid);
+		}
+		if (cancel_work_sync(&con->rwork)) {
+			log_print("rwork cancelled for node %d", nodeid);
+		}
 		close_connection(con, true);
@@ -1395,9 +1407,11 @@ static void process_send_sockets(struct
 	if (test_and_clear_bit(CF_CONNECT_PENDING, &con->flags)) {
+		set_bit(CF_WRITE_PENDING, &con->flags);
+	}
+	if (test_and_clear_bit(CF_WRITE_PENDING, &con->flags)) {
+		send_to_sock(con);
-	clear_bit(CF_WRITE_PENDING, &con->flags);
-	send_to_sock(con);

Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG N├╝rnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

