[Linux-cluster] dlm patch to fix referencing freed memory

Thu Dec 9 23:47:02 UTC 2004

I turned on CONFIG_DEBUG_SLAB was hitting slab corruption.
I added a debug magic number to the rw_semaphore struct
and then added a BUG_ON() in the down_write() and up_write()
and got a stack trace showing the up_write() was referencing
free memory:

EIP is at dlm_unlock_stage2+0x126/0x2a0 [dlm]
eax: 00000025   ebx: e5b34d98   ecx: c0456c0c   edx: 00000286
esi: e5b34d34   edi: e5b2eb28   ebp: ea3f2e64   esp: ea3f2e48
ds: 007b   es: 007b   ss: 0068
Process gfs_glockd (pid: 3693, threadinfo=ea3f2000 task=e8c20ed0)
Stack: e5b2eb28 00000005 00000000 00000000 e5b2eb28 e7ae85a4 e7ae8654 ea3f2ea0
       f8b305dd e5b2eb28 e5b34d34 00000000 000e00a3 00040000 00000000 00000000
       e5b34dcd ffffffea e5b34d34 00000000 e5b40678 e5b33d38 ea3f2ecc f8b4fcd6
Call Trace:
 [<c010626f>] show_stack+0x7f/0xa0
 [<c010641e>] show_registers+0x15e/0x1d0
 [<c010663e>] die+0xfe/0x190
 [<c0105e59>] error_code+0x2d/0x38
 [<f8b305dd>] dlm_unlock+0x2ad/0x3c0 [dlm]
 [<f8b4fcd6>] do_dlm_unlock+0x86/0x120 [lock_dlm]
 [<f8b50138>] lm_dlm_unlock+0x18/0x30 [lock_dlm]
 [<f8af2ba3>] gfs_glock_drop_th+0x93/0x1a0 [gfs]
 [<f8af1e0b>] rq_demote+0xbb/0xe0 [gfs]
 [<f8af1f18>] run_queue+0x88/0xe0 [gfs]
 [<f8af208b>] unlock_on_glock+0x2b/0x40 [gfs]
 [<f8af49d2>] gfs_reclaim_glock+0x132/0x1b0 [gfs]
 [<f8ae46ea>] gfs_glockd+0x11a/0x130 [gfs]
 [<c0103325>] kernel_thread_helper+0x5/0x10
Code: 2a 89 d8 ba ff ff 00 00 f0 0f c1 10 0f 85 4c 12 00 00 ba 9a 3a b4 f8 e8 c9 bf 6e c7 e8 54 b8 ff ff 83 c4 10 31 c0 5b 5e 5f 5d c3 <0f> 0b e8 00 2e 3a b4 f8 eb cc ba 88 3a b4 f8 89 d8 e8 a4 bf 6e

Looking through the code, I found when that a call to
queue_ast(lkb, AST_COMP | AST_DEL, 0); will lead to
process_asts() which will free the dlm_rsb.  So there
is a race where the rsb can be freed BEFORE we do the
up_write(rsb->res_lock);

The fix is simple, do the up_write() before the queue_ast().

Here's a patch that fixed this problem:

--- cluster.orig/dlm-kernel/src/locking.c	2004-12-09 15:23:13.789834384 -0800
+++ cluster/dlm-kernel/src/locking.c	2004-12-09 15:24:51.809742940 -0800
@@ -687,8 +687,13 @@ void dlm_lock_stage3(struct dlm_lkb *lkb
 		lkb->lkb_retstatus = -EAGAIN;
 		if (lkb->lkb_lockqueue_flags & DLM_LKF_NOQUEUEBAST)
 			send_blocking_asts_all(rsb, lkb);
+		/*
+		 * up the res_lock before queueing ast, since the AST_DEL will
+		 * cause the rsb to be released and that can happen anytime.
+		 */
+		up_write(&rsb->res_lock);
 		queue_ast(lkb, AST_COMP | AST_DEL, 0);
-		goto out;
+		return;
 	}
 
 	/*
@@ -888,7 +893,13 @@ int dlm_unlock_stage2(struct dlm_lkb *lk
 	lkb->lkb_retstatus = flags & DLM_LKF_CANCEL ? -DLM_ECANCEL:-DLM_EUNLOCK;
 
 	if (!remote) {
+		/*
+		 * up the res_lock before queueing ast, since the AST_DEL will
+		 * cause the rsb to be released and that can happen anytime.
+		 */
+		up_write(&rsb->res_lock);
 		queue_ast(lkb, AST_COMP | AST_DEL, 0);
+		goto out2;
 	} else {
 		up_write(&rsb->res_lock);
 		release_lkb(rsb->res_ls, lkb);

This did not fix my other hang, I'll try out Patrick's
simple patch and see what happens.

Thanks,

Daniel