[linux-lvm] clvmd leaving kernel dlm uncontrolled lockspace

Thu Jun 6 06:17:17 UTC 2013

Am 05.06.13 17:13, schrieb David Teigland:

> A few different topics wrapped together there:
>
> - With kill -9 clvmd (possibly combined with dlm_tool leave clvmd),
>    you can manually clear/remove a userland lockspace like clvmd.
>
> - If clvmd is blocked in the kernel in uninterruptible sleep, then
>    the kill above will not work.  To make kill work, you'd locate the
>    particular sleep in the kernel and determine if there's a way to
>    make it interruptible, and cleanly back it out.

I had clvmds blocked in kernel, so how to "locate the sleep and make it 
interruptible"?
>
> - If clvmd is blocked in the kernel for >120s, you probably want to
>    investigate what is causing that, rather than being too hasty
>    killing clvmd.
INFO: task clvmd:19766 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
clvmd           D ffff880058ec4870     0 19766      1 0x00000000
ffff880058ec4870 0000000000000282 0000000000000000 ffff8800698d9590
0000000000013740 ffff880063787fd8 ffff880063787fd8 0000000000013740
ffff880058ec4870 ffff880063786010 0000000000000001 0000000100000000
Call Trace:
[<ffffffff81367f7a>] ? rwsem_down_failed_common+0xda/0x10e
[<ffffffff811c5924>] ? call_rwsem_down_read_failed+0x14/0x30
[<ffffffff813678da>] ? down_read+0x17/0x19
[<ffffffffa059b705>] ? dlm_user_request+0x3a/0x17e [dlm]
[<ffffffffa05a40e4>] ? device_write+0x279/0x5f7 [dlm]
[<ffffffff810f7d7a>] ? __kmalloc+0x104/0x116
[<ffffffffa05a416b>] ? device_write+0x300/0x5f7 [dlm]
[<ffffffff810042c9>] ? xen_mc_flush+0x12b/0x158
[<ffffffff8117489e>] ? security_file_permission+0x18/0x2d
[<ffffffff81106dd5>] ? vfs_write+0xa4/0xff
[<ffffffff81106ee6>] ? sys_write+0x45/0x6e
[<ffffffff8136d652>] ? system_call_fastpath+0x16/0x1b

On 3.2.35

>
> - If corosync or dlm_controld are killed while dlm lockspaces exist,
>    they become "uncontrolled" and would need to be forcibly cleaned up.
>    This cleanup may be possible to implement for userland lockspaces,
>    but it's not been clear that the benefits would greatly outweigh
>    using reboot for this.

On a machine being Xen host with 20+ running VMs I'd clearly prefer to 
clean those orphaned memory space and go on.... I still have 4 hosts to 
be rebooted which serve as xen host, providing their devices from 
clvmd-controlled (i.e. now uncontrollable) san space.
>
> - Killing either corosync or dlm_controld is very unlikely help
>    anything, and more likely to cause further problems, so it should
>    be avoided as far as possible.

I understand. One reason to upgrade was that I had infrequent 
situations, where the corosync 1.4.2 instances on all nodes exitted 
simultaneously without any log notice. Having this with the new 
corosync2.3/dlm infrastructure would mean a whole cluster having 
uncontrollable san space. So either the lockspace should be 
automatically reclaimed if dlm_controld finds it uncontrolled, or a 
means to clean it up manually should be available.

Regards,
Andreas
>
> Dave