[Linux-cluster] Deadlock when using clvmd + OpenAIS + Corosync

Mon Jan 25 13:28:27 UTC 2010

On Mon, Jan 25, 2010 at 8:13 AM, Evan Broder <broder at mit.edu> wrote:
> On Mon, Jan 25, 2010 at 5:32 AM, Christine Caulfield
> <ccaulfie at redhat.com> wrote:
>> On 23/01/10 17:35, Evan Broder wrote:
>>>
>>> On Fri, Jan 22, 2010 at 6:41 AM, Christine Caulfield
>>> <ccaulfie at redhat.com>  wrote:
>>>>
>>>> On 21/01/10 15:17, Evan Broder wrote:
>>>>>
>>>>> On Wed, Jan 13, 2010 at 4:59 AM, Christine Caulfield
>>>>> <ccaulfie at redhat.com>    wrote:
>>>>>>
>>>>>> On 12/01/10 16:21, Evan Broder wrote:
>>>>>>>
>>>>>>> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
>>>>>>> <ccaulfie at redhat.com>      wrote:
>>>>>>>>
>>>>>>>> On 11/01/10 09:38, Christine Caulfield wrote:
>>>>>>>>>
>>>>>>>>> On 11/01/10 09:32, Evan Broder wrote:
>>>>>>>>>>
>>>>>>>>>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
>>>>>>>>>> <ccaulfie at redhat.com>      wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 08/01/10 22:58, Evan Broder wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> [please preserve the CC when replying, thanks]
>>>>>>>>>>>>
>>>>>>>>>>>> Hi -
>>>>>>>>>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
>>>>>>>>>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the
>>>>>>>>>>>> past
>>>>>>>>>>>> by
>>>>>>>>>>>> crashes leaving DLM state around and forcing us to reboot our
>>>>>>>>>>>> nodes,
>>>>>>>>>>>> so we're specifically looking for a solution that doesn't involve
>>>>>>>>>>>> in-kernel locking.
>>>>>>>>>>>>
>>>>>>>>>>>> We're also running the Pacemaker OpenAIS service, as we're hoping
>>>>>>>>>>>> to
>>>>>>>>>>>> use it for management of some other resources going forward.
>>>>>>>>>>>>
>>>>>>>>>>>> We've managed to form the OpenAIS cluster, and get clvmd running
>>>>>>>>>>>> on
>>>>>>>>>>>> both of our nodes. Operations using LVM succeed, so long as only
>>>>>>>>>>>> one
>>>>>>>>>>>> operation runs at a time. However, if we attempt to run two
>>>>>>>>>>>> operations
>>>>>>>>>>>> (say, one lvcreate on each host) at a time, they both hang, and
>>>>>>>>>>>> both
>>>>>>>>>>>> clvmd processes appear to deadlock.
>>>>>>>>>>>>
>>>>>>>>>>>> When they deadlock, it doesn't appear to affect the other
>>>>>>>>>>>> clustering
>>>>>>>>>>>> processes - both corosync and pacemaker still report a fully
>>>>>>>>>>>> formed
>>>>>>>>>>>> cluster, so it seems the issue is localized to clvmd.
>>>>>>>>>>>>
>>>>>>>>>>>> I've looked at logs from corosync and pacemaker, and I've straced
>>>>>>>>>>>> various processes, but I don't want to blast a bunch of useless
>>>>>>>>>>>> information at the list. What information can I provide to make
>>>>>>>>>>>> it
>>>>>>>>>>>> easier to debug and fix this deadlock?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> To start with, the best logging to produce is the clvmd logs which
>>>>>>>>>>> can be
>>>>>>>>>>> got with clvmd -d (see the man page for details). Ideally these
>>>>>>>>>>> should be
>>>>>>>>>>> from all nodes in the cluster so they can be correlated. If you're
>>>>>>>>>>> still
>>>>>>>>>>> using DLM then a dlm lock dump from all nodes is often helpful in
>>>>>>>>>>> conjunction with the clvmd logs.
>>>>>>>>>>
>>>>>>>>>> Sure, no problem. I've posted the logs from clvmd on both processes
>>>>>>>>>> in
>>>>>>>>>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
>>>>>>>>>> few points with what I was doing - the annotations all start with "
>>>>>>>>>>>>
>>>>>>>>>>>> ", so they should be easy to spot.
>>>>>>>>
>>>>>>>>
>>>>>>>> Ironically it looks like a bug in the clvmd-openais code. I can
>>>>>>>> reproduce
>>>>>>>> it
>>>>>>>> on my systems here. I don't see the problem when using the dlm!
>>>>>>>>
>>>>>>>> Can you try -Icorosync and see if that helps? In the meantime I'll
>>>>>>>> have
>>>>>>>> a
>>>>>>>> look at the openais bits to try and find out what is wrong.
>>>>>>>>
>>>>>>>> Chrissie
>>>>>>>>
>>>>>>>
>>>>>>> I'll see what we can pull together, but the nodes running the clvm
>>>>>>> cluster are also Xen dom0's. They're currently running on (Ubuntu
>>>>>>> Hardy's) 2.6.24, so upgrading them to something new enough to support
>>>>>>> DLM 3 would be...challenging.
>>>>>>>
>>>>>>> It would be much, much better for us if we could get clvmd-openais
>>>>>>> working.
>>>>>>>
>>>>>>> Is there any chance this would work better if we dropped back to
>>>>>>> openais whitetank instead of corosync + openais wilson?
>>>>>>>
>>>>>>
>>>>>>
>>>>>> OK, I've found the bug and it IS in openais. The attached patch will
>>>>>> fix
>>>>>> it.
>>>>>>
>>>>>> Chrissie
>>>>>>
>>>>>
>>>>> Awesome. That patch fixed our problem.
>>>>>
>>>>> We are running into one other problem - performing LVM operations on
>>>>> one node is substantially slower than performing them on the other
>>>>> node:
>>>>>
>>>>> root at black-mesa:~# time lvcreate -n test -L 1G xenvg
>>>>>   Logical volume "test" created
>>>>>
>>>>> real    0m0.309s
>>>>> user    0m0.000s
>>>>> sys     0m0.008s
>>>>> root at black-mesa:~# time lvremove -f /dev/xenvg/test
>>>>>   Logical volume "test" successfully removed
>>>>>
>>>>> real    0m0.254s
>>>>> user    0m0.004s
>>>>> sys     0m0.008s
>>>>>
>>>>>
>>>>> root at torchwood-institute:~# time lvcreate -n test -L 1G xenvg
>>>>>   Logical volume "test" created
>>>>>
>>>>> real    0m7.282s
>>>>> user    0m6.396s
>>>>> sys     0m0.312s
>>>>> root at torchwood-institute:~# time lvremove -f /dev/xenvg/test
>>>>>   Logical volume "test" successfully removed
>>>>>
>>>>> real    0m7.277s
>>>>> user    0m6.420s
>>>>> sys     0m0.292s
>>>>>
>>>>> Any idea why this is happening and if there's anything we can do about
>>>>> it?
>>>>>
>>>>
>>>>
>>>> I'm not at all sure why that should be happening. I suppose the best
>>>> thing
>>>> to do would be to enable clvmd logging (clvmd -d) and see what is taking
>>>> the
>>>> time.
>>>>
>>>> Chrissie
>>>>
>>>
>>> No problem. I've collected another set of logs - they're in
>>> <http://web.mit.edu/broder/Public/clvmd-slow/>.
>>>
>>> After spinning up corosync and clvmd, the commands I ran were, in order:
>>>
>>>   root at black-mesa:~# vgchange -a y xenvg
>>>     0 logical volume(s) in volume group "xenvg" now active
>>>   root at black-mesa:~# time lvcreate -n test1 -L 1G xenvg
>>>     Logical volume "test1" created
>>>
>>>   real    0m0.685s
>>>   user    0m0.004s
>>>   sys     0m0.000s
>>>   root at black-mesa:~# time lvremove -f /dev/xenvg/test1
>>>     Logical volume "test1" successfully removed
>>>
>>>   real    0m0.235s
>>>   user    0m0.004s
>>>   sys     0m0.004s
>>>   root at torchwood-institute:~# time lvcreate -n test2 -L 1G xenvg
>>>     Logical volume "test2" created
>>>
>>>   real    0m8.007s
>>>   user    0m6.396s
>>>   sys     0m0.312s
>>>   root at torchwood-institute:~# time lvremove -f /dev/xenvg/test2
>>>     Logical volume "test2" successfully removed
>>>
>>>   real    0m7.364s
>>>   user    0m6.436s
>>>   sys     0m0.300s
>>>   root at black-mesa:~# vgchange -a n xenvg
>>>     0 logical volume(s) in volume group "xenvg" now active
>>>
>>> (black-mesa is node 1, and torchwood-institute is node 2)
>>>
>>> Thanks again for your help,
>>>
>>
>> Hiya,
>>
>> Oddly I can't find any delays in the clvmd logs at all. There are some
>> 7-second gaps in the log files but those are between commands coming from
>> the lvm command-line and not internal to clvmd itself.
>>
>> What sort of storage are you using for these LVs? The only thing I can think
>> of now is that there is some sort of delay in lvm opening the device for
>> writing as it updates the metadata. An LVM debug log might be helpful here,
>> though I'm not sure, off-hand, how to put time-stamps on that - I'm not
>> really an LVM developer any more.
>>
>> Chrissie
>>
>
> Well, uh, you see...at the moment I'm using a VM on a personal server
> running iscsitarget from the other side of MIT's campus :)
>
> But that's only because our normal RAID is in the process of being
> RMA'd. Normally it's a Fibre channel RAID sitting right next to the
> two servers, and we still had the same timing problems (with the same
> server being slower) then. The two servers are identical hardware and
> sitting right next to each other.
>
> In any case, thanks for the suggestion to look at LVM itself instead
> of clvmd - I sort of forgot my basic debugging techniques before
> asking for help on this one. LVM logs to syslog, and its logging mode
> includes timestamps.
>
> When I cranked up the debug level, I noticed that the lvcreates and
> lvremoves were hanging on "Archiving volume group "xenvg" metadata
> (seqno 48).". If I turn off the metadata archives in lvm.conf on
> torchwood-institute, its performance goes back to what I'd expect, so
> I guess it's having some sort of disk problems.
>
> Thanks again for all of our help,
>  - Evan
>

Actually, the real problem turned out to be much more straightforward:

root at torchwood-institute:~# ls -a /etc/lvm/archive | wc -l
301369

For some reason, torchwood-institute had built up a massive archive of
files named like
/etc/lvm/archive/.lvm_torchwood-institute.mit.edu_17470_993238909.
Listing them all took...about 7 seconds.

Deleting them all put performance back where it should be.

- Evan