[Linux-cluster] problems with clvmd

Mon Apr 18 19:46:15 UTC 2011

On Mon, Apr 18, 2011 at 2:17 PM, Terry <td3201 at gmail.com> wrote:
> On Mon, Apr 18, 2011 at 9:49 AM, Terry <td3201 at gmail.com> wrote:
>> On Mon, Apr 18, 2011 at 9:26 AM, Christine Caulfield
>> <ccaulfie at redhat.com> wrote:
>>> On 18/04/11 15:11, Terry wrote:
>>>>
>>>> On Mon, Apr 18, 2011 at 8:57 AM, Christine Caulfield
>>>> <ccaulfie at redhat.com>  wrote:
>>>>>
>>>>> On 18/04/11 14:38, Terry wrote:
>>>>>>
>>>>>> On Mon, Apr 18, 2011 at 3:48 AM, Christine Caulfield
>>>>>> <ccaulfie at redhat.com>    wrote:
>>>>>>>
>>>>>>> On 17/04/11 21:52, Terry wrote:
>>>>>>>>
>>>>>>>> As a result of a strange situation where our licensing for storage
>>>>>>>> dropped off, I need to join a centos 5.6 node to a now single node
>>>>>>>> cluster.  I got it joined to the cluster but I am having issues with
>>>>>>>> CLVMD.  Any lvm operations on both boxes hang.  For example, vgscan.
>>>>>>>> I have increased debugging and I don't see any logs.  The VGs aren't
>>>>>>>> being populated in /dev/mapper.  This WAS working right after I joined
>>>>>>>> it to the cluster and now it's not for some unknown reason.  Not sure
>>>>>>>> where to take this at this point.   I did find one weird startup log
>>>>>>>> that I am not sure what it means yet:
>>>>>>>> [root at omadvnfs01a ~]# dmesg | grep dlm
>>>>>>>> dlm: no local IP address has been set
>>>>>>>> dlm: cannot start dlm lowcomms -107
>>>>>>>> dlm: Using TCP for communications
>>>>>>>> dlm: connecting to 2
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> That message usually means that dlm_controld has failed to start. Try
>>>>>>> starting the cman daemons (groupd, dlm_controld) manually with the -D
>>>>>>> switch
>>>>>>> and read the output which might give some clues to why it's not
>>>>>>> working.
>>>>>>>
>>>>>>> Chrissie
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Chrissie,
>>>>>>
>>>>>> I thought of that but I see dlm started on both nodes.  See right below.
>>>>>>
>>>>>>>> [root at omadvnfs01a ~]# ps xauwwww | grep dlm
>>>>>>>> root      5476  0.0  0.0  24736   760 ?        Ss   15:34   0:00
>>>>>>>> /sbin/dlm_controld
>>>>>>>> root      5502  0.0  0.0      0     0 ?        S<        15:34   0:00
>>>>>
>>>>>
>>>>> Well, that's encouraging in a way! But it's evidently not started fully
>>>>> or
>>>>> the DLM itself would be working. So I still recommend starting it with -D
>>>>> to
>>>>> see how far it gets.
>>>>>
>>>>>
>>>>> Chrissie
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>
>>>> I think we had posts cross.  Here's my latest:
>>>>
>>>> Ok, started all the CMAN elements manually as you suggested.  I
>>>> started them in order as in the init script. Here's the only error
>>>> that I see.  I can post the other debug messages if you think they'd
>>>> be useful but this is the only one that stuck out to me.
>>>>
>>>> [root at omadvnfs01a ~]# /sbin/dlm_controld -D
>>>> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
>>>> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
>>>> 1303134840 set_ccs_options 480
>>>> 1303134840 cman: node 2 added
>>>> 1303134840 set_configfs_node 2 10.198.1.111 local 0
>>>> 1303134840 cman: node 3 added
>>>> 1303134840 set_configfs_node 3 10.198.1.110 local 1
>>>>
>>>
>>> Can I see the whole set please ? It looks like dlm_controld might be stalled
>>> registering with groupd.
>>>
>>> Chrissie
>>>
>>> --
>>
>> Here you go.  Thank you very much for the help.  Each daemon's output
>> that I started is below.
>>
>> [root at omadvnfs01a log]# /sbin/ccsd -n
>> Starting ccsd 2.0.115:
>>  Built: Mar  6 2011 00:47:03
>>  Copyright (C) Red Hat, Inc.  2004  All rights reserved.
>>  No Daemon:: SET
>>
>> cluster.conf (cluster name = omadvnfs01, version = 71) found.
>> Remote copy of cluster.conf is from quorate node.
>>  Local version # : 71
>>  Remote version #: 71
>> Remote copy of cluster.conf is from quorate node.
>>  Local version # : 71
>>  Remote version #: 71
>> Remote copy of cluster.conf is from quorate node.
>>  Local version # : 71
>>  Remote version #: 71
>> Remote copy of cluster.conf is from quorate node.
>>  Local version # : 71
>>  Remote version #: 71
>> Initial status:: Quorate
>>
>> [root at omadvnfs01a ~]# /sbin/fenced -D
>> 1303134822 cman: node 2 added
>> 1303134822 cman: node 3 added
>> 1303134822 our_nodeid 3 our_name omadvnfs01a.sec.jel.lc
>> 1303134822 listen 4 member 5 groupd 7
>> 1303134861 client 3: join default
>> 1303134861 delay post_join 3s post_fail 0s
>> 1303134861 added 2 nodes from ccs
>> 1303134861 setid default 65537
>> 1303134861 start default 1 members 2 3
>> 1303134861 do_recovery stop 0 start 1 finish 0
>> 1303134861 finish default 1
>>
>> [root at omadvnfs01a ~]# /sbin/dlm_controld -D
>> 1303134840 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
>> 1303134840 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
>> 1303134840 set_ccs_options 480
>> 1303134840 cman: node 2 added
>> 1303134840 set_configfs_node 2 10.198.1.111 local 0
>> 1303134840 cman: node 3 added
>> 1303134840 set_configfs_node 3 10.198.1.110 local 1
>>
>>
>> [root at omadvnfs01a ~]# /sbin/groupd -D
>> 1303134809 cman: our nodeid 3 name omadvnfs01a.sec.jel.lc quorum 1
>> 1303134809 setup_cpg groupd_handle 6b8b456700000000
>> 1303134809 groupd confchg total 2 left 0 joined 1
>> 1303134809 send_version nodeid 3 cluster 2 mode 2 compat 1
>> 1303134822 client connection 3
>> 1303134822 got client 3 setup
>> 1303134822 setup fence 0
>> 1303134840 client connection 4
>> 1303134840 got client 4 setup
>> 1303134840 setup dlm 1
>> 1303134853 client connection 5
>> 1303134853 got client 5 setup
>> 1303134853 setup gfs 2
>> 1303134861 got client 3 join
>> 1303134861 0:default got join
>> 1303134861 0:default is cpg client 6 name 0_default handle 6633487300000001
>> 1303134861 0:default cpg_join ok
>> 1303134861 0:default waiting for first cpg event
>> 1303134861 client connection 7
>> 1303134861 0:default waiting for first cpg event
>> 1303134861 got client 7 get_group
>> 1303134861 0:default waiting for first cpg event
>> 1303134861 0:default waiting for first cpg event
>> 1303134861 0:default confchg left 0 joined 1 total 2
>> 1303134861 0:default process_node_join 3
>> 1303134861 0:default cpg add node 2 total 1
>> 1303134861 0:default cpg add node 3 total 2
>> 1303134861 0:default make_event_id 300020001 nodeid 3 memb_count 2 type 1
>> 1303134861 0:default queue join event for nodeid 3
>> 1303134861 0:default process_current_event 300020001 3 JOIN_BEGIN
>> 1303134861 0:default app node init: add 3 total 1
>> 1303134861 0:default app node init: add 2 total 2
>> 1303134861 0:default waiting for 1 more stopped messages before
>> JOIN_ALL_STOPPED
>>
>>  3
>> 1303134861 0:default mark node 2 stopped
>> 1303134861 0:default set global_id 10001 from 2
>> 1303134861 0:default process_current_event 300020001 3 JOIN_ALL_STOPPED
>> 1303134861 0:default action for app: setid default 65537
>> 1303134861 0:default action for app: start default 1 2 2 2 3
>> 1303134861 client connection 7
>> 1303134861 got client 7 get_group
>> 1303134861 0:default mark node 2 started
>> 1303134861 client connection 7
>> 1303134861 got client 7 get_group
>> 1303134861 got client 3 start_done
>> 1303134861 0:default send started
>> 1303134861 0:default mark node 3 started
>> 1303134861 0:default process_current_event 300020001 3 JOIN_ALL_STARTED
>> 1303134861 0:default action for app: finish default 1
>> 1303134862 client connection 7
>> 1303134862 got client 7 get_group
>>
>>
>> [root at omadvnfs01a ~]# /sbin/gfs_controld -D
>> 1303134853 config_no_withdraw 0
>> 1303134853 config_no_plock 0
>> 1303134853 config_plock_rate_limit 100
>> 1303134853 config_plock_ownership 0
>> 1303134853 config_drop_resources_time 10000
>> 1303134853 config_drop_resources_count 10
>> 1303134853 config_drop_resources_age 10000
>> 1303134853 protocol 1.0.0
>> 1303134853 listen 3
>> 1303134853 cpg 6
>> 1303134853 groupd 7
>> 1303134853 uevent 8
>> 1303134853 plocks 10
>> 1303134853 plock need_fsid_translation 1
>> 1303134853 plock cpg message size: 336 bytes
>> 1303134853 setup done
>>
>
> Another gap that I just found is I forgot to specify a fencing method
> for the new centos node.  I put that in and now the rhel node wants to
> fence it so I am letting it do that then i'll see where i end up.
>

Node came up with no problems then started services manually:
service cman start
service clvmd start  (keep in mind that I commented out the vgscan in
that script, otherwise it times out)
service rgmanager start

The node enters the cluster and everything looks fine but no cluster
LVM devices.  The other node does see dlm start on the centos node:
Apr 18 14:37:06 omadvnfs01b kernel: dlm: got connection from 3

On a hunch, tried this on the RHEL node:
[root at omadvnfs01b ~]# clvmd -R
Error resetting node omadvnfs01b.sec.jel.lc: Command timed out

I think the RHEL node is broke but it has working services on it.  I
am OK with stopping all services but not sure how to get the cluster
devices working on the new centos node.   I have every intention of
formatting the RHEL node but need to understand what I am getting into
before I start shutting things down on that node.  How can I
forcefully make the centos node aware of the existing LVM
configuration?

Thanks!