[Linux-cluster] Cluster 3.0.0.rc3 release
Guido Günther
agx at sigxcpu.org
Wed Jul 1 11:57:25 UTC 2009
On Mon, Jun 29, 2009 at 10:10:00PM +0200, Fabio M. Di Nitto wrote:
> > 1246297857 fenced 3.0.0.rc3 started
> > 1246297857 our_nodeid 1 our_name node2.foo.bar
> > 1246297857 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/fenced.log
> > 1246297857 found uncontrolled entry /sys/kernel/dlm/rgmanager
And it also leads to:
dlm_controld[14981]: fenced_domain_info error -1
so it's not possible to get the node back without rebooting.
> It looks to me the node has not been shutdown properly and an attempt to
> restart it did fail. The fenced segfault shouldn't happen but I am
> CC'ing David. Maybe he has a better idea.
>
> >
> > when trying to restart fenced. Since this is not possible one has to
> > reboot the node.
> >
> > We're also seeing:
> >
> > Jun 29 19:29:03 node2 kernel: [ 50.149855] dlm: no local IP address has been set
> > Jun 29 19:29:03 node2 kernel: [ 50.150035] dlm: cannot start dlm lowcomms -107
>
> hmm this looks like a bad configuration to me or bad startup.
>
> IIRC dlm kernel is configured via configfs and probably it was not
> mounted by the init script.
It is.
> > from time to time. Stopping/starting via cman's init script (as from the
> > Ubuntu package) several times makes this go away.
> >
> > Any ideas what causes this?
>
> Could you please try to use our upstream init scripts? They work just
> fine (unchanged) in ubuntu/debian environment and they are for sure a
> lot more robust than the ones I originally wrote for Ubuntu many years
> ago.
Tested that without any notable change.
> Could you also please summarize your setup and config? I assume you did
> the normal checks such as cman_tool status, cman_tool nodes and so on...
>
> The usual extra things I'd check are:
>
> - make sure the hostname doesn't resolve to localhost but to the real ip
> address of the cluster interface
> - cman_tool status
> - cman_tool nodes
These all do look o.k. However:
> - Before starting any kind of service, such as rgmanager or gfs*, make
> sure that the fencing configuration is correct. Test by using fence_node
> $nodename.
fence_node node1
gives the segfaults at the same locationo as described above which seems
to be the cause of the trouble. (Howvever "fence_ilo -z -l user -p pass
-a iloip" works as expected).
The segfault happens in fence/libfence/agent.c's make_args where the
second XPath lookup (FENCE_DEVICE_ARGS_PATH) returns a bogus (non NULL)
str. Doing this xpath lookup by hand looks fine. So it seems
ccs_get_list is returning corrupted pointers. I've attached the current
clluster.conf.
Cheers,
-- Guido
-------------- next part --------------
?xml version="1.0"?>
<cluster config_version="5" name="cl">
<cman two_node="1" expected_votes="2">
</cman>
<dlm log_debug="1"/>
<clusternodes>
<clusternode name="node1.foo.bar" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="fence1"/>
</method>
</fence>
</clusternode>
<clusternode name="node2.foo.bar" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="fence2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_ilo" hostname="rnode1.foo.bar" login="reboot" name="node1" passwd="pass"/>
<fencedevice agent="fence_ilo" hostname="rnode2.foo.bar" login="reboot" name="node2" passwd="pass"/>
</fencedevices>
<rm log_level="7">
<failoverdomains>
<failoverdomain name="kvm-hosts" ordered="1">
<failoverdomainnode name="node1.foo.bar"/>
<failoverdomainnode name="node2.foo.bar"/>
</failoverdomain>
</failoverdomains>
<resources>
<virt name="test11" />
<virt name="test12" />
</resources>
<service name="test11">
<virt ref="test11"/>
</service>
<service name="test12">
<virt ref="test12"/>
</service>
</rm>
</cluster>
More information about the Linux-cluster
mailing list