[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Cluster 3.0.0.rc3 release

On Mon, Jun 29, 2009 at 10:10:00PM +0200, Fabio M. Di Nitto wrote:
> > 1246297857 fenced 3.0.0.rc3 started
> > 1246297857 our_nodeid 1 our_name node2.foo.bar
> > 1246297857 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/fenced.log
> > 1246297857 found uncontrolled entry /sys/kernel/dlm/rgmanager
And it also leads to:

dlm_controld[14981]: fenced_domain_info error -1 

so it's not possible to get the node back without rebooting.

> It looks to me the node has not been shutdown properly and an attempt to
> restart it did fail. The fenced segfault shouldn't happen but I am
> CC'ing David. Maybe he has a better idea.
> > 
> > when trying to restart fenced. Since this is not possible one has to
> > reboot the node.
> > 
> > We're also seeing:
> > 
> > Jun 29 19:29:03 node2 kernel: [   50.149855] dlm: no local IP address has been set
> > Jun 29 19:29:03 node2 kernel: [   50.150035] dlm: cannot start dlm lowcomms -107
> hmm this looks like a bad configuration to me or bad startup.
> IIRC dlm kernel is configured via configfs and probably it was not
> mounted by the init script.
It is. 

> > from time to time. Stopping/starting via cman's init script (as from the
> > Ubuntu package) several times makes this go away.
> > 
> > Any ideas what causes this?
> Could you please try to use our upstream init scripts? They work just
> fine (unchanged) in ubuntu/debian environment and they are for sure a
> lot more robust than the ones I originally wrote for Ubuntu many years
> ago.
Tested that without any notable change.

> Could you also please summarize your setup and config? I assume you did
> the normal checks such as cman_tool status, cman_tool nodes and so on...
> The usual extra things I'd check are:
> - make sure the hostname doesn't resolve to localhost but to the real ip
> address of the cluster interface
> - cman_tool status
> - cman_tool nodes
These all do look o.k. However:

> - Before starting any kind of service, such as rgmanager or gfs*, make
> sure that the fencing configuration is correct. Test by using fence_node
> $nodename.
fence_node node1

gives the segfaults at the same locationo as described above which seems
to be the cause of the trouble. (Howvever "fence_ilo -z -l user -p pass
-a iloip" works as expected). 
The segfault happens in fence/libfence/agent.c's make_args where the
second XPath lookup (FENCE_DEVICE_ARGS_PATH) returns a bogus (non NULL)
str. Doing this xpath lookup by hand looks fine. So it seems
ccs_get_list is returning corrupted pointers. I've attached the current
 -- Guido

?xml version="1.0"?>
<cluster config_version="5" name="cl">
  <cman two_node="1" expected_votes="2">
  <dlm log_debug="1"/>
    <clusternode name="node1.foo.bar" nodeid="1" votes="1">
        <method name="1">
          <device name="fence1"/>
    <clusternode name="node2.foo.bar" nodeid="2" votes="1">
        <method name="1">
          <device name="fence2"/>

    <fencedevice agent="fence_ilo" hostname="rnode1.foo.bar" login="reboot" name="node1" passwd="pass"/>
    <fencedevice agent="fence_ilo" hostname="rnode2.foo.bar" login="reboot" name="node2" passwd="pass"/>

  <rm log_level="7">
      <failoverdomain name="kvm-hosts" ordered="1">
        <failoverdomainnode name="node1.foo.bar"/>
        <failoverdomainnode name="node2.foo.bar"/>
       <virt name="test11" />
       <virt name="test12" />
   <service name="test11">
        <virt ref="test11"/>
   <service name="test12">
        <virt ref="test12"/>

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]