[Linux-cluster] Cluster 3.0.0.rc3 release

Wed Jul 1 11:57:25 UTC 2009

On Mon, Jun 29, 2009 at 10:10:00PM +0200, Fabio M. Di Nitto wrote:
> > 1246297857 fenced 3.0.0.rc3 started
> > 1246297857 our_nodeid 1 our_name node2.foo.bar
> > 1246297857 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/fenced.log
> > 1246297857 found uncontrolled entry /sys/kernel/dlm/rgmanager
And it also leads to:

dlm_controld[14981]: fenced_domain_info error -1 

so it's not possible to get the node back without rebooting.

> It looks to me the node has not been shutdown properly and an attempt to
> restart it did fail. The fenced segfault shouldn't happen but I am
> CC'ing David. Maybe he has a better idea.
> 
> > 
> > when trying to restart fenced. Since this is not possible one has to
> > reboot the node.
> > 
> > We're also seeing:
> > 
> > Jun 29 19:29:03 node2 kernel: [   50.149855] dlm: no local IP address has been set
> > Jun 29 19:29:03 node2 kernel: [   50.150035] dlm: cannot start dlm lowcomms -107
> 
> hmm this looks like a bad configuration to me or bad startup.
> 
> IIRC dlm kernel is configured via configfs and probably it was not
> mounted by the init script.
It is. 

> > from time to time. Stopping/starting via cman's init script (as from the
> > Ubuntu package) several times makes this go away.
> > 
> > Any ideas what causes this?
> 
> Could you please try to use our upstream init scripts? They work just
> fine (unchanged) in ubuntu/debian environment and they are for sure a
> lot more robust than the ones I originally wrote for Ubuntu many years
> ago.
Tested that without any notable change.

> Could you also please summarize your setup and config? I assume you did
> the normal checks such as cman_tool status, cman_tool nodes and so on...
> 
> The usual extra things I'd check are:
> 
> - make sure the hostname doesn't resolve to localhost but to the real ip
> address of the cluster interface
> - cman_tool status
> - cman_tool nodes
These all do look o.k. However:

> - Before starting any kind of service, such as rgmanager or gfs*, make
> sure that the fencing configuration is correct. Test by using fence_node
> $nodename.
fence_node node1

gives the segfaults at the same locationo as described above which seems
to be the cause of the trouble. (Howvever "fence_ilo -z -l user -p pass
-a iloip" works as expected). 
The segfault happens in fence/libfence/agent.c's make_args where the
second XPath lookup (FENCE_DEVICE_ARGS_PATH) returns a bogus (non NULL)
str. Doing this xpath lookup by hand looks fine. So it seems
ccs_get_list is returning corrupted pointers. I've attached the current
clluster.conf.
Cheers,
 -- Guido

-------------- next part --------------
?xml version="1.0"?>
<cluster config_version="5" name="cl">
  <cman two_node="1" expected_votes="2">
  </cman>
  <dlm log_debug="1"/>
  <clusternodes>
    <clusternode name="node1.foo.bar" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device name="fence1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node2.foo.bar" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device name="fence2"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>

  <fencedevices>
    <fencedevice agent="fence_ilo" hostname="rnode1.foo.bar" login="reboot" name="node1" passwd="pass"/>
    <fencedevice agent="fence_ilo" hostname="rnode2.foo.bar" login="reboot" name="node2" passwd="pass"/>
  </fencedevices>

  <rm log_level="7">
   <failoverdomains>
      <failoverdomain name="kvm-hosts" ordered="1">
        <failoverdomainnode name="node1.foo.bar"/>
        <failoverdomainnode name="node2.foo.bar"/>
      </failoverdomain>
   </failoverdomains>
   <resources>
       <virt name="test11" />
       <virt name="test12" />
   </resources>
   <service name="test11">
        <virt ref="test11"/>
   </service>
   <service name="test12">
        <virt ref="test12"/>
   </service>
  </rm>
</cluster>