[Linux-cluster] Cluster 3.0.0.rc3 release

Fabio M. Di Nitto fdinitto at redhat.com
Wed Jul 1 13:23:56 UTC 2009


Hi Guido,

On Wed, 2009-07-01 at 13:57 +0200, Guido Günther wrote:

> > - Before starting any kind of service, such as rgmanager or gfs*, make
> > sure that the fencing configuration is correct. Test by using fence_node
> > $nodename.
> fence_node node1
> 
> gives the segfaults at the same locationo as described above which seems
> to be the cause of the trouble. (Howvever "fence_ilo -z -l user -p pass
> -a iloip" works as expected). 
> The segfault happens in fence/libfence/agent.c's make_args where the
> second XPath lookup (FENCE_DEVICE_ARGS_PATH) returns a bogus (non NULL)
> str. Doing this xpath lookup by hand looks fine. So it seems
> ccs_get_list is returning corrupted pointers. I've attached the current
> clluster.conf.

I am having problems to reproduce this problem and I'll need your help.

First of all I replicated your configuration:

<?xml version="1.0"?>
<cluster name="fabbione" config_version="1" alias="fabbione">
  <logging debug="on"/>
  <clusternodes>
    <clusternode name="node1.foo.bar" votes="1" nodeid="1">
      <fence>
        <method name="1">
          <device name="fence1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="node2.foo.bar" votes="1" nodeid="4">
      <fence>
        <method name="1">
          <device name="fence2"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice name="node1" agent="fence_virsh" port="fedora-rh-node1"
ipaddr="daikengo.int.fabbione.net" login="root" secure="1"
identity_file="/root/.ssh/id_rsa"/>
    <fencedevice name="node2" agent="fence_virsh" port="fedora-rh-node4"
ipaddr="daikengo.int.fabbione.net" login="root" secure="1"
identity_file="/root/.ssh/id_rsa"/>
  </fencedevices>
</cluster>

as you can see node names and fencing methods are the same.

I don't have ilo but it shouldn't matter.

Now my question is: did you mangle the configuration you sent me
manually? because there is no matching entry between device to use for a
node and the fencedevices section and I get:

[root at node2]# fence_node -vv node1
fence node1 dev 0.0 agent none result: error config agent
agent args: 
fence node1 failed

Now if i change device name="fenceX" to name="nodeX" there is a matching
and:

[root at node2 cluster]# fence_node -vv node1
fence node1 dev 0.0 agent fence_virsh result: success
agent args: agent=fence_virsh port=fedora-rh-node1
ipaddr=daikengo.int.fabbione.net login=root secure=1
identity_file=/root/.ssh/id_rsa 
fence node1 success

and I still don't see the segfault...

Since you can reproduce the problem regularly I'd really like to see
some debugging output of libfence to start with. I'd really appreciate
if you could help us.

test 1:

Please add a bunch fprintf(stderr, to agents.c to see the created XPath
queries and the result coming back from libccs.

If you could please collect the output and send it to me.

test 2:

If you could please find:

cd = ccs_connect(); (line 287 in agent.c)
and right before that add:
fullxpath=1;

That change will ask libccs to use a different Xpath engine internally.

And then re-run test1.

This should be able to isolate pretty much the problem and give me
enough information to debug the issue.

the next question is: are you running on some fancy architecture? Maybe
something in that environment is not initialized properly (the garbage
string you get back from libccs sounds like that) but on more common
arches like x86/x86_64 gcc takes care of that for us.... (really wild
guessing but still something to fix!).

Thanks
Fabio




More information about the Linux-cluster mailing list