[Linux-cluster] Cluster 3.0.0.rc3 release

Mon Jun 29 20:10:00 UTC 2009

Hi Guido,

On Mon, 2009-06-29 at 20:48 +0200, Guido Günther wrote:
> Hi Fabione,
> Thanks for rolling this rc candidate!
> 
> On Sat, Jun 20, 2009 at 01:19:49PM +0200, Fabio M. Di Nitto wrote:
> [..snip..] 
> > In order to build the 3.0.0.rc3 release you will need:
> > 
> > - corosync 0.98
> > - openais 0.97
> We used these without any patches.
> 
> > - linux kernel 2.6.29
> We were running against 2.6.30.

Shouldn't be a problem. You simply won't be able to build or use gfs1.

> 
> We observed these issues:
> 
> fenced segfaults with:
> 
> (gdb) bt
> #0  0x00007f8e293508fe in fence_node (victim=0x114b510 "node1.foo.bar", log=0x61e0a0, log_size=32, log_count=0x7fff2e46a634) at /var/home/schmitz/3/redhat-cluster/fence/libfence/agent.c:156
> #1  0x000000000040c5cd in fence_victims (fd=0x114f270) at /var/home/schmitz/3/redhat-cluster/fence/fenced/recover.c:319
> #2  0x0000000000405f27 in apply_changes (fd=0x114f270) at /var/home/schmitz/3/redhat-cluster/fence/fenced/cpg.c:1056
> #3  0x00007f8e2914bcc1 in cpg_dispatch () from /usr/lib/libcpg.so.4 #4  0x0000000000404588 in process_fd_cpg (ci=4) at /var/home/schmitz/3/redhat-cluster/fence/fenced/cpg.c:1351 #5  0x000000000040b0f7 in main (argc=<value optimized out>, argv=<value optimized out>) at /var/home/schmitz/3/redhat-cluster/fence/fenced/main.c:818
> 
> this leads to
> 
> 1246297857 fenced 3.0.0.rc3 started
> 1246297857 our_nodeid 1 our_name node2.foo.bar
> 1246297857 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/fenced.log
> 1246297857 found uncontrolled entry /sys/kernel/dlm/rgmanager

It looks to me the node has not been shutdown properly and an attempt to
restart it did fail. The fenced segfault shouldn't happen but I am
CC'ing David. Maybe he has a better idea.

> 
> when trying to restart fenced. Since this is not possible one has to
> reboot the node.
> 
> We're also seeing:
> 
> Jun 29 19:29:03 node2 kernel: [   50.149855] dlm: no local IP address has been set
> Jun 29 19:29:03 node2 kernel: [   50.150035] dlm: cannot start dlm lowcomms -107

hmm this looks like a bad configuration to me or bad startup.

IIRC dlm kernel is configured via configfs and probably it was not
mounted by the init script.

> 
> from time to time. Stopping/starting via cman's init script (as from the
> Ubuntu package) several times makes this go away.
> 
> Any ideas what causes this?

Could you please try to use our upstream init scripts? They work just
fine (unchanged) in ubuntu/debian environment and they are for sure a
lot more robust than the ones I originally wrote for Ubuntu many years
ago.

Could you also please summarize your setup and config? I assume you did
the normal checks such as cman_tool status, cman_tool nodes and so on...

The usual extra things I'd check are:

- make sure the hostname doesn't resolve to localhost but to the real ip
address of the cluster interface
- cman_tool status
- cman_tool nodes
- Before starting any kind of service, such as rgmanager or gfs*, make
sure that the fencing configuration is correct. Test by using fence_node
$nodename.

Cheers
Fabio