[Linux-cluster] Service not relocated after successful fence of its owner

Sun Apr 26 06:19:47 UTC 2009

El vie, 24-04-2009 a las 13:41 -0400, Lon Hohberger escribió:

> On Wed, 2009-04-22 at 01:40 -0400, Maykel Moya wrote:
> > I still can't get my service automatically relocated after
> > _successfully_ fencing its owner node.
> > 
> > I have a 4 node cluster n{1,2,3,4} and 4 services s{1,2,3,4}. My fence
> > device uses 'off' as action, so a successful fence means the node is
> > off.
> > 
> > Say, s4 running on n4 and I do a 'ip link set eth0 down' on n4. n4 get
> > successfully fenced but s4 is never relocated to one of the other
> > available nodes which means s4 is not available.
> > 
> > Find attached the cluster.conf.
> 
> Conf looks okay, what do the logs say?  Any other errors?  It looks like
> things should be working correctly.

The relevant part
----
Apr 26 02:08:29 e1b01 kernel: [345031.041719] dlm: closing connection to
node 4
Apr 26 02:08:29 e1b01 clurgmgrd[3880]: <debug> Membership Change Event
Apr 26 02:08:29 e1b01 clurgmgrd[3880]: <info> State change: e1b04 DOWN
Apr 26 02:08:29 e1b01 clurgmgrd[3880]: <debug> Membership Change Event
Apr 26 02:08:29 e1b01 clurgmgrd[3880]: <debug> Membership Change Event
Apr 26 02:08:29 e1b01 clurgmgrd[3880]: <debug> Membership Change Event
Apr 26 02:08:29 e1b01 fenced[3850]: e1b04 not a cluster member after 0
sec post_fail_delay
Apr 26 02:08:29 e1b01 fenced[3850]: fencing node "e1b04"
Apr 26 02:08:40 e1b01 fenced[3850]: can't get node number for node
�ҋ#010Pҋ#010#020
Apr 26 02:08:40 e1b01 fenced[3850]: fence "e1b04" success
----

> 'cman_tool services' and 'cman_tool nodes' output would be helpful,
> too.  

It's a bit odd, clustat saying that node e1b04 is offline but service4
is owned by e1b04 and started.

e1b01:/var/log# clustat 
Cluster Status for cinfomed @ Sun Apr 26 02:09:07 2009
Member Status: Quorate

 Member Name               ID   Status
 ------ ----               ---- ------
 e1b01                         1 Online, Local, rgmanager
 e1b02                         2 Online, rgmanager
 e1b03                         3 Online, rgmanager
 e1b04                         4 Offline

 Service Name              Owner (Last)             State         
 ------- ----              ----- ------             -----         
 service:vmail1_svc        e1b01                    started       
 service:vmail2_svc        e1b04                    started       
 service:vmail3_svc        e1b03                    started       
 service:vmail4_svc        e1b04                    started       

e1b01:/var/log# cman_tool services
type             level name       id       state       
fence            0     default    00010001 none        
[1 2 3]
dlm              1     rgmanager  00010004 none        
[1 2 3]

e1b01:/var/log# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M   1404   2009-04-22 02:17:09  e1b01
   2   M   1432   2009-04-22 02:51:31  e1b02
   3   M   1412   2009-04-22 02:17:11  e1b03
   4   X   1408                        e1b04

Forgot to mention

e1b01:/var/log# lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 5.0.1 (lenny)
Release:	5.0.1
Codename:	lenny

e1b01:/var/log# cman_tool -V
cman_tool 2.03.09 (built Nov  3 2008 18:22:25)
Copyright (C) Red Hat, Inc.  2004-2008  All rights reserved.

e1b01:/var/log# uname -r
2.6.26-2-686

----
This is the only thing I'm missing to deploy, have tried fencing with
'reboot', with 'off', setting service recovery policy and 'relocate' and
nothing solves it. If a node goes down, the service is not migrated
after fence it.

Regards,
maykel