[Linux-cluster] Cluster fails after fencing by DRAC

Jorge Gonzalez jorge.gonzalez at degesys.com
Thu Jan 10 16:18:21 UTC 2008


Hi all!

I have a problem with 3 nodes cluster. When I run "fence_node node1" the 
node1 reeboot by drac succesfully. When node1 restarts  then gets frozen:

------------------
starting clvmd: dlm: got connection fron 32
dlm: connecting to 33
dlm: got connection fron 33
[frozen]

* cman_tool services shows:
type             level name       id       state      
fence            0     default    0001001f none       
[31 32 33]
dlm              1     clvmd      00010020 none       
[31 32 33]
dlm              1     rgmanager  00020020 none       
[32 33]

It seems rgmanager has not 31 (?)

* clustat shows:
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  xenr3u1.domain.com                  31 Online
  xenr3u2.domain.com                 32 Online, Local
  xenr3u3.domain.com                 33 Online

-------------------

Then I rebooted again the node1:
Starting cluster
    Loading modules DLM .......
done
starting ccsd
starting cman
starting daemons
starting fencing
[frozen again]

after long time starting fencing [done] but cman_tool services fails

* cman_tool services shows:
type             level name       id       state      
fence            0     default    0001001f FAIL_ALL_STOPPED
[31 32 33]
dlm              1     clvmd      00010020 FAIL_STOP_WAIT
[31 32 33]
dlm              1     rgmanager  00020020 FAIL_STOP_WAIT

* clustat shows:
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  xenr3u1.domain.com                  31 Online
  xenr3u2.domain.com                 32 Online, Local
  xenr3u3.domain.com                 33 Online

/etc/init.d/rgmanager restart
Shutting down Cluster Service Manager...
Waiting for services to stop:
[long timeeeeeeee]
----------------------------------

I saw this page translated to english 
(http://translate.google.com/translate?u=http%3A%2F%2Fken-etsu-tech.blogspot.com%2F2007%2F11%2Fred-hat-cluster-kernel-xen.html&langpair=ja%7Cen&hl=es&ie=UTF-8). 

It's exactly the same. A kernel bug? clvmd bug?

Linux xenr3u2 2.6.18-8.1.15.el5xen #1 SMP Mon Oct 22 09:01:12 EDT 2007 
x86_64 x86_64 x86_64 GNU/Linux
cman-2.0.64-1.0.1.el5
rgmanager-2.0.24-1.el5.centos
lvm2-cluster-2.02.16-3.el5



Sometimes the node starts ok and cman_tool is also ok.

* /etc/lvm.conf:

devices {
    dir = "/dev"
    scan = [ "/dev" ]
    filter = [ "a/.*/" ]   
    cache = "/etc/lvm/.cache"
    write_cache_state = 1
    sysfs_scan = 1   
    md_component_detection = 1
}
log {   
    verbose = 0
    syslog = 1
    overwrite = 0  
    level = 0
    indent = 1
    command_names = 0
    prefix = "  "
}
backup {
    backup = 1
    backup_dir = "/etc/lvm/backup"
    archive = 1
    archive_dir = "/etc/lvm/archive"
    retain_min = 10
    retain_days = 30
}
shell {
    history_size = 100
}
global {
    library_dir = "/usr/lib64"
    umask = 077
    test = 0
    activation = 1
    proc = "/proc"
    locking_type = 3
    fallback_to_clustered_locking = 1
    fallback_to_local_locking = 1
    locking_dir = "/var/lock/lvm"
}
activation {
    missing_stripe_filler = "/dev/ioerror"
    reserved_stack = 256
    reserved_memory = 8192
    process_priority = -18
    mirror_region_size = 512
    mirror_log_fault_policy = "allocate"
    mirror_device_fault_policy = "remove"
}



That's all ;-)
Thanks in advance








-------------- next part --------------
A non-text attachment was scrubbed...
Name: jorge.gonzalez.vcf
Type: text/x-vcard
Size: 350 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080110/b6a70fe2/attachment.vcf>


More information about the Linux-cluster mailing list