[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

RE: [Linux-cluster] Cluster fails after fencing by DRAC



hello,

sorry to ask but is the "none" state a normal state for services?
I have issues with cluster services too and I've been told that this state 
is not normal and indicates that the nodes didn't join the fence domain that causing issues with rgmanager too.

what does show clustat and cman_tool services at startup ?


regards, 
 
Mathieu

-----Message d'origine-----
De : linux-cluster-bounces redhat com [mailto:linux-cluster-bounces redhat com] De la part de Jorge Gonzalez
Envoyé : jeudi 10 janvier 2008 17:18
À : linux-cluster redhat com
Objet : [Linux-cluster] Cluster fails after fencing by DRAC

Hi all!

I have a problem with 3 nodes cluster. When I run "fence_node node1" the 
node1 reeboot by drac succesfully. When node1 restarts  then gets frozen:

------------------
starting clvmd: dlm: got connection fron 32
dlm: connecting to 33
dlm: got connection fron 33
[frozen]

* cman_tool services shows:
type             level name       id       state      
fence            0     default    0001001f none       
[31 32 33]
dlm              1     clvmd      00010020 none       
[31 32 33]
dlm              1     rgmanager  00020020 none       
[32 33]

It seems rgmanager has not 31 (?)

* clustat shows:
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  xenr3u1.domain.com                  31 Online
  xenr3u2.domain.com                 32 Online, Local
  xenr3u3.domain.com                 33 Online

-------------------

Then I rebooted again the node1:
Starting cluster
    Loading modules DLM .......
done
starting ccsd
starting cman
starting daemons
starting fencing
[frozen again]

after long time starting fencing [done] but cman_tool services fails

* cman_tool services shows:
type             level name       id       state      
fence            0     default    0001001f FAIL_ALL_STOPPED
[31 32 33]
dlm              1     clvmd      00010020 FAIL_STOP_WAIT
[31 32 33]
dlm              1     rgmanager  00020020 FAIL_STOP_WAIT

* clustat shows:
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  xenr3u1.domain.com                  31 Online
  xenr3u2.domain.com                 32 Online, Local
  xenr3u3.domain.com                 33 Online

/etc/init.d/rgmanager restart
Shutting down Cluster Service Manager...
Waiting for services to stop:
[long timeeeeeeee]
----------------------------------

I saw this page translated to english 
(http://translate.google.com/translate?u=http%3A%2F%2Fken-etsu-tech.blogspot.com%2F2007%2F11%2Fred-hat-cluster-kernel-xen.html&langpair=ja%7Cen&hl=es&ie=UTF-8). 

It's exactly the same. A kernel bug? clvmd bug?

Linux xenr3u2 2.6.18-8.1.15.el5xen #1 SMP Mon Oct 22 09:01:12 EDT 2007 
x86_64 x86_64 x86_64 GNU/Linux
cman-2.0.64-1.0.1.el5
rgmanager-2.0.24-1.el5.centos
lvm2-cluster-2.02.16-3.el5



Sometimes the node starts ok and cman_tool is also ok.

* /etc/lvm.conf:

devices {
    dir = "/dev"
    scan = [ "/dev" ]
    filter = [ "a/.*/" ]   
    cache = "/etc/lvm/.cache"
    write_cache_state = 1
    sysfs_scan = 1   
    md_component_detection = 1
}
log {   
    verbose = 0
    syslog = 1
    overwrite = 0  
    level = 0
    indent = 1
    command_names = 0
    prefix = "  "
}
backup {
    backup = 1
    backup_dir = "/etc/lvm/backup"
    archive = 1
    archive_dir = "/etc/lvm/archive"
    retain_min = 10
    retain_days = 30
}
shell {
    history_size = 100
}
global {
    library_dir = "/usr/lib64"
    umask = 077
    test = 0
    activation = 1
    proc = "/proc"
    locking_type = 3
    fallback_to_clustered_locking = 1
    fallback_to_local_locking = 1
    locking_dir = "/var/lock/lvm"
}
activation {
    missing_stripe_filler = "/dev/ioerror"
    reserved_stack = 256
    reserved_memory = 8192
    process_priority = -18
    mirror_region_size = 512
    mirror_log_fault_policy = "allocate"
    mirror_device_fault_policy = "remove"
}



That's all ;-)
Thanks in advance










[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]