[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] info on "A processor failed" message and fencing when going to single user mode



Hello,
2 nodes cluster  (virtfed and virtfedbis their names) with F11 x86_64 up2date as of today and without qdisk
cman-3.0.2-1.fc11.x86_64
openais-1.0.1-1.fc11.x86_64
corosync-1.0.0-1.fc11.x86_64
and kernel 2.6.30.8-64.fc11.x86_64

I was in a situation where both nodes up, after virtfedbis hust restarted and starting a service
Inside one of its resources there is a loop where it tests availability of a file and so it was in starting of this service, but infra ws up, as of this messages:

Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] New Configuration:
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] #011r(0) ip(192.168.16.101)
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] Members Left:
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] #011r(0) ip(192.168.16.102)
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] Members Joined:
Oct  5 11:44:39 virtfed corosync[4684]:   [QUORUM] This node is within the primary component and will provide service.
Oct  5 11:44:39 virtfed corosync[4684]:   [QUORUM] Members[1]:
Oct  5 11:44:39 virtfed corosync[4684]:   [QUORUM]     1
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] New Configuration:
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] #011r(0) ip(192.168.16.101)
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] Members Left:
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] Members Joined:
Oct  5 11:44:39 virtfed corosync[4684]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct  5 11:44:39 virtfed kernel: dlm: closing connection to node 2
Oct  5 11:44:39 virtfed corosync[4684]:   [MAIN  ] Completed service synchronization, ready to provide service.

So now they are at this condition, reported by virtfedbis
[root virtfedbis ~]# clustat
Cluster Status for kvm @ Mon Oct  5 11:49:27 2009
Member Status: Quorate

 Member Name                                                ID   Status
 ------ ----                                                ---- ------
 kvm1                                                           1 Online, rgmanager
 kvm2                                                           2 Online, Local, rgmanager

 Service Name                                      Owner (Last)                                      State        
 ------- ----                                      ----- ------                                      -----        
 service:DRBDNODE1                                 kvm1                                              started      
 service:DRBDNODE2                                 kvm2                                              starting     

I realize that I forgot a thing so that after 10 attempts DRBDNODE2 service would not come up and so I decide to put
virtfedbis in single user mode, so that I run on it

shutdown 0

I would expect virtfedbis to leave cleanly the cluster, instead it is fenced and rebooted (via fence_ilo agent)

On virtfed these are the messages:
Oct  5 11:49:49 virtfed corosync[4684]:   [TOTEM ] A processor failed, forming new configuration.
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] New Configuration:
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] #011r(0) ip(192.168.16.101)
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] Members Left:
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] #011r(0) ip(192.168.16.102)
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] Members Joined:
Oct  5 11:49:54 virtfed corosync[4684]:   [QUORUM] This node is within the primary component and will provide service.
Oct  5 11:49:54 virtfed corosync[4684]:   [QUORUM] Members[1]:
Oct  5 11:49:54 virtfed corosync[4684]:   [QUORUM]     1
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] New Configuration:
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] #011r(0) ip(192.168.16.101)
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] Members Left:
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] Members Joined:
Oct  5 11:49:54 virtfed corosync[4684]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct  5 11:49:54 virtfed corosync[4684]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct  5 11:49:54 virtfed kernel: dlm: closing connection to node 2
Oct  5 11:49:54 virtfed fenced[4742]: fencing node kvm2
Oct  5 11:49:54 virtfed rgmanager[5496]: State change: kvm2 DOWN
Oct  5 11:50:26 virtfed fenced[4742]: fence kvm2 success

What I find on virtfedbis after restart in /var/log/cluster directory is this:

corosync.log
Oct 05 11:49:49 corosync [TOTEM ] A processor failed, forming new configuration.
Oct 05 11:49:49 corosync [TOTEM ] The network interface is down.
Oct 05 11:49:54 corosync [CLM   ] CLM CONFIGURATION CHANGE
Oct 05 11:49:54 corosync [CLM   ] New Configuration:
Oct 05 11:49:54 corosync [CLM   ]       r(0) ip(127.0.0.1)
Oct 05 11:49:54 corosync [CLM   ] Members Left:
Oct 05 11:49:54 corosync [CLM   ]       r(0) ip(192.168.16.102)
Oct 05 11:49:54 corosync [CLM   ] Members Joined:
Oct 05 11:49:54 corosync [QUORUM] This node is within the primary component and will provide service.
Oct 05 11:49:54 corosync [QUORUM] Members[1]:
Oct 05 11:49:54 corosync [QUORUM]     1
Oct 05 11:49:54 corosync [CLM   ] CLM CONFIGURATION CHANGE
Oct 05 11:49:54 corosync [CLM   ] New Configuration:
Oct 05 11:49:54 corosync [CLM   ]       r(0) ip(127.0.0.1)
Oct 05 11:49:54 corosync [CLM   ] Members Left:
Oct 05 11:49:54 corosync [CLM   ] Members Joined:
Oct 05 11:49:54 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 05 11:49:54 corosync [CMAN  ] Killing node kvm2 because it has rejoined the cluster with existing state

I think there is something wrong in this behaviour....
This is a test cluster so I have no qdisk .....
Is this the cause inherent with my config that has:
<cman expected_votes="1" two_node="1"/>
        <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="20"/>

In general, if I do a shutdown -r now an one of the two nodes I have not thsi kind of problems.....

Thanks for any insight,
Gianluca

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]