[Linux-cluster] Failover root cause

Sun Nov 11 22:49:43 UTC 2012

Hi,
I plan to implement NTP so that both servers time synchronized. How
can I look for the failover cause? I already graph sar data and no
peak usage on the time when db1svr was fenced by db2svr. What file
(and what specific message) that I should look to know the root cause
of this failover. Thank you.
Regards,

Panji

On Fri, Nov 9, 2012 at 10:40 AM, Yu <songyu555 at gmail.com> wrote:
> Regardless what was the root cause you find. Cluster requires Ntp service to ensure all nodes have time synchronized.  So you have to fix this 5 mins difference now.
>
> Regards
> Yu
>
> On 09/11/2012, at 11:47, Muhammad Panji <sumodirjo at gmail.com> wrote:
>
>> Dear All,
>> I have an oracle cluster on RHEL 6.2 with 2 servers. Several days ago
>> the service was failover from node1 to node2. From /var/log/messages
>> on node2 I only see this message :
>>
>> ...
>> Oct 23 12:54:19 db2svr corosync[4142]:   [TOTEM ] A processor failed,
>> forming new configuration.
>> Oct 23 12:54:21 db2svr corosync[4142]:   [QUORUM] Members[1]: 2
>> Oct 23 12:54:21 db2svr corosync[4142]:   [TOTEM ] A processor joined
>> or left the membership and a new membership was formed.
>> Oct 23 12:54:21 db2svr kernel: dlm: closing connection to node 1
>> Oct 23 12:54:21 db2svr rgmanager[5327]: State change: clu1 DOWN
>> Oct 23 12:54:21 db2svr fenced[4193]: fencing node clu1
>> ...
>>
>> Googling this message " [TOTEM ] A processor failed, forming new
>> configuration." I learned that it means node2 couldn't see node1 and
>> then fence node1. on node1 I get this message :
>>
>> Oct 23 12:50:45 db1svr rgmanager[75890]: [script] Executing
>> /etc/init.d/httpd status
>> Oct 23 12:56:01 db1svr kernel: imklog 4.6.2, log source = /proc/kmsg started.
>> Oct 23 12:56:01 db1svr rsyslogd: [origin software="rsyslogd"
>> swVersion="4.6.2" x-pid="3792" x-info="http://www.rsyslog.com"]
>> (re)start
>> Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpuset
>> Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpu
>> Oct 23 12:56:01 db1svr kernel: Linux version 2.6.32-220.el6.x86_64
>> (mockbuild at x86-004.build.bos.redhat.com) (gcc version 4.4.5 20110214
>> (Red Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011
>>
>> on 12:50 rgmanager still checking the service and then it's rebooted.
>> Thing that make it worse is that the date / time of both servers are
>> different so that I can't compare the logs directly. Current time
>> difference between both servers is around 5 minutes.
>>
>> I would like to ask where to look for the cause of this failover? I
>> plan to graph sar data today to see if there were bottleneck on CPU
>> etc so that node1 could not send status to node2, but if no bottleneck
>> on CPU or RAM etc where should I find the root cause of failover?
>> thank you.
>> Regards,
>>
>>
>>
>>
>>
>> --
>> Muhammad Panji
>> http://www.panji.web.id
>> http://www.kurungsiku.com
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Muhammad Panji
http://www.panji.web.id
http://www.kurungsiku.com