[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] Fencing when missed too many heartbeats



We have a premium subscription ticket open on this already, but I wanted to throw the question out there to this development list to possibly hear from its software engineers and make this scenario more clear to its users:

1)  When one node detects 'missed too many heartbeats', what decision-making process goes into effect towards the final outcome of fencing the node?

2)  If a few nodes are down for maintenance, and they left the cluster with "remove" for adjustment of 'quorum' count, but not 'expected' count, how might this affect question #1?

It would be even more excellent If the responses could apply using our RHEL AS 4.5 11-node cluster as example:

$ cman_tool nodes
Node  Votes Exp Sts  Name
   1    1   19   M   db2
   2    5   19   M   net1
   3    5   19   M   net2
   4    1   19   M   db4
   5    1   19   M   db1
   6    1   19   M   db5
   7    1   19   X   app3
   8    1   19   X   app2
   9    1   19   M   app6
  10    1   19   M   db3
  11    1   19   X   net3

LVS network tier: net1 (5-votes), net2 (5-votes), net3 (remove)
Application tier: app2 (remove), app3 (remove), app6
Database tier: db1, db2, db3, db4, db5

Expected: 19, Quorum: 9, Total votes: 16

FYI: the nodes net3, app2, app3 left this cluster with "remove" to do some isolated testing of RHEL AS 4.6 update, but only net3 was left powered on.  It was in this state for over a week.

As seen in syslog messages from each member that net1 went 'dark':

Mar 15 16:20:28 net2 kernel: CMAN: node net1 has been removed from the cluster : Missed too many heartbeats
Mar 15 16:20:29 net2 fenced[19273]: fencing deferred to db2
Mar 15 16:23:05 net2 clurgmgrd[20012]: <info> Magma Event: Membership Change
Mar 15 16:23:05 net2 clurgmgrd[20012]: <info> State change: net1 DOWN

Mar 15 12:29:16 app6 kernel: CMAN: node net1 has been removed from the cluster : Missed too many heartbeats
Mar 15 12:29:17 app6 fenced[19015]: fencing deferred to db2
Mar 15 12:31:53 app6 clurgmgrd[21831]: <info> Magma Event: Membership Change
Mar 15 12:31:53 app6 clurgmgrd[21831]: <info> State change: net1 DOWN

Mar 15 16:29:19 db1 kernel: CMAN: node net1 has been removed from the cluster : Missed too many heartbeats
Mar 15 16:29:20 db1 fenced[19297]: fencing deferred to db2
Mar 15 16:31:56 db1 clurgmgrd[21436]: <info> Magma Event: Membership Change
Mar 15 16:31:56 db1 clurgmgrd[21436]: <info> State change: net1 DOWN

Mar 15 16:29:19 db2 kernel: CMAN: removing node net1 from the cluster : Missed too many heartbeats
Mar 15 16:29:20 db2 fenced[14778]: net1 not a cluster member after 0 sec post_fail_delay
Mar 15 16:29:20 db2 fenced[14778]: fencing node "net1"
Mar 15 16:31:48 db2 ccsd[14677]: Attempt to close an unopened CCS descriptor (151704870).
Mar 15 16:31:48 db2 ccsd[14677]: Error while processing disconnect: Invalid request descriptor
Mar 15 16:31:48 db2 fenced[14778]: fence "net1" success

Mar 15 16:29:19 db3 kernel: CMAN: node net1 has been removed from the cluster : Missed too many heartbeats
Mar 15 16:29:20 db3 fenced[19097]: fencing deferred to db2
Mar 15 16:31:56 db3 clurgmgrd[21315]: <info> Magma Event: Membership Change
Mar 15 16:31:56 db3 clurgmgrd[21315]: <info> State change: net1 DOWN

Mar 15 16:29:19 db4 kernel: CMAN: node net1 has been removed from the cluster : Missed too many heartbeats
Mar 15 16:29:20 db4 fenced[19126]: fencing deferred to db2
Mar 15 16:31:56 db4 clurgmgrd[21182]: <info> Magma Event: Membership Change
Mar 15 16:31:56 db4 clurgmgrd[21182]: <info> State change: net1 DOWN

Mar 15 16:29:19 db5 kernel: CMAN: node net1 has been removed from the cluster : Missed too many heartbeats
Mar 15 16:29:20 db5 fenced[14508]: fencing deferred to db2
Mar 15 16:31:56 db5 clurgmgrd[17187]: <info> Magma Event: Membership Change
Mar 15 16:31:56 db5 clurgmgrd[17187]: <info> State change: net1 DOWN

It may be of no consequence, but also note that there was clock drift on net2, because of a failed NTP server;  and also app6 because its clock was not calibrated after being down for a motherboard swapout and memory upgrade for a few weeks.


Robert Hurst, Sr. Caché Administrator
Beth Israel Deaconess Medical Center
1135 Tremont Street, REN-7
Boston, Massachusetts   02120-2140
617-754-8754 ∙ Fax: 617-754-8730 ∙ Cell: 401-787-3154
Any technology distinguishable from magic is insufficiently advanced.


Attachment: smime.p7s
Description: S/MIME cryptographic signature


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]