[Linux-cluster] 3 node cluster crashes
Dalton, Maurice
bobby.m.dalton at nasa.gov
Tue Aug 5 21:56:04 UTC 2008
I have a 3 node cluster running cman-2.0.84-2.el5. At times we have
spanning tree events that cause network storms up to 9 seconds.
When these events occur (today we caused them twice to verify this
issue). All three nodes go down within seconds of this event.
The second time we tried it I added the totem token statement shown
below. Same problem.
<cman>
<multicast addr="225.0.0.11"/>
<totem token="21000"/>
</cman>
Aug 5 16:41:18 csarcsys2-eth0 ntpd[3484]: kernel time sync enabled 0001
Aug 5 16:41:19 csarcsys2-eth0 openais[3096]: [TOTEM] The token was lost
in the OPERATIONAL state.
Aug 5 16:41:19 csarcsys2-eth0 openais[3096]: [TOTEM] Receive multicast
socket recv buffer size (288000 bytes).
Aug 5 16:41:19 csarcsys2-eth0 openais[3096]: [TOTEM] Transmit multicast
socket send buffer size (262142 bytes).
Aug 5 16:41:19 csarcsys2-eth0 openais[3096]: [TOTEM] entering GATHER
state from 2.
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] entering GATHER
state from 0.
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Creating commit
token because I am the rep.
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Saving state aru
46 high seq received 46
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Storing new
sequence id for ring b50
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] entering COMMIT
state.
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] entering RECOVERY
state.
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] position [0]
member 172.xx.xx.xxx:
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] previous ring seq
2892 rep 172.xx.xxx.xx
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] aru 46 high
delivered 46 received flag 1
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Did not need to
originate any messages in recovery.
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Sending initial
ORF token
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM ] CLM CONFIGURATION
CHANGE
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM ] New Configuration:
Aug 5 16:41:24 csarcsys2-eth0 kernel: dlm: closing connection to node 1
Aug 5 16:41:24 csarcsys2-eth0 clurgmgrd[3750]: <emerg> #1: Quorum
Dissolved
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM ] r(0) ip(172.
xx.xxx.xx)
Aug 5 16:41:24 csarcsys2-eth0 kernel: dlm: closing connection to node 3
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM ] Members Left:
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM ] r(0) ip(172.
xx.xxx.xx)
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM ] r(0) ip(172.
xx.xxx.xx)
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM ] Members Joined:
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [CMAN ] quorum lost,
blocking activity
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM ] CLM CONFIGURATION
CHANGE
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM ] New Configuration:
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM ] r(0) ip(172.
xx.xxx.xx)
Aug 5 16:41:24 csarcsys2-eth0 ccsd[3031]: Cluster is not quorate.
Refusing connection.
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM ] Members Left:
Aug 5 16:41:24 csarcsys2-eth0 ccsd[3031]: Error while processing
connect: Connection refused
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM ] Members Joined:
Aug 5 16:41:24 csarcsys2-eth0 ccsd[3031]: Invalid descriptor specified
(-111).
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [SYNC ] This node is
within the primary component and will provide service.
Aug 5 16:41:24 csarcsys2-eth0 ccsd[3031]: Someone may be attempting
something evil.
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] entering
OPERATIONAL state.
Aug 5 16:41:24 csarcsys2-eth0 ccsd[3031]: Error while processing get:
Invalid request descriptor
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM ] got nodejoin
message 172.24.86.143
Aug 5 16:41:24 csarcsys2-eth0 ccsd[3031]: Cluster is not quorate.
Refusing connection.
Aug 5 16:41:24 csarcsys2-eth0 openais[3096]: [CPG ] got joinlist
message from node 2
Aug 5 16:41:24 csarcsys2-eth0 ccsd[3031]: Error while processing
connect: Connection refused
Aug 5 16:41:24 csarcsys2-eth0 ccsd[3031]: Invalid descriptor specified
(-111).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080805/6e50c080/attachment.htm>
More information about the Linux-cluster
mailing list