[Linux-cluster] 3 node cluster crashes

Dalton, Maurice bobby.m.dalton at nasa.gov
Tue Aug 5 21:56:04 UTC 2008


 

I have a 3 node cluster running cman-2.0.84-2.el5.  At times we have
spanning tree events that cause network storms up to 9 seconds. 

When these events  occur (today we caused them twice to verify this
issue). All three nodes go down within seconds of this event.

 

The second time we tried it I added the totem token statement shown
below. Same problem.

 

 

 

 

 

<cman>

                <multicast addr="225.0.0.11"/>

                <totem token="21000"/>

        </cman>

 

 

 

Aug  5 16:41:18 csarcsys2-eth0 ntpd[3484]: kernel time sync enabled 0001

Aug  5 16:41:19 csarcsys2-eth0 openais[3096]: [TOTEM] The token was lost
in the OPERATIONAL state.

Aug  5 16:41:19 csarcsys2-eth0 openais[3096]: [TOTEM] Receive multicast
socket recv buffer size (288000 bytes).

Aug  5 16:41:19 csarcsys2-eth0 openais[3096]: [TOTEM] Transmit multicast
socket send buffer size (262142 bytes).

Aug  5 16:41:19 csarcsys2-eth0 openais[3096]: [TOTEM] entering GATHER
state from 2.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] entering GATHER
state from 0.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Creating commit
token because I am the rep.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Saving state aru
46 high seq received 46

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Storing new
sequence id for ring b50

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] entering COMMIT
state.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] entering RECOVERY
state.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] position [0]
member 172.xx.xx.xxx:

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] previous ring seq
2892 rep 172.xx.xxx.xx

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] aru 46 high
delivered 46 received flag 1

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Did not need to
originate any messages in recovery.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] Sending initial
ORF token

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] CLM CONFIGURATION
CHANGE

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] New Configuration:

Aug  5 16:41:24 csarcsys2-eth0 kernel: dlm: closing connection to node 1

Aug  5 16:41:24 csarcsys2-eth0 clurgmgrd[3750]: <emerg> #1: Quorum
Dissolved

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ]   r(0) ip(172.
xx.xxx.xx)

Aug  5 16:41:24 csarcsys2-eth0 kernel: dlm: closing connection to node 3

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] Members Left:

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ]   r(0) ip(172.
xx.xxx.xx)

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ]   r(0) ip(172.
xx.xxx.xx)

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] Members Joined:

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CMAN ] quorum lost,
blocking activity

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] CLM CONFIGURATION
CHANGE

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] New Configuration:

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ]   r(0) ip(172.
xx.xxx.xx)

Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Cluster is not quorate.
Refusing connection.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] Members Left:

Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Error while processing
connect: Connection refused

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] Members Joined:

Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Invalid descriptor specified
(-111).

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [SYNC ] This node is
within the primary component and will provide service.

Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Someone may be attempting
something evil.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [TOTEM] entering
OPERATIONAL state.

Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Error while processing get:
Invalid request descriptor

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CLM  ] got nodejoin
message 172.24.86.143

Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Cluster is not quorate.
Refusing connection.

Aug  5 16:41:24 csarcsys2-eth0 openais[3096]: [CPG  ] got joinlist
message from node 2

Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Error while processing
connect: Connection refused

Aug  5 16:41:24 csarcsys2-eth0 ccsd[3031]: Invalid descriptor specified
(-111).

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080805/6e50c080/attachment.htm>


More information about the Linux-cluster mailing list