[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Cluster environment issue



Thank you so much for your reply again.

--- On Tue, 5/31/11, Kaloyan Kovachev <kkovachev varna net> wrote:
Thanks for your reply again.


 > 
> If it is a switch restart you will have in your logs the
> interface going
> down/up, but more problematic is to find a short drop of
> the multicast

I checked all nodes did not find anything about interface, but in all the nodes it is reporting that server19(node 12) /server18 (node 11) is the problematic, here I am mentioning the logs  from three nodes (out of 16 nodes)

   May 24 18:04:59 server7 openais[6113]: [TOTEM] entering GATHER state from 12.
   May 24 18:05:01 server7 crond[5068]: (root) CMD (  /opt/hp/hp-health/bin/check-for-restart-requests)
   May 24 18:05:19 server7 openais[6113]: [TOTEM] entering GATHER state from 11.

   May 24 18:04:59 server1 openais[6148]: [TOTEM] entering GATHER state from 12.
   May 24 18:05:01 server1 crond[2275]: (root) CMD (  /opt/hp/hp-health/bin/check-for-restart-requests)
   May 24 18:05:19 server1 openais[6148]: [TOTEM] entering GATHER state from 11.

   May 24 18:04:59 server8 openais[6279]: [TOTEM] entering GATHER state from 12.
   May 24 18:05:01 server8 crond[11125]: (root) CMD (  /opt/hp/hp-health/bin/check-for-restart-requests)
   May 24 18:05:19 server8 openais[6279]: [TOTEM] entering GATHER state from 11.


Here is some lines from  node12 , at the same time
___________________________________________________


May 24 18:04:59 server19 openais[5950]: [TOTEM] The token was lost in the OPERATIONAL state.
May 24 18:04:59 server19 openais[5950]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes).
May 24 18:04:59 server19 openais[5950]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
May 24 18:04:59 server19 openais[5950]: [TOTEM] entering GATHER state from 2.
May 24 18:05:19 server19 openais[5950]: [TOTEM] entering GATHER state from 11.
May 24 18:05:20 server19 openais[5950]: [TOTEM] Saving state aru 39a8f high seq received 39a8f
May 24 18:05:20 server19 openais[5950]: [TOTEM] Storing new sequence id for ring 2af0
May 24 18:05:20 server19 openais[5950]: [TOTEM] entering COMMIT state.
May 24 18:05:20 server19 openais[5950]: [TOTEM] entering RECOVERY state.


Here is few lines  on node11 ie server18 
------------------------------------------

ay 24 18:04:48 server18
May 24 18:10:14 server18 syslog-ng[5619]: syslog-ng starting up; version='2.0.10'
May 24 18:10:14 server18 Bootdata ok (command line is ro root=/dev/vgroot_xen/lvroot rhgb quiet)


So it seems  that node11 is rebooting just after few mintues we get all the problems  in the logs of all nodes. 


 > You may ask the network people to check for STP changes and
> double check
> the multicast configuration and you may also try to use
> broadcast instead
> of multicast or use a dedicated switch.

As per the dedicated switch,  I don't think it is possible as per the network team.  I asked the STP chanes  related.  their answer is 

"there are no stp changes for the private network as there are no redundant devices in the environment. the multicast configs is  igmp snooping with Pim"

I have talked to the network team for using the broadcast instead of multicast, as per them , they can set..  

Pl. comment  on this...

 > your interface and multicast address)
>     ping -I ethX -b -L 239.x.x.x -c 1
> and finaly run this script until the cluster gets broken

Yes ,  I have checked it , it is working fine now.  I have also set a cron
for this script and set in one node.

I have few questions  regarding the cluster configuration ...


   -  We are using clvm  in the cluster environment.  As I understand it is active-active.
      The environment is xen . all the xen hosts are in the cluster and each host have
      the guests. We are keeping the options  to live migrate the guests from one host to another.

    - I was looking into the redhat knowledgebase https://access.redhat.com/kb/docs/DOC-3068, 
     as per the document , what do you think using  CLVM or HA-LVM will be the best choice?

Pl. advice.
 

Thanks  and regards again.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]