[Linux-cluster] Network switch problem

Fabio M. Di Nitto fdinitto at redhat.com
Sun Aug 21 05:15:38 UTC 2011


Hi Nicolas,

On 08/19/2011 09:48 PM, Nicolas Ross wrote:
> Hi !
> 
> We have a cluster of 8 nodes that are splited among 2 gigabit 24 ports
> network switch. Port one on each server is used for services, and port 2
> for the "totem-ring" or cluster communications.
> 
> The servers are splited 4 on each switch, with each port configured to
> the proper vlan. We have a vlan trunk between the switchs.
> 
> I need to reboot one or both switch, without interupting the cluster
> services. In the past (i.e. before there were critical services), I did
> rebooted a switch and the cluster lost quorum and all services stoped
> and restarted as the quorum got back. I can live with a minute or so
> without services as the switch reboot, but not 5 or 10 while the
> services stops and starts.
> 
> Now, to reboot the switch, I plan on adding a 3rd temporary switch just
> for the cluster vlan, and connect, one by one, the network interfaces to
> that switch.
> 
> So, if I disconnect a the cluster network interface on a node, will that
> node immediatly be fenced or I have some time, let's say 10 seconds, to
> complete the reconnect ?
> 
> I also see that each node has a tcp connection to the other nodes. So,
> will the disconnect / reconnect sever complety that connection or will
> it be retried ?
> 
> Thanks for any insights.

Assuming you have the option to add a 3rd switch (or even a 4th one) and
the availability one/teo extra network card(s) on each server, you can
use a slightly different setup that would allow you to reboot the all
switches without any service interruption.

What most people do is:

serverX -> eth0 -> switch0
        -> eth1 -> switch1
        -> eth2 -> switch2
        -> eth3 -> switch3

eth0 and eth1 are configured in bonding (IIRC bond 1 is the only
supported mode for cluster heartbeat but check the KB on redhat website)
and that's where you allow cluster heartbeat traffic.

eth2 and eth3 are also configured in bonding, but you have a greater
freedom of mode (load-balancing for example to increase bandwith to 2x)
for services.

switch0 and switch1 / switch2 and switch3 would be configured in
trunking like you have now.

With such setup, you can have up to two switches offline at the same
time, as long as they are not on the same bond/trunk.

A soon-to-be-supported technology in RHEL6 is Redundant Ring, that
allows you to use two separated LAN to perform cluster heartbeats (one
primary, one backup).

Fabio




More information about the Linux-cluster mailing list