[Linux-cluster] bonding

Thu Apr 12 13:52:47 UTC 2007

I have the same hardware configuration for 11 nodes, but without any of
the spurious failover events.  The main thing different I had to do was
to increase the bond device count to 2 (the driver defaults to only 1),
as I have mine teamed between dual tg3/e1000 ports from the mobo and PCI
card.  bond0 is on a gigabit switch, while bond1 is on 100mb.
In /etc/modprobe.conf:

alias bond0 bonding
alias bond1 bonding
options bonding max_bonds=2 mode=1 miimon=100 updelay=200
alias eth0 e1000
alias eth1 e1000
alias eth2 tg3
alias eth3 tg3

So eth0/eth2 are teamed, and eth1/eth3 are teamed.  In dmesg:

e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex
bonding: bond0: making interface eth0 the new active one 0 ms earlier.
bonding: bond0: enslaving eth0 as an active interface with an up link.
bonding: bond0: enslaving eth2 as a backup interface with a down link.
tg3: eth2: Link is up at 1000 Mbps, full duplex.
tg3: eth2: Flow control is on for TX and on for RX.
bonding: bond0: link status up for interface eth2, enabling it in 200
ms.
bonding: bond0: link status definitely up for interface eth2.
e1000: eth1: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex
bonding: bond1: making interface eth1 the new active one 0 ms earlier.
bonding: bond1: enslaving eth1 as an active interface with an up link.
bonding: bond1: enslaving eth3 as a backup interface with a down link.
bond0: duplicate address detected!
tg3: eth3: Link is up at 100 Mbps, full duplex.
tg3: eth3: Flow control is off for TX and off for RX.
bonding: bond1: link status up for interface eth3, enabling it in 200
ms.
bonding: bond1: link status definitely up for interface eth3.

$ uname -srvmpio
Linux 2.6.9-42.0.10.ELsmp #1 SMP Fri Feb 16 17:13:42 EST 2007 x86_64
x86_64 x86_64 GNU/Linux

$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:11:0a:5f:1e:0a

Slave Interface: eth2
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:17:a4:a7:9a:54

$ cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 0

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:11:0a:5f:1e:0b

Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:17:a4:a7:9a:53

On Thu, 2007-04-12 at 08:45 -0400, Scott McClanahan wrote:

> I have every node in my four node cluster setup to do active-backup
> bonding and the drivers loaded for the bonded network interfaces vary
> between tg3 and e100.  All interfaces with the e100 driver loaded report
> errors much like what you see here:
> 
> bonding: bond0: link status definitely down for interface eth2,
> disabling it
> e100: eth2: e100_watchdog: link up, 100Mbps, full-duplex
> bonding: bond0: link status definitely up for interface eth2.
> 
> This happens all day on every node.  I have configured the bonding
> module to do MII link monitoring at a frequency of 100 milliseconds and
> it is using basic carrier link detection to test if the interface is
> alive or not.  There was no custom building of any modules on these
> nodes and the o/s is CentOS 4.3.
> 
> Some more relevant information is below (this display is consistent
> across all nodes):
> 
> [smccl at tf35 ~]$uname -srvmpio
> Linux 2.6.9-34.ELhugemem #1 SMP Wed Mar 8 00:47:12 CST 2006 i686 i686
> i386 GNU/Linux
> 
> [smccl at tf35 ~]$head -5 /etc/modprobe.conf
> alias bond0 bonding
> options bonding miimon=100 mode=1
> alias eth0 tg3
> alias eth1 tg3
> alias eth2 e100
> 
> [smccl at tf35 ~]$cat /proc/net/bonding/bond0 
> Ethernet Channel Bonding Driver: v2.6.1 (October 29, 2004)
> 
> Bonding Mode: fault-tolerance (active-backup)
> Primary Slave: None
> Currently Active Slave: eth0
> MII Status: up
> MII Polling Interval (ms): 100
> Up Delay (ms): 0
> Down Delay (ms): 0
> 
> Slave Interface: eth0
> MII Status: up
> Link Failure Count: 0
> Permanent HW addr: 00:10:18:0c:86:a4
> 
> Slave Interface: eth2
> MII Status: up
> Link Failure Count: 12
> Permanent HW addr: 00:02:55:ac:a2:ea
> 
> Any idea why these e100 links report failures so often?  They are
> directly plugged into a Cisco Catalyst 4506.  Thanks.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

Robert Hurst, Sr. Caché Administrator
Beth Israel Deaconess Medical Center
1135 Tremont Street, REN-7
Boston, Massachusetts   02120-2140
617-754-8754 ∙ Fax: 617-754-8730 ∙ Cell: 401-787-3154
Any technology distinguishable from magic is insufficiently advanced.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070412/68f10845/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2178 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070412/68f10845/attachment.p7s>