[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] bonding



I have the same hardware configuration for 11 nodes, but without any of the spurious failover events.  The main thing different I had to do was to increase the bond device count to 2 (the driver defaults to only 1), as I have mine teamed between dual tg3/e1000 ports from the mobo and PCI card.  bond0 is on a gigabit switch, while bond1 is on 100mb.  In /etc/modprobe.conf:

alias bond0 bonding
alias bond1 bonding
options bonding max_bonds=2 mode=1 miimon=100 updelay=200
alias eth0 e1000
alias eth1 e1000
alias eth2 tg3
alias eth3 tg3

So eth0/eth2 are teamed, and eth1/eth3 are teamed.  In dmesg:

e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex
bonding: bond0: making interface eth0 the new active one 0 ms earlier.
bonding: bond0: enslaving eth0 as an active interface with an up link.
bonding: bond0: enslaving eth2 as a backup interface with a down link.
tg3: eth2: Link is up at 1000 Mbps, full duplex.
tg3: eth2: Flow control is on for TX and on for RX.
bonding: bond0: link status up for interface eth2, enabling it in 200 ms.
bonding: bond0: link status definitely up for interface eth2.
e1000: eth1: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex
bonding: bond1: making interface eth1 the new active one 0 ms earlier.
bonding: bond1: enslaving eth1 as an active interface with an up link.
bonding: bond1: enslaving eth3 as a backup interface with a down link.
bond0: duplicate address detected!
tg3: eth3: Link is up at 100 Mbps, full duplex.
tg3: eth3: Flow control is off for TX and off for RX.
bonding: bond1: link status up for interface eth3, enabling it in 200 ms.
bonding: bond1: link status definitely up for interface eth3.

$ uname -srvmpio
Linux 2.6.9-42.0.10.ELsmp #1 SMP Fri Feb 16 17:13:42 EST 2007 x86_64 x86_64 x86_64 GNU/Linux

$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:11:0a:5f:1e:0a

Slave Interface: eth2
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:17:a4:a7:9a:54

$ cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 0

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:11:0a:5f:1e:0b

Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:17:a4:a7:9a:53


On Thu, 2007-04-12 at 08:45 -0400, Scott McClanahan wrote:
I have every node in my four node cluster setup to do active-backup
bonding and the drivers loaded for the bonded network interfaces vary
between tg3 and e100.  All interfaces with the e100 driver loaded report
errors much like what you see here:

bonding: bond0: link status definitely down for interface eth2,
disabling it
e100: eth2: e100_watchdog: link up, 100Mbps, full-duplex
bonding: bond0: link status definitely up for interface eth2.

This happens all day on every node.  I have configured the bonding
module to do MII link monitoring at a frequency of 100 milliseconds and
it is using basic carrier link detection to test if the interface is
alive or not.  There was no custom building of any modules on these
nodes and the o/s is CentOS 4.3.

Some more relevant information is below (this display is consistent
across all nodes):

[smccl tf35 ~]$uname -srvmpio
Linux 2.6.9-34.ELhugemem #1 SMP Wed Mar 8 00:47:12 CST 2006 i686 i686
i386 GNU/Linux

[smccl tf35 ~]$head -5 /etc/modprobe.conf
alias bond0 bonding
options bonding miimon=100 mode=1
alias eth0 tg3
alias eth1 tg3
alias eth2 e100

[smccl tf35 ~]$cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v2.6.1 (October 29, 2004)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:10:18:0c:86:a4

Slave Interface: eth2
MII Status: up
Link Failure Count: 12
Permanent HW addr: 00:02:55:ac:a2:ea

Any idea why these e100 links report failures so often?  They are
directly plugged into a Cisco Catalyst 4506.  Thanks.

--
Linux-cluster mailing list
Linux-cluster redhat com
https://www.redhat.com/mailman/listinfo/linux-cluster

Robert Hurst, Sr. Caché Administrator
Beth Israel Deaconess Medical Center
1135 Tremont Street, REN-7
Boston, Massachusetts   02120-2140
617-754-8754 ∙ Fax: 617-754-8730 ∙ Cell: 401-787-3154
Any technology distinguishable from magic is insufficiently advanced.

Attachment: smime.p7s
Description: S/MIME cryptographic signature


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]