[Linux-cluster] IP adress missing from interface (sometimes)
Miroslav Zubcic
mvz at nimium.hr
Mon Jul 11 16:45:23 UTC 2005
Hi People.
I have 4 servers under RH Cluster Suite: 2 Clusters - one with oracle,
and one with two very important java daemons for local telecom.
Sometimes, I can see in log files (remote syslog outside cluster(s)):
--------------------------------------------------------
Jul 11 11:56:40 szgtr01 clusvcmgrd: [17188]: <err> service error: IP address 10.100.1.151 missing
Jul 11 11:56:40 szgtr01 last message repeated 2 times
Jul 11 11:56:40 szgtr01 clusvcmgrd: [17188]: <err> service error: 0: error fetching interface information: Device not found
Jul 11 11:56:40 szgtr01 last message repeated 2 times
Jul 11 11:56:40 szgtr01 clusvcmgrd: [17188]: <err> service error: Check status failed on IP addresses for tomcat
Jul 11 11:56:40 szgtr01 last message repeated 2 times
Jul 11 11:56:40 szgtr01 clusvcmgrd[17187]: <warning> Restarting locally failed service tomcat
Jul 11 11:56:40 szgtr01 last message repeated 2 times
Jul 11 11:56:40 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Stopping service tomcat ...
Jul 11 11:56:40 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Stopping service tomcat ...
Jul 11 11:56:40 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Running user script '/etc/init.d/tomcat stop'
Jul 11 11:56:40 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Running user script '/etc/init.d/tomcat stop'
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17440]: <info> service info: Stopping IP address 10.100.1.151
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17440]: <info> service info: Stopping IP address 10.100.1.151
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Stopped service tomcat ...
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Stopped service tomcat ...
Jul 11 11:56:46 szgtr01 clusvcmgrd[17187]: <notice> Starting stopped service tomcat
Jul 11 11:56:46 szgtr01 clusvcmgrd[17187]: <notice> Starting stopped service tomcat
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Starting service tomcat ...
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Starting service tomcat ...
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <info> service info: Starting IP address 10.100.1.151
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <info> service info: Starting IP address 10.100.1.151
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <info> service info: Sending Gratuitous arp for 10.100.1.151 (00:12:79:D6:7F:30)
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <info> service info: Sending Gratuitous arp for 10.100.1.151 (00:12:79:D6:7F:30)
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Running user script '/etc/init.d/tomcat start'
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Running user script '/etc/init.d/tomcat start'
Jul 11 11:56:58 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Started service tomcat ...
Jul 11 11:56:58 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Started service tomcat ...
--------------------------------------------------------
As I see, this "IP address <foo> missing" is coming from
/usr/lib/clumanager/services/svclib_ip script. Why? Nobody is removing
service address from interface, but sometimes script (or ifconfig) is
failing to find IP address. I didn't hacked anything. Everything is
configured by redhat-config-cluster graphic tool BTW. Cluster has 2
shared raw devices (sda1 and sdb1) from HP SAN (EVA) for
configuration. I have WTI NPS power switches ...
This is IPv4 configuration of the maschine is following:
--------------------------------------------------------
1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
2: bond0: <BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue
link/ether 00:12:79:d6:7f:30 brd ff:ff:ff:ff:ff:ff
inet 10.100.1.20/24 brd 10.100.1.255 scope global bond0
inet 10.100.1.152/24 brd 10.100.1.255 scope global secondary bond0:0
inet 10.100.1.151/24 brd 10.100.1.255 scope global secondary bond0:1
3: eth0: <BROADCAST,MULTICAST,NOARP,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond1 qlen 1000
link/ether 00:12:79:d6:7f:31 brd ff:ff:ff:ff:ff:ff
inet 10.100.252.20/24 brd 10.100.252.255 scope global eth0
4: eth1: <BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond0 qlen 1000
link/ether 00:12:79:d6:7f:30 brd ff:ff:ff:ff:ff:ff
inet 10.100.1.20/24 brd 10.100.1.255 scope global eth1
5: eth2: <BROADCAST,MULTICAST,NOARP,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond0 qlen 1000
link/ether 00:12:79:d6:7f:30 brd ff:ff:ff:ff:ff:ff
inet 10.100.1.20/24 brd 10.100.1.255 scope global eth2
6: eth3: <BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond1 qlen 1000
link/ether 00:12:79:d6:7f:31 brd ff:ff:ff:ff:ff:ff
inet 10.100.252.20/24 brd 10.100.252.255 scope global eth3
7: bond1: <BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue
link/ether 00:12:79:d6:7f:31 brd ff:ff:ff:ff:ff:ff
inet 10.100.252.20/24 brd 10.100.252.255 scope global bond1
--------------------------------------------------------
Network 10.100.1.0/24 on bond0 is data and heartbeat link, and net
10.100.252.0/24 on bond1 is special vlan/network for communication
with WTI NPS network power switches.
Every machine in cluster has 2 network cards - gigabit Broadcom and
gigabit intel card with bcm5700 and e1000 drivers.
Cluster node-systems are Red Hat Advanced Server 3. All machines are
updated from RHN to update 5. Clumanager package is 1.2.26.1-1.
This is cluster.xml on the first cluster:
------------------------------------------------------------------
# strings /dev/sda1
uszgtr01.tel.local
0/usr/sbin/clusvcmgrd
0/usr/sbin/clusvcmgrd
p/usr/sbin/clusvcmgrd
v%bE
8{+No!
Qwf?
UT!u
S-R*.
@&e:
'&gs
k/wB
ix.x
<?xml version="1.0"?>
<cluconfig version="3.0">
<clumembd broadcast="no" interval="750000" loglevel="6" multicast="yes" multicast_ipaddress="225.0.0.11" thread="yes" tko_count="20"/>
<cluquorumd loglevel="6" pinginterval="" tiebreaker_ip="10.100.1.1"/>
<clurmtabd loglevel="6" pollinterval="4"/>
<clusvcmgrd loglevel="6"/>
<clulockd loglevel="6"/>
<cluster config_viewnumber="10" key="55c6b6814c16718ea1728bdfcea5cf78" name="java"/>
<sharedstate driver="libsharedraw.so" rawprimary="/dev/raw/raw1" rawshadow="/dev/raw/raw2" type="raw"/>
<members>
<member id="0" name="szgtr01" watchdog="no">
<powercontroller id="0" ipaddress="10.100.252.222" password="xxxxxxx" port="1" type="wti_nps" user=""/>
<powercontroller id="1" ipaddress="10.100.252.223" password="xxxxxxx" port="1" type="wti_nps" user=""/>
</member>
<member id="1" name="szgtr02" watchdog="no">
<powercontroller id="0" ipaddress="10.100.252.222" password="xxxxxxxx" port="5" type="wti_nps" user=""/>
<powercontroller id="1" ipaddress="10.100.252.223" password="xxxxxxxx" port="5" type="wti_nps" user=""/>
</member>
</members>
<services>
<service checkinterval="8" failoverdomain="javadom" id="0" maxfalsestarts="0" maxrestarts="0" name="tomcat" userscript="/etc/init.d/tomcat">
<service_ipaddresses>
<service_ipaddress broadcast="10.100.1.255" id="0" ipaddress="10.100.1.151" netmask="255.255.255.0"/>
</service_ipaddresses>
</service>
<service checkinterval="8" failoverdomain="javadom" id="1" maxfalsestarts="0" maxrestarts="0" name="rad" userscript="/etc/init.d/radiusd">
<service_ipaddresses>
<service_ipaddress broadcast="10.100.1.255" id="0" ipaddress="10.100.1.152" netmask="255.255.255.0"/>
</service_ipaddresses>
</service>
</services>
<failoverdomains>
<failoverdomain id="0" name="javadom" ordered="yes" restricted="yes">
<failoverdomainnode id="0" name="szgtr01"/>
<failoverdomainnode id="1" name="szgtr02"/>
</failoverdomain>
</failoverdomains>
</cluconfig>
------------------------------------------------------------------------
lsmod:
------------------------------------------------------------------------
Module Size Used by
iptable_filter 2412 0 (autoclean) (unused)
ip_tables 16544 1 [iptable_filter]
cpqci 28612 3
audit 90808 3
bonding1 25156 1
e1000 83784 2
bcm5700 110564 2
bonding 25156 1
microcode 6912 0 (autoclean)
keybdev 2976 0 (unused)
mousedev 5688 0 (unused)
hid 22532 0 (unused)
input 6176 0 [keybdev mousedev hid]
ehci-hcd 20776 0 (unused)
usb-uhci 26860 0 (unused)
usbcore 81152 1 [hid ehci-hcd usb-uhci]
ext3 89960 3
jbd 55156 3 [ext3]
sg 37324 0
qla2300 590844 9
qla2300_conf 301560 0
cciss 45188 4
sd_mod 14128 8
scsi_mod 115496 3 [sg qla2300 cciss sd_mod]
------------------------------------------------------------------------
I have set up this script to watch ifconfig output:
while `usleep 500000`
do
ifconfig bond0:1;echo "----------------------------"
done >> /tmp/ifconfig.log &
After 4-5 hours, i have one failure:
# grep addr:10.100.1.151 ifconfig.log | wc -l
23207
# grep 'HWaddr 00:12:79:D6:7F:30' ifconfig.log | wc -l
23208
Can somebody help me with this IP network and occationaly missing
service IP address?
Thanks ...
P.S.
Service failures are very randoom. Sometimes 2-3 in one day. Sometimes
only one weekly ... but this is not acceptible by my customer. :-(
P.P.S.
Situation (bug) is the same on all 4 cluster nodes. HW is HP ProLiant
DL 380 with hotswapable SCSI discs in HW RAID1, 2 CPUs each, and 12 GB
RAM.
--
Miroslav Zubcic, RHCE, Nimium d.o.o., email: <mvz at nimium.hr>
Tel: +385 01 4852 639, Fax: +385 01 4852 640, Mobile: +385 098 942 8672
Mrazoviceva 12, 10000 Zagreb, Hrvatska
More information about the Linux-cluster
mailing list