[Linux-cluster] IP adress missing from interface (sometimes)

Mon Jul 11 16:45:23 UTC 2005

Hi People.

I have 4 servers under RH Cluster Suite: 2 Clusters - one with oracle,
and one with two very important java daemons for local telecom.

Sometimes, I can see in log files (remote syslog outside cluster(s)):
--------------------------------------------------------
Jul 11 11:56:40 szgtr01 clusvcmgrd: [17188]: <err> service error: IP address 10.100.1.151 missing  
Jul 11 11:56:40 szgtr01 last message repeated 2 times
Jul 11 11:56:40 szgtr01 clusvcmgrd: [17188]: <err> service error: 0: error fetching interface information: Device not found  
Jul 11 11:56:40 szgtr01 last message repeated 2 times
Jul 11 11:56:40 szgtr01 clusvcmgrd: [17188]: <err> service error: Check status failed on IP addresses for tomcat  
Jul 11 11:56:40 szgtr01 last message repeated 2 times
Jul 11 11:56:40 szgtr01 clusvcmgrd[17187]: <warning> Restarting locally failed service tomcat  
Jul 11 11:56:40 szgtr01 last message repeated 2 times
Jul 11 11:56:40 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Stopping service tomcat ...  
Jul 11 11:56:40 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Stopping service tomcat ...  
Jul 11 11:56:40 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Running user script '/etc/init.d/tomcat stop'  
Jul 11 11:56:40 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Running user script '/etc/init.d/tomcat stop'  
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17440]: <info> service info: Stopping IP address 10.100.1.151  
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17440]: <info> service info: Stopping IP address 10.100.1.151  
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Stopped service tomcat ...  
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Stopped service tomcat ...  
Jul 11 11:56:46 szgtr01 clusvcmgrd[17187]: <notice> Starting stopped service tomcat  
Jul 11 11:56:46 szgtr01 clusvcmgrd[17187]: <notice> Starting stopped service tomcat  
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Starting service tomcat ...  
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Starting service tomcat ...  
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <info> service info: Starting IP address 10.100.1.151  
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <info> service info: Starting IP address 10.100.1.151  
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <info> service info: Sending Gratuitous arp for 10.100.1.151 (00:12:79:D6:7F:30)  
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <info> service info: Sending Gratuitous arp for 10.100.1.151 (00:12:79:D6:7F:30)  
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Running user script '/etc/init.d/tomcat start'  
Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Running user script '/etc/init.d/tomcat start'  
Jul 11 11:56:58 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Started service tomcat ...  
Jul 11 11:56:58 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Started service tomcat ...  
--------------------------------------------------------

As I see, this "IP address <foo> missing" is coming from
/usr/lib/clumanager/services/svclib_ip script. Why? Nobody is removing
service address from interface, but sometimes script (or ifconfig) is
failing to find IP address. I didn't hacked anything. Everything is
configured by redhat-config-cluster graphic tool BTW. Cluster has 2
shared raw devices (sda1 and sdb1) from HP SAN (EVA) for
configuration. I have WTI NPS power switches ...

This is IPv4 configuration of the maschine is following:
--------------------------------------------------------
1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
2: bond0: <BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue 
    link/ether 00:12:79:d6:7f:30 brd ff:ff:ff:ff:ff:ff
    inet 10.100.1.20/24 brd 10.100.1.255 scope global bond0
    inet 10.100.1.152/24 brd 10.100.1.255 scope global secondary bond0:0
    inet 10.100.1.151/24 brd 10.100.1.255 scope global secondary bond0:1
3: eth0: <BROADCAST,MULTICAST,NOARP,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond1 qlen 1000
    link/ether 00:12:79:d6:7f:31 brd ff:ff:ff:ff:ff:ff
    inet 10.100.252.20/24 brd 10.100.252.255 scope global eth0
4: eth1: <BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond0 qlen 1000
    link/ether 00:12:79:d6:7f:30 brd ff:ff:ff:ff:ff:ff
    inet 10.100.1.20/24 brd 10.100.1.255 scope global eth1
5: eth2: <BROADCAST,MULTICAST,NOARP,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond0 qlen 1000
    link/ether 00:12:79:d6:7f:30 brd ff:ff:ff:ff:ff:ff
    inet 10.100.1.20/24 brd 10.100.1.255 scope global eth2
6: eth3: <BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond1 qlen 1000
    link/ether 00:12:79:d6:7f:31 brd ff:ff:ff:ff:ff:ff
    inet 10.100.252.20/24 brd 10.100.252.255 scope global eth3
7: bond1: <BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue 
    link/ether 00:12:79:d6:7f:31 brd ff:ff:ff:ff:ff:ff
    inet 10.100.252.20/24 brd 10.100.252.255 scope global bond1
--------------------------------------------------------

Network 10.100.1.0/24 on bond0 is data and heartbeat link, and net
10.100.252.0/24 on bond1 is special vlan/network for communication
with WTI NPS network power switches.

Every machine in cluster has 2 network cards - gigabit Broadcom and
gigabit intel card with bcm5700 and e1000 drivers.

Cluster node-systems are Red Hat Advanced Server 3. All machines are
updated from RHN to update 5. Clumanager package is 1.2.26.1-1.

This is cluster.xml on the first cluster:
------------------------------------------------------------------
# strings /dev/sda1
uszgtr01.tel.local
0/usr/sbin/clusvcmgrd
0/usr/sbin/clusvcmgrd
p/usr/sbin/clusvcmgrd
v%bE
8{+No!
Qwf?
UT!u
S-R*.
@&e:
'&gs
k/wB
ix.x
<?xml version="1.0"?>
<cluconfig version="3.0">
  <clumembd broadcast="no" interval="750000" loglevel="6" multicast="yes" multicast_ipaddress="225.0.0.11" thread="yes" tko_count="20"/>
  <cluquorumd loglevel="6" pinginterval="" tiebreaker_ip="10.100.1.1"/>
  <clurmtabd loglevel="6" pollinterval="4"/>
  <clusvcmgrd loglevel="6"/>
  <clulockd loglevel="6"/>
  <cluster config_viewnumber="10" key="55c6b6814c16718ea1728bdfcea5cf78" name="java"/>
  <sharedstate driver="libsharedraw.so" rawprimary="/dev/raw/raw1" rawshadow="/dev/raw/raw2" type="raw"/>
  <members>
    <member id="0" name="szgtr01" watchdog="no">
      <powercontroller id="0" ipaddress="10.100.252.222" password="xxxxxxx" port="1" type="wti_nps" user=""/>
      <powercontroller id="1" ipaddress="10.100.252.223" password="xxxxxxx" port="1" type="wti_nps" user=""/>
    </member>
    <member id="1" name="szgtr02" watchdog="no">
      <powercontroller id="0" ipaddress="10.100.252.222" password="xxxxxxxx" port="5" type="wti_nps" user=""/>
      <powercontroller id="1" ipaddress="10.100.252.223" password="xxxxxxxx" port="5" type="wti_nps" user=""/>
    </member>
  </members>
  <services>
    <service checkinterval="8" failoverdomain="javadom" id="0" maxfalsestarts="0" maxrestarts="0" name="tomcat" userscript="/etc/init.d/tomcat">
      <service_ipaddresses>
        <service_ipaddress broadcast="10.100.1.255" id="0" ipaddress="10.100.1.151" netmask="255.255.255.0"/>
      </service_ipaddresses>
    </service>
    <service checkinterval="8" failoverdomain="javadom" id="1" maxfalsestarts="0" maxrestarts="0" name="rad" userscript="/etc/init.d/radiusd">
      <service_ipaddresses>
        <service_ipaddress broadcast="10.100.1.255" id="0" ipaddress="10.100.1.152" netmask="255.255.255.0"/>
      </service_ipaddresses>
    </service>
  </services>
  <failoverdomains>
    <failoverdomain id="0" name="javadom" ordered="yes" restricted="yes">
      <failoverdomainnode id="0" name="szgtr01"/>
      <failoverdomainnode id="1" name="szgtr02"/>
    </failoverdomain>
  </failoverdomains>
</cluconfig>
------------------------------------------------------------------------

lsmod:
------------------------------------------------------------------------
Module                  Size  Used by
iptable_filter          2412   0  (autoclean) (unused)
ip_tables              16544   1  [iptable_filter]
cpqci                  28612   3 
audit                  90808   3 
bonding1               25156   1 
e1000                  83784   2 
bcm5700               110564   2 
bonding                25156   1 
microcode               6912   0  (autoclean)
keybdev                 2976   0  (unused)
mousedev                5688   0  (unused)
hid                    22532   0  (unused)
input                   6176   0  [keybdev mousedev hid]
ehci-hcd               20776   0  (unused)
usb-uhci               26860   0  (unused)
usbcore                81152   1  [hid ehci-hcd usb-uhci]
ext3                   89960   3 
jbd                    55156   3  [ext3]
sg                     37324   0 
qla2300               590844   9 
qla2300_conf          301560   0 
cciss                  45188   4 
sd_mod                 14128   8 
scsi_mod              115496   3  [sg qla2300 cciss sd_mod]
------------------------------------------------------------------------

I have set up this script to watch ifconfig output:

while `usleep 500000`
do
ifconfig bond0:1;echo "----------------------------"
done >> /tmp/ifconfig.log &

After 4-5 hours, i have one failure:

# grep addr:10.100.1.151 ifconfig.log  | wc -l
  23207
# grep 'HWaddr 00:12:79:D6:7F:30' ifconfig.log  | wc -l
  23208

Can somebody help me with this IP network and occationaly missing
service IP address?

Thanks ...

P.S.
Service failures are very randoom. Sometimes 2-3 in one day. Sometimes
only one weekly ... but this is not acceptible by my customer. :-(

P.P.S.
Situation (bug) is the same on all 4 cluster nodes. HW is HP ProLiant
DL 380 with hotswapable SCSI discs in HW RAID1, 2 CPUs each, and 12 GB
RAM.

-- 
Miroslav Zubcic, RHCE, Nimium d.o.o., email: <mvz at nimium.hr>
Tel: +385 01 4852 639, Fax: +385 01 4852 640, Mobile: +385 098 942 8672
Mrazoviceva 12, 10000 Zagreb, Hrvatska