Send Linux-cluster mailing list submissions to
linux-cluster redhat com
To subscribe or unsubscribe via the World Wide Web, visit
or, via email, send a message with subject or body 'help' to
linux-cluster-request redhat com
You can reach the person managing the list at
linux-cluster-owner redhat com
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Linux-cluster digest..."
1. Succesfull installation on centos 5.3 with live kvm migration
2. Centos 5.3 X64 a& luci (akapp)
3. Re: Linux-cluster Digest, Vol 64, Issue 10 (Bob Peterson)
4. RHEL 4.7 fenced fails -- stuck join state: S-2,2,1 (Robert Hurst)
5. Re: do I have a fence DRAC device? (bergman merctech com)
Date: Tue, 11 Aug 2009 14:59:13 +0200
From: Robert Verspuy <robert exa-omicron nl>
Subject: [Linux-cluster] Succesfull installation on centos 5.3 with
live kvm migration
To: linux clustering <linux-cluster redhat com>
Message-ID: <4A816B21 2040907 exa-omicron nl>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Getting cluster software including kvm virtual machines with live
can be a very difficult task, with many obstacles.
But I would like to mention to the mailing list, that I just booked
And because nobody is around to tell the wonderfull news,
I would like to share my hapiness here ;)
2 NAS servers and 1 Supermicro bladeserver with 5 blades.
The NAS servers are running Openfiler 2.3
both NAS servers have:
1 Transcend IDE 4Gbyte flashcard (on the ide port on the mainboard).
3 x Transcend 4Gbyte usb sticks
8 SATA disks.
The IDE flashcard is setup in a raid-1 mirroring (md0) with one USB
stick providing the root FS voor openfiler
The other 2 USB sticks have 5 partitions: 4 x 500 MB and 1 x 2GB.
those are mirrored with raid-1 together. (md5 until md8 are the 500mb
partions, and md9 is the 2Gb partition).
Then the 8 harddisks are also tied together per 2 as mirroring raid1
(md1 until md4).
Then I used DRBD (8.2.7) to mirror the 4 raid-1's of the disks (md1
until md4) and the 2GB mirror (md9)
over the network to the other NAS server. (drbd1, drbd2, drbd3 and
The 500mb raid-1's are used to store metadata of the 4 disk raid-1's.
The 2gb drbd (drbd0) has internal metadata.
The 2gb drbd (drbd0) is mounted as ext3 on one only server and is used
to store all kinds
of openfiler information that is needed on both nas servers,
like the openfiler config (mostly), dhcp leases database, openldap
And heartbeat makes sure that one NAS server is running all the
software, and with any problems,
it can switch over very easily.
The drbd1 til 4 are setup as a LVM PV, and bound together in one big
From that VG, I created a 5 x 5GB LV to be used as root device for
blade1 til blade5
These LV's are stripped accross 2 PV for speed (altough that's still
only bottleneck at the moment, but more later about this...).
These LV's are setup as iSCSI
I also created one big LV of around 600GB, which can be mounted
Then a few more LV's are created (around 10GB, also iscsi) for every
For every iSCSI LV I create a separate target.
The Supermicro blades can boot from an iscsi device.
The exact scsi device is given through a DHCP option.
I only setup a initiatior name in the iscsi bios of the blade.
On the blade LV's I installed CentOS 5.3 (latest updates).
But with a few modifications.
I changes a few things in the initrd, to bound eth0 to br0 during the
and before linux is taking over the iscsi from the bios, because
when you have a linux root through iscsi, and try to attach eth0 to
you loose networkconnectivity for a moment, and could crash the linux,
because everything it uses, comes from the network (iscsi root).
I also added a little script to the initrd to call iscsiadm with a
target, because unfortunately iscsiadm can't read the iscsi settings
or the supermicro firmware.
When the blades are booted, they all join one redhat cluster with 3
nodes to be quorum.
Because I have 5 blades, two can fail before everything stops working.
Then I compiled the following software my own, because the ones in the
and the testing repo didn't function correctly:
libvirt 0.7.0 (./configure --prefix=/usr)
kvm-88 (./configure --prefix=/usr --disable-xen)
The /usr/share/cluster/vm.sh from the default centos repo is still
I downloaded the latest from
but it appears that that one is not working correctly either.
I made some changes myself.
And now it's working all together very nicely
I just ran a VM on blade1, and while this VM was running bonnie++ on a
NFS mount to the NAS server,
I live-migrated it about 10 times to blade2 and back.
During this bonnie++ run and live migrations, I pinged the device.
And where the normal ping times are around 20-35 ms (I pinged
VPN line from my home to the data center).
I only saw one or 2 pings just around the end of the live migration
were around 40-60ms.
but no drops, and no errors in bonnie++.
I will write some more information about the complete setup, and
somewhere on my blog or someting,
But I just wanted to let everybody know, that it can be done ;)
If you have any questions, let me know.
The only 'problem' I still have is the speed to and from the disks.
When I update any settings on the bladeserver. I always do this on
Then shut it down, On the NAS server I copy the content of the iscsi
to an image file on the ext3 LV.
Then I can power up blade1, wait until it reenters the cluster,
and then one by one shut down the next blade, On the NAS copy the
from the ext3 LV to the blade LV.
And start the blade again.
I use the drbd1 til drbd4 as 4 PV's for a VG.
The speed (hdparm -t on the NAS) of all PV's are around 75 MB/sec
(except for one which is 45MB/sec)
The blade LV (/dev/vg0/blade1 for example) is striped over 2 PV's.
The Speed (hdparm -t) of /dev/vg0/blade1 is 122MB/sec.
The ext3 LV (/dev/vg0/data0) is striped over 4 PV's.
The Speed (hdparm -t) of /dev/vg0/data0 is 227 MB/sec.
But when copying from the blade LV to the ext3 LV:
dd if=/dev/vg0/blade1 of=/mnt/vg0/data0/vm/image/blade_v2.7.img
it takes about 70 seconds, which is about 75MB/sec.
but when copying back:
dd if=/mnt/vg0/data0/vm/image/blade_v2.7.img of=/dev/vg0/blade1
It takes about 390 seconds, which is about 13MB/sec
I think it has something to do with the striped over 4 PV's of the
So I will try to create a new ext3 LV stiped accross 2 PV's and see if
this is faster.
3892 DB Zeewolde
Tel.: 088-OMICRON (66 427 66)
Date: Tue, 11 Aug 2009 14:58:38 +0200
From: akapp <akapp fnds3000 com>
Subject: [Linux-cluster] Centos 5.3 X64 a& luci
To: linux-cluster redhat com
Message-ID: <34CB6BB9-6856-4E52-B946-5465F90B081C fnds3000 com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
I have a Sun x4100 server running Centos 5.3 X64 - patch to latest and
When trying to start luci, it simply fails, no error in /var/log and
nothing in /var/lib/luci/log
I have re-installed luci and ricci a couple of times now. Cleaned
var/lib/luci & /rici between installations.
I have even tried the complete yum grouremove "Clustering" "Cluster
Storage" and re-installed the complete package again.
Used ricci/luci combination with great success in 5.2, but both
servers giving the same problem.
Any pointers will be appreciated.
Here is screen snipped of problem:
Installed: luci.x86_64 0:0.12.1-7.3.el5.centos.1
[root clu1 luci]# luci_admin init
Initializing the luci server
Creating the 'admin' user
The admin password has been successfully set.
Generating SSL certificates...
The luci server has been successfully initialized
You must restart the luci server for changes to take effect.
Run "service luci restart" to do so
[root clu1 luci]# service luci restart
Shutting down luci: [ OK ]
Starting luci: Generating https SSL certificates... done
[root clu1 luci]#
Date: Tue, 11 Aug 2009 09:24:14 -0400 (EDT)
From: Bob Peterson <rpeterso redhat com>
Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 64, Issue 10
To: Wendell Dingus <wendell bisonline com>
Cc: linux-cluster redhat com
<1891903567 443711249997054068 JavaMail root zmail06 collab prod int phx2 redhat com
Content-Type: text/plain; charset=utf-8
----- "Wendell Dingus" <wendell bisonline com> wrote:
| Well, here's the entire list of blocks it ignored and the entire
| message section.
| Perhaps I'm just overlooking it but I'm not seeing anything in the
| that appears to be a block number. Maybe 1633350398 but if so it is
| not a match.
Your assumption is correct. The block number was 1633350398, which
is labeled "bh = " for some reason.
| Anyway, since you didn't specifically say a new/fixed version of
| imminent and that it would likely fix this we began plan B today. We
Yesterday I pushed a newer gfs_fsck and fsck.gfs2 to their appropriate
git source repositories. So you can build that version from source if
you need it right away. But it sounds like it wouldn't have helped
your problem anyway. What would really be nice is if there is a way
to recreate the problem in our lab. In theory, this error could be
caused by a hardware problem too.
| in another drive, placed a GFS2 filesystem on it and am actively
| copying files
| off to it now. Fingers crossed that nothing will hit a disk block
| this again but I could be so lucky probably...
It's hard to say whether you'll hit it again.
Red Hat File Systems
Date: Tue, 11 Aug 2009 10:55:48 -0400
From: "Robert Hurst" <rhurst bidmc harvard edu>
Subject: [Linux-cluster] RHEL 4.7 fenced fails -- stuck join state:
To: "linux clustering" <linux-cluster redhat com>
Message-ID: <1250002549 2782 36 camel WSBID06223 bidmc harvard edu>
Content-Type: text/plain; charset="us-ascii"
Simple 4-node cluster, 2-nodes have a GFS shared home directory
for over a month. Today, I wanted to mount /home on a 3rd node, so:
# service fenced start [failed]
Weird. Checking /var/log/messages show:
Aug 11 10:19:06 cerberus kernel: Lock_Harness 2.6.9-80.9.el4_7.10
Jan 22 2009 18:39:16) installed
Aug 11 10:19:06 cerberus kernel: GFS 2.6.9-80.9.el4_7.10 (built Jan 22
2009 18:39:32) installed
Aug 11 10:19:06 cerberus kernel: GFS: Trying to join cluster
Aug 11 10:19:06 cerberus kernel: Lock_DLM (built Jan 22 2009 18:39:18)
Aug 11 10:19:06 cerberus kernel: lock_dlm: fence domain not found;
Aug 11 10:19:06 cerberus kernel: GFS: can't mount proto = lock_dlm,
table = ccc_cluster47:home, hostdata =
# cman_tool services
Service Name GID LID State
Fence Domain: "default" 0 2 join
So, a fenced process is now hung:
root 28302 0.0 0.0 3668 192 ? Ss 10:19 0:00
Q: Any idea how to "recover" from this state, without rebooting?
The other two servers are unaffected by this (thankfully) and show
$ cman_tool services
Service Name GID LID State
Fence Domain: "default" 2 2 run -
DLM Lock Space: "home" 5 5 run -
GFS Mount Group: "home" 6 6 run -
-------------- next part --------------
An HTML attachment was scrubbed...
Date: Tue, 11 Aug 2009 11:39:39 -0400
From: bergman merctech com
Subject: Re: [Linux-cluster] do I have a fence DRAC device?
To: linux clustering <linux-cluster redhat com>
Message-ID: <27649 1250005179 mirchi>
Content-Type: text/plain; charset=us-ascii
In the message dated: Tue, 11 Aug 2009 14:14:03 +0200,
The pithy ruminations from Juan Ramon Martin Blanco on
<Re: [Linux-cluster] do I have a fence DRAC device?> were:
=> Content-Type: multipart/alternative;
=> Content-Type: text/plain; charset=ISO-8859-1
=> Content-Transfer-Encoding: quoted-printable
=> On Tue, Aug 11, 2009 at 2:03 PM, ESGLinux <esggrupos gmail com>
=> > Thanks
=> > I=B4ll check it when I could reboot the server.
=> > greetings,
=> You have a BMC ipmi in the first network interface, it can be
=> boot time (I don't remember if inside the BIOS or pressing cntrl
=> during boot)
Based on my notes, here's how I configured the DRAC interface on a
for use as a fence device:
Configuring the card from Linux depending on the installation of
OMSA package. Once that's installed, use the following
racadm config -g cfgSerial -o cfgSerialTelnetEnable 1
racadm config -g cfgLanNetworking -o cfgDNSRacName
racadm config -g cfgDNSDomainName DOMAINNAME_FOR_INTERFACE
racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 PASSWORD
racadm config -g cfgNicEnable 1
racadm config -g cfgNicIpAddress WWW.XXX.YYY.ZZZ
racadm config -g cfgNicNetmask WWW.XXX.YYY.ZZZ
racadm config -g cfgNicGateway WWW.XXX.YYY.ZZZ
racadm config -g cfgNicUseDhcp 0
I also save a backup of the configuration with:
racadm getconfig -f ~/drac_config
Hope this helps,
Mark Bergman voice: 215-662-7310
mark bergman uphs upenn edu fax: 215-614-0266
System Administrator Section of Biomedical Image Analysis
Department of Radiology University of Pennsylvania
PGP Key: https://www.rad.upenn.edu/sbia/bergman
=> > ESG
=> > 2009/8/10 Paras pradhan <pradhanparas gmail com>
=> > On Mon, Aug 10, 2009 at 5:24 AM, ESGLinux<esggrupos gmail com>
=> >> > Hi all,
=> >> > I was designing a 2 node cluster and I was going to use 2
=> >> > PowerEdge 1950. I was going to buy a DRAC card to use for
=> >> > running several commands in the servers I have noticed that
when I run
=> >> this
=> >> > command:
=> >> > #ipmitool lan print
=> >> > Set in Progress : Set Complete
=> >> > Auth Type Support : NONE MD2 MD5 PASSWORD
=> >> > Auth Type Enable : Callback : MD2 MD5
=> >> > : User : MD2 MD5
=> >> > : Operator : MD2 MD5
=> >> > : Admin : MD2 MD5
=> >> > : OEM : MD2 MD5
=> >> > IP Address Source : Static Address
=> >> > IP Address : 0.0.0.0
=> >> > Subnet Mask : 0.0.0.0
=> >> > MAC Address : 00:1e:c9:ae:6f:7e
=> >> > SNMP Community String : public
=> >> > IP Header : TTL=0x40 Flags=0x40 Precedence=0x00 TOS=0x10
=> >> > Default Gateway IP : 0.0.0.0
=> >> > Default Gateway MAC : 00:00:00:00:00:00
=> >> > Backup Gateway IP : 0.0.0.0
=> >> > Backup Gateway MAC : 00:00:00:00:00:00
=> >> > 802.1q VLAN ID : Disabled
=> >> > 802.1q VLAN Priority : 0
=> >> > RMCP+ Cipher Suites : 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
=> >> > Cipher Suite Priv Max : aaaaaaaaaaaaaaa
=> >> > : X=Cipher Suite Unused
=> >> > : c=CALLBACK
=> >> > : u=USER
=> >> > : o=OPERATOR
=> >> > : a=ADMIN
=> >> > : O=OEM
=> >> > does this mean that I already have an ipmi card (not
=> >> can
=> >> > use for fencing? if the anwser is yes, where hell must I
=> >> > don=B4t see wher can I do it.
=> >> > If I haven=B4t a fencing device which one do you recommed to
=> >> > Thanks in advance
=> >> > ESG
=> >> >
=> >> > --
=> >> > Linux-cluster mailing list
=> >> > Linux-cluster redhat com
=> >> > https://www.redhat.com/mailman/listinfo/linux-cluster
=> >> >
=> >> Yes you have IPMI and if you are using 1950 Dell, DRAC should
=> >> too. You can see if you have DRAC or not when the server
=> >> before the loading of the OS.
=> >> I have 1850s and I am using DRAC for fencing.
=> >> Paras.
=> >> --
=> >> Linux-cluster mailing list
=> >> Linux-cluster redhat com
=> >> https://www.redhat.com/mailman/listinfo/linux-cluster
Linux-cluster mailing list
Linux-cluster redhat com
End of Linux-cluster Digest, Vol 64, Issue 12