[Linux-cluster] Backups with GFS and RAM sizing

Wed Jul 11 15:15:54 UTC 2007

My backups are taking a very long time, and I am looking for ways to
optimize the process, so suggestions are welcome.

The clusters consist of:
2 x IBM x335 Dual Proc 3.6Ghz XEON
8GB RAM

RHEL4U4 running off local 73GB U320 SCSI in hardware mirror. We are
running RHCS with GFS and DLM. Data that is being backed up is on an IBM
SAN. The slow backups are on an IBM DS4800, Fibre Channel 2Gbps. The
DS4800 has 16GB cache, and is not being over utilized.

The data is divided in to ten 2TB chunks, and mounted under /data/T1
through /data/T10. The backup software is IBM Tivoli Storage Manager.
When the backups start, it processes every file to determine if it needs
to be backed up or not. Last night, it took 15 hours to process 6.7
million files, then back up 4200 files (9GB) total. I do not currently
know how long it takes to actually back up 9GB, but standard copies
would be done relatively quick over the gig ethernet. The Tivoli backup
servers caches the data on seperate SAN disks before backing up to tape,
so the slowdown is not there.

>From what I can tell, the slowness is only on the Red Hat servers,
during processing. Comparing this to some AIX servers with large
backups, the AIX servers can scan 12 million files in about 5 hours, and
a Netware server scanned 17.2 million files in 16 hours. The AIX is
difficult to compare, since it is totally different hardware, but the
Netware server is the same model server, with only 1GB RAM, and using
1.5TB on FC SATA on an IBM FastT 100. If the Netware server takes the
same time to scan more files with less RAM and slower disks, why are my
Linux servers so slow. I know Netware has excellent disk I/O, but this
seems to be more of a processing issue. I don't think the content or
size of the files should matter, but according to our backup admin,
Tivoli will check some attributes (file size, date, rights, etc.) to see
if there are changes.

I am looking for backup client optimizations, but would also like to see
what others are doing or can suggest. The CPU is ranging between
80-100%, so I assume it is hitting both processors. If I try manual
copies from this server during backups, a copy that should take 10
seconds takes 10 minutes. I moved the share to the server not performing
backups, but using the same GFS storage locations, and the copy takes 10
seconds, so the SAN does not appear to be a problem. The slowness
appears to be in the file scan stage, to determine what needs to be
backed up. 

Is there any way I can optimize the disk access, RAM or processor that
might benefit? I am considering adding a server to split up the load, so
I could potentially have two servers with two samba shares each, and the
third server could provide failover and backup services. If adding RAM
would help, I am open to that as well. Additional CPUs might help, since
the utilization is 80-100 during backups, but I will have to purchase
new servers and move everything, which is not appealing. 

If I run free:

             total       used       free     shared    buffers
cached
Mem:       4040864    4020716      20148          0      20512
183012
-/+ buffers/cache:    3817192     223672
Swap:      2097144        224    2096920

Since swap is not really being used, I assume the RAM is being used for
file cache, which makes it hard to determine how much RAM is actually
available for processing. Are there any guidelines I can use to help me
properly size the server (specifically RAM), based on the number of
files or size of data? I recently upgraded from 4GB to 8GB, because I
would occasionally run out of memory on the servers. 

Since Tivoli does some comparison of various attributes during
processing, is it possible I am seeing problems related to the clustered
file system(ie. du -sh on /data/T1 takes minutes the first time)? Any
way to speed this up? Are others using snapshot pools or some other
backup method? Thanks