[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] GFS + CORAID Performance Problem



Wendy Cheng wrote:

bigendian+gfs gmail com wrote:

I've just set up a new two-node GFS cluster on a CORAID sr1520 ATA-over-Ethernet. My nodes are each quad dual-core Opteron CPU systems with 32GB RAM each. The CORAID unit exports a 1.6TB block device that I have a GFS file system on.

I seem to be having performance issues where certain read system calls take up to three seconds to complete. My test app is bonnie++, and the slow-downs appear to be happen in the "Rewriting" portion of the test, though I'm not sure if this is exclusive. If I watch top and iostat for the device in question, I see activity on the device, then long (up to three second) periods of no apparent I/O. During the periods of no I/O the bonnie++ process is blocked on disk I/O, so it seems that the system it trying to do something. Network traces seem to show that the host machine is not waiting on the RAID array, and the packet following the dead-period seems to always be sent from the host to the coraid device. Unfortunately, I don't know how to dig in any deeper to figure out what the problem is.

Wait ... sorry, I didn't read carefully... now I see that 3 seconds in the strace. That doesn't look like a bonnie++ issue.... Does bonnie++ run on single node ? Or you dispatch them on both nodes (on different directories) ? This is more complicated than that I originally expected (since this is a network block device ?). Need to think how to catch the culprit... could be memory issue though. Could you try to run bonnie++ on 4G of memory to see how whether you can see there are 3 seconds read delay ?

-- Wendy


I think we know about this issue. Note that bonnie++ doesn't keep the file size within the benchmark's local memory, it always invokes a "stat" system call to poll for the file size before it can do read and rewrite. GFS1 has a known performance issue with stat system call (that we hope it can be addressed by GFS2) and since file size in bonnie++ tend to be small, the stat() call overhead becomes very obvious. It will be worse in your case due to the filesystem size.

hmm, wait ... I didn't check your strace carefully. Now I see that 3 seconds delay ...


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]