[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] GFS + CORAID Performance Problem



My thanks to Jayson and especially Wendy for providing so much help with this issue.  With a little help from Coraid, I've troubleshot the performance issues down to one of the two ports on the Coraid device.  In the end, I was able to move the performance problem from on of my two hosts to the other just by swapping ports.  I'll follow up with Coraid to see if I have a hardware problem.

It's really nice to have such a great level of community support.  Wendy, I'd be happy to share the particulars on my deployment once I get things stabilized.

Thanks again!
Tom

On 12/11/06, Wendy Cheng <wcheng redhat com> wrote:
Jayson Vantuyl wrote:
> Tom,
>
> I currently administer a system running a similar but larger setup, so
> I may be able to help you.
>
> First, make sure you contact Coraid.  They are really good about
> helping with this stuff.
Yes, this is another big area that needs to get looked into. Network
block device is so new (at least on Linux) that it requires some
fine-tuning. If folks have working experiences and willing to share, we
would be very happy to learn from them.

-- Wendy
>
> Second, have you looked at /dev/etherd/err?  There is usually a lot of
> good debugging there.
>
> Third, have you upgraded the firmware in the Coraid and built the
> newest AoE driver?  These are absolutely critical in getting the best
> performance / reliability and generally the plain kernel driver has
> fallen behind.  They assure me they're working on this and I can vouch
> for the fact that this driver is essentially the one in the kernel
> with development necessary to make it work--not some sort of vendor
> supplied out-of-tree driver.
>
> Finally, make sure you have good switches.  I have had a number of
> switches that drop a packet here and there.  These are death to AoE
> performance.  Gigabit is generally a must as well.
>
> On Dec 10, 2006, at 2:03 AM, bigendian+gfs gmail com
> <mailto:bigendian+gfs gmail com> wrote:
>
>> I've just set up a new two-node GFS cluster on a CORAID sr1520
>> ATA-over-Ethernet.  My nodes are each quad dual-core Opteron CPU
>> systems with 32GB RAM each.  The CORAID unit exports a 1.6TB block
>> device that I have a GFS file system on.
>>
>> I seem to be having performance issues where certain read system
>> calls take up to three seconds to complete.  My test app is bonnie++,
>> and the slow-downs appear to be happen in the "Rewriting" portion of
>> the test, though I'm not sure if this is exclusive.  If I watch top
>> and iostat for the device in question, I see activity on the device,
>> then long (up to three second) periods of no apparent I/O.  During
>> the periods of no I/O the bonnie++ process is blocked on disk I/O, so
>> it seems that the system it trying to do something.  Network traces
>> seem to show that the host machine is not waiting on the RAID array,
>> and the packet following the dead-period seems to always be sent from
>> the host to the coraid device.  Unfortunately, I don't know how to
>> dig in any deeper to figure out what the problem is.
>>
>> Below are strace and tcpdump snippets that show what I'm talking
>> about.  Notice the time stamps and the time spent in system calls in
>> <> brackets after the call.  I'm quite far from a GFS expert, so
>> please let me know if other data would be helpful.
>>
>> Any help is much appreciated.
>>
>> Thanks!
>
> --
> Jayson Vantuyl
> Systems Architect
> *Engine Yard*
> jvantuyl engineyard com <mailto:jvantuyl engineyard com >
>
>
> ------------------------------------------------------------------------
>
> --
> Linux-cluster mailing list
> Linux-cluster redhat com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster redhat com
https://www.redhat.com/mailman/listinfo/linux-cluster


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]