[Linux-cluster] GFS + CORAID Performance Problem

Tue Dec 12 05:55:05 UTC 2006

My thanks to Jayson and especially Wendy for providing so much help with
this issue.  With a little help from Coraid, I've troubleshot the
performance issues down to one of the two ports on the Coraid device.  In
the end, I was able to move the performance problem from on of my two hosts
to the other just by swapping ports.  I'll follow up with Coraid to see if I
have a hardware problem.

It's really nice to have such a great level of community support.  Wendy,
I'd be happy to share the particulars on my deployment once I get things
stabilized.

Thanks again!
Tom

On 12/11/06, Wendy Cheng <wcheng at redhat.com> wrote:
>
> Jayson Vantuyl wrote:
> > Tom,
> >
> > I currently administer a system running a similar but larger setup, so
> > I may be able to help you.
> >
> > First, make sure you contact Coraid.  They are really good about
> > helping with this stuff.
> Yes, this is another big area that needs to get looked into. Network
> block device is so new (at least on Linux) that it requires some
> fine-tuning. If folks have working experiences and willing to share, we
> would be very happy to learn from them.
>
> -- Wendy
> >
> > Second, have you looked at /dev/etherd/err?  There is usually a lot of
> > good debugging there.
> >
> > Third, have you upgraded the firmware in the Coraid and built the
> > newest AoE driver?  These are absolutely critical in getting the best
> > performance / reliability and generally the plain kernel driver has
> > fallen behind.  They assure me they're working on this and I can vouch
> > for the fact that this driver is essentially the one in the kernel
> > with development necessary to make it work--not some sort of vendor
> > supplied out-of-tree driver.
> >
> > Finally, make sure you have good switches.  I have had a number of
> > switches that drop a packet here and there.  These are death to AoE
> > performance.  Gigabit is generally a must as well.
> >
> > On Dec 10, 2006, at 2:03 AM, bigendian+gfs at gmail.com
> > <mailto:bigendian+gfs at gmail.com> wrote:
> >
> >> I've just set up a new two-node GFS cluster on a CORAID sr1520
> >> ATA-over-Ethernet.  My nodes are each quad dual-core Opteron CPU
> >> systems with 32GB RAM each.  The CORAID unit exports a 1.6TB block
> >> device that I have a GFS file system on.
> >>
> >> I seem to be having performance issues where certain read system
> >> calls take up to three seconds to complete.  My test app is bonnie++,
> >> and the slow-downs appear to be happen in the "Rewriting" portion of
> >> the test, though I'm not sure if this is exclusive.  If I watch top
> >> and iostat for the device in question, I see activity on the device,
> >> then long (up to three second) periods of no apparent I/O.  During
> >> the periods of no I/O the bonnie++ process is blocked on disk I/O, so
> >> it seems that the system it trying to do something.  Network traces
> >> seem to show that the host machine is not waiting on the RAID array,
> >> and the packet following the dead-period seems to always be sent from
> >> the host to the coraid device.  Unfortunately, I don't know how to
> >> dig in any deeper to figure out what the problem is.
> >>
> >> Below are strace and tcpdump snippets that show what I'm talking
> >> about.  Notice the time stamps and the time spent in system calls in
> >> <> brackets after the call.  I'm quite far from a GFS expert, so
> >> please let me know if other data would be helpful.
> >>
> >> Any help is much appreciated.
> >>
> >> Thanks!
> >
> > --
> > Jayson Vantuyl
> > Systems Architect
> > *Engine Yard*
> > jvantuyl at engineyard.com <mailto:jvantuyl at engineyard.com>
> >
> >
> > ------------------------------------------------------------------------
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20061211/c725f9df/attachment.htm>