[Linux-cluster] rhcs + gfs performance issues

Fri Oct 3 16:56:51 UTC 2008

Thanks so much for the reply, hopefully this will lead to something.

On Fri, 2008-10-03 at 17:25 +0100, Gordan Bobic wrote:
> It sounds like you have a SAN (fibre attached storage) that you are trying to turn into a NAS. That's justifiable if you have multiple mirrored SANs, but makes a mockery of HA if you only have one storage device since it leaves you with a single point of failure regardless of the number of front end nodes.

Understood on the san single point of failure.  We're addressing HA on
the front end, don't have the money to address the backend yet.  Storage
is something you set up once and don't have to mess with again, and
doesn't do things like have application issues, etc, it's just storage.
So barring a hardware issue not covered by redundant power supplies,
spare disks, etc, it doesn't have issues.  Having a cluster on the front
end allows for failure of software on one, being able to reboot one, and
provide zero downtime to the clients.

> Do you have a separate gigabit interface/vlan just for cluster communication? RHCS doesn't use a lot of sustained bandwidth but performance is sensitive to latencies for DLM comms. If you only have 2 nodes, a direct crossover connection would be ideal.

Not sure how to accomplish that.  How do you get certain services of the
cluster environment to talk over 1 interface, and other services (such
as the shares) over another?  The only other interface I have configured
is for the fence device (dell drac cards).

> How big is your data store? Are files large or small? Are they in few directories with lots of files (e.g. Maildir)?

Very much mixed.  We have SAS and SATA  in the same SAN device, and
carved out based on application performance need.  Some large volumes
(7TB), some small (2GB).  Some large files (video) down to the mix of
millions of 1k user files.

> Load averages will go up - that's normal, since there is added latency (round trip time) from locking across nodes. Unless your CPUs is 0% idle, the servers aren't running out of steam. So don't worry about it.

Understood.  That was just the measure I used as comparison.  There is
definite performance lag during these higher load averages.  What I was
trying (and doing poorly) to communicate was that all we are doing here
is serving files over nfs..we're not running apps on the cluster
itself...difficult for me to understand why file serving would be so
slow or ever drive load up on a box that high.  And, the old file
server, did not have these performance issues doing the same tasks with
less hardware, bandwith, etc.

> Also note that a clustered FS will _ALWAYS_ be slower than a non-clustered one, all things being equal. No exceptions. Also, if you are load sharing across the nodes, and you have Maildir-like file structures, it'll go slower than a purely fail-over setup, even on a clustered FS (like GFS), since there is no lock bouncing between head nodes. For extra performance, you can use a non-clustered FS as a failover resource, but be very careful with that since dual mounting a non-clustered FS will destroy the volume firtually instantly.

Agreed.  That's not the comaprison though.  Our old file server was
running a clustered file system from Tru64 (AdvFS).  Our expectation was
that a newer technology, plus a major upgrade in hardware, would result
in better performance at least than what we had, it has not, it is far
worse.

> Provided that your data isn't fundamentally unsuitable for being handled by a clustered load sharing setup, you could try increasing lock trimming and increasing the number of resource groups. Search through the archives for details on that.

Can you point me in the direction of the archives?  I can't seem to find
them?

> More suggestions when you provide more details on what your data is like.

My apologies for the lack of detail, I'm a bit lost as to what to
provide.  It's basic files, large and small.  User volumes, webserver
volumes, postfix mail volumes, etc.  Thanks so much!

> Gordan
>