[Linux-cluster] GFS2 performance on large files

Thu Apr 23 16:04:50 UTC 2009

On Thu, 23 Apr 2009 17:24:18 +0200, Christopher Smith
<csmith at nighthawkrad.net> wrote:

>>> Don't get me wrong, I'm 100% behind using COTS stuff wherever possible,

>>> and setups with DRBD, et al, have worked very well for us in several 
>>> locations.  But there are some situations where it just doesn't (eg:
SAN
>>>
>>> LUNs shared between multiple servers - unless you want to forego the 
>>> performance benefits of write caching and DIY with multiple machines, 
>>> DRBD and iscsi-target).
>> 
>> I'm not sure write-caching is that big a deal - your SAN will be caching
>> all the writes anyway. Granted, the cache will be about 0.05ms further
>> away than it would be on a local controller, but then again, the
>> clustering overheads will relegate that into the realm of irrelevance.
>> I have yet to see a shared-SAN file system that doesn't introduce
>> performance penalties big enough to make the ping time to SAN a drop
>> in the ocean.
> 
> I was actually thinking of the DIY-SAN scenario.  Eg: you get a couple 
> of 2U machines with a few TB of internal disk, mirror them with DRBD, 
> then export the disk as an iSCSI target.  Setup something (we used 
> heartbeat) to failover between the two and voila, you have your own 
> redundant iSCSI SAN.
> 
> Unfortunately you then can't get the best benefit from the dirt cheap 
> gigabytes of RAM you can stuff into those machines for write caching, 
> since there's no way to synchronise between the two - so if one machines 
> dies there's data loss.

Using replication protocol B instead of C in DRBD offsets some of that.

> The same applies if you want to DIY an SMB or NFS NAS - either no write 
> caching, or a high risk of data corruption.

It depends on how much of your stack is fsync() safe, and even then
you'll still get data loss if the client machine crashes (everything
cached there gets lost), the most fsync() will buy you is a FS consistent
enough for a journal replay/rollback. The only way to ensure 100% recovery
is to make sure that any failed transactions can be replayed from the very
top of the application stack.

>> Just being a devil's advocate. ;)
> 
> Me too, to a degree.  We have a couple of SANs, primarily to keep 
> higher-ups feeling warm and fuzzy, and I'm not convinced any of them 
> have delivered anything close to proportionally better performance and 
> reliability than something we could have built ourselves.
> 
> With that said, I wouldn't want to be the guy who (for example) DIYed an 
> NFS NAS to run crtical Oracle DBs on, when Oracle support comes back 
> with "until you're running on supported storage, we won't help you with 
> your Oracle problems".

Database on NFS? That's just asking for trouble any way you look at it.

The point is that people who are likely to DIY the solution are the ones
that will do it for performance reasons before cost reasons, and they
are also the ones that are likely to be more knowledgeable about all
the components in the application stack than the vendor support are
ever going to. When something goes disastrously wrong, passing the buck
to the vendor is usually not going to help one's job prospects or
reputation - fixing the problem will. The more opaque the components
in the stack are, the lower the chance that it can actually be fixed
in a timely manner.

On a separate note, I think that we have strayed off the topic
sufficiently far to be told off by the list admins, so I'm going
to withdraw from this thread.

Gordan