[Linux-cluster] Starter Cluster / GFS

Thu Nov 11 08:56:09 UTC 2010

Nicolas Ross wrote:
>>> The volume will be composed of 7 1TB disk in raid5, so 6 TB.
>>
>> Be careful with that arrangement. You are right up against the ragged 
>> edge
>> in terms of data safety.
>>
>> 1TB disks a consumer grade SATA disks with non-recoverable error rates of
>> about 10^-14. That is one non-recoverable error per 11TB.
>>
>> Now consider what happens when one of your disks fails. You have to read
>> 6TB to reconstruct the failed disk. With error rate of 1 in 11TB, the
>> chances of another failure occurring in 6TB of reads is about 53%. So the
>> chances are that during this operation, you are going to have another
>> failure, and the chances are that your RAID layer will kick the disk out
>> as faulty - at which point you will find yourself with 2 failed disks 
>> in a
>> RAID5 array and in need of a day or two of downtime to scrub your data to
>> a fresh array and hope for the best.
>>
>> RAID5 is ill suited to arrays over 5TB. Using enterprise grade disks will
>> gain you an improved error rate (10^-15), which makes it good enough - if
>> you also have regular backups. But enterprise grade disks are much 
>> smaller
>> and much more expensive.
>>
>> Not to mention that your performance on small writes (smaller than the
>> stripe width) will be appalling with RAID5 due to the write-read-write
>> operation required to construct the parity which will reduce your
>> effective performance to that of a single disk.
> 
> Wow...
> 
> The enclosure I will use (and already have) is an activestorage's 
> activeraid
> in 16 x 1tb config. (http://www.getactivestorage.com/activeraid.php).

I dealt with them before. All I'm going to say is - disregard any and 
all performance figures they claim and work out what the performance is 
likely to be from basic principles. Provided you stick to that and 
ignore the marketing specmanship, as far as enterprisey storage 
appliances go, those are reasonably good value for money.

> The
> drives are Hitachi model HDE721010SLA33. From what I could find, error rate
> is 1 in 10^15.

That makes it less bad than my figures above, but still, be careful.

>>> It will host many, many small files, and some biger files. But the files
>>> that change the most often will mos likely be smaller than the blocsize.
>>
>> That sounds like a scenario from hell for RAID5 (or RAID6).
> 
> What do you suggest to acheive size in the range of 6-7 TB, maybe more ?

RAID10 if you need more performance than that of a single disk, unless 
your I/O operations are always very big (bigger than the RAID stripe width).

stripe_width = chunk_size * number_of_disks

Smaller disks are good for reducing rebuild times, and more smaller 
disks will give you better performance. It all depends on the nature of 
the I/O and the performance you require.

>>> The gfs will not be used for io-intensive tasks, that's where the
>>> standalone volumes comes into play. It'll be used to access many files,
>>> often. Specificly, apache will run from it, with document root, session
>>> store, etc on the gfs.
>>
>> Performance-wise, GFS should should be OK for that if you are running 
>> with
>> noatime and the operations are all reads. If you end up with write
>> contention without partitioning the access to directory subtrees on a per
>> server basis, the performance will fall off a cliff pretty quickly.
> 
> Can you explain a little bit more ? I'm not sure I fully understand the
> partitioning into directories ?

Make sure that only one node only accesses a particular directory 
subtree (until it gets failed over, that is). If you have multiple nodes 
simultaneously writing to the same directory with any regularity you 
will experience performance issues.

Gordan