[dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics

Tue Jan 24 18:05:50 UTC 2012

Andreas Dilger <adilger at dilger.ca> writes:

> On 2012-01-24, at 9:56, Christoph Hellwig <hch at infradead.org> wrote:
>> On Tue, Jan 24, 2012 at 10:15:04AM -0500, Chris Mason wrote:
>>> https://lkml.org/lkml/2011/12/13/326
>>> 
>>> This patch is another example, although for a slight different reason.
>>> I really have no idea yet what the right answer is in a generic sense,
>>> but you don't need a 512K request to see higher latencies from merging.
>> 
>> That assumes the 512k requests is created by merging.  We have enough
>> workloads that create large I/O from the get go, and not splitting them
>> and eventually merging them again would be a big win.  E.g. I'm
>> currently looking at a distributed block device which uses internal 4MB
>> chunks, and increasing the maximum request size to that dramatically
>> increases the read performance.
>
> (sorry about last email, hit send by accident)
>
> I don't think we can have a "one size fits all" policy here. In most
> RAID devices the IO size needs to be at least 1MB, and with newer
> devices 4MB gives better performance.

Right, and there's more to it than just I/O size.  There's access
pattern, and more importantly, workload and related requirements
(latency vs throughput).

> One of the reasons that Lustre used to hack so much around the VFS and
> VM APIs is exactly to avoid the splitting of read/write requests into
> pages and then depend on the elevator to reconstruct a good-sized IO
> out of it.
>
> Things have gotten better with newer kernels, but there is still a
> ways to go w.r.t. allowing large IO requests to pass unhindered
> through to disk (or at least as far as enduring that the IO is aligned
> to the underlying disk geometry).

I've been wondering if it's gotten better, so decided to run a few quick
tests.

kernel version 3.2.0, storage: hp eva fc array, i/o scheduler cfq,
max_sectors_kb: 1024, test program: dd

ext3:
- buffered writes and buffered O_SYNC writes, all 1MB block size show 4k
  I/Os passed down to the I/O scheduler
- buffered 1MB reads are a little better, typically in the 128k-256k
  range when they hit the I/O scheduler.

ext4:
- buffered writes: 512K I/Os show up at the elevator
- buffered O_SYNC writes: data is again 512KB, journal writes are 4K
- buffered 1MB reads get down to the scheduler in 128KB chunks

xfs:
- buffered writes: 1MB I/Os show up at the elevator
- buffered O_SYNC writes: 1MB I/Os
- buffered 1MB reads: 128KB chunks show up at the I/O scheduler

So, ext4 is doing better than ext3, but still not perfect.  xfs is
kicking ass for writes, but reads are still split up.

Cheers,
Jeff