[dm-devel] [Lsf-pc] [LSF/MM TOPIC] a few storage topics

Wed Jan 25 22:46:14 UTC 2012

On Wed, Jan 25, 2012 at 03:06:13PM -0500, Chris Mason wrote:
> We can talk about scaling up how big the RA windows get on their own,
> but if userland asks for 1MB, we don't have to worry about futile RA, we
> just have to make sure we don't oom the box trying to honor 1MB reads
> from 5000 different procs.

:) that's for sure if read has a 1M buffer as destination. However
even cp /dev/sda reads/writes through a 32kb buffer, so it's not so
common to read in 1m buffers.

But I also would prefer to stay on the simple side (on a side note we
run out of page flags already on 32bit I think as I had to nuke
PG_buddy already).

Overall I think the risk of the pages being evicted before they can be
copied to userland is quite a minor risk. A 16G system with 100
readers all hitting on disk at the same time using 100M readahead
would still only create a 100m memory pressure... So it'd sure be ok,
100m is less than what kswapd keeps always free for example. Think a
4TB system. Especially if 128k fixed has been ok so far on a 1G system.

If we really want to be more dynamic than a setting at boot depending
on ram size, we could limit it to a fraction of freeable memory (using
similar math to determine_dirtyable_memory, maybe calling it over time
but not too frequently to reduce the overhead). Like if there's 0
memory freeable keep it low. If there's 1G freeable out of that math
(and we assume the readahead hit rate is near 100%), raise the maximum
readahead to 1M even if the total ram is only 1G. So we allow up to
1000 readers before we even recycle the readahead.

I doubt the complexity of tracking exactly how many pages are getting
recycled before they're copied to userland would be worth it, besides
it'd be 0% for 99% of systems and workloads.

Way more important is to have feedback on the readahead hits and be
sure when readahead is raised to the maximum the hit rate is near 100%
and fallback to lower readaheads if we don't get that hit rate. But
that's not a VM problem and it's a readahead issue only.

The actual VM pressure side of it, sounds minor issue if the hit rate
of the readahead cache is close to 100%.

The config option is also ok with me, but I think it'd be nicer to set
it at boot depending on ram size (one less option to configure
manually and zero overhead).