[dm-devel] [PATCH 1/6] dm raid45 target: export region hash functions and add a needed one
Doug Ledford
dledford at redhat.com
Tue Jul 7 18:38:32 UTC 2009
On Jul 5, 2009, at 11:21 PM, Neil Brown wrote:
> Here your code seems to be 2-3 times faster!
> Can you check which function xor_block is using?
> If it is :
> xor: automatically using best checksumming function: ....
> then it might be worth disabling that test in calibrate_xor_blocks and
> see if it picks one that ends up being faster.
>
> There is still the fact that by using the cache for data that will be
> accessed once, we are potentially slowing down the rest of the system.
> i.e. the reason to avoid the cache is not just because it won't
> benefit the xor much, but because it will hurt other users.
> I don't know how to measure that effect :-(
> But if avoiding the cache makes xor 1/3 the speed of using the cache
> even though it is cold, then it would be hard to justify not using the
> cache I think.
So, Heinz and I are actually both looking at xor speed issues, but
from two different perspectives. While he's comparing some of the
dmraid45 xor stuff to the xor_blocks routine in crypto/, I'm
specifically looking at that "automatically using best checksumming
function" routine. For the last 9 or so years, we've automatically
opted for the SSE + non-temporal store routine specifically because
it's not supposed to pollute cache. However, after even just a
cursory reading of the current Intel architecture optimization guide,
it's obvious that our SSE routine is getting rather aged, and I think
the routine is in serious need of an overhaul. This is something I'm
currently looking into. But, that raises the question of how to
decide whether or not to use it, either in its current form or any new
form it might take. As you point out, the tradeoff between cache
polluting and non-cache polluting is hard to quantify.
We made a significant error when we originally wrote the SSE routines,
and Heinz just duplicated it. Specifically, we tested performance on
a quiescent system. For the SSE routines, I think this is a *major*
error. The prefetch instructions need to be timed such that the
prefetch happens at roughly the right point in time to compensate for
the memory latency in getting the data to L1/L2 cache prior to use by
the CPU. Unfortunately, memory latency in a system that is quiescent
is drastically different than latency in a system with several CPUs
actively competing for RAM resources on top of 100MB/s+ of DMA
traffic, etc. When we optimized the routines in a quiescent state, I
think we got our prefetches too close to when the data was needed by
the CPU under real world use conditions and that's impacting the
operation of the routines today (or maybe we did get it right, but
changes in CPU speed relative to memory latency have caused the best
prefetch point to change over time, either way the current SSE xor
routine appears to be seriously underperforming in my benchmark tests).
Likewise, Heinz's tests were comparing cold cache to hot cache and
trying to find a break over point where we switch from one to the
other. But that question necessarily depends on other factors in the
system including what other cores on the same die are doing as that
impacts the same cache.
So if the error was to not test and optimize these routines under
load, then the right course of action would be to do the opposite.
And that leads me to believe that the best way to quantify the
difference between cache polluting and non-cache polluting should
likewise not be done on a quiescent system with a micro benchmark.
Instead, we need a holistic performance test to get the truly best xor
algorithm. In my current setup, the disks are so much faster than the
single threaded xor thread that the bottleneck is the xor speed. So,
what does it matter if the xor routine doesn't pollute cache if the
raid is so slow that programs are stuck in I/O wait all the time as
the raid5 thread runs non-stop? Likewise, who cares what the top
speed of a cache polluting xor routine is if in the process it evicts
so many cache pages belonging to the processes doing real work on the
system that now cache reload becomes the bottleneck. The ultimate
goal of either approach is overall *system* speed, not micro benchmark
speed. I would suggest a specific, system wide workload test that
involves a filesystem on a device that uses the particular raid level
and parity routine you want to test, and then you need to run that
system workload and get a total time required to perform that specific
work set, CPU time versus idle+I/O wait time in completing that work
set, etc. Repeat the test for the various algorithms you wish to
test, then analyze the results and go from there. I don't think
you're going to get a valid run time test for this, instead we would
likely need to create a few heuristic rules that, combined with
specific CPU properties, cause us to choose the right routine for the
machine.
--
Doug Ledford <dledford at redhat.com>
GPG KeyID: CFBFF194
http://people.redhat.com/dledford
InfiniBand Specific RPMS
http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 203 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20090707/4b3c7fb6/attachment.sig>
More information about the dm-devel
mailing list