[dm-devel] dm-thin vs lvm performance

Wed Jan 18 19:30:54 UTC 2012

Joe,
 Thanks for looking into the issue and running the tests and suggesting to use "direct" flag. I do see a difference with "direct" flag using dd. However the difference is significant when using bs=64M compared to bs=4k. 

dd-blocksize  flags  dm-thin     LVMoutput   
64M                none   179 MB/s         1GB/s           
64M                direct   2.4GB/s           3.6GB/s        
4k                   none   179 MB/s          965MB/s      
4k                   direct   193MB/s          1.7GB/s        

In all the tests, dm-thin performance is below lvm. I used 8M pages as we are not planning to use snapshots right now.  

We are planning to use the dm-thin module in a high end flash device that supports upto 300k IOPS/second. Hence  we are doing some performance tests. I wanted to try more parallel async IO instead of dd. A good program I found on web is 

fsbench.filesystems.org/bench/aio-stress.c

(if you are aware of other tools to test parallel async io using multiple threads, please let me know. I will give it a try.)

I compiled it with libaio and libpth. 

Then I ran the test on LVM lun. Throughput is 9563.56MB/s. Fully allocated TP LUN is 5399.81MB/s

I used "perf top" to see where most of the time is spent. Lot of time is spent in _raw_spin_lock. I poked the code little bit and I see that dm_thin_find_block returns EWOULDBLOCK and hence the bio gets deferred. It looks like spin locks are used when putting the bio on deferred list and getting it back from the deferred list. I see that EWOULDBLOCK is returned from dm_thin_find_block -> dm_btree_lookup->btree_lookup_raw->ro_step->bn_read_lock->dm_tm_read_lock->dm_bm_read_try_lock->bl_down_read_nonblock.

I tried to change the code little bit so that dm_thin_find_block uses array as a cache and looks up btree only if not found in the array.
With that change, I see TP performance go way up.

Fully allocated TP LUN with array lookup is 9387.17MB/s. With array, no bios are being deferred. However when I used the array, I see kernel hang after several runs ( yet to debug the reason).

Code changes are initialize array in pool_ctr() in dm-thin.c

extern dm_block_t blockmap[1000];

printk("blockmap array inited");
        for(i=0;i<1000;i++)
        {
          blockmap[i]=0xffffffff;
        }

dm_thin_find_block() in dm-thin-metadata.c

at the begining:
       if(blockmap[block]==0xffffffff)
        {
                printk("found unassigned block %llu\n",block);
        }
        else
        {
                result->block = blockmap[block];
                result->shared = 0;
                return 0;
        }

at the end before returning:
       if(r == 0)
        {
                blockmap[block] = result->block ;
        }

Note that the array changes are only for my testing and not generic enough at all.  Command I used for getting the throughput numbers are

i=0; while [ $i -lt 100 ] ; do ./aio-stress.exe  -O -o 1 -c 16 -t 16 -d 256 /dev/mapper/thin1 2>&1 | grep throughput | cut -f3 -d ' ' | cut -f2 -d '(' | cut -f1 -d ' ' ; let i=$i+1; done  | awk 'BEGIN{i=0.0} {i+=$0} END{print i/100}'

Thanks,
Jagan.

________________________________
 From: Joe Thornber <thornber at redhat.com>
To: Jagan Reddy <gjmsreddy at yahoo.com>; device-mapper development <dm-devel at redhat.com> 
Sent: Monday, January 16, 2012 4:42 AM
Subject: Re: [dm-devel] dm-thin vs lvm performance

Hi Jagan,

On Thu, Jan 12, 2012 at 04:55:16PM -0800, Jagan Reddy wrote:
> Hi,

> I  recently started using dm-thin module and first of all, thank you
> for the good work. It works great, well documented and the comments
> in the code are very useful.  I tried to run some performance tests
> using a high performance flash device and the performance of dm thin
> is about 30% of LVM performance on a full allocated thin provisioned
> volume. ( ie after all pages/blocks are allocated using
> dd). Performance test does only reads and no writes. 

Thanks very much for taking the time to try thinp out properly.
People testing different scenarios is very useful to us.

You're probably aware that the thinp test suite is available here:

    https://github.com/jthornber/thinp-test-suite

I've added a little set of tests that recreate your scenario here:

    https://github.com/jthornber/thinp-test-suite/blob/master/ramdisk_tests.rb

I used a 2G ramdisk for these tests, and a variety of thinp block
sizes and 'dd' options.  I'll just summarise the main results, also I
should point out that my testing was done on a VM hosted on a 4G
machine, so the machine was under a lot of memory pressure and there
was a lot of variance in the benchmarks.

writes across various volumes
-----------------------------

write1
------

Testing write performance.

dd if=/dev/zero of=/dev/mapper/<thin> oflags=direct bs=64M

zeroing new blocks turned on.

thinp block size = 64k

| Linear              | 2.2 G/s |
| Unprovisioned thin  | 1.4 G/s |
| Provisioned thin    | 1.9     |
| Snap totally shared | 1.5 G/s |
| Snap no sharing     | 1.9     |

Pretty good.  Not showing the drastic drop that you were seeing.  The
small thinp block size means the snaps perform nicely (not many
copy-on-writes).

write2
------

As test1, but with 8M thinp block size as in your tests.

| Linear              | 2.2 G/s |
| Unprovisioned thin  | 1.5 G/s |
| Provisioned thin    | 2.2     |
| Snap totally shared | 882 Ms  |
| Snap no sharing     | 2.2     | 

Good results, breaking sharing performance is down because the large
block size mean there will be more actual copying incurred.

write3
------

As test2, but no oflags=direct option to dd.

| Linear              | 900 M/s |
| Unprovisioned thin  | 579 M/s |
| Provisioned thin    | 694 M/s |
| Snap totally shared | 510 M/s |
| Snap no sharing     | 654 M/s | 

Alarming.  Results are similar for thinp block size of 64k.

read1
-----

Testing read performance.

dd if=/dev/mapper/<thin> of=/dev/null iflags=direct bs=64M

thinp block size = 64k

| Linear              | 3.3 G/s |
| Provisioned thin    | 2.7 G/s |
| Snap no sharing     | 2.8 G/s | 

read2
-----

read1 but with 8M thinp block size

| Linear              | 3.3 G/s |
| Provisioned thin    | 3.2 G/s |
| Snap no sharing     | 3.3 G/s | 

read3
-----

As read2, but without the iflags=direct option to 'dd'.

| Linear              | 1.0 G/s |
| Provisioned thin    | 594 M/s |
| Snap no sharing     | 605 M/s | 

I think there are a couple of conclusions we can draw from this:

i) dd isn't great for benchmarking block devices
ii) if you are going to use it, then make sure you use O_DIRECT

Using an instrumented kernel, I've confirmed that these read tests
are taking the fast path through the code.  The mapping tables are all
in memory.  The bio_map function is always returning DM_MAPIO_REMAPPED.

Will you be using a filesystem on these block devices?

- Joe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20120118/da7b31fb/attachment.htm>