[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[dm-devel] Performance testing of related dm-crypt patches


the mail will be quite a big one so for better navigation I'm adding contents:

[1] Short resume of performance results
[2] Descriptions of test systems
[3] Detailed tests description
[4] Description of dm-crypt modules involved in testing
[5] dm-zero based test results
[6] spin drive based results
[7] spin drive based results (heavy load)

[1] Short resume of performance results

Results for dm-crypt target mapped over dm-zero one (testing pure performance of dm-crypt only) show that unbounding the workqueue is vastly beneficial for very fast devices. Offloading the requests to separate thread (before sorting the requests) has some cost (~10% compared to after the unbound workqueue patch applied) but it's not anything that would kill the performance seriously. Also results show that (CPU) price for sorting the requests before submitting to lower layer is negligible. Note that with dm-zero backend no I/O scheduler steps in.

With spin drives it's not so straightforward, but in summary there're still nice performance gains visible. Especially with larger block sizes (and deeper queues) the sorting patch improves the performance significantly and sometimes matches the performance of raw block device!

Unfortunately there are examples of workloads where even unbounding the queue or subsequent offloading of requests to separate thread can hurt performance so this is why we decided to introduce 2 switches in dm-crypt target constructor. More detailed explanation in [6] and [7].

[2] Descriptions of test systems

numa_1 : single socket Intel system with 6 cores CPU and hyper-threading enabled (12 logical cores), 12GiB ram

numa_2 : two socket Intel system with 2x8 cores with HT enabled (32 logical cores), 128 GiB ram

numa_4 : 4 socket AMD system with 4x4 cores no HT (16 logical cores), 8GiB ram

numa_8 : 8 socket Intel system with 8x10 cores and HT eanabled (160 logical cores), 1 TiB ram

- All systems had additional storage attached so that spin drives were not shared with the system (with rootfs, swap, whatever)

- CPU throttling was disabled: especially all sleep states (except c-state 0) and turbo modes (if available)

- read/write caching disabled on spin drives

- test OS was RHEL7 with upstream kernel and custom dm-crypt patches (more on that in section [4])

[3] Detailed tests description

tested cipher passed to dm-crypt target: aes-xts-plain64

Tests were performing async sequental writes using fio and libaio library. Each test scenario ran repeatedly (5 to 10 iterations per each scenario) to rule out measurements error as much as possible or to detect some results for particular job were highly volatile (there were some)

Tests were based on two backends for dm-crypt mapping: spin drive or dm-zero target for measuring pure dm-crypt performance.

I used three basic scenarios:
"disk" single fio process writing sequentially dm-crypt mapped over spind drive (starting with device's origin)

"zero": single fio process writing sequentially dm-crypt mapped over dm-zero

"disk_heavy_load": sequential writes issued from multiple fio processes each process set bound to different CPU sockets writing to spin drive (under dm-crypt mapping). The device is divided uniformly between all sockets (and thus also all fio processes).

example of disk_heavy_load test with 3 fio processes per socket:
CPU0 (meant whole socket, not single core)
f0 f1 f2 (set of three individual fio processes bound to CPU0)
r0 (device region (linear segment) written by f0)

           |          |          |
 f0 f1 f2  | f3 f4 f5 | f6 f7 f8 |
  |  |  |  |  |  |  | |  |  |  | |
 r0 r1 r2  | r3 r4 r5 | r6 r7 r9 |

Result tables are composed from multiple lines that looks like following:

D iodepth=256, 32k, mode: write: 698461.10 14795.64 2.12 %
-    -----     ---                -----      -----   ----
|      |        |                   |          |      |
|      |        |                   |          |      v
|      |        |                   |          |  standard deviation
|      |        |                   |          v
|      |        |                   |     average deviation (KiB/s)
|      |        |                   v
|      |        |       sum of bandwidth all fio's (KiB/s)
|      |        v
|      v     block size
| max I/O queue depth
dm-crypt module name (see following section)

[4] Description of dm-crypt modules involved in testing

Each line in results tables is prefixed with single letter meaning different dm-crypt module was involved in testing.

'_' stands for raw block device (used only within one "disk" test)

'A' stands for upstream kernel

'D' stands for following patches:
- dm crypt: remove unused io_pool and _crypt_io_pool
- dm crypt: avoid deadlock in mempools
- dm crypt: don't allocate pages for a partial request

'E' stands for following patch:
dm crypt: use unbound workqueue for request processing (the option 'same_cpu_crypt' turned off)

'F' stands for following patches:
- dm crypt: offload writes to thread
- dm crypt: add 'submit_from_crypt_cpus' option (but turned off)

'G' stands for following patch:
- dm crypt: sort writes ('submit_from_crypt_cpus' turned off)

[5] dm-zero based test results

"zero" test on single socket system: http://okozina.fedorapeople.org/dm-crypt-for-3.20/zero/numa_1/stats

"zero" test on 8 socket system: http://okozina.fedorapeople.org/dm-crypt-for-3.20/zero/numa_8/stats

full test results including fio job files and logs:

[6] spin drive based results

"disk" test single socket system with cfq scheduler: http://okozina.fedorapeople.org/dm-crypt-for-3.20/disk/numa_1/stats
full test results including fio job files and logs:

Usually, there's noticeable performance improvement starting with patch E in iodepth=8 and reasonably set bsize (4KiB and larger), but as you can seen there're few examples where offloading (and sorting) hurts the performance (iodepth=32, various block sizes).

With iodepth=256 there're some examples where unbounding the workqueue without offloading to single thread can hurt the performance (bsize=16KiB and 32KiB)

But in most cases we can say dm-crypt performance is pretty close to raw block device now.

[7] spin drive based results (heavy load)

These tests were most complex. Tested both cfq and deadline schedulers, setting different nr_request parameter for device's scheduler queue.

Tests were spawning 1, 5 or 8 fio processes per CPU socket (8, 40 or 64 processes in case of numa_8) in a system and performed i/o on same count of non-overlapping disk regions.

subdir /numj_1/ means: single process per cpu socket, /numj_5/: 5 processes...

Unfortunately, there're workloads where unbounding the workqueue shows performance drop and subsequent offloading to single thread makes it even worse. (see 8 socket system, cfq, numj_1: http://okozina.fedorapeople.org/dm-crypt-for-3.20/disk_heavy_load/numa_8/cfq/nr_req_128/numj_1/stats).

Similar observations in 2 socket system, cfq, numj_1 :http://okozina.fedorapeople.org/dm-crypt-for-3.20/disk_heavy_load/numa_2/cfq/nr_req_128/numj_1/stats.

On both 2 socket and 8 socket system this observation fades away with adding more fio processes per socket.

Only 4 socket system (not so up to date AMD CPUs w/o HT) didn't show such pattern.

Generally with higher load, deeper ioqueues and larger block sizes, the sorting which takes place in offload thread proves to do it's job good.

*cfq* scheduler, nr_request=128:

2 sockets system:

4 sockets system:

8 sockets system:

*deadline* scheduler, nr_request=128:

2 sockets system:

4 sockets system:

8 sockets system:

full test results including fio job files and logs (beware of archive unpacked has about 500MiBs):


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]