Customers are always looking to gain performance improvements from their servers. One area of performance bottleneck has always been the speed of disk access. Until fairly recently, servers have usually been configured with banks of hard disk drives or attached to Storage Area Networks (SANs) which themselves are huge banks of hard drives. Solid State Drives (and NVMe devices) offer better performance for most users than spindle based hard disk drives. However, SSDs and NVMe devices are considerably more expensive in comparison.

One solution to providing improved disk I/O performance would be to combine the capacity offered by spindle based HDDs with the speed of access offered by SSDs. Some storage vendors sell hybrid drives combining these two storage technologies. It is possible to achieve the same solution in Red Hat Enterprise Linux by configuring an SSD to act as a cache device for a larger HDD. This has the added benefit of allowing you to choose your storage vendor without relying on their cache implementation. As SSD prices drop and capacities increase, the cache devices can be replaced without worrying about the underlying data devices.

A supported solution in Red Hat Enterprise Linux is to use a dm-cache device. Since this is part of devicemapper, we don’t need to worry about kernel modules and kernel configuration options, and no tuning has been necessary for the tests performed.

Read more about optimizing performance for the open-hybrid enterprise.

However, it is worth knowing that dm-cache has been engineered to target particular use cases - it is a ‘hot-spot’ cache and is slow filling. This design choice means that data will be promoted to the cache over multiple accesses and population of the cache will be slow. As such data streams will not be cached and random access will also not be helped. Likewise, where files are created and destroyed on a frequent basis, dm-cache will not likely be of benefit.

This behavior is in contrast to the more familiar kernel filesystem cache which uses physical RAM to cache file access. The kernel filesystem cache will be populated quickly, but it is also more volatile, and cannot be targeted at a specific volume in the manner that dm-cache can be.

Setting up the performance testing environment

Given these criteria for use cases, testing will not use dd, will focus on read speeds, and will require multiple runs before significant performance benefits can be realized. My hardware for testing has been a PC with three storage devices present:

  1. 120GB mSATA 'disk' (/dev/sdc). This is where the OS has been installed, and for the purposes of testing is a 'fast' disk offering similar speeds to an SSD.

  2. 500GB SATA 2.5" HDD (/dev/sda). This is my 'slow' disk and will be the target location for data to be read from.

  3. 130GB SATA 2.5" SSD (/dev/sdb). This is my 'fast' disk and will be used as my cache device.

Red Hat Enterprise Linux 7.3 has been installed on the mSATA disk (identified as /dev/sdc) with the filesystem formatted as xfs. I have then created a single partition on the 500GB HDD, and created an LVM volume group called data, with a single logical volume of 500GB called 'slowdisk'. The logical volume has then been formatted as xfs.

I will create a 2.5GB file on the slowdisk logical volume, and for testing will measure the time it takes to copy this file from the slowdisk to the root filesystem (this is hosted on the 'fast' mSATA disk, and it is therefore reasonable to assume that the time overhead will be predominantly caused by the read from the slow disk).

As can be seen, the logical volumes are as follows:

[root@rhel-test ~]# lvs -a
   LV       VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
   slowdisk data -wi-ao---- 400.00g
   home     rhel -wi-ao----  60.29g
   root     rhel -wi-ao----  50.00g
   swap     rhel -wi-ao----   7.75g

 

 

Above is a diagram depicting the filesystem configuration.

First Test

The first test is to copy the file from the slow disk to the root filesystem (hosted on the fast mSATA drive).

[root@rhel-test ~]# echo 3 > /proc/sys/vm/drop_caches && time cp /testing/2.5GB.testfile /root/

 real    0m21.464s
 user    0m0.018s
 sys     0m1.719s

From the above, the kernel file cache has been cleared prior to timing the copy of the test file. As was noted earlier, the Linux kernel file cache can make file operations appear much faster than the underlying disks can actually perform. Given that this article is specifically focussed on testing the performance of the underlying disks it is important to drop the file cache prior to running each iteration of the test.

It is also recommended to repeat each test multiple times - for my testing, I repeated the test five times and have taken mean values of the copy times.

Slow disk to fast disk mean copy time:

real 21.467 seconds
user 0.0144 seconds
sys 1.7394 seconds

These values represent the baseline performance. Hopefully, by putting a cache disk in place, these values can be improved.

Setting up the cache

The 120GB SSD is added to the data volume group, and a cachedisk logical volume created, as well as a smaller metadata volume. Instructions for creating the cache volumes are at the Red Hat Customer Portal:

 [root@rhel-test ~]# lvcreate -L 100G -n cachedisk data /dev/sdb1
   Logical volume "cachedisk" created.
 [root@rhel-test ~]# lvcreate -L 4G -n metadisk data /dev/sdb1
   Logical volume "metadisk" created.

The cache device is then added to the slowdisk logical volume.

[root@rhel-test ~]# lvconvert --type cache-pool /dev/data/cachedisk --poolmetadata /dev/data/metadisk
   Using 128.00 KiB chunk size instead of default 64.00 KiB, so cache pool has less then 1000000 chunks.
   WARNING: Converting logical volume data/cachedisk and data/metadisk to cache pool's data and metadata volumes with metadata wiping.
   THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
 Do you really want to convert data/cachedisk and data/metadisk? [y/n]: y
   Converted data/cachedisk to cache pool.
 [root@rhel-test ~]# lvconvert --type cache /dev/data/slowdisk --cachepool /dev/data/cachedisk
 Do you want wipe existing metadata of cache pool volume data/cachedisk? [y/n]: y
   Logical volume data/slowdisk is now cached.


The logical volume configuration can be checked using the lvs command:

[root@rhel-test ~]# lvs -a
   LV                VG   Attr       LSize   Pool        Origin           Data%  Meta%  Move Log Cpy%Sync Convert
   [cachedisk]       data Cwi---C--- 100.00g                              0.00   0.16            0.00
   [cachedisk_cdata] data Cwi-ao---- 100.00g
   [cachedisk_cmeta] data ewi-ao----   4.00g
   [lvol0_pmspare]   data ewi-------   4.00g
   slowdisk          data Cwi-aoC--- 400.00g [cachedisk] [slowdisk_corig] 0.00   0.16            0.00
   [slowdisk_corig]  data owi-aoC--- 400.00g
   home              rhel -wi-ao----  60.29g
   root              rhel -wi-ao----  50.00g
   swap              rhel -wi-ao----   7.75g

From the above output, the slowdisk logical volume has a cache, on the logical volume cachedisk, added. The utilization of the cache is currently at 0.00% (a small amount of metadata has already been created).

 

 

 

The /dev/sdc device that hosts the root filesystem has been removed from this picture but remains the same as the previous diagram.

First cache test

With the cache in place, the tests can be re-run:

[root@rhel-test ~]# echo 3 > /proc/sys/vm/drop_caches && time cp /testing/2.5GB.testfile /root/

 real    0m21.560s
 user    0m0.009s
 sys     0m1.804s

“These numbers are rubbish! There’s no improvement by using the cache device, and I’ve wasted my money buying this expensive SSD!”

This is the first run since adding the cache device. There is currently no data cached, and it is therefore normal and expected that there would be no performance improvement. The test needs to be repeated.

Second cache test

[root@rhel-test ~]# echo 3 > /proc/sys/vm/drop_caches && time cp /testing/2.5GB.testfile /root/

 real    0m22.312s
 user    0m0.009s
 sys     0m1.607s

You’re joking?! The performance has actually dropped! This ‘caching’ just doesn’t work!” 

Initially, the figures don’t look very promising at all. This second run has taken slightly longer than the first run. Looking at the logical volume properties:

[root@rhel-test ~]# lvs -a
   LV                VG   Attr       LSize   Pool        Origin           Data%  Meta%  Move Log Cpy%Sync Convert
   [cachedisk]       data Cwi---C--- 100.00g                              0.15   0.16            0.00
   [cachedisk_cdata] data Cwi-ao---- 100.00g
   [cachedisk_cmeta] data ewi-ao----   4.00g
   [lvol0_pmspare]   data ewi-------   4.00g
   slowdisk          data Cwi-aoC--- 400.00g [cachedisk] [slowdisk_corig] 0.15   0.16            0.00
   [slowdisk_corig]  data owi-aoC--- 400.00g
   home              rhel -wi-ao----  60.29g
   root              rhel -wi-ao----  50.00g
   swap              rhel -wi-ao----   7.75g

Although the copy job has taken slightly longer, we can see that the cachedisk is now beginning to be utilized. Only 0.15% utilized, which equates to 150MB of cache has been used. Given we are copying 2.5GB of data each time, the entire data that is being accessed has certainly not been promoted to the cache.

This, again, is by design. dm-cache has been designed as a hot-spot cache with a targeting towards read caching. The hot-spot cache will build up somewhat slowly over time and promote the frequently accessed data to the cache. It won’t fill up quickly with recently accessed data. This behavior means that there should be less ‘cache thrashing’ of items being regularly added and dropped from the cache. Greater long term performance benefits can be had with this behavior.

However, for our testing purposes, this means we must rerun the tests several times to populate the cache to finally determine the measurement of performance benefit.

Third cache test

 [root@rhel-test ~]# echo 3 > /proc/sys/vm/drop_caches && time cp /testing/2.5GB.testfile /root/

 real    0m23.641s
 user    0m0.006s
 sys     0m1.614s

“Still slow, more doubts are beginning to creep in … Patience and persistence will be rewarded though.”

Fourth cache test

 [root@rhel-test ~]# echo 3 > /proc/sys/vm/drop_caches && time cp /testing/2.5GB.testfile /root/

 real    0m22.452s
 user    0m0.004s
 sys     0m1.620s

After four runs with the cache device, performance is back to where we started. However, the cache device is increasing in utilization. Hopefully, performance payback is just around the corner.

[root@rhel-test ~]# lvs -a
   LV                VG   Attr       LSize   Pool        Origin           Data%  Meta%  Move Log Cpy%Sync Convert
   [cachedisk]       data Cwi---C--- 100.00g                              0.43   0.16            0.00
   [cachedisk_cdata] data Cwi-ao---- 100.00g
   [cachedisk_cmeta] data ewi-ao----   4.00g
   [lvol0_pmspare]   data ewi-------   4.00g
   slowdisk          data Cwi-aoC--- 400.00g [cachedisk] [slowdisk_corig] 0.43   0.16            0.00
   [slowdisk_corig]  data owi-aoC--- 400.00g
   home              rhel -wi-ao----  60.29g
   root              rhel -wi-ao----  50.00g
   swap              rhel -wi-ao----   7.75g

Fifth Cache Test

 [root@rhel-test ~]# echo 3 > /proc/sys/vm/drop_caches && time cp /testing/2.5GB.testfile /root/

 real    0m20.279s
 user    0m0.004s
 sys     0m1.605s

The fastest copy time yet! Albeit only approximately 1 second faster than with no cache at all. There should be room for further improvement though, as the cache statistics show:

Sixth Test

 [root@rhel-test ~]# echo 3 > /proc/sys/vm/drop_caches && time cp /testing/2.5GB.testfile /root/

 real    0m19.778s
 user    0m0.005s
 sys     0m1.549s

More time has been shaved off the copy job. At this point, we’ll skip ahead … many test runs later

Table of results

Test Run Number

Mean (real) copy time (s)

1

21.47

2

21.53

...

...

13

14.80

...

...

21

5.80

22

5.02

23

5.02

 

Table of mean time spent in real mode during file copy operation

Test Run Number 1 was performed with no cache device configure. Test Run Number 2 and subsequent tests were performed with the cache device present.

 

Above is a graph depicting how the copy time decreased as more iterations of the test were run.

After running the test 22 times, no performance improvement was found with further iterations of the test. From the output of lvs, it is obvious that the cache has now been filled with 2.5GB of data (the file that has been copied across 20+ times).

[root@rhel-test ~]# lvs -a
 LV                VG   Attr       LSize   Pool        Origin           Data%  Meta%  Move Log Cpy%Sync Convert
 [cachedisk]       data Cwi---C--- 100.00g                              2.50   0.16            0.00
 [cachedisk_cdata] data Cwi-ao---- 100.00g
 [cachedisk_cmeta] data ewi-ao----   4.00g
 [lvol0_pmspare]   data ewi-------   4.00g
 slowdisk          data Cwi-aoC--- 400.00g [cachedisk] [slowdisk_corig] 2.50   0.16            0.00
 [slowdisk_corig]  data owi-aoC--- 400.00g
 home              rhel -wi-ao----  60.29g
 root              rhel -wi-ao----  50.00g
 swap              rhel -wi-ao----   7.75g

 swap              rhel -wi-ao----   7.75g

 

Graph depicting how the cache utilization increased as more test runs were performed.

Conclusions and summary

From the results above it has been demonstrated that implementing dm-cache on a fast device in front of a larger and slower disk can provide significant performance gains, given specific use cases. Data that is frequently read will be promoted to the cache which the tests have demonstrated can provide a significant increase in read performance. If data is read only once, dm-cache does not offer any improvement.

Using dm-cache can provide performance benefit for file accesses on a server, but does not replace the kernel file caching, and does not provide a good use case for random file access. It excels as a read cache where frequently accessed files (hot-spots) can be promoted to the cache over a period of multiple accesses (slow fill cache). Once the cache is populated, the read performance should increase.

The Linux kernel file cache generally will perform considerably faster than an SSD (or NVMe) based dm-cache device - physical RAM is still significantly faster than solid state drives. However, an SSD based dm-cache can survive a server reboot and should not be as ephemeral as the Linux kernel file cache. The Linux kernel also frees in-memory cache as processes demand memory allocations, whereas a device backed dm-cache provides a defined cache capacity.

References

The inspiration for this blog entry came from a customer case and research led to a three-year-old post that Richard Jones made on his blog while he tried to optimise the performance of his virtual machines and discussions with dm-cache developers he had on the linux-lvm mailing list.

It is worth noting that, as always, software has moved on and the tuning parameters that Richard was advised to try to alter the dm-cache characteristics have now mostly been deprecated. This has been caused by the change in the cache policy used by dm-cache making it much simpler for use by removing these parameters.

A more recent discussion, this time on the dm-devel mailing list, details the newer cache policy (smq) and also the tuning options available. For the tests that I have performed in putting together this blog, I didn’t alter any of the default settings for dm-cache. I have also displayed the cache utilization by simply using the output of the ‘lvs -a’ command, however, there are other tools and scripts available that people have put together. One such example that I found helpful was created by Armin Hammer.
 

 

Jonathan Ervine is a TAM from Hong Kong. He is providing support to enterprise customers in the financial, logistics, and technology sectors in the APAC region. Recently, Jonathan has been helping his customers deploy private cloud infrastructure and maintaining their existing platform deployments on a supported platform. More about Jonathan.

A Red Hat Technical Account Manager (TAM) is a specialized product expert who works collaboratively with IT organizations to strategically plan for successful deployments and help realize optimal performance and growth. The TAM is part of Red Hat’s world class Customer Experience and Engagement organization and provides proactive advice and guidance to help you identify and address potential problems before they occur. Should a problem arise, your TAM will own the issue and engage the best resources to resolve it as quickly as possible with minimal disruption to your business.

Connect with TAMs at a Red Hat Convergence event near you! Red Hat Convergence is a free, invitation-only event offering technical users an opportunity to deepen their Red Hat product knowledge and discover new ways to apply open source technology to meet their business goals. These events travel to cities around the world to provide you with a convenient, local one-day experience to learn and connect with Red Hat experts and industry peers.