[dm-devel] [PATCH for-3.14] Add dm-writeboost (log-structured caching target)

Sat Jan 11 07:36:02 UTC 2014

dm-writeboost is an another cache target like dm-cache and bcache.
The biggest difference from existing cache softwares is that
it focuses on bursty writes.

dm-writeboost first writes the data to RAM buffer and makes a
log containing both data and their metadata.
The log is written to the cache device in log-structured manner.
The fact that the log contains metadata of the data blocks makes
dm-writeboost is robust for power fault. It can replay the log
after crash.

Signed-off-by: Akira Hayakawa <ruby.wktk at gmail.com>
---
 Documentation/device-mapper/dm-writeboost.txt |  161 +++
 drivers/md/Kconfig                            |    8 +
 drivers/md/Makefile                           |    3 +
 drivers/md/dm-writeboost-daemon.c             |  520 ++++++++++
 drivers/md/dm-writeboost-daemon.h             |   40 +
 drivers/md/dm-writeboost-metadata.c           | 1352 +++++++++++++++++++++++++
 drivers/md/dm-writeboost-metadata.h           |   51 +
 drivers/md/dm-writeboost-target.c             | 1258 +++++++++++++++++++++++
 drivers/md/dm-writeboost.h                    |  464 +++++++++
 9 files changed, 3857 insertions(+)
 create mode 100644 Documentation/device-mapper/dm-writeboost.txt
 create mode 100644 drivers/md/dm-writeboost-daemon.c
 create mode 100644 drivers/md/dm-writeboost-daemon.h
 create mode 100644 drivers/md/dm-writeboost-metadata.c
 create mode 100644 drivers/md/dm-writeboost-metadata.h
 create mode 100644 drivers/md/dm-writeboost-target.c
 create mode 100644 drivers/md/dm-writeboost.h

diff --git a/Documentation/device-mapper/dm-writeboost.txt b/Documentation/device-mapper/dm-writeboost.txt
new file mode 100644
index 0000000..0161663
--- /dev/null
+++ b/Documentation/device-mapper/dm-writeboost.txt
@@ -0,0 +1,161 @@
+dm-writeboost
+=============
+Writeboost target provides log-structured caching.
+It batches random writes into a big sequential write to a cache device.
+
+It is like dm-cache as a cache target but the difference is that Writeboost
+focuses on bursty writes and the lifetime of the SSD cache device.
+
+More documents and tests are available in
+https://github.com/akiradeveloper/dm-writeboost
+
+Design
+======
+There are 1 foreground and 6 background processes.
+
+Foreground
+----------
+It accepts bios and stores the write data to RAM buffer.
+When the buffer is full, it creates a "flush job" and queues it.
+
+Background
+----------
+* wbflusher (Writeboost flusher)
+Executes a flush job.
+wbflusher exploits workqueue mechanism and may run in parallel.
+It exhibits the sysfs (/sys/bus/workqueue/devices/wbflusher)
+to control the behavior.
+
+* Barrier deadline worker
+Barrier flags such as REQ_FUA and REQ_FLUSH are acked lazily.
+Immediately handling these bios badly deteriorate the throughput.
+Bios with these flags are queued and forcefully processed at worst
+within `barrier_deadline_ms` period.
+
+* Migrate Daemon
+It migrates, or writes back, cache data to backing store.
+
+If `allow_migrate` is true, it migrates without impending situation.
+Being in impending situation is that there are no room in cache device
+for writing more flush jobs.
+
+Migration is done batching `nr_max_batched_migration` segments at maximum
+at a time. Thus, unlike existing I/O scheduler, two dirty writes close in
+positional space but distant in time space can be merged. Writetboost is
+also a extension of I/O scheduler.
+
+* Migration Modulator
+Migration while the backing store is heavily loaded grows the device queue
+longer and affects the read from the backing store.
+Migration modulator surveils the load of the backing store and turns on/off
+the migration by switching `allow_migrate`.
+
+* Superblock Recorder
+Superblock is a last sector of first 1MB region in cache device containing
+what id of the segment lastly migrated. This daemon periodically updates
+the region every `update_record_interval` seconds.
+
+* Sync Daemon
+This daemon forcefully writes out all the dirty data persistently every
+`sync_interval` seconds. Some careful users want to make all the writes
+persistent periodically.
+
+Target Interface
+================
+All the operations are via dmsetup command.
+
+Constructor
+-----------
+<type>
+<essential args>*
+<#optional args> <optional args>*
+<#tunable args> <tunable args>* (see 'Message')
+
+Optionals are tunables are unordered lists of Key-Value pairs.
+
+Essential args and optional args are different for different buffer type.
+
+<type> (The type of the RAM buffer)
+0: volatile RAM buffer (DRAM)
+1: non-volatile buffer with a block I/F
+2: non-volatile buffer with PRAM I/F
+
+Currently, only type 0 is supported.
+
+Type 0
+------
+<essential args>
+backing_dev        : Slow device holding original data blocks.
+cache_dev          : Fast device holding cached data and its metadata.
+
+<optional args>
+segment_size_order : The size of RAM buffer
+                     1 << n (sectors), 4 <= n <= 10
+                     default 7
+rambuf_pool_amount : The amount of the RAM buffer pool (kB).
+                     Too fewer amount may cause waiting for new buffer
+                     to become available again. But too much doesn't
+		     benefit the performance.
+                     default 2048
+
+Note that cache device is re-formatted if the first sector of the cache
+device is zeroed out.
+
+Status
+------
+<cursor pos>
+<#cache blocks>
+<#segments>
+<current id>
+<lastly flushed id>
+<lastly migrated id>
+<#dirty cache blocks>
+<stat (w/r) x (hit/miss) x (on buffer?) x (fullsize?)>
+<#not full flushed>
+<#tunable args> [tunable args]
+
+Messages
+--------
+You can tune up the behavior of writeboost via message interface.
+
+* barrier_deadline_ms (ms)
+Default: 3
+All the bios with barrier flags like REQ_FUA or REQ_FLUSH
+are guaranteed to be acked within this deadline.
+
+* allow_migrate (bool)
+Default: 1
+Set to 1 to start migration.
+
+* enable_migration_modulator (bool) and
+  migrate_threshold (%)
+Default: 1 and 70
+Set to 1 to run migration modulator.
+Migration modulator surveils the load of backing store and sets the
+migration started if the load is lower than the `migrate_threshold`.
+
+* nr_max_batched_migration (int)
+Default: 1MB / segment size
+Number of segments to migrate at a time.
+Set higher value to fully exploit the capacily of the backing store.
+Even a single HDD is capable of processing 1MB/sec random writes so
+the default value is set to 1MB / segment size. Set higher value if
+you use RAID-ed drive as the backing store.
+
+* update_record_interval (sec)
+Default: 60
+The superblock record is updated every update_record_interval seconds.
+
+* sync_interval (sec)
+Default: 60
+All the dirty writes are guaranteed to be persistent every this interval.
+
+Example
+=======
+dmsetup create writeboost-vol --table "0 ${sz} 0 writeboost ${BACKING} {CACHE}"
+dmsetup create writeboost-vol --table "0 ${sz} 0 writeboost ${BACKING} {CACHE} \
+                                       4 rambuf_pool_amount 8192 segment_size_order 8 \
+				       2 allow_migrate 1"
+dmsetup create writeboost-vol --table "0 ${sz} 0 writeboost ${BACKING} {CACHE} \
+                                       0 \
+				       2 allow_migrate 1"
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index f2ccbc3..65a6d95 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -290,6 +290,14 @@ config DM_CACHE_CLEANER
          A simple cache policy that writes back all data to the
          origin.  Used when decommissioning a dm-cache.
 
+config DM_WRITEBOOST
+	tristate "Log-structured Caching (EXPERIMENTAL)"
+	depends on BLK_DEV_DM
+	default y
+	---help---
+	  A cache layer that batches random writes into a big sequential
+	  write to a cache device in log-structured manner.
+
 config DM_MIRROR
        tristate "Mirror target"
        depends on BLK_DEV_DM
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 2acc43f..6db61ce 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -14,6 +14,8 @@ dm-thin-pool-y	+= dm-thin.o dm-thin-metadata.o
 dm-cache-y	+= dm-cache-target.o dm-cache-metadata.o dm-cache-policy.o
 dm-cache-mq-y   += dm-cache-policy-mq.o
 dm-cache-cleaner-y += dm-cache-policy-cleaner.o
+dm-writeboost-y	+= dm-writeboost-target.o dm-writeboost-metadata.o \
+			dm-writeboost-daemon.o
 md-mod-y	+= md.o bitmap.o
 raid456-y	+= raid5.o
 
@@ -52,6 +54,7 @@ obj-$(CONFIG_DM_VERITY)		+= dm-verity.o
 obj-$(CONFIG_DM_CACHE)		+= dm-cache.o
 obj-$(CONFIG_DM_CACHE_MQ)	+= dm-cache-mq.o
 obj-$(CONFIG_DM_CACHE_CLEANER)	+= dm-cache-cleaner.o
+obj-$(CONFIG_DM_WRITEBOOST)	+= dm-writeboost.o
 
 ifeq ($(CONFIG_DM_UEVENT),y)
 dm-mod-objs			+= dm-uevent.o
diff --git a/drivers/md/dm-writeboost-daemon.c b/drivers/md/dm-writeboost-daemon.c
new file mode 100644
index 0000000..5ea1300
--- /dev/null
+++ b/drivers/md/dm-writeboost-daemon.c
@@ -0,0 +1,520 @@
+/*
+ * Copyright (C) 2012-2014 Akira Hayakawa <ruby.wktk at gmail.com>
+ *
+ * This file is released under the GPL.
+ */
+
+#include "dm-writeboost.h"
+#include "dm-writeboost-metadata.h"
+#include "dm-writeboost-daemon.h"
+
+/*----------------------------------------------------------------*/
+
+static void update_barrier_deadline(struct wb_device *wb)
+{
+	mod_timer(&wb->barrier_deadline_timer,
+		  jiffies + msecs_to_jiffies(ACCESS_ONCE(wb->barrier_deadline_ms)));
+}
+
+void queue_barrier_io(struct wb_device *wb, struct bio *bio)
+{
+	mutex_lock(&wb->io_lock);
+	bio_list_add(&wb->barrier_ios, bio);
+	mutex_unlock(&wb->io_lock);
+
+	if (!timer_pending(&wb->barrier_deadline_timer))
+		update_barrier_deadline(wb);
+}
+
+void barrier_deadline_proc(unsigned long data)
+{
+	struct wb_device *wb = (struct wb_device *) data;
+	schedule_work(&wb->barrier_deadline_work);
+}
+
+void flush_barrier_ios(struct work_struct *work)
+{
+	struct wb_device *wb = container_of(
+		work, struct wb_device, barrier_deadline_work);
+
+	if (bio_list_empty(&wb->barrier_ios))
+		return;
+
+	atomic64_inc(&wb->count_non_full_flushed);
+	flush_current_buffer(wb);
+}
+
+/*----------------------------------------------------------------*/
+
+static void
+process_deferred_barriers(struct wb_device *wb, struct flush_job *job)
+{
+	int r = 0;
+	bool has_barrier = !bio_list_empty(&job->barrier_ios);
+
+	/*
+	 * Make all the data until now persistent.
+	 */
+	if (has_barrier)
+		IO(blkdev_issue_flush(wb->cache_dev->bdev, GFP_NOIO, NULL));
+
+	/*
+	 * Ack the chained barrier requests.
+	 */
+	if (has_barrier) {
+		struct bio *bio;
+		while ((bio = bio_list_pop(&job->barrier_ios))) {
+			LIVE_DEAD(
+				bio_endio(bio, 0),
+				bio_endio(bio, -EIO)
+			);
+		}
+	}
+
+	if (has_barrier)
+		update_barrier_deadline(wb);
+}
+
+void flush_proc(struct work_struct *work)
+{
+	int r = 0;
+
+	struct flush_job *job = container_of(work, struct flush_job, work);
+
+	struct wb_device *wb = job->wb;
+	struct segment_header *seg = job->seg;
+
+	struct dm_io_request io_req = {
+		.client = wb_io_client,
+		.bi_rw = WRITE,
+		.notify.fn = NULL,
+		.mem.type = DM_IO_KMEM,
+		.mem.ptr.addr = job->rambuf->data,
+	};
+	struct dm_io_region region = {
+		.bdev = wb->cache_dev->bdev,
+		.sector = seg->start_sector,
+		.count = (seg->length + 1) << 3,
+	};
+
+	/*
+	 * The actual write requests to the cache device are not serialized.
+	 * They may perform in parallel.
+	 */
+	IO(dm_safe_io(&io_req, 1, &region, NULL, false));
+
+	/*
+	 * Deferred ACK for barrier requests
+	 * To serialize barrier ACK in logging we wait for the previous
+	 * segment to be persistently written (if needed).
+	 */
+	wait_for_flushing(wb, SUB_ID(seg->id, 1));
+
+	process_deferred_barriers(wb, job);
+
+	/*
+	 * We can count up the last_flushed_segment_id only after segment
+	 * is written persistently. Counting up the id is serialized.
+	 */
+	atomic64_inc(&wb->last_flushed_segment_id);
+	wake_up_interruptible(&wb->flush_wait_queue);
+
+	mempool_free(job, wb->flush_job_pool);
+}
+
+void wait_for_flushing(struct wb_device *wb, u64 id)
+{
+	wait_event_interruptible(wb->flush_wait_queue,
+		atomic64_read(&wb->last_flushed_segment_id) >= id);
+}
+
+/*----------------------------------------------------------------*/
+
+static void migrate_endio(unsigned long error, void *context)
+{
+	struct wb_device *wb = context;
+
+	if (error)
+		atomic_inc(&wb->migrate_fail_count);
+
+	if (atomic_dec_and_test(&wb->migrate_io_count))
+		wake_up_interruptible(&wb->migrate_io_wait_queue);
+}
+
+/*
+ * Asynchronously submit the segment data at position k in the migrate buffer.
+ * Batched migration first collects all the segments to migrate into a migrate buffer.
+ * So, there are a number of segment data in the migrate buffer.
+ * This function submits the one in position k.
+ */
+static void submit_migrate_io(struct wb_device *wb, struct segment_header *seg,
+			      size_t k)
+{
+	int r = 0;
+
+	size_t a = wb->nr_caches_inseg * k;
+	void *p = wb->migrate_buffer + (wb->nr_caches_inseg << 12) * k;
+
+	u8 i;
+	for (i = 0; i < seg->length; i++) {
+		unsigned long offset = i << 12;
+		void *base = p + offset;
+
+		struct metablock *mb = seg->mb_array + i;
+		u8 dirty_bits = *(wb->dirtiness_snapshot + (a + i));
+		if (!dirty_bits)
+			continue;
+
+		if (dirty_bits == 255) {
+			void *addr = base;
+			struct dm_io_request io_req_w = {
+				.client = wb_io_client,
+				.bi_rw = WRITE,
+				.notify.fn = migrate_endio,
+				.notify.context = wb,
+				.mem.type = DM_IO_VMA,
+				.mem.ptr.vma = addr,
+			};
+			struct dm_io_region region_w = {
+				.bdev = wb->origin_dev->bdev,
+				.sector = mb->sector,
+				.count = 1 << 3,
+			};
+			IO(dm_safe_io(&io_req_w, 1, &region_w, NULL, false));
+		} else {
+			u8 j;
+			for (j = 0; j < 8; j++) {
+				struct dm_io_request io_req_w;
+				struct dm_io_region region_w;
+
+				void *addr = base + (j << SECTOR_SHIFT);
+				bool bit_on = dirty_bits & (1 << j);
+				if (!bit_on)
+					continue;
+
+				io_req_w = (struct dm_io_request) {
+					.client = wb_io_client,
+					.bi_rw = WRITE,
+					.notify.fn = migrate_endio,
+					.notify.context = wb,
+					.mem.type = DM_IO_VMA,
+					.mem.ptr.vma = addr,
+				};
+				region_w = (struct dm_io_region) {
+					.bdev = wb->origin_dev->bdev,
+					.sector = mb->sector + j,
+					.count = 1,
+				};
+				IO(dm_safe_io(&io_req_w, 1, &region_w, NULL, false));
+			}
+		}
+	}
+}
+
+static void memorize_data_to_migrate(struct wb_device *wb,
+				     struct segment_header *seg, size_t k)
+{
+	int r = 0;
+
+	void *p = wb->migrate_buffer + (wb->nr_caches_inseg << 12) * k;
+	struct dm_io_request io_req_r = {
+		.client = wb_io_client,
+		.bi_rw = READ,
+		.notify.fn = NULL,
+		.mem.type = DM_IO_VMA,
+		.mem.ptr.vma = p,
+	};
+	struct dm_io_region region_r = {
+		.bdev = wb->cache_dev->bdev,
+		.sector = seg->start_sector + (1 << 3),
+		.count = seg->length << 3,
+	};
+	IO(dm_safe_io(&io_req_r, 1, &region_r, NULL, false));
+}
+
+/*
+ * We first memorize the snapshot of the dirtiness in the segments.
+ * The snapshot dirtiness is dirtier than that of any future moment
+ * because it is only monotonously decreasing after flushed.
+ * Therefore, we will migrate the possible dirtiest state of the
+ * segments which won't lose any dirty data.
+ */
+static void memorize_metadata_to_migrate(struct wb_device *wb, struct segment_header *seg,
+					 size_t k, size_t *migrate_io_count)
+{
+	u8 i, j;
+
+	struct metablock *mb;
+	size_t a = wb->nr_caches_inseg * k;
+
+	/*
+	 * We first memorize the dirtiness of the metablocks.
+	 * Dirtiness may decrease while we run through the migration code
+	 * and it may cause corruption.
+	 */
+	for (i = 0; i < seg->length; i++) {
+		mb = seg->mb_array + i;
+		*(wb->dirtiness_snapshot + (a + i)) = read_mb_dirtiness(wb, seg, mb);
+	}
+
+	for (i = 0; i < seg->length; i++) {
+		u8 dirty_bits = *(wb->dirtiness_snapshot + (a + i));
+
+		if (!dirty_bits)
+			continue;
+
+		if (dirty_bits == 255) {
+			(*migrate_io_count)++;
+		} else {
+			for (j = 0; j < 8; j++) {
+				if (dirty_bits & (1 << j))
+					(*migrate_io_count)++;
+			}
+		}
+	}
+}
+
+/*
+ * Memorize the dirtiness snapshot and count up the number of io to migrate.
+ */
+static void memorize_dirty_state(struct wb_device *wb, struct segment_header *seg,
+				 size_t k, size_t *migrate_io_count)
+{
+	memorize_data_to_migrate(wb, seg, k);
+	memorize_metadata_to_migrate(wb, seg, k, migrate_io_count);
+}
+
+static void cleanup_segment(struct wb_device *wb, struct segment_header *seg)
+{
+	u8 i;
+	for (i = 0; i < seg->length; i++) {
+		struct metablock *mb = seg->mb_array + i;
+		cleanup_mb_if_dirty(wb, seg, mb);
+	}
+}
+
+static void transport_emigrates(struct wb_device *wb)
+{
+	int r;
+	struct segment_header *seg;
+	size_t k, migrate_io_count = 0;
+
+	for (k = 0; k < wb->num_emigrates; k++) {
+		seg = *(wb->emigrates + k);
+		memorize_dirty_state(wb, seg, k, &migrate_io_count);
+	}
+
+migrate_write:
+	atomic_set(&wb->migrate_io_count, migrate_io_count);
+	atomic_set(&wb->migrate_fail_count, 0);
+
+	for (k = 0; k < wb->num_emigrates; k++) {
+		seg = *(wb->emigrates + k);
+		submit_migrate_io(wb, seg, k);
+	}
+
+	LIVE_DEAD(
+		wait_event_interruptible(wb->migrate_io_wait_queue,
+					 !atomic_read(&wb->migrate_io_count)),
+		atomic_set(&wb->migrate_io_count, 0));
+
+	if (atomic_read(&wb->migrate_fail_count)) {
+		WBWARN("%u writebacks failed. retry",
+		       atomic_read(&wb->migrate_fail_count));
+		goto migrate_write;
+	}
+	BUG_ON(atomic_read(&wb->migrate_io_count));
+
+	/*
+	 * We clean up the metablocks because there is no reason
+	 * to leave the them dirty.
+	 */
+	for (k = 0; k < wb->num_emigrates; k++) {
+		seg = *(wb->emigrates + k);
+		cleanup_segment(wb, seg);
+	}
+
+	/*
+	 * We must write back a segments if it was written persistently.
+	 * Nevertheless, we betray the upper layer.
+	 * Remembering which segment is persistent is too expensive
+	 * and furthermore meaningless.
+	 * So we consider all segments are persistent and write them back
+	 * persistently.
+	 */
+	IO(blkdev_issue_flush(wb->origin_dev->bdev, GFP_NOIO, NULL));
+}
+
+static void do_migrate_proc(struct wb_device *wb)
+{
+	u32 i, nr_mig_candidates, nr_mig, nr_max_batch;
+	struct segment_header *seg;
+
+	bool start_migrate = ACCESS_ONCE(wb->allow_migrate) ||
+			     ACCESS_ONCE(wb->urge_migrate)  ||
+			     ACCESS_ONCE(wb->force_drop);
+
+	if (!start_migrate) {
+		schedule_timeout_interruptible(msecs_to_jiffies(1000));
+		return;
+	}
+
+	nr_mig_candidates = atomic64_read(&wb->last_flushed_segment_id) -
+			    atomic64_read(&wb->last_migrated_segment_id);
+
+	if (!nr_mig_candidates) {
+		schedule_timeout_interruptible(msecs_to_jiffies(1000));
+		return;
+	}
+
+	nr_max_batch = ACCESS_ONCE(wb->nr_max_batched_migration);
+	if (wb->nr_cur_batched_migration != nr_max_batch)
+		try_alloc_migration_buffer(wb, nr_max_batch);
+	nr_mig = min(nr_mig_candidates, wb->nr_cur_batched_migration);
+
+	/*
+	 * Store emigrates
+	 */
+	for (i = 0; i < nr_mig; i++) {
+		seg = get_segment_header_by_id(wb,
+			atomic64_read(&wb->last_migrated_segment_id) + 1 + i);
+		*(wb->emigrates + i) = seg;
+	}
+	wb->num_emigrates = nr_mig;
+	transport_emigrates(wb);
+
+	atomic64_add(nr_mig, &wb->last_migrated_segment_id);
+	wake_up_interruptible(&wb->migrate_wait_queue);
+}
+
+int migrate_proc(void *data)
+{
+	struct wb_device *wb = data;
+	while (!kthread_should_stop())
+		do_migrate_proc(wb);
+	return 0;
+}
+
+/*
+ * Wait for a segment to be migrated.
+ * After migrated the metablocks in the segment are clean.
+ */
+void wait_for_migration(struct wb_device *wb, u64 id)
+{
+	wb->urge_migrate = true;
+	wake_up_process(wb->migrate_daemon);
+	wait_event_interruptible(wb->migrate_wait_queue,
+		atomic64_read(&wb->last_migrated_segment_id) >= id);
+	wb->urge_migrate = false;
+}
+
+/*----------------------------------------------------------------*/
+
+int modulator_proc(void *data)
+{
+	struct wb_device *wb = data;
+
+	struct hd_struct *hd = wb->origin_dev->bdev->bd_part;
+	unsigned long old = 0, new, util;
+	unsigned long intvl = 1000;
+
+	while (!kthread_should_stop()) {
+		new = jiffies_to_msecs(part_stat_read(hd, io_ticks));
+
+		if (!ACCESS_ONCE(wb->enable_migration_modulator))
+			goto modulator_update;
+
+		util = div_u64(100 * (new - old), 1000);
+
+		if (util < ACCESS_ONCE(wb->migrate_threshold))
+			wb->allow_migrate = true;
+		else
+			wb->allow_migrate = false;
+
+modulator_update:
+		old = new;
+
+		schedule_timeout_interruptible(msecs_to_jiffies(intvl));
+	}
+	return 0;
+}
+
+/*----------------------------------------------------------------*/
+
+static void update_superblock_record(struct wb_device *wb)
+{
+	int r = 0;
+
+	struct superblock_record_device o;
+	void *buf;
+	struct dm_io_request io_req;
+	struct dm_io_region region;
+
+	o.last_migrated_segment_id =
+		cpu_to_le64(atomic64_read(&wb->last_migrated_segment_id));
+
+	buf = mempool_alloc(wb->buf_1_pool, GFP_NOIO | __GFP_ZERO);
+	memcpy(buf, &o, sizeof(o));
+
+	io_req = (struct dm_io_request) {
+		.client = wb_io_client,
+		.bi_rw = WRITE_FUA,
+		.notify.fn = NULL,
+		.mem.type = DM_IO_KMEM,
+		.mem.ptr.addr = buf,
+	};
+	region = (struct dm_io_region) {
+		.bdev = wb->cache_dev->bdev,
+		.sector = (1 << 11) - 1,
+		.count = 1,
+	};
+	IO(dm_safe_io(&io_req, 1, &region, NULL, false));
+
+	mempool_free(buf, wb->buf_1_pool);
+}
+
+int recorder_proc(void *data)
+{
+	struct wb_device *wb = data;
+
+	unsigned long intvl;
+
+	while (!kthread_should_stop()) {
+		/* sec -> ms */
+		intvl = ACCESS_ONCE(wb->update_record_interval) * 1000;
+
+		if (!intvl) {
+			schedule_timeout_interruptible(msecs_to_jiffies(1000));
+			continue;
+		}
+
+		update_superblock_record(wb);
+		schedule_timeout_interruptible(msecs_to_jiffies(intvl));
+	}
+	return 0;
+}
+
+/*----------------------------------------------------------------*/
+
+int sync_proc(void *data)
+{
+	int r = 0;
+
+	struct wb_device *wb = data;
+	unsigned long intvl;
+
+	while (!kthread_should_stop()) {
+		/* sec -> ms */
+		intvl = ACCESS_ONCE(wb->sync_interval) * 1000;
+
+		if (!intvl) {
+			schedule_timeout_interruptible(msecs_to_jiffies(1000));
+			continue;
+		}
+
+		flush_current_buffer(wb);
+		IO(blkdev_issue_flush(wb->cache_dev->bdev, GFP_NOIO, NULL));
+		schedule_timeout_interruptible(msecs_to_jiffies(intvl));
+	}
+	return 0;
+}
diff --git a/drivers/md/dm-writeboost-daemon.h b/drivers/md/dm-writeboost-daemon.h
new file mode 100644
index 0000000..7e913db
--- /dev/null
+++ b/drivers/md/dm-writeboost-daemon.h
@@ -0,0 +1,40 @@
+/*
+ * Copyright (C) 2012-2014 Akira Hayakawa <ruby.wktk at gmail.com>
+ *
+ * This file is released under the GPL.
+ */
+
+#ifndef DM_WRITEBOOST_DAEMON_H
+#define DM_WRITEBOOST_DAEMON_H
+
+/*----------------------------------------------------------------*/
+
+void flush_proc(struct work_struct *);
+void wait_for_flushing(struct wb_device *, u64 id);
+
+/*----------------------------------------------------------------*/
+
+void queue_barrier_io(struct wb_device *, struct bio *);
+void barrier_deadline_proc(unsigned long data);
+void flush_barrier_ios(struct work_struct *);
+
+/*----------------------------------------------------------------*/
+
+int migrate_proc(void *);
+void wait_for_migration(struct wb_device *, u64 id);
+
+/*----------------------------------------------------------------*/
+
+int modulator_proc(void *);
+
+/*----------------------------------------------------------------*/
+
+int sync_proc(void *);
+
+/*----------------------------------------------------------------*/
+
+int recorder_proc(void *);
+
+/*----------------------------------------------------------------*/
+
+#endif
diff --git a/drivers/md/dm-writeboost-metadata.c b/drivers/md/dm-writeboost-metadata.c
new file mode 100644
index 0000000..54a94f5
--- /dev/null
+++ b/drivers/md/dm-writeboost-metadata.c
@@ -0,0 +1,1352 @@
+/*
+ * Copyright (C) 2012-2014 Akira Hayakawa <ruby.wktk at gmail.com>
+ *
+ * This file is released under the GPL.
+ */
+
+#include "dm-writeboost.h"
+#include "dm-writeboost-metadata.h"
+#include "dm-writeboost-daemon.h"
+
+#include <linux/crc32c.h>
+
+/*----------------------------------------------------------------*/
+
+struct part {
+	void *memory;
+};
+
+struct large_array {
+	struct part *parts;
+	u64 nr_elems;
+	u32 elemsize;
+};
+
+#define ALLOC_SIZE (1 << 16)
+static u32 nr_elems_in_part(struct large_array *arr)
+{
+	return div_u64(ALLOC_SIZE, arr->elemsize);
+};
+
+static u64 nr_parts(struct large_array *arr)
+{
+	u64 a = arr->nr_elems;
+	u32 b = nr_elems_in_part(arr);
+	return div_u64(a + b - 1, b);
+}
+
+static struct large_array *large_array_alloc(u32 elemsize, u64 nr_elems)
+{
+	u64 i;
+
+	struct large_array *arr = kmalloc(sizeof(*arr), GFP_KERNEL);
+	if (!arr) {
+		WBERR("failed to allocate arr");
+		return NULL;
+	}
+
+	arr->elemsize = elemsize;
+	arr->nr_elems = nr_elems;
+	arr->parts = kmalloc(sizeof(struct part) * nr_parts(arr), GFP_KERNEL);
+	if (!arr->parts) {
+		WBERR("failed to allocate parts");
+		goto bad_alloc_parts;
+	}
+
+	for (i = 0; i < nr_parts(arr); i++) {
+		struct part *part = arr->parts + i;
+		part->memory = kmalloc(ALLOC_SIZE, GFP_KERNEL);
+		if (!part->memory) {
+			u8 j;
+
+			WBERR("failed to allocate part memory");
+			for (j = 0; j < i; j++) {
+				part = arr->parts + j;
+				kfree(part->memory);
+			}
+			goto bad_alloc_parts_memory;
+		}
+	}
+	return arr;
+
+bad_alloc_parts_memory:
+	kfree(arr->parts);
+bad_alloc_parts:
+	kfree(arr);
+	return NULL;
+}
+
+static void large_array_free(struct large_array *arr)
+{
+	size_t i;
+	for (i = 0; i < nr_parts(arr); i++) {
+		struct part *part = arr->parts + i;
+		kfree(part->memory);
+	}
+	kfree(arr->parts);
+	kfree(arr);
+}
+
+static void *large_array_at(struct large_array *arr, u64 i)
+{
+	u32 n = nr_elems_in_part(arr);
+	u32 k;
+	u64 j = div_u64_rem(i, n, &k);
+	struct part *part = arr->parts + j;
+	return part->memory + (arr->elemsize * k);
+}
+
+/*----------------------------------------------------------------*/
+
+/*
+ * Get the in-core metablock of the given index.
+ */
+static struct metablock *mb_at(struct wb_device *wb, u32 idx)
+{
+	u32 idx_inseg;
+	u32 seg_idx = div_u64_rem(idx, wb->nr_caches_inseg, &idx_inseg);
+	struct segment_header *seg =
+		large_array_at(wb->segment_header_array, seg_idx);
+	return seg->mb_array + idx_inseg;
+}
+
+static void mb_array_empty_init(struct wb_device *wb)
+{
+	u32 i;
+	for (i = 0; i < wb->nr_caches; i++) {
+		struct metablock *mb = mb_at(wb, i);
+		INIT_HLIST_NODE(&mb->ht_list);
+
+		mb->idx = i;
+		mb->dirty_bits = 0;
+	}
+}
+
+/*
+ * Calc the starting sector of the k-th segment
+ */
+static sector_t calc_segment_header_start(struct wb_device *wb, u32 k)
+{
+	return (1 << 11) + (1 << wb->segment_size_order) * k;
+}
+
+static u32 calc_nr_segments(struct dm_dev *dev, struct wb_device *wb)
+{
+	sector_t devsize = dm_devsize(dev);
+	return div_u64(devsize - (1 << 11), 1 << wb->segment_size_order);
+}
+
+/*
+ * Get the relative index in a segment of the mb_idx-th metablock
+ */
+u32 mb_idx_inseg(struct wb_device *wb, u32 mb_idx)
+{
+	u32 tmp32;
+	div_u64_rem(mb_idx, wb->nr_caches_inseg, &tmp32);
+	return tmp32;
+}
+
+/*
+ * Calc the starting sector of the mb_idx-th cache block
+ */
+sector_t calc_mb_start_sector(struct wb_device *wb, struct segment_header *seg, u32 mb_idx)
+{
+	return seg->start_sector + ((1 + mb_idx_inseg(wb, mb_idx)) << 3);
+}
+
+/*
+ * Get the segment that contains the passed mb
+ */
+struct segment_header *mb_to_seg(struct wb_device *wb, struct metablock *mb)
+{
+	struct segment_header *seg;
+	seg = ((void *) mb)
+	      - mb_idx_inseg(wb, mb->idx) * sizeof(struct metablock)
+	      - sizeof(struct segment_header);
+	return seg;
+}
+
+bool is_on_buffer(struct wb_device *wb, u32 mb_idx)
+{
+	u32 start = wb->current_seg->start_idx;
+	if (mb_idx < start)
+		return false;
+
+	if (mb_idx >= (start + wb->nr_caches_inseg))
+		return false;
+
+	return true;
+}
+
+static u32 segment_id_to_idx(struct wb_device *wb, u64 id)
+{
+	u32 idx;
+	div_u64_rem(id - 1, wb->nr_segments, &idx);
+	return idx;
+}
+
+static struct segment_header *segment_at(struct wb_device *wb, u32 k)
+{
+	return large_array_at(wb->segment_header_array, k);
+}
+
+/*
+ * Get the segment from the segment id.
+ * The Index of the segment is calculated from the segment id.
+ */
+struct segment_header *
+get_segment_header_by_id(struct wb_device *wb, u64 id)
+{
+	return segment_at(wb, segment_id_to_idx(wb, id));
+}
+
+/*----------------------------------------------------------------*/
+
+static int __must_check init_segment_header_array(struct wb_device *wb)
+{
+	u32 segment_idx;
+
+	wb->segment_header_array = large_array_alloc(
+			sizeof(struct segment_header) +
+			sizeof(struct metablock) * wb->nr_caches_inseg,
+			wb->nr_segments);
+	if (!wb->segment_header_array) {
+		WBERR("failed to allocate segment header array");
+		return -ENOMEM;
+	}
+
+	for (segment_idx = 0; segment_idx < wb->nr_segments; segment_idx++) {
+		struct segment_header *seg = large_array_at(wb->segment_header_array, segment_idx);
+
+		seg->id = 0;
+		seg->length = 0;
+		atomic_set(&seg->nr_inflight_ios, 0);
+
+		/*
+		 * Const values
+		 */
+		seg->start_idx = wb->nr_caches_inseg * segment_idx;
+		seg->start_sector = calc_segment_header_start(wb, segment_idx);
+	}
+
+	mb_array_empty_init(wb);
+
+	return 0;
+}
+
+static void free_segment_header_array(struct wb_device *wb)
+{
+	large_array_free(wb->segment_header_array);
+}
+
+/*----------------------------------------------------------------*/
+
+struct ht_head {
+	struct hlist_head ht_list;
+};
+
+/*
+ * Initialize the Hash Table.
+ */
+static int __must_check ht_empty_init(struct wb_device *wb)
+{
+	u32 idx;
+	size_t i, nr_heads;
+	struct large_array *arr;
+
+	wb->htsize = wb->nr_caches;
+	nr_heads = wb->htsize + 1;
+	arr = large_array_alloc(sizeof(struct ht_head), nr_heads);
+	if (!arr) {
+		WBERR("failed to allocate arr");
+		return -ENOMEM;
+	}
+
+	wb->htable = arr;
+
+	for (i = 0; i < nr_heads; i++) {
+		struct ht_head *hd = large_array_at(arr, i);
+		INIT_HLIST_HEAD(&hd->ht_list);
+	}
+
+	/*
+	 * Our hashtable has one special bucket called null head.
+	 * Orphan metablocks are linked to the null head.
+	 */
+	wb->null_head = large_array_at(wb->htable, wb->htsize);
+
+	for (idx = 0; idx < wb->nr_caches; idx++) {
+		struct metablock *mb = mb_at(wb, idx);
+		hlist_add_head(&mb->ht_list, &wb->null_head->ht_list);
+	}
+
+	return 0;
+}
+
+static void free_ht(struct wb_device *wb)
+{
+	large_array_free(wb->htable);
+}
+
+struct ht_head *ht_get_head(struct wb_device *wb, struct lookup_key *key)
+{
+	u32 idx;
+	div_u64_rem(key->sector, wb->htsize, &idx);
+	return large_array_at(wb->htable, idx);
+}
+
+static bool mb_hit(struct metablock *mb, struct lookup_key *key)
+{
+	return mb->sector == key->sector;
+}
+
+/*
+ * Remove the metablock from the hashtable
+ * and link the orphan to the null head.
+ */
+void ht_del(struct wb_device *wb, struct metablock *mb)
+{
+	struct ht_head *null_head;
+
+	hlist_del(&mb->ht_list);
+
+	null_head = wb->null_head;
+	hlist_add_head(&mb->ht_list, &null_head->ht_list);
+}
+
+void ht_register(struct wb_device *wb, struct ht_head *head,
+		 struct metablock *mb, struct lookup_key *key)
+{
+	hlist_del(&mb->ht_list);
+	hlist_add_head(&mb->ht_list, &head->ht_list);
+
+	mb->sector = key->sector;
+};
+
+struct metablock *ht_lookup(struct wb_device *wb, struct ht_head *head,
+			    struct lookup_key *key)
+{
+	struct metablock *mb, *found = NULL;
+	hlist_for_each_entry(mb, &head->ht_list, ht_list) {
+		if (mb_hit(mb, key)) {
+			found = mb;
+			break;
+		}
+	}
+	return found;
+}
+
+/*
+ * Remove all the metablock in the segment from the lookup table.
+ */
+void discard_caches_inseg(struct wb_device *wb, struct segment_header *seg)
+{
+	u8 i;
+	for (i = 0; i < wb->nr_caches_inseg; i++) {
+		struct metablock *mb = seg->mb_array + i;
+		ht_del(wb, mb);
+	}
+}
+
+/*----------------------------------------------------------------*/
+
+static int read_superblock_header(struct superblock_header_device *sup,
+				  struct wb_device *wb)
+{
+	int r = 0;
+	struct dm_io_request io_req_sup;
+	struct dm_io_region region_sup;
+
+	void *buf = kmalloc(1 << SECTOR_SHIFT, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	io_req_sup = (struct dm_io_request) {
+		.client = wb_io_client,
+		.bi_rw = READ,
+		.notify.fn = NULL,
+		.mem.type = DM_IO_KMEM,
+		.mem.ptr.addr = buf,
+	};
+	region_sup = (struct dm_io_region) {
+		.bdev = wb->cache_dev->bdev,
+		.sector = 0,
+		.count = 1,
+	};
+	r = dm_safe_io(&io_req_sup, 1, &region_sup, NULL, false);
+	if (r) {
+		WBERR("I/O failed");
+		goto bad_io;
+	}
+
+	memcpy(sup, buf, sizeof(*sup));
+
+bad_io:
+	kfree(buf);
+	return r;
+}
+
+/*
+ * Check if the cache device is already formatted.
+ * Returns 0 iff this routine runs without failure.
+ */
+static int __must_check
+audit_cache_device(struct wb_device *wb, bool *need_format, bool *allow_format)
+{
+	int r = 0;
+	struct superblock_header_device sup;
+	r = read_superblock_header(&sup, wb);
+	if (r) {
+		WBERR("failed to read superblock header");
+		return r;
+	}
+
+	*need_format = true;
+	*allow_format = false;
+
+	if (le32_to_cpu(sup.magic) != WB_MAGIC) {
+		*allow_format = true;
+		WBERR("superblock header: magic number invalid");
+		return 0;
+	}
+
+	if (sup.segment_size_order != wb->segment_size_order) {
+		WBERR("superblock header: segment order not same %u != %u",
+		      sup.segment_size_order, wb->segment_size_order);
+	} else {
+		*need_format = false;
+	}
+
+	return r;
+}
+
+static int format_superblock_header(struct wb_device *wb)
+{
+	int r = 0;
+
+	struct dm_io_request io_req_sup;
+	struct dm_io_region region_sup;
+
+	struct superblock_header_device sup = {
+		.magic = cpu_to_le32(WB_MAGIC),
+		.segment_size_order = wb->segment_size_order,
+	};
+
+	void *buf = kzalloc(1 << SECTOR_SHIFT, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	memcpy(buf, &sup, sizeof(sup));
+
+	io_req_sup = (struct dm_io_request) {
+		.client = wb_io_client,
+		.bi_rw = WRITE_FUA,
+		.notify.fn = NULL,
+		.mem.type = DM_IO_KMEM,
+		.mem.ptr.addr = buf,
+	};
+	region_sup = (struct dm_io_region) {
+		.bdev = wb->cache_dev->bdev,
+		.sector = 0,
+		.count = 1,
+	};
+	r = dm_safe_io(&io_req_sup, 1, &region_sup, NULL, false);
+	if (r) {
+		WBERR("I/O failed");
+		goto bad_io;
+	}
+
+bad_io:
+	kfree(buf);
+	return 0;
+}
+
+struct format_segmd_context {
+	int err;
+	atomic64_t count;
+};
+
+static void format_segmd_endio(unsigned long error, void *__context)
+{
+	struct format_segmd_context *context = __context;
+	if (error)
+		context->err = 1;
+	atomic64_dec(&context->count);
+}
+
+static int zeroing_full_superblock(struct wb_device *wb)
+{
+	int r = 0;
+	struct dm_dev *dev = wb->cache_dev;
+
+	struct dm_io_request io_req_sup;
+	struct dm_io_region region_sup;
+
+	void *buf = kzalloc(1 << 20, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	io_req_sup = (struct dm_io_request) {
+		.client = wb_io_client,
+		.bi_rw = WRITE_FUA,
+		.notify.fn = NULL,
+		.mem.type = DM_IO_KMEM,
+		.mem.ptr.addr = buf,
+	};
+	region_sup = (struct dm_io_region) {
+		.bdev = dev->bdev,
+		.sector = 0,
+		.count = (1 << 11),
+	};
+	r = dm_safe_io(&io_req_sup, 1, &region_sup, NULL, false);
+	if (r) {
+		WBERR("I/O failed");
+		goto bad_io;
+	}
+
+bad_io:
+	kfree(buf);
+	return r;
+}
+
+static int format_all_segment_headers(struct wb_device *wb)
+{
+	int r = 0;
+	struct dm_dev *dev = wb->cache_dev;
+	u32 i, nr_segments = calc_nr_segments(dev, wb);
+
+	struct format_segmd_context context;
+
+	void *buf = kzalloc(1 << 12, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	atomic64_set(&context.count, nr_segments);
+	context.err = 0;
+
+	/*
+	 * Submit all the writes asynchronously.
+	 */
+	for (i = 0; i < nr_segments; i++) {
+		struct dm_io_request io_req_seg = {
+			.client = wb_io_client,
+			.bi_rw = WRITE,
+			.notify.fn = format_segmd_endio,
+			.notify.context = &context,
+			.mem.type = DM_IO_KMEM,
+			.mem.ptr.addr = buf,
+		};
+		struct dm_io_region region_seg = {
+			.bdev = dev->bdev,
+			.sector = calc_segment_header_start(wb, i),
+			.count = (1 << 3),
+		};
+		r = dm_safe_io(&io_req_seg, 1, &region_seg, NULL, false);
+		if (r) {
+			WBERR("I/O failed");
+			break;
+		}
+	}
+	kfree(buf);
+
+	if (r)
+		return r;
+
+	/*
+	 * Wait for all the writes complete.
+	 */
+	while (atomic64_read(&context.count))
+		schedule_timeout_interruptible(msecs_to_jiffies(100));
+
+	if (context.err) {
+		WBERR("I/O failed at last");
+		return -EIO;
+	}
+
+	return r;
+}
+
+/*
+ * Format superblock header and
+ * all the segment headers in a cache device
+ */
+static int __must_check format_cache_device(struct wb_device *wb)
+{
+	int r = 0;
+	struct dm_dev *dev = wb->cache_dev;
+	r = zeroing_full_superblock(wb);
+	if (r)
+		return r;
+	r = format_superblock_header(wb); /* first 512B */
+	if (r)
+		return r;
+	r = format_all_segment_headers(wb);
+	if (r)
+		return r;
+	r = blkdev_issue_flush(dev->bdev, GFP_KERNEL, NULL);
+	return r;
+}
+
+/*
+ * First check if the superblock and the passed arguments
+ * are consistent and re-format the cache structure if they are not.
+ * If you want to re-format the cache device you must zeroed out
+ * the first one sector of the device.
+ *
+ * After this, the segment_size_order is fixed.
+ */
+static int might_format_cache_device(struct wb_device *wb)
+{
+	int r = 0;
+
+	bool need_format, allow_format;
+	r = audit_cache_device(wb, &need_format, &allow_format);
+	if (r) {
+		WBERR("failed to audit cache device");
+		return r;
+	}
+
+	if (need_format) {
+		if (allow_format) {
+			r = format_cache_device(wb);
+			if (r) {
+				WBERR("failed to format cache device");
+				return r;
+			}
+		} else {
+			r = -EINVAL;
+			WBERR("cache device not allowed to format");
+			return r;
+		}
+	}
+
+	return r;
+}
+
+/*----------------------------------------------------------------*/
+
+static int __must_check
+read_superblock_record(struct superblock_record_device *record,
+		       struct wb_device *wb)
+{
+	int r = 0;
+	struct dm_io_request io_req;
+	struct dm_io_region region;
+
+	void *buf = kmalloc(1 << SECTOR_SHIFT, GFP_KERNEL);
+	if (!buf) {
+		WBERR();
+		return -ENOMEM;
+	}
+
+	io_req = (struct dm_io_request) {
+		.client = wb_io_client,
+		.bi_rw = READ,
+		.notify.fn = NULL,
+		.mem.type = DM_IO_KMEM,
+		.mem.ptr.addr = buf,
+	};
+	region = (struct dm_io_region) {
+		.bdev = wb->cache_dev->bdev,
+		.sector = (1 << 11) - 1,
+		.count = 1,
+	};
+	r = dm_safe_io(&io_req, 1, &region, NULL, false);
+	if (r) {
+		WBERR("I/O failed");
+		goto bad_io;
+	}
+
+	memcpy(record, buf, sizeof(*record));
+
+bad_io:
+	kfree(buf);
+	return r;
+}
+
+/*
+ * Read whole segment on the cache device to a pre-allocated buffer.
+ */
+static int __must_check
+read_whole_segment(void *buf, struct wb_device *wb, struct segment_header *seg)
+{
+	struct dm_io_request io_req = {
+		.client = wb_io_client,
+		.bi_rw = READ,
+		.notify.fn = NULL,
+		.mem.type = DM_IO_KMEM,
+		.mem.ptr.addr = buf,
+	};
+	struct dm_io_region region = {
+		.bdev = wb->cache_dev->bdev,
+		.sector = seg->start_sector,
+		.count = 1 << wb->segment_size_order,
+	};
+	return dm_safe_io(&io_req, 1, &region, NULL, false);
+}
+
+/*
+ * We make a checksum of a segment from the valid data
+ * in a segment except the first 1 sector.
+ */
+static u32 calc_checksum(void *rambuffer, u8 length)
+{
+	unsigned int len = (4096 - 512) + 4096 * length;
+	return crc32c(WB_CKSUM_SEED, rambuffer + 512, len);
+}
+
+/*
+ * Complete metadata in a segment buffer.
+ */
+void prepare_segment_header_device(void *rambuffer,
+				   struct wb_device *wb,
+				   struct segment_header *src)
+{
+	struct segment_header_device *dest = rambuffer;
+	u32 i;
+
+	BUG_ON((src->length - 1) != mb_idx_inseg(wb, wb->cursor));
+
+	for (i = 0; i < src->length; i++) {
+		struct metablock *mb = src->mb_array + i;
+		struct metablock_device *mbdev = dest->mbarr + i;
+
+		mbdev->sector = cpu_to_le64(mb->sector);
+		mbdev->dirty_bits = mb->dirty_bits;
+	}
+
+	dest->id = cpu_to_le64(src->id);
+	dest->checksum = cpu_to_le32(calc_checksum(rambuffer, src->length));
+	dest->length = src->length;
+}
+
+static void
+apply_metablock_device(struct wb_device *wb, struct segment_header *seg,
+		       struct segment_header_device *src, u8 i)
+{
+	struct lookup_key key;
+	struct ht_head *head;
+	struct metablock *found = NULL, *mb = seg->mb_array + i;
+	struct metablock_device *mbdev = src->mbarr + i;
+
+	mb->sector = le64_to_cpu(mbdev->sector);
+	mb->dirty_bits = mbdev->dirty_bits;
+
+	/*
+	 * A metablock is usually dirty but the exception is that
+	 * the one inserted by force flush.
+	 * In that case, the first metablock in a segment is clean.
+	 */
+	if (!mb->dirty_bits)
+		return;
+
+	key = (struct lookup_key) {
+		.sector = mb->sector,
+	};
+	head = ht_get_head(wb, &key);
+	found = ht_lookup(wb, head, &key);
+	if (found) {
+		bool overwrite_fullsize = (mb->dirty_bits == 255);
+		invalidate_previous_cache(wb, mb_to_seg(wb, found), found,
+					  overwrite_fullsize);
+	}
+
+	inc_nr_dirty_caches(wb);
+	ht_register(wb, head, mb, &key);
+}
+
+/*
+ * Read the on-disk metadata of the segment and
+ * update the in-core cache metadata structure.
+ */
+static void
+apply_segment_header_device(struct wb_device *wb, struct segment_header *seg,
+			    struct segment_header_device *src)
+{
+	u8 i;
+
+	seg->length = src->length;
+
+	for (i = 0; i < src->length; i++)
+		apply_metablock_device(wb, seg, src, i);
+}
+
+/*
+ * If the RAM buffer is non-volatile
+ * we first write back all the valid buffers on them.
+ * By doing this, the discussion on replay algorithm is closed
+ * in replaying logs on only cache device.
+ */
+static int writeback_non_volatile_buffers(struct wb_device *wb)
+{
+	return 0;
+}
+
+static int find_max_id(struct wb_device *wb, u64 *max_id)
+{
+	int r = 0;
+
+	void *rambuf = kmalloc(1 << (wb->segment_size_order + SECTOR_SHIFT),
+			       GFP_KERNEL);
+	u32 k;
+
+	*max_id = 0;
+	for (k = 0; k < wb->nr_segments; k++) {
+		struct segment_header *seg = segment_at(wb, k);
+		struct segment_header_device *header;
+		r = read_whole_segment(rambuf, wb, seg);
+		if (r) {
+			kfree(rambuf);
+			return r;
+		}
+
+		header = rambuf;
+		if (le64_to_cpu(header->id) > *max_id)
+			*max_id = le64_to_cpu(header->id);
+	}
+	kfree(rambuf);
+	return r;
+}
+
+static int apply_valid_segments(struct wb_device *wb, u64 *max_id)
+{
+	int r = 0;
+	struct segment_header *seg;
+	struct segment_header_device *header;
+
+	void *rambuf = kmalloc(1 << (wb->segment_size_order + SECTOR_SHIFT),
+			       GFP_KERNEL);
+
+	u32 i, start_idx = segment_id_to_idx(wb, *max_id + 1);
+	*max_id = 0;
+	for (i = start_idx; i < (start_idx + wb->nr_segments); i++) {
+		u32 checksum1, checksum2, k;
+		div_u64_rem(i, wb->nr_segments, &k);
+		seg = segment_at(wb, k);
+
+		r = read_whole_segment(rambuf, wb, seg);
+		if (r) {
+			kfree(rambuf);
+			return r;
+		}
+
+		header = rambuf;
+
+		if (!le64_to_cpu(header->id))
+			continue;
+
+		checksum1 = le32_to_cpu(header->checksum);
+		checksum2 = calc_checksum(rambuf, header->length);
+		if (checksum1 != checksum2) {
+			DMWARN("checksum inconsistent id:%llu checksum:%u != %u",
+			       (long long unsigned int) le64_to_cpu(header->id),
+			       checksum1, checksum2);
+			continue;
+		}
+
+		apply_segment_header_device(wb, seg, header);
+		*max_id = le64_to_cpu(header->id);
+	}
+	kfree(rambuf);
+	return r;
+}
+
+static int infer_last_migrated_id(struct wb_device *wb)
+{
+	int r = 0;
+
+	u64 record_id;
+	struct superblock_record_device uninitialized_var(record);
+	r = read_superblock_record(&record, wb);
+	if (r)
+		return r;
+
+	atomic64_set(&wb->last_migrated_segment_id,
+		atomic64_read(&wb->last_flushed_segment_id) > wb->nr_segments ?
+		atomic64_read(&wb->last_flushed_segment_id) - wb->nr_segments : 0);
+
+	record_id = le64_to_cpu(record.last_migrated_segment_id);
+	if (record_id > atomic64_read(&wb->last_migrated_segment_id))
+		atomic64_set(&wb->last_migrated_segment_id, record_id);
+
+	return r;
+}
+
+/*
+ * Replay all the log on the cache device to reconstruct
+ * the in-memory metadata.
+ *
+ * Algorithm:
+ * 1. find the maxium id
+ * 2. start from the right. iterate all the log.
+ * 2. skip if id=0 or checkum invalid
+ * 2. apply otherwise.
+ *
+ * This algorithm is robust for floppy SSD that may write
+ * a segment partially or lose data on its buffer on power fault.
+ *
+ * Even if number of threads flush segments in parallel and
+ * some of them loses atomicity because of power fault
+ * this robust algorithm works.
+ */
+static int replay_log_on_cache(struct wb_device *wb)
+{
+	int r = 0;
+	u64 max_id;
+
+	r = find_max_id(wb, &max_id);
+	if (r) {
+		WBERR("failed to find max id");
+		return r;
+	}
+	r = apply_valid_segments(wb, &max_id);
+	if (r) {
+		WBERR("failed to apply valid segments");
+		return r;
+	}
+
+	/*
+	 * Setup last_flushed_segment_id
+	 */
+	atomic64_set(&wb->last_flushed_segment_id, max_id);
+
+	/*
+	 * Setup last_migrated_segment_id
+	 */
+	infer_last_migrated_id(wb);
+
+	return r;
+}
+
+/*
+ * Acquire and initialize the first segment header for our caching.
+ */
+static void prepare_first_seg(struct wb_device *wb)
+{
+	u64 init_segment_id = atomic64_read(&wb->last_flushed_segment_id) + 1;
+	acquire_new_seg(wb, init_segment_id);
+
+	/*
+	 * We always keep the intergrity between cursor
+	 * and seg->length.
+	 */
+	wb->cursor = wb->current_seg->start_idx;
+	wb->current_seg->length = 1;
+}
+
+/*
+ * Recover all the cache state from the
+ * persistent devices (non-volatile RAM and SSD).
+ */
+static int __must_check recover_cache(struct wb_device *wb)
+{
+	int r = 0;
+
+	r = writeback_non_volatile_buffers(wb);
+	if (r) {
+		WBERR("failed to write back all the persistent data on non-volatile RAM");
+		return r;
+	}
+
+	r = replay_log_on_cache(wb);
+	if (r) {
+		WBERR("failed to replay log");
+		return r;
+	}
+
+	prepare_first_seg(wb);
+	return 0;
+}
+
+/*----------------------------------------------------------------*/
+
+static int __must_check init_rambuf_pool(struct wb_device *wb)
+{
+	size_t i;
+	sector_t alloc_sz = 1 << wb->segment_size_order;
+	u32 nr = div_u64(wb->rambuf_pool_amount * 2, alloc_sz);
+
+	if (!nr)
+		return -EINVAL;
+
+	wb->nr_rambuf_pool = nr;
+	wb->rambuf_pool = kmalloc(sizeof(struct rambuffer) * nr,
+				  GFP_KERNEL);
+	if (!wb->rambuf_pool)
+		return -ENOMEM;
+
+	for (i = 0; i < wb->nr_rambuf_pool; i++) {
+		size_t j;
+		struct rambuffer *rambuf = wb->rambuf_pool + i;
+
+		rambuf->data = kmalloc(alloc_sz << SECTOR_SHIFT, GFP_KERNEL);
+		if (!rambuf->data) {
+			WBERR("failed to allocate rambuf data");
+			for (j = 0; j < i; j++) {
+				rambuf = wb->rambuf_pool + j;
+				kfree(rambuf->data);
+			}
+			kfree(wb->rambuf_pool);
+			return -ENOMEM;
+		}
+	}
+
+	return 0;
+}
+
+static void free_rambuf_pool(struct wb_device *wb)
+{
+	size_t i;
+	for (i = 0; i < wb->nr_rambuf_pool; i++) {
+		struct rambuffer *rambuf = wb->rambuf_pool + i;
+		kfree(rambuf->data);
+	}
+	kfree(wb->rambuf_pool);
+}
+
+/*----------------------------------------------------------------*/
+
+/*
+ * Try to allocate new migration buffer by the nr_batch size.
+ * On success, it frees the old buffer.
+ *
+ * Bad User may set # of batches that can hardly allocate.
+ * This function is robust in that case.
+ */
+int try_alloc_migration_buffer(struct wb_device *wb, size_t nr_batch)
+{
+	int r = 0;
+
+	struct segment_header **emigrates;
+	void *buf;
+	void *snapshot;
+
+	emigrates = kmalloc(nr_batch * sizeof(struct segment_header *), GFP_KERNEL);
+	if (!emigrates) {
+		WBERR("failed to allocate emigrates");
+		r = -ENOMEM;
+		return r;
+	}
+
+	buf = vmalloc(nr_batch * (wb->nr_caches_inseg << 12));
+	if (!buf) {
+		WBERR("failed to allocate migration buffer");
+		r = -ENOMEM;
+		goto bad_alloc_buffer;
+	}
+
+	snapshot = kmalloc(nr_batch * wb->nr_caches_inseg, GFP_KERNEL);
+	if (!snapshot) {
+		WBERR("failed to allocate dirty snapshot");
+		r = -ENOMEM;
+		goto bad_alloc_snapshot;
+	}
+
+	/*
+	 * Free old buffers
+	 */
+	kfree(wb->emigrates); /* kfree(NULL) is safe */
+	if (wb->migrate_buffer)
+		vfree(wb->migrate_buffer);
+	kfree(wb->dirtiness_snapshot);
+
+	/*
+	 * Swap by new values
+	 */
+	wb->emigrates = emigrates;
+	wb->migrate_buffer = buf;
+	wb->dirtiness_snapshot = snapshot;
+	wb->nr_cur_batched_migration = nr_batch;
+
+	return r;
+
+bad_alloc_buffer:
+	kfree(wb->emigrates);
+bad_alloc_snapshot:
+	vfree(wb->migrate_buffer);
+
+	return r;
+}
+
+static void free_migration_buffer(struct wb_device *wb)
+{
+	kfree(wb->emigrates);
+	vfree(wb->migrate_buffer);
+	kfree(wb->dirtiness_snapshot);
+}
+
+/*----------------------------------------------------------------*/
+
+#define CREATE_DAEMON(name) \
+	do { \
+		wb->name##_daemon = kthread_create( \
+				name##_proc, wb,  #name "_daemon"); \
+		if (IS_ERR(wb->name##_daemon)) { \
+			r = PTR_ERR(wb->name##_daemon); \
+			wb->name##_daemon = NULL; \
+			WBERR("couldn't spawn " #name " daemon"); \
+			goto bad_##name##_daemon; \
+		} \
+		wake_up_process(wb->name##_daemon); \
+	} while (0)
+
+/*
+ * Setup the core info relavant to the cache format or geometry.
+ */
+static void setup_geom_info(struct wb_device *wb)
+{
+	wb->nr_segments = calc_nr_segments(wb->cache_dev, wb);
+	wb->nr_caches_inseg = (1 << (wb->segment_size_order - 3)) - 1;
+	wb->nr_caches = wb->nr_segments * wb->nr_caches_inseg;
+}
+
+/*
+ * Harmless init
+ * - allocate memory
+ * - setup the initial state of the objects
+ */
+static int harmless_init(struct wb_device *wb)
+{
+	int r = 0;
+
+	setup_geom_info(wb);
+
+	wb->buf_1_pool = mempool_create_kmalloc_pool(16, 1 << SECTOR_SHIFT);
+	if (!wb->buf_1_pool) {
+		r = -ENOMEM;
+		WBERR("failed to allocate 1 sector pool");
+		goto bad_buf_1_pool;
+	}
+	wb->buf_8_pool = mempool_create_kmalloc_pool(16, 8 << SECTOR_SHIFT);
+	if (!wb->buf_8_pool) {
+		r = -ENOMEM;
+		WBERR("failed to allocate 8 sector pool");
+		goto bad_buf_8_pool;
+	}
+
+	r = init_rambuf_pool(wb);
+	if (r) {
+		WBERR("failed to allocate rambuf pool");
+		goto bad_init_rambuf_pool;
+	}
+	wb->flush_job_pool = mempool_create_kmalloc_pool(
+				wb->nr_rambuf_pool, sizeof(struct flush_job));
+	if (!wb->flush_job_pool) {
+		r = -ENOMEM;
+		WBERR("failed to allocate flush job pool");
+		goto bad_flush_job_pool;
+	}
+
+	r = init_segment_header_array(wb);
+	if (r) {
+		WBERR("failed to allocate segment header array");
+		goto bad_alloc_segment_header_array;
+	}
+
+	r = ht_empty_init(wb);
+	if (r) {
+		WBERR("failed to allocate hashtable");
+		goto bad_alloc_ht;
+	}
+
+	return r;
+
+bad_alloc_ht:
+	free_segment_header_array(wb);
+bad_alloc_segment_header_array:
+	mempool_destroy(wb->flush_job_pool);
+bad_flush_job_pool:
+	free_rambuf_pool(wb);
+bad_init_rambuf_pool:
+	mempool_destroy(wb->buf_8_pool);
+bad_buf_8_pool:
+	mempool_destroy(wb->buf_1_pool);
+bad_buf_1_pool:
+
+	return r;
+}
+
+static void harmless_free(struct wb_device *wb)
+{
+	free_ht(wb);
+	free_segment_header_array(wb);
+	mempool_destroy(wb->flush_job_pool);
+	free_rambuf_pool(wb);
+	mempool_destroy(wb->buf_8_pool);
+	mempool_destroy(wb->buf_1_pool);
+}
+
+static int init_migrate_daemon(struct wb_device *wb)
+{
+	int r = 0;
+	size_t nr_batch;
+
+	atomic_set(&wb->migrate_fail_count, 0);
+	atomic_set(&wb->migrate_io_count, 0);
+
+	/*
+	 * Default number of batched migration is 1MB / segment size.
+	 * An ordinary HDD can afford at least 1MB/sec.
+	 */
+	nr_batch = 1 << (11 - wb->segment_size_order);
+	wb->nr_max_batched_migration = nr_batch;
+	if (try_alloc_migration_buffer(wb, nr_batch))
+		return -ENOMEM;
+
+	init_waitqueue_head(&wb->migrate_wait_queue);
+	init_waitqueue_head(&wb->wait_drop_caches);
+	init_waitqueue_head(&wb->migrate_io_wait_queue);
+
+	wb->allow_migrate = false;
+	wb->urge_migrate = false;
+	CREATE_DAEMON(migrate);
+
+	return r;
+
+bad_migrate_daemon:
+	free_migration_buffer(wb);
+	return r;
+}
+
+static int init_flusher(struct wb_device *wb)
+{
+	int r = 0;
+	wb->flusher_wq = alloc_workqueue(
+		"%s", WQ_MEM_RECLAIM | WQ_SYSFS, 1, "wbflusher");
+	if (!wb->flusher_wq) {
+		WBERR("failed to allocate wbflusher");
+		return -ENOMEM;
+	}
+	init_waitqueue_head(&wb->flush_wait_queue);
+	return r;
+}
+
+static void init_barrier_deadline_work(struct wb_device *wb)
+{
+	wb->barrier_deadline_ms = 3;
+	setup_timer(&wb->barrier_deadline_timer,
+		    barrier_deadline_proc, (unsigned long) wb);
+	bio_list_init(&wb->barrier_ios);
+	INIT_WORK(&wb->barrier_deadline_work, flush_barrier_ios);
+}
+
+static int init_migrate_modulator(struct wb_device *wb)
+{
+	int r = 0;
+	/*
+	 * EMC's textbook on storage system teaches us
+	 * storage should keep its load no more than 70%.
+	 */
+	wb->migrate_threshold = 70;
+	wb->enable_migration_modulator = true;
+	CREATE_DAEMON(modulator);
+	return r;
+
+bad_modulator_daemon:
+	return r;
+}
+
+static int init_recorder_daemon(struct wb_device *wb)
+{
+	int r = 0;
+	wb->update_record_interval = 60;
+	CREATE_DAEMON(recorder);
+	return r;
+
+bad_recorder_daemon:
+	return r;
+}
+
+static int init_sync_daemon(struct wb_device *wb)
+{
+	int r = 0;
+	wb->sync_interval = 60;
+	CREATE_DAEMON(sync);
+	return r;
+
+bad_sync_daemon:
+	return r;
+}
+
+int __must_check resume_cache(struct wb_device *wb)
+{
+	int r = 0;
+
+	r = might_format_cache_device(wb);
+	if (r)
+		goto bad_might_format_cache;
+	r = harmless_init(wb);
+	if (r)
+		goto bad_harmless_init;
+	r = init_migrate_daemon(wb);
+	if (r) {
+		WBERR("failed to init migrate daemon");
+		goto bad_migrate_daemon;
+	}
+	r = recover_cache(wb);
+	if (r) {
+		WBERR("failed to recover cache metadata");
+		goto bad_recover;
+	}
+	r = init_flusher(wb);
+	if (r) {
+		WBERR("failed to init wbflusher");
+		goto bad_flusher;
+	}
+	init_barrier_deadline_work(wb);
+	r = init_migrate_modulator(wb);
+	if (r) {
+		WBERR("failed to init migrate modulator");
+		goto bad_migrate_modulator;
+	}
+	r = init_recorder_daemon(wb);
+	if (r) {
+		WBERR("failed to init superblock recorder");
+		goto bad_recorder_daemon;
+	}
+	r = init_sync_daemon(wb);
+	if (r) {
+		WBERR("failed to init sync daemon");
+		goto bad_sync_daemon;
+	}
+
+	return r;
+
+bad_sync_daemon:
+	kthread_stop(wb->recorder_daemon);
+bad_recorder_daemon:
+	kthread_stop(wb->modulator_daemon);
+bad_migrate_modulator:
+	cancel_work_sync(&wb->barrier_deadline_work);
+	destroy_workqueue(wb->flusher_wq);
+bad_flusher:
+bad_recover:
+	kthread_stop(wb->migrate_daemon);
+	free_migration_buffer(wb);
+bad_migrate_daemon:
+	harmless_free(wb);
+bad_harmless_init:
+bad_might_format_cache:
+
+	return r;
+}
+
+void free_cache(struct wb_device *wb)
+{
+	/*
+	 * kthread_stop() wakes up the thread.
+	 * We don't need to wake them up in our code.
+	 */
+	kthread_stop(wb->sync_daemon);
+	kthread_stop(wb->recorder_daemon);
+	kthread_stop(wb->modulator_daemon);
+
+	cancel_work_sync(&wb->barrier_deadline_work);
+
+	destroy_workqueue(wb->flusher_wq);
+
+	kthread_stop(wb->migrate_daemon);
+	free_migration_buffer(wb);
+
+	harmless_free(wb);
+}
diff --git a/drivers/md/dm-writeboost-metadata.h b/drivers/md/dm-writeboost-metadata.h
new file mode 100644
index 0000000..860e4bf
--- /dev/null
+++ b/drivers/md/dm-writeboost-metadata.h
@@ -0,0 +1,51 @@
+/*
+ * Copyright (C) 2012-2014 Akira Hayakawa <ruby.wktk at gmail.com>
+ *
+ * This file is released under the GPL.
+ */
+
+#ifndef DM_WRITEBOOST_METADATA_H
+#define DM_WRITEBOOST_METADATA_H
+
+/*----------------------------------------------------------------*/
+
+struct segment_header *
+get_segment_header_by_id(struct wb_device *, u64 segment_id);
+sector_t calc_mb_start_sector(struct wb_device *, struct segment_header *,
+			      u32 mb_idx);
+u32 mb_idx_inseg(struct wb_device *, u32 mb_idx);
+struct segment_header *mb_to_seg(struct wb_device *, struct metablock *);
+bool is_on_buffer(struct wb_device *, u32 mb_idx);
+
+/*----------------------------------------------------------------*/
+
+struct lookup_key {
+	sector_t sector;
+};
+
+struct ht_head;
+struct ht_head *ht_get_head(struct wb_device *, struct lookup_key *);
+struct metablock *ht_lookup(struct wb_device *,
+			    struct ht_head *, struct lookup_key *);
+void ht_register(struct wb_device *, struct ht_head *,
+		 struct metablock *, struct lookup_key *);
+void ht_del(struct wb_device *, struct metablock *);
+void discard_caches_inseg(struct wb_device *, struct segment_header *);
+
+/*----------------------------------------------------------------*/
+
+void prepare_segment_header_device(void *rambuffer, struct wb_device *,
+				   struct segment_header *src);
+
+/*----------------------------------------------------------------*/
+
+int try_alloc_migration_buffer(struct wb_device *, size_t nr_batch);
+
+/*----------------------------------------------------------------*/
+
+int __must_check resume_cache(struct wb_device *);
+void free_cache(struct wb_device *);
+
+/*----------------------------------------------------------------*/
+
+#endif
diff --git a/drivers/md/dm-writeboost-target.c b/drivers/md/dm-writeboost-target.c
new file mode 100644
index 0000000..4cbf579
--- /dev/null
+++ b/drivers/md/dm-writeboost-target.c
@@ -0,0 +1,1258 @@
+/*
+ * Writeboost
+ * Log-structured Caching for Linux
+ *
+ * Copyright (C) 2012-2014 Akira Hayakawa <ruby.wktk at gmail.com>
+ *
+ * This file is released under the GPL.
+ */
+
+#include "dm-writeboost.h"
+#include "dm-writeboost-metadata.h"
+#include "dm-writeboost-daemon.h"
+
+/*----------------------------------------------------------------*/
+
+struct safe_io {
+	struct work_struct work;
+	int err;
+	unsigned long err_bits;
+	struct dm_io_request *io_req;
+	unsigned num_regions;
+	struct dm_io_region *regions;
+};
+
+static void safe_io_proc(struct work_struct *work)
+{
+	struct safe_io *io = container_of(work, struct safe_io, work);
+	io->err_bits = 0;
+	io->err = dm_io(io->io_req, io->num_regions, io->regions,
+			&io->err_bits);
+}
+
+int dm_safe_io_internal(struct wb_device *wb, struct dm_io_request *io_req,
+			unsigned num_regions, struct dm_io_region *regions,
+			unsigned long *err_bits, bool thread, const char *caller)
+{
+	int err = 0;
+
+	if (thread) {
+		struct safe_io io = {
+			.io_req = io_req,
+			.regions = regions,
+			.num_regions = num_regions,
+		};
+
+		INIT_WORK_ONSTACK(&io.work, safe_io_proc);
+
+		queue_work(safe_io_wq, &io.work);
+		flush_work(&io.work);
+
+		err = io.err;
+		if (err_bits)
+			*err_bits = io.err_bits;
+	} else {
+		err = dm_io(io_req, num_regions, regions, err_bits);
+	}
+
+	/*
+	 * err_bits can be NULL.
+	 */
+	if (err || (err_bits && *err_bits)) {
+		char buf[BDEVNAME_SIZE];
+		dev_t dev = regions->bdev->bd_dev;
+
+		unsigned long eb;
+		if (!err_bits)
+			eb = (~(unsigned long)0);
+		else
+			eb = *err_bits;
+
+		format_dev_t(buf, dev);
+		WBERR("%s() I/O error(%d), bits(%lu), dev(%s), sector(%llu), rw(%d)",
+		      caller, err, eb,
+		      buf, (unsigned long long) regions->sector, io_req->bi_rw);
+	}
+
+	return err;
+}
+
+sector_t dm_devsize(struct dm_dev *dev)
+{
+	return i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT;
+}
+
+/*----------------------------------------------------------------*/
+
+static u8 count_dirty_caches_remained(struct segment_header *seg)
+{
+	u8 i, count = 0;
+	struct metablock *mb;
+	for (i = 0; i < seg->length; i++) {
+		mb = seg->mb_array + i;
+		if (mb->dirty_bits)
+			count++;
+	}
+	return count;
+}
+
+/*
+ * Prepare the kmalloc-ed RAM buffer for segment write.
+ *
+ * dm_io routine requires RAM buffer for its I/O buffer.
+ * Even if we uses non-volatile RAM we have to copy the
+ * data to the volatile buffer when we come to submit I/O.
+ */
+static void prepare_rambuffer(struct rambuffer *rambuf,
+			      struct wb_device *wb,
+			      struct segment_header *seg)
+{
+	prepare_segment_header_device(rambuf->data, wb, seg);
+}
+
+static void init_rambuffer(struct wb_device *wb)
+{
+	memset(wb->current_rambuf->data, 0, 1 << 12);
+}
+
+/*
+ * Acquire new RAM buffer for the new segment.
+ */
+static void acquire_new_rambuffer(struct wb_device *wb, u64 id)
+{
+	struct rambuffer *next_rambuf;
+	u32 tmp32;
+
+	wait_for_flushing(wb, SUB_ID(id, wb->nr_rambuf_pool));
+
+	div_u64_rem(id - 1, wb->nr_rambuf_pool, &tmp32);
+	next_rambuf = wb->rambuf_pool + tmp32;
+
+	wb->current_rambuf = next_rambuf;
+
+	init_rambuffer(wb);
+}
+
+/*
+ * Acquire the new segment and RAM buffer for the following writes.
+ * Gurantees all dirty caches in the segments are migrated and all metablocks
+ * in it are invalidated (linked to null head).
+ */
+void acquire_new_seg(struct wb_device *wb, u64 id)
+{
+	struct segment_header *new_seg = get_segment_header_by_id(wb, id);
+
+	/*
+	 * We wait for all requests to the new segment is consumed.
+	 * Mutex taken gurantees that no new I/O to this segment is coming in.
+	 */
+	size_t rep = 0;
+	while (atomic_read(&new_seg->nr_inflight_ios)) {
+		rep++;
+		if (rep == 1000)
+			WBWARN("too long to process all requests");
+		schedule_timeout_interruptible(msecs_to_jiffies(1));
+	}
+	BUG_ON(count_dirty_caches_remained(new_seg));
+
+	wait_for_migration(wb, SUB_ID(id, wb->nr_segments));
+
+	discard_caches_inseg(wb, new_seg);
+
+	/*
+	 * We must not set new id to the new segment before
+	 * all wait_* events are done since they uses those id for waiting.
+	 */
+	new_seg->id = id;
+	wb->current_seg = new_seg;
+
+	acquire_new_rambuffer(wb, id);
+}
+
+static void prepare_new_seg(struct wb_device *wb)
+{
+	u64 next_id = wb->current_seg->id + 1;
+	acquire_new_seg(wb, next_id);
+
+	/*
+	 * Set the cursor to the last of the flushed segment.
+	 */
+	wb->cursor = wb->current_seg->start_idx + (wb->nr_caches_inseg - 1);
+	wb->current_seg->length = 0;
+}
+
+static void
+copy_barrier_requests(struct flush_job *job, struct wb_device *wb)
+{
+	bio_list_init(&job->barrier_ios);
+	bio_list_merge(&job->barrier_ios, &wb->barrier_ios);
+	bio_list_init(&wb->barrier_ios);
+}
+
+static void init_flush_job(struct flush_job *job, struct wb_device *wb)
+{
+	job->wb = wb;
+	job->seg = wb->current_seg;
+	job->rambuf = wb->current_rambuf;
+
+	copy_barrier_requests(job, wb);
+}
+
+static void queue_flush_job(struct wb_device *wb)
+{
+	struct flush_job *job;
+	size_t rep = 0;
+
+	while (atomic_read(&wb->current_seg->nr_inflight_ios)) {
+		rep++;
+		if (rep == 1000)
+			WBWARN("too long to process all requests");
+		schedule_timeout_interruptible(msecs_to_jiffies(1));
+	}
+	prepare_rambuffer(wb->current_rambuf, wb, wb->current_seg);
+
+	job = mempool_alloc(wb->flush_job_pool, GFP_NOIO);
+	init_flush_job(job, wb);
+	INIT_WORK(&job->work, flush_proc);
+	queue_work(wb->flusher_wq, &job->work);
+}
+
+static void queue_current_buffer(struct wb_device *wb)
+{
+	queue_flush_job(wb);
+	prepare_new_seg(wb);
+}
+
+/*
+ * Flush out all the transient data at a moment but _NOT_ persistently.
+ * Clean up the writes before termination is an example of the usecase.
+ */
+void flush_current_buffer(struct wb_device *wb)
+{
+	struct segment_header *old_seg;
+
+	mutex_lock(&wb->io_lock);
+	old_seg = wb->current_seg;
+
+	queue_current_buffer(wb);
+
+	wb->cursor = wb->current_seg->start_idx;
+	wb->current_seg->length = 1;
+	mutex_unlock(&wb->io_lock);
+
+	wait_for_flushing(wb, old_seg->id);
+}
+
+/*----------------------------------------------------------------*/
+
+static void bio_remap(struct bio *bio, struct dm_dev *dev, sector_t sector)
+{
+	bio->bi_bdev = dev->bdev;
+	bio->bi_sector = sector;
+}
+
+static u8 io_offset(struct bio *bio)
+{
+	u32 tmp32;
+	div_u64_rem(bio->bi_sector, 1 << 3, &tmp32);
+	return tmp32;
+}
+
+static sector_t io_count(struct bio *bio)
+{
+	return bio->bi_size >> SECTOR_SHIFT;
+}
+
+static bool io_fullsize(struct bio *bio)
+{
+	return io_count(bio) == (1 << 3);
+}
+
+/*
+ * We use 4KB alignment address of original request the for the lookup key.
+ */
+static sector_t calc_cache_alignment(sector_t bio_sector)
+{
+	return div_u64(bio_sector, 1 << 3) * (1 << 3);
+}
+
+/*----------------------------------------------------------------*/
+
+static void inc_stat(struct wb_device *wb,
+		     int rw, bool found, bool on_buffer, bool fullsize)
+{
+	atomic64_t *v;
+
+	int i = 0;
+	if (rw)
+		i |= (1 << STAT_WRITE);
+	if (found)
+		i |= (1 << STAT_HIT);
+	if (on_buffer)
+		i |= (1 << STAT_ON_BUFFER);
+	if (fullsize)
+		i |= (1 << STAT_FULLSIZE);
+
+	v = &wb->stat[i];
+	atomic64_inc(v);
+}
+
+static void clear_stat(struct wb_device *wb)
+{
+	size_t i;
+	for (i = 0; i < STATLEN; i++) {
+		atomic64_t *v = &wb->stat[i];
+		atomic64_set(v, 0);
+	}
+}
+
+/*----------------------------------------------------------------*/
+
+void inc_nr_dirty_caches(struct wb_device *wb)
+{
+	BUG_ON(!wb);
+	atomic64_inc(&wb->nr_dirty_caches);
+}
+
+static void dec_nr_dirty_caches(struct wb_device *wb)
+{
+	BUG_ON(!wb);
+	if (atomic64_dec_and_test(&wb->nr_dirty_caches))
+		wake_up_interruptible(&wb->wait_drop_caches);
+}
+
+/*
+ * Increase the dirtiness of a metablock.
+ */
+static void taint_mb(struct wb_device *wb, struct segment_header *seg,
+		     struct metablock *mb, struct bio *bio)
+{
+	unsigned long flags;
+
+	bool was_clean = false;
+
+	spin_lock_irqsave(&wb->lock, flags);
+	if (!mb->dirty_bits) {
+		seg->length++;
+		BUG_ON(seg->length > wb->nr_caches_inseg);
+		was_clean = true;
+	}
+	if (likely(io_fullsize(bio))) {
+		mb->dirty_bits = 255;
+	} else {
+		u8 i;
+		u8 acc_bits = 0;
+		for (i = io_offset(bio); i < (io_offset(bio) + io_count(bio)); i++)
+			acc_bits += (1 << i);
+
+		mb->dirty_bits |= acc_bits;
+	}
+	BUG_ON(!io_count(bio));
+	BUG_ON(!mb->dirty_bits);
+	spin_unlock_irqrestore(&wb->lock, flags);
+
+	if (was_clean)
+		inc_nr_dirty_caches(wb);
+}
+
+void cleanup_mb_if_dirty(struct wb_device *wb, struct segment_header *seg,
+			 struct metablock *mb)
+{
+	unsigned long flags;
+
+	bool was_dirty = false;
+
+	spin_lock_irqsave(&wb->lock, flags);
+	if (mb->dirty_bits) {
+		mb->dirty_bits = 0;
+		was_dirty = true;
+	}
+	spin_unlock_irqrestore(&wb->lock, flags);
+
+	if (was_dirty)
+		dec_nr_dirty_caches(wb);
+}
+
+/*
+ * Read the dirtiness of a metablock at the moment.
+ *
+ * In fact, I don't know if we should have the read statement surrounded
+ * by spinlock. Why I do this is that I worry about reading the
+ * intermediate value (neither the value of before-write nor after-write).
+ * Intel CPU guarantees it but other CPU may not.
+ * If any other CPU guarantees it we can remove the spinlock held.
+ */
+u8 read_mb_dirtiness(struct wb_device *wb, struct segment_header *seg,
+		     struct metablock *mb)
+{
+	unsigned long flags;
+	u8 val;
+
+	spin_lock_irqsave(&wb->lock, flags);
+	val = mb->dirty_bits;
+	spin_unlock_irqrestore(&wb->lock, flags);
+
+	return val;
+}
+
+/*
+ * Migrate the caches in a metablock on the SSD (after flushed).
+ * The caches on the SSD are considered to be persistent so we need to
+ * write them back with WRITE_FUA flag.
+ */
+static void migrate_mb(struct wb_device *wb, struct segment_header *seg,
+		       struct metablock *mb, u8 dirty_bits, bool thread)
+{
+	int r = 0;
+
+	if (!dirty_bits)
+		return;
+
+	if (dirty_bits == 255) {
+		void *buf = mempool_alloc(wb->buf_8_pool, GFP_NOIO);
+		struct dm_io_request io_req_r, io_req_w;
+		struct dm_io_region region_r, region_w;
+
+		io_req_r = (struct dm_io_request) {
+			.client = wb_io_client,
+			.bi_rw = READ,
+			.notify.fn = NULL,
+			.mem.type = DM_IO_KMEM,
+			.mem.ptr.addr = buf,
+		};
+		region_r = (struct dm_io_region) {
+			.bdev = wb->cache_dev->bdev,
+			.sector = calc_mb_start_sector(wb, seg, mb->idx),
+			.count = (1 << 3),
+		};
+		IO(dm_safe_io(&io_req_r, 1, &region_r, NULL, thread));
+
+		io_req_w = (struct dm_io_request) {
+			.client = wb_io_client,
+			.bi_rw = WRITE_FUA,
+			.notify.fn = NULL,
+			.mem.type = DM_IO_KMEM,
+			.mem.ptr.addr = buf,
+		};
+		region_w = (struct dm_io_region) {
+			.bdev = wb->origin_dev->bdev,
+			.sector = mb->sector,
+			.count = (1 << 3),
+		};
+		IO(dm_safe_io(&io_req_w, 1, &region_w, NULL, thread));
+
+		mempool_free(buf, wb->buf_8_pool);
+	} else {
+		void *buf = mempool_alloc(wb->buf_1_pool, GFP_NOIO);
+		u8 i;
+		for (i = 0; i < 8; i++) {
+			struct dm_io_request io_req_r, io_req_w;
+			struct dm_io_region region_r, region_w;
+
+			bool bit_on = dirty_bits & (1 << i);
+			if (!bit_on)
+				continue;
+
+			io_req_r = (struct dm_io_request) {
+				.client = wb_io_client,
+				.bi_rw = READ,
+				.notify.fn = NULL,
+				.mem.type = DM_IO_KMEM,
+				.mem.ptr.addr = buf,
+			};
+			region_r = (struct dm_io_region) {
+				.bdev = wb->cache_dev->bdev,
+				.sector = calc_mb_start_sector(wb, seg, mb->idx) + i,
+				.count = 1,
+			};
+			IO(dm_safe_io(&io_req_r, 1, &region_r, NULL, thread));
+
+			io_req_w = (struct dm_io_request) {
+				.client = wb_io_client,
+				.bi_rw = WRITE_FUA,
+				.notify.fn = NULL,
+				.mem.type = DM_IO_KMEM,
+				.mem.ptr.addr = buf,
+			};
+			region_w = (struct dm_io_region) {
+				.bdev = wb->origin_dev->bdev,
+				.sector = mb->sector + i,
+				.count = 1,
+			};
+			IO(dm_safe_io(&io_req_w, 1, &region_w, NULL, thread));
+		}
+		mempool_free(buf, wb->buf_1_pool);
+	}
+}
+
+/*
+ * Migrate the caches on the RAM buffer.
+ * Calling this function is really rare so the code is not optimal.
+ *
+ * Since the caches are of either one of these two status
+ * - not flushed and thus not persistent (volatile buffer)
+ * - acked to barrier request before but it is also on the
+ *   non-volatile buffer (non-volatile buffer)
+ * there is no reason to write them back with FUA flag.
+ */
+static void migrate_buffered_mb(struct wb_device *wb,
+				struct metablock *mb, u8 dirty_bits)
+{
+	int r = 0;
+
+	sector_t offset = ((mb_idx_inseg(wb, mb->idx) + 1) << 3);
+	void *buf = mempool_alloc(wb->buf_1_pool, GFP_NOIO);
+
+	u8 i;
+	for (i = 0; i < 8; i++) {
+		struct dm_io_request io_req;
+		struct dm_io_region region;
+		void *src;
+		sector_t dest;
+
+		bool bit_on = dirty_bits & (1 << i);
+		if (!bit_on)
+			continue;
+
+		src = wb->current_rambuf->data +
+		      ((offset + i) << SECTOR_SHIFT);
+		memcpy(buf, src, 1 << SECTOR_SHIFT);
+
+		io_req = (struct dm_io_request) {
+			.client = wb_io_client,
+			.bi_rw = WRITE,
+			.notify.fn = NULL,
+			.mem.type = DM_IO_KMEM,
+			.mem.ptr.addr = buf,
+		};
+
+		dest = mb->sector + i;
+		region = (struct dm_io_region) {
+			.bdev = wb->origin_dev->bdev,
+			.sector = dest,
+			.count = 1,
+		};
+
+		IO(dm_safe_io(&io_req, 1, &region, NULL, true));
+	}
+	mempool_free(buf, wb->buf_1_pool);
+}
+
+void invalidate_previous_cache(struct wb_device *wb, struct segment_header *seg,
+			       struct metablock *old_mb, bool overwrite_fullsize)
+{
+	u8 dirty_bits = read_mb_dirtiness(wb, seg, old_mb);
+
+	/*
+	 * First clean up the previous cache and migrate the cache if needed.
+	 */
+	bool needs_cleanup_prev_cache =
+		!overwrite_fullsize || !(dirty_bits == 255);
+
+	/*
+	 * Migration works in background and may have cleaned up the metablock.
+	 * If the metablock is clean we need not to migrate.
+	 */
+	if (!dirty_bits)
+		needs_cleanup_prev_cache = false;
+
+	if (overwrite_fullsize)
+		needs_cleanup_prev_cache = false;
+
+	if (unlikely(needs_cleanup_prev_cache)) {
+		wait_for_flushing(wb, seg->id);
+		migrate_mb(wb, seg, old_mb, dirty_bits, true);
+	}
+
+	cleanup_mb_if_dirty(wb, seg, old_mb);
+
+	ht_del(wb, old_mb);
+}
+
+static void
+write_on_buffer(struct wb_device *wb, struct segment_header *seg,
+		struct metablock *mb, struct bio *bio)
+{
+	sector_t start_sector = ((mb_idx_inseg(wb, mb->idx) + 1) << 3) +
+				io_offset(bio);
+	size_t start_byte = start_sector << SECTOR_SHIFT;
+	void *data = bio_data(bio);
+
+	/*
+	 * Write data block to the volatile RAM buffer.
+	 */
+	memcpy(wb->current_rambuf->data + start_byte, data, bio->bi_size);
+}
+
+static void advance_cursor(struct wb_device *wb)
+{
+	u32 tmp32;
+	div_u64_rem(wb->cursor + 1, wb->nr_caches, &tmp32);
+	wb->cursor = tmp32;
+}
+
+struct per_bio_data {
+	void *ptr;
+};
+
+static int writeboost_map(struct dm_target *ti, struct bio *bio)
+{
+	struct wb_device *wb = ti->private;
+	struct dm_dev *origin_dev = wb->origin_dev;
+	int rw = bio_data_dir(bio);
+	struct lookup_key key = {
+		.sector = calc_cache_alignment(bio->bi_sector),
+	};
+	struct ht_head *head = ht_get_head(wb, &key);
+
+	struct segment_header *uninitialized_var(found_seg);
+	struct metablock *mb, *new_mb;
+
+	bool found,
+	     on_buffer, /* is the metablock found on the RAM buffer? */
+	     needs_queue_seg; /* need to queue the current seg? */
+
+	struct per_bio_data *map_context;
+	map_context = dm_per_bio_data(bio, ti->per_bio_data_size);
+	map_context->ptr = NULL;
+
+	DEAD(bio_endio(bio, -EIO); return DM_MAPIO_SUBMITTED);
+
+	/*
+	 * We only discard sectors on only the backing store because
+	 * blocks on cache device are unlikely to be discarded.
+	 * Discarding blocks is likely to be operated long after writing;
+	 * the block is likely to be migrated before that.
+	 *
+	 * Moreover, it is very hard to implement discarding cache blocks.
+	 */
+	if (bio->bi_rw & REQ_DISCARD) {
+		bio_remap(bio, origin_dev, bio->bi_sector);
+		return DM_MAPIO_REMAPPED;
+	}
+
+	/*
+	 * Defered ACK for flush requests
+	 *
+	 * In device-mapper, bio with REQ_FLUSH is guaranteed to have no data.
+	 * So, we can simply defer it for lazy execution.
+	 */
+	if (bio->bi_rw & REQ_FLUSH) {
+		BUG_ON(bio->bi_size);
+		queue_barrier_io(wb, bio);
+		return DM_MAPIO_SUBMITTED;
+	}
+
+	mutex_lock(&wb->io_lock);
+	mb = ht_lookup(wb, head, &key);
+	if (mb) {
+		found_seg = mb_to_seg(wb, mb);
+		atomic_inc(&found_seg->nr_inflight_ios);
+	}
+
+	found = (mb != NULL);
+	on_buffer = false;
+	if (found)
+		on_buffer = is_on_buffer(wb, mb->idx);
+
+	inc_stat(wb, rw, found, on_buffer, io_fullsize(bio));
+
+	/*
+	 * (Locking)
+	 * A cache data is placed either on RAM buffer or SSD if it was flushed.
+	 * To ease the locking, we establish a simple rule for the dirtiness
+	 * of a cache data.
+	 *
+	 * If the data is on the RAM buffer, the dirtiness (dirty_bits of metablock)
+	 * only increases. The justification for this design is that the cache on the
+	 * RAM buffer is seldom migrated.
+	 * If the data is, on the other hand, on the SSD after flushed the dirtiness
+	 * only decreases.
+	 *
+	 * This simple rule frees us from the dirtiness fluctuating thus simplies
+	 * locking design.
+	 */
+
+	if (!rw) {
+		u8 dirty_bits;
+
+		mutex_unlock(&wb->io_lock);
+
+		if (!found) {
+			bio_remap(bio, origin_dev, bio->bi_sector);
+			return DM_MAPIO_REMAPPED;
+		}
+
+		dirty_bits = read_mb_dirtiness(wb, found_seg, mb);
+		if (unlikely(on_buffer)) {
+			if (dirty_bits)
+				migrate_buffered_mb(wb, mb, dirty_bits);
+
+			atomic_dec(&found_seg->nr_inflight_ios);
+			bio_remap(bio, origin_dev, bio->bi_sector);
+			return DM_MAPIO_REMAPPED;
+		}
+
+		/*
+		 * We must wait for the (maybe) queued segment to be flushed
+		 * to the cache device.
+		 * Without this, we read the wrong data from the cache device.
+		 */
+		wait_for_flushing(wb, found_seg->id);
+
+		if (likely(dirty_bits == 255)) {
+			bio_remap(bio, wb->cache_dev,
+				  calc_mb_start_sector(wb, found_seg, mb->idx) +
+				  io_offset(bio));
+			map_context->ptr = found_seg;
+		} else {
+			migrate_mb(wb, found_seg, mb, dirty_bits, true);
+			cleanup_mb_if_dirty(wb, found_seg, mb);
+
+			atomic_dec(&found_seg->nr_inflight_ios);
+			bio_remap(bio, origin_dev, bio->bi_sector);
+		}
+		return DM_MAPIO_REMAPPED;
+	}
+
+	if (found) {
+		if (unlikely(on_buffer)) {
+			mutex_unlock(&wb->io_lock);
+			goto write_on_buffer;
+		} else {
+			invalidate_previous_cache(wb, found_seg, mb,
+						  io_fullsize(bio));
+			atomic_dec(&found_seg->nr_inflight_ios);
+			goto write_not_found;
+		}
+	}
+
+write_not_found:
+	/*
+	 * If wb->cursor is 254, 509, ...
+	 * which is the last cache line in the segment.
+	 * We must flush the current segment and get the new one.
+	 */
+	needs_queue_seg = !mb_idx_inseg(wb, wb->cursor + 1);
+
+	if (needs_queue_seg)
+		queue_current_buffer(wb);
+
+	advance_cursor(wb);
+
+	new_mb = wb->current_seg->mb_array + mb_idx_inseg(wb, wb->cursor);
+	BUG_ON(new_mb->dirty_bits);
+	ht_register(wb, head, new_mb, &key);
+
+	atomic_inc(&wb->current_seg->nr_inflight_ios);
+	mutex_unlock(&wb->io_lock);
+
+	mb = new_mb;
+
+write_on_buffer:
+	taint_mb(wb, wb->current_seg, mb, bio);
+
+	write_on_buffer(wb, wb->current_seg, mb, bio);
+
+	atomic_dec(&wb->current_seg->nr_inflight_ios);
+
+	/*
+	 * Deferred ACK for FUA request
+	 *
+	 * bio with REQ_FUA flag has data.
+	 * So, we must run through the path for usual bio.
+	 * And the data is now stored in the RAM buffer.
+	 */
+	if (bio->bi_rw & REQ_FUA) {
+		queue_barrier_io(wb, bio);
+		return DM_MAPIO_SUBMITTED;
+	}
+
+	LIVE_DEAD(bio_endio(bio, 0),
+		  bio_endio(bio, -EIO));
+
+	return DM_MAPIO_SUBMITTED;
+}
+
+static int writeboost_end_io(struct dm_target *ti, struct bio *bio, int error)
+{
+	struct segment_header *seg;
+	struct per_bio_data *map_context =
+		dm_per_bio_data(bio, ti->per_bio_data_size);
+
+	if (!map_context->ptr)
+		return 0;
+
+	seg = map_context->ptr;
+	atomic_dec(&seg->nr_inflight_ios);
+
+	return 0;
+}
+
+static int consume_essential_argv(struct wb_device *wb, struct dm_arg_set *as)
+{
+	int r = 0;
+	struct dm_target *ti = wb->ti;
+
+	static struct dm_arg _args[] = {
+		{0, 0, "invalid buffer type"},
+	};
+	unsigned tmp;
+
+	r = dm_read_arg(_args, as, &tmp, &ti->error);
+	if (r)
+		return r;
+	wb->type = tmp;
+
+	r = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table),
+			  &wb->origin_dev);
+	if (r) {
+		ti->error = "failed to get origin dev";
+		return r;
+	}
+
+	r = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table),
+			  &wb->cache_dev);
+	if (r) {
+		ti->error = "failed to get cache dev";
+		goto bad;
+	}
+
+	return r;
+
+bad:
+	dm_put_device(ti, wb->origin_dev);
+	return r;
+}
+
+#define consume_kv(name, nr) { \
+	if (!strcasecmp(key, #name)) { \
+		if (!argc) \
+			break; \
+		r = dm_read_arg(_args + (nr), as, &tmp, &ti->error); \
+		if (r) \
+			break; \
+		wb->name = tmp; \
+	 } }
+
+static int consume_optional_argv(struct wb_device *wb, struct dm_arg_set *as)
+{
+	int r = 0;
+	struct dm_target *ti = wb->ti;
+
+	static struct dm_arg _args[] = {
+		{0, 4, "invalid optional argc"},
+		{4, 10, "invalid segment_size_order"},
+		{512, UINT_MAX, "invalid rambuf_pool_amount"},
+	};
+	unsigned tmp, argc = 0;
+
+	if (as->argc) {
+		r = dm_read_arg_group(_args, as, &argc, &ti->error);
+		if (r)
+			return r;
+	}
+
+	while (argc) {
+		const char *key = dm_shift_arg(as);
+		argc--;
+
+		r = -EINVAL;
+
+		consume_kv(segment_size_order, 1);
+		consume_kv(rambuf_pool_amount, 2);
+
+		if (!r) {
+			argc--;
+		} else {
+			ti->error = "invalid optional key";
+			break;
+		}
+	}
+
+	return r;
+}
+
+static int do_consume_tunable_argv(struct wb_device *wb,
+				   struct dm_arg_set *as, unsigned argc)
+{
+	int r = 0;
+	struct dm_target *ti = wb->ti;
+
+	static struct dm_arg _args[] = {
+		{0, 1, "invalid allow_migrate"},
+		{0, 1, "invalid enable_migration_modulator"},
+		{1, 1000, "invalid barrier_deadline_ms"},
+		{1, 1000, "invalid nr_max_batched_migration"},
+		{0, 100, "invalid migrate_threshold"},
+		{0, 3600, "invalid update_record_interval"},
+		{0, 3600, "invalid sync_interval"},
+	};
+	unsigned tmp;
+
+	while (argc) {
+		const char *key = dm_shift_arg(as);
+		argc--;
+
+		r = -EINVAL;
+
+		consume_kv(allow_migrate, 0);
+		consume_kv(enable_migration_modulator, 1);
+		consume_kv(barrier_deadline_ms, 2);
+		consume_kv(nr_max_batched_migration, 3);
+		consume_kv(migrate_threshold, 4);
+		consume_kv(update_record_interval, 5);
+		consume_kv(sync_interval, 6);
+
+		if (!r) {
+			argc--;
+		} else {
+			ti->error = "invalid tunable key";
+			break;
+		}
+	}
+
+	return r;
+}
+
+static int consume_tunable_argv(struct wb_device *wb, struct dm_arg_set *as)
+{
+	int r = 0;
+	struct dm_target *ti = wb->ti;
+
+	static struct dm_arg _args[] = {
+		{0, 14, "invalid tunable argc"},
+	};
+	unsigned argc = 0;
+
+	if (as->argc) {
+		r = dm_read_arg_group(_args, as, &argc, &ti->error);
+		if (r)
+			return r;
+		/*
+		 * tunables are emitted only if
+		 * they were origianlly passed.
+		 */
+		wb->should_emit_tunables = true;
+	}
+
+	return do_consume_tunable_argv(wb, as, argc);
+}
+
+static int init_core_struct(struct dm_target *ti)
+{
+	int r = 0;
+	struct wb_device *wb;
+
+	r = dm_set_target_max_io_len(ti, 1 << 3);
+	if (r) {
+		WBERR("failed to set max_io_len");
+		return r;
+	}
+
+	ti->flush_supported = true;
+	ti->num_flush_bios = 1;
+	ti->num_discard_bios = 1;
+	ti->discard_zeroes_data_unsupported = true;
+	ti->per_bio_data_size = sizeof(struct per_bio_data);
+
+	wb = kzalloc(sizeof(*wb), GFP_KERNEL);
+	if (!wb) {
+		WBERR("failed to allocate wb");
+		return -ENOMEM;
+	}
+	ti->private = wb;
+	wb->ti = ti;
+
+	mutex_init(&wb->io_lock);
+	spin_lock_init(&wb->lock);
+	atomic64_set(&wb->nr_dirty_caches, 0);
+	clear_bit(WB_DEAD, &wb->flags);
+	wb->should_emit_tunables = false;
+
+	return r;
+}
+
+/*
+ * Create a Writeboost device
+ *
+ * <type>
+ * <essential args>*
+ * <#optional args> <optional args>*
+ * <#tunable args> <tunable args>*
+ * optionals are tunables are unordered lists of k-v pair.
+ *
+ * See Documentation for detail.
+  */
+static int writeboost_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	int r = 0;
+	struct wb_device *wb;
+
+	struct dm_arg_set as;
+	as.argc = argc;
+	as.argv = argv;
+
+	r = init_core_struct(ti);
+	if (r) {
+		ti->error = "failed to init core";
+		return r;
+	}
+	wb = ti->private;
+
+	r = consume_essential_argv(wb, &as);
+	if (r) {
+		ti->error = "failed to consume essential argv";
+		goto bad_essential_argv;
+	}
+
+	wb->segment_size_order = 7;
+	wb->rambuf_pool_amount = 2048;
+	r = consume_optional_argv(wb, &as);
+	if (r) {
+		ti->error = "failed to consume optional argv";
+		goto bad_optional_argv;
+	}
+
+	r = resume_cache(wb);
+	if (r) {
+		ti->error = "failed to resume cache";
+		goto bad_resume_cache;
+	}
+
+	r = consume_tunable_argv(wb, &as);
+	if (r) {
+		ti->error = "failed to consume tunable argv";
+		goto bad_tunable_argv;
+	}
+
+	clear_stat(wb);
+	atomic64_set(&wb->count_non_full_flushed, 0);
+
+	return r;
+
+bad_tunable_argv:
+	free_cache(wb);
+bad_resume_cache:
+bad_optional_argv:
+	dm_put_device(ti, wb->cache_dev);
+	dm_put_device(ti, wb->origin_dev);
+bad_essential_argv:
+	kfree(wb);
+
+	return r;
+}
+
+static void writeboost_dtr(struct dm_target *ti)
+{
+	struct wb_device *wb = ti->private;
+
+	free_cache(wb);
+
+	dm_put_device(ti, wb->cache_dev);
+	dm_put_device(ti, wb->origin_dev);
+
+	kfree(wb);
+
+	ti->private = NULL;
+}
+
+/*
+ * .postsuspend is called before .dtr.
+ * We flush out all the transient data and make them persistent.
+ */
+static void writeboost_postsuspend(struct dm_target *ti)
+{
+	int r = 0;
+	struct wb_device *wb = ti->private;
+
+	flush_current_buffer(wb);
+	IO(blkdev_issue_flush(wb->cache_dev->bdev, GFP_NOIO, NULL));
+}
+
+static int writeboost_message(struct dm_target *ti, unsigned argc, char **argv)
+{
+	struct wb_device *wb = ti->private;
+
+	struct dm_arg_set as;
+	as.argc = argc;
+	as.argv = argv;
+
+	if (!strcasecmp(argv[0], "clear_stat")) {
+		clear_stat(wb);
+		return 0;
+	}
+
+	if (!strcasecmp(argv[0], "drop_caches")) {
+		int r = 0;
+		wb->force_drop = true;
+		r = wait_event_interruptible(wb->wait_drop_caches,
+			     !atomic64_read(&wb->nr_dirty_caches));
+		wb->force_drop = false;
+		return r;
+	}
+
+	return do_consume_tunable_argv(wb, &as, 2);
+}
+
+/*
+ * Since Writeboost is just a cache target and the cache block size is fixed
+ * to 4KB. There is no reason to count the cache device in device iteration.
+ */
+static int
+writeboost_iterate_devices(struct dm_target *ti,
+			   iterate_devices_callout_fn fn, void *data)
+{
+	struct wb_device *wb = ti->private;
+	struct dm_dev *orig = wb->origin_dev;
+	sector_t start = 0;
+	sector_t len = dm_devsize(orig);
+	return fn(ti, orig, start, len, data);
+}
+
+static void
+writeboost_io_hints(struct dm_target *ti, struct queue_limits *limits)
+{
+	blk_limits_io_opt(limits, 4096);
+}
+
+static void emit_tunables(struct wb_device *wb, char *result, unsigned maxlen)
+{
+	ssize_t sz = 0;
+
+	DMEMIT(" %d", 14);
+	DMEMIT(" barrier_deadline_ms %lu",
+	       wb->barrier_deadline_ms);
+	DMEMIT(" allow_migrate %d",
+	       wb->allow_migrate ? 1 : 0);
+	DMEMIT(" enable_migration_modulator %d",
+	       wb->enable_migration_modulator ? 1 : 0);
+	DMEMIT(" migrate_threshold %d",
+	       wb->migrate_threshold);
+	DMEMIT(" nr_cur_batched_migration %u",
+	       wb->nr_cur_batched_migration);
+	DMEMIT(" sync_interval %lu",
+	       wb->sync_interval);
+	DMEMIT(" update_record_interval %lu",
+	       wb->update_record_interval);
+}
+
+static void writeboost_status(struct dm_target *ti, status_type_t type,
+			      unsigned flags, char *result, unsigned maxlen)
+{
+	ssize_t sz = 0;
+	char buf[BDEVNAME_SIZE];
+	struct wb_device *wb = ti->private;
+	size_t i;
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		DMEMIT("%u %u %llu %llu %llu %llu %llu",
+		       (unsigned int)
+		       wb->cursor,
+		       (unsigned int)
+		       wb->nr_caches,
+		       (long long unsigned int)
+		       wb->nr_segments,
+		       (long long unsigned int)
+		       wb->current_seg->id,
+		       (long long unsigned int)
+		       atomic64_read(&wb->last_flushed_segment_id),
+		       (long long unsigned int)
+		       atomic64_read(&wb->last_migrated_segment_id),
+		       (long long unsigned int)
+		       atomic64_read(&wb->nr_dirty_caches));
+
+		for (i = 0; i < STATLEN; i++) {
+			atomic64_t *v = &wb->stat[i];
+			DMEMIT(" %llu", (unsigned long long) atomic64_read(v));
+		}
+		DMEMIT(" %llu", (unsigned long long) atomic64_read(&wb->count_non_full_flushed));
+		emit_tunables(wb, result + sz, maxlen - sz);
+		break;
+
+	case STATUSTYPE_TABLE:
+		DMEMIT("%u", wb->type);
+		format_dev_t(buf, wb->origin_dev->bdev->bd_dev),
+		DMEMIT(" %s", buf);
+		format_dev_t(buf, wb->cache_dev->bdev->bd_dev),
+		DMEMIT(" %s", buf);
+		DMEMIT(" 4 segment_size_order %u rambuf_pool_amount %u",
+		       wb->segment_size_order,
+		       wb->rambuf_pool_amount);
+		if (wb->should_emit_tunables)
+			emit_tunables(wb, result + sz, maxlen - sz);
+		break;
+	}
+}
+
+static struct target_type writeboost_target = {
+	.name = "writeboost",
+	.version = {0, 1, 0},
+	.module = THIS_MODULE,
+	.map = writeboost_map,
+	.end_io = writeboost_end_io,
+	.ctr = writeboost_ctr,
+	.dtr = writeboost_dtr,
+	/*
+	 * .merge is not implemented
+	 * We split the passed I/O into 4KB cache block no matter
+	 * how big the I/O is.
+	 */
+	.postsuspend = writeboost_postsuspend,
+	.message = writeboost_message,
+	.status = writeboost_status,
+	.io_hints = writeboost_io_hints,
+	.iterate_devices = writeboost_iterate_devices,
+};
+
+struct dm_io_client *wb_io_client;
+struct workqueue_struct *safe_io_wq;
+static int __init writeboost_module_init(void)
+{
+	int r = 0;
+
+	r = dm_register_target(&writeboost_target);
+	if (r < 0) {
+		WBERR("failed to register target");
+		return r;
+	}
+
+	safe_io_wq = alloc_workqueue("wbsafeiowq",
+				     WQ_NON_REENTRANT | WQ_MEM_RECLAIM, 0);
+	if (!safe_io_wq) {
+		WBERR("failed to allocate safe_io_wq");
+		r = -ENOMEM;
+		goto bad_wq;
+	}
+
+	wb_io_client = dm_io_client_create();
+	if (IS_ERR(wb_io_client)) {
+		WBERR("failed to allocate wb_io_client");
+		r = PTR_ERR(wb_io_client);
+		goto bad_io_client;
+	}
+
+	return r;
+
+bad_io_client:
+	destroy_workqueue(safe_io_wq);
+bad_wq:
+	dm_unregister_target(&writeboost_target);
+
+	return r;
+}
+
+static void __exit writeboost_module_exit(void)
+{
+	dm_io_client_destroy(wb_io_client);
+	destroy_workqueue(safe_io_wq);
+	dm_unregister_target(&writeboost_target);
+}
+
+module_init(writeboost_module_init);
+module_exit(writeboost_module_exit);
+
+MODULE_AUTHOR("Akira Hayakawa <ruby.wktk at gmail.com>");
+MODULE_DESCRIPTION(DM_NAME " writeboost target");
+MODULE_LICENSE("GPL");
diff --git a/drivers/md/dm-writeboost.h b/drivers/md/dm-writeboost.h
new file mode 100644
index 0000000..3e37b53
--- /dev/null
+++ b/drivers/md/dm-writeboost.h
@@ -0,0 +1,464 @@
+/*
+ * Copyright (C) 2012-2014 Akira Hayakawa <ruby.wktk at gmail.com>
+ *
+ * This file is released under the GPL.
+ */
+
+#ifndef DM_WRITEBOOST_H
+#define DM_WRITEBOOST_H
+
+#define DM_MSG_PREFIX "writeboost"
+
+#include <linux/module.h>
+#include <linux/version.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/mutex.h>
+#include <linux/kthread.h>
+#include <linux/sched.h>
+#include <linux/timer.h>
+#include <linux/workqueue.h>
+#include <linux/device-mapper.h>
+#include <linux/dm-io.h>
+
+/*----------------------------------------------------------------*/
+
+#define SUB_ID(x, y) ((x) > (y) ? (x) - (y) : 0)
+
+/*----------------------------------------------------------------*/
+
+/*
+ * Nice printk macros
+ *
+ * Production code should not include lineno
+ * but name of the caller seems to be OK.
+ */
+
+/*
+ * Only for debugging.
+ * Don't include this macro in the production code.
+ */
+#define wbdebug(f, args...) \
+	DMINFO("debug@%s() L.%d " f, __func__, __LINE__, ## args)
+
+#define WBERR(f, args...) \
+	DMERR("err@%s() " f, __func__, ## args)
+#define WBWARN(f, args...) \
+	DMWARN("warn@%s() " f, __func__, ## args)
+#define WBINFO(f, args...) \
+	DMINFO("info@%s() " f, __func__, ## args)
+
+/*----------------------------------------------------------------*/
+
+/*
+ * The Detail of the Disk Format (SSD)
+ * -----------------------------------
+ *
+ * ### Overall
+ * Superblock (1MB) + Segment + Segment ...
+ *
+ * ### Superblock
+ * head <----                                     ----> tail
+ * superblock header (512B) + ... + superblock record (512B)
+ *
+ * ### Segment
+ * segment_header_device (512B) +
+ * metablock_device * nr_caches_inseg +
+ * data[0] (4KB) + data[1] + ... + data[nr_cache_inseg - 1]
+ */
+
+/*----------------------------------------------------------------*/
+
+/*
+ * Superblock Header (Immutable)
+ * -----------------------------
+ * First one sector of the super block region whose value
+ * is unchanged after formatted.
+ */
+#define WB_MAGIC 0x57427374 /* Magic number "WBst" */
+struct superblock_header_device {
+	__le32 magic;
+	__u8 segment_size_order;
+} __packed;
+
+/*
+ * Superblock Record (Mutable)
+ * ---------------------------
+ * Last one sector of the superblock region.
+ * Record the current cache status if required.
+ */
+struct superblock_record_device {
+	__le64 last_migrated_segment_id;
+} __packed;
+
+/*----------------------------------------------------------------*/
+
+/*
+ * The size must be a factor of one sector to avoid starddling
+ * neighboring two sectors.
+ * Facebook's flashcache does the same thing.
+ */
+struct metablock_device {
+	__le64 sector;
+	__u8 dirty_bits;
+	__u8 padding[16 - (8 + 1)]; /* 16B */
+} __packed;
+
+#define WB_CKSUM_SEED (~(u32)0)
+
+struct segment_header_device {
+	/*
+	 * We assume 1 sector write is atomic.
+	 * This 1 sector region contains important information
+	 * such as checksum of the rest of the segment data.
+	 * We use 32bit checksum to audit if the segment is
+	 * correctly written to the cache device.
+	 */
+	/* - FROM ------------------------------------ */
+	__le64 id;
+	/* TODO add timestamp? */
+	__le32 checksum;
+	/*
+	 * The number of metablocks in this segment header
+	 * to be considered in log replay. The rest are ignored.
+	 */
+	__u8 length;
+	__u8 padding[512 - (8 + 4 + 1)]; /* 512B */
+	/* - TO -------------------------------------- */
+	struct metablock_device mbarr[0]; /* 16B * N */
+} __packed;
+
+/*----------------------------------------------------------------*/
+
+struct metablock {
+	sector_t sector; /* The original aligned address */
+
+	u32 idx; /* Index in the metablock array. Const */
+
+	struct hlist_node ht_list; /* Linked to the Hash table */
+
+	u8 dirty_bits; /* 8bit for dirtiness in sector granularity */
+};
+
+#define SZ_MAX (~(size_t)0)
+struct segment_header {
+	u64 id; /* Must be initialized to 0 */
+
+	/*
+	 * The number of metablocks in a segment to flush and then migrate.
+	 */
+	u8 length;
+
+	u32 start_idx; /* Const */
+	sector_t start_sector; /* Const */
+
+	atomic_t nr_inflight_ios;
+
+	struct metablock mb_array[0];
+};
+
+/*----------------------------------------------------------------*/
+
+enum RAMBUF_TYPE {
+	BUF_NORMAL = 0, /* Volatile DRAM */
+	BUF_NV_BLK, /* Non-volatile with block I/F */
+	BUF_NV_RAM, /* Non-volatile with PRAM I/F */
+};
+
+/*
+ * RAM buffer is a buffer that any dirty data are first written to.
+ * type member in wb_device indicates the buffer type.
+ */
+struct rambuffer {
+	void *data; /* The DRAM buffer. Used as the buffer to submit I/O */
+};
+
+/*
+ * wbflusher's favorite food.
+ * foreground queues this object and wbflusher later pops
+ * one job to submit journal write to the cache device.
+ */
+struct flush_job {
+	struct work_struct work;
+	struct wb_device *wb;
+	struct segment_header *seg;
+	struct rambuffer *rambuf; /* RAM buffer to flush */
+	struct bio_list barrier_ios; /* List of deferred bios */
+};
+
+/*----------------------------------------------------------------*/
+
+enum STATFLAG {
+	STAT_WRITE = 0,
+	STAT_HIT,
+	STAT_ON_BUFFER,
+	STAT_FULLSIZE,
+};
+#define STATLEN (1 << 4)
+
+enum WB_FLAG {
+	/*
+	 * This flag is set when either one of the underlying devices
+	 * returned EIO and we must immediately block up the whole to
+	 * avoid further damage.
+	 */
+	WB_DEAD = 0,
+};
+
+/*
+ * The context of the cache driver.
+ */
+struct wb_device {
+	enum RAMBUF_TYPE type;
+
+	struct dm_target *ti;
+
+	struct dm_dev *origin_dev; /* Slow device (HDD) */
+	struct dm_dev *cache_dev; /* Fast device (SSD) */
+
+	mempool_t *buf_1_pool; /* 1 sector buffer pool */
+	mempool_t *buf_8_pool; /* 8 sector buffer pool */
+
+	/*
+	 * Mutex is very light-weight.
+	 * To mitigate the overhead of the locking we chose to
+	 * use mutex.
+	 * To optimize the read path, rw_semaphore is an option
+	 * but it means to sacrifice write path.
+	 */
+	struct mutex io_lock;
+
+	spinlock_t lock;
+
+	u8 segment_size_order; /* Const */
+	u8 nr_caches_inseg; /* Const */
+
+	/*---------------------------------------------*/
+
+	/******************
+	 * Current position
+	 ******************/
+
+	/*
+	 * Current metablock index
+	 * which is the last place already written
+	 * *not* the position to write hereafter.
+	 */
+	u32 cursor;
+	struct segment_header *current_seg;
+	struct rambuffer *current_rambuf;
+
+	/*---------------------------------------------*/
+
+	/**********************
+	 * Segment header array
+	 **********************/
+
+	u32 nr_segments; /* Const */
+	struct large_array *segment_header_array;
+
+	/*---------------------------------------------*/
+
+	/********************
+	 * Chained Hash table
+	 ********************/
+
+	u32 nr_caches; /* Const */
+	struct large_array *htable;
+	size_t htsize;
+	struct ht_head *null_head;
+
+	/*---------------------------------------------*/
+
+	/*****************
+	 * RAM buffer pool
+	 *****************/
+
+	u32 rambuf_pool_amount; /* kB */
+	u32 nr_rambuf_pool; /* Const */
+	struct rambuffer *rambuf_pool;
+	mempool_t *flush_job_pool;
+
+	/*---------------------------------------------*/
+
+	/***********
+	 * wbflusher
+	 ***********/
+
+	struct workqueue_struct *flusher_wq;
+	wait_queue_head_t flush_wait_queue; /* wait for a segment to be flushed */
+	atomic64_t last_flushed_segment_id;
+
+	/*---------------------------------------------*/
+
+	/*************************
+	 * Barrier deadline worker
+	 *************************/
+
+	struct work_struct barrier_deadline_work;
+	struct timer_list barrier_deadline_timer;
+	struct bio_list barrier_ios; /* List of barrier requests */
+	unsigned long barrier_deadline_ms; /* tunable */
+
+	/*---------------------------------------------*/
+
+	/****************
+	 * Migrate daemon
+	 ****************/
+
+	struct task_struct *migrate_daemon;
+	int allow_migrate;
+	int urge_migrate; /* Start migration immediately */
+	int force_drop; /* Don't stop migration */
+	atomic64_t last_migrated_segment_id;
+
+	/*
+	 * Data structures used by migrate daemon
+	 */
+	wait_queue_head_t migrate_wait_queue; /* wait for a segment to be migrated */
+	wait_queue_head_t wait_drop_caches; /* wait for drop_caches */
+
+	wait_queue_head_t migrate_io_wait_queue; /* wait for migrate ios */
+	atomic_t migrate_io_count;
+	atomic_t migrate_fail_count;
+
+	u32 nr_cur_batched_migration;
+	u32 nr_max_batched_migration; /* tunable */
+
+	u32 num_emigrates; /* Number of emigrates */
+	struct segment_header **emigrates; /* Segments to be migrated */
+	void *migrate_buffer; /* Memorizes the data blocks of the emigrates */
+	u8 *dirtiness_snapshot; /* Memorizes the dirtiness of the metablocks to be migrated */
+
+	/*---------------------------------------------*/
+
+	/*********************
+	 * Migration modulator
+	 *********************/
+
+	struct task_struct *modulator_daemon;
+	int enable_migration_modulator; /* tunable */
+	u8 migrate_threshold;
+
+	/*---------------------------------------------*/
+
+	/*********************
+	 * Superblock recorder
+	 *********************/
+
+	struct task_struct *recorder_daemon;
+	unsigned long update_record_interval; /* tunable */
+
+	/*---------------------------------------------*/
+
+	/*************
+	 * Sync daemon
+	 *************/
+
+	struct task_struct *sync_daemon;
+	unsigned long sync_interval; /* tunable */
+
+	/*---------------------------------------------*/
+
+	/************
+	 * Statistics
+	 ************/
+
+	atomic64_t nr_dirty_caches;
+	atomic64_t stat[STATLEN];
+	atomic64_t count_non_full_flushed;
+
+	/*---------------------------------------------*/
+
+	unsigned long flags;
+	bool should_emit_tunables; /* should emit tunables in dmsetup table? */
+};
+
+/*----------------------------------------------------------------*/
+
+void acquire_new_seg(struct wb_device *, u64 id);
+void flush_current_buffer(struct wb_device *);
+void inc_nr_dirty_caches(struct wb_device *);
+void cleanup_mb_if_dirty(struct wb_device *, struct segment_header *, struct metablock *);
+u8 read_mb_dirtiness(struct wb_device *, struct segment_header *, struct metablock *);
+void invalidate_previous_cache(struct wb_device *, struct segment_header *,
+			       struct metablock *old_mb, bool overwrite_fullsize);
+
+/*----------------------------------------------------------------*/
+
+extern struct workqueue_struct *safe_io_wq;
+extern struct dm_io_client *wb_io_client;
+
+/*
+ * Wrapper of dm_io function.
+ * Set thread to true to run dm_io in other thread to avoid potential deadlock.
+ */
+#define dm_safe_io(io_req, num_regions, regions, err_bits, thread) \
+	dm_safe_io_internal(wb, (io_req), (num_regions), (regions), \
+			    (err_bits), (thread), __func__);
+int dm_safe_io_internal(struct wb_device *, struct dm_io_request *,
+			unsigned num_regions, struct dm_io_region *,
+			unsigned long *err_bits, bool thread, const char *caller);
+
+sector_t dm_devsize(struct dm_dev *);
+
+/*----------------------------------------------------------------*/
+
+/*
+ * Device blockup
+ * --------------
+ *
+ * I/O error on either backing device or cache device should block
+ * up the whole system immediately.
+ * After the system is blocked up all the I/Os to underlying
+ * devices are all ignored as if they are switched to /dev/null.
+ */
+
+#define LIVE_DEAD(proc_live, proc_dead) \
+	do { \
+		if (likely(!test_bit(WB_DEAD, &wb->flags))) { \
+			proc_live; \
+		} else { \
+			proc_dead; \
+		} \
+	} while (0)
+
+#define noop_proc do {} while (0)
+#define LIVE(proc) LIVE_DEAD(proc, noop_proc);
+#define DEAD(proc) LIVE_DEAD(noop_proc, proc);
+
+/*
+ * Macro to add context of failure to I/O routine call.
+ * We inherited the idea from Maybe monad of the Haskell language.
+ *
+ * Policies
+ * --------
+ * 1. Only -EIO will block up the system.
+ * 2. -EOPNOTSUPP could be returned if the target device is a virtual
+ *    device and we request discard to the device.
+ * 3. -ENOMEM could be returned from blkdev_issue_discard (3.12-rc5)
+ *    for example. Waiting for a while can make room for new allocation.
+ * 4. For other unknown error codes we ignore them and ask the users to report.
+ */
+#define IO(proc) \
+	do { \
+		r = 0; \
+		LIVE(r = proc); /* do nothing after blockup */ \
+		if (r == -EOPNOTSUPP) { \
+			r = 0; \
+		} else if (r == -EIO) { \
+			set_bit(WB_DEAD, &wb->flags); \
+			WBERR("device is marked as dead"); \
+		} else if (r == -ENOMEM) { \
+			WBERR("I/O failed by ENOMEM"); \
+			schedule_timeout_interruptible(msecs_to_jiffies(1000));\
+		} else if (r) { \
+			r = 0;\
+			WARN_ONCE(1, "PLEASE REPORT!!! I/O FAILED FOR UNKNOWN REASON err(%d)", r); \
+		} \
+	} while (r)
+
+/*----------------------------------------------------------------*/
+
+#endif
-- 
1.8.3.4