[dm-devel] [NOTES] thinp zeroing

Wed Feb 22 14:14:43 UTC 2012

* Requirements

  There are two distinct requirements for zeroing applicable to the
  thin-provisioning target:

  - Avoid data leaks (DATA_LEAK)

    Consumers of thin devices may be using the same pool.  eg, two
    different vm guests provisioned from the same pool.  We must
    ensure that no data from one thin device appears on another in
    newly provisioned areas.

  - Provide guarantees about the presence/absence of sensitive data (ERASE)

    eg, When decomissioning a guest vm, the host (running the pool)
    wishes to guarantee that no data from that guest remains on the
    data device.

* Implementing DATA_LEAK

  Currently the DATA_LEAK requirement is enforced by zeroing every
  newly provisioned thin device block.  This zeroing can often be
  elided if the write io triggering the provisioning completely covers
  the block in question.  This zeroing can be turned off at the pool
  level if data leaks are not a concern (eg, a desktop system).
  Already upstream.

* Implementing ERASE

** Erase on deallocation

  The ERASE requirement is more difficult.  Zeroing data blocks when
  they are deallocated (ie. their ref count drops to zero after
  deleting the device) sounds like a good approach, but this
  introduces certain difficulties, mainly:

  - To retain our crash recovery properties, the zeroing cannot occur
    until after the next commit.  Extra on disk metadata would need to
    be stored to keep track of these blocks that need zeroing.  A
    commit would trigger a storm of io; currently the cost of a
    copy-on-write exception is paid immediately by the io that
    triggers it.  Building up delayed work like this makes it very
    hard to give performance estimates.

** Erasing from userland.

   Zeroing all unshared blocks when deleting a thin device will create
   a lot of io, I'd much rather this was being managed by userland.  I
   really don't want message ioctls that take many minutes to
   complete.

   The 'held_root' facility allows userland to read a snapshot of the
   metadata in a live pool.  [Note: this is a snapshot of the
   metadata, not a snapshot of data].  The following will implement
   ERASE in userland.

   - Deactivate all thin volumes that you wish to erase.

     Failure to deactivate would mean the mappings for the thins could
     be out of date by the time userland reads them.  There is no
     mechanism for enforcing this at the device mapper level; but
     userland can easily do this (eg, lvm2 already has a comprehensive
     locking scheme that will handle this).  It should also be pointed
     out that if you try and erase a volume while you're still using it,
     you are an idiot.

   - Grab a 'held_root'

   - Read the mappings for all of the thins you wish to erase.

   - Work out which data blocks are used exclusively by this subset of
     thins.

   - Write zeroes across these blocks.

   - send a thin-delete message to the pool for each thin.

** Crash recovery

   If we crash during the copy-on-write or provision operation the
   recovery process needs to zero those new, but not committed,
   blocks.  This requires the introduction of an 'erase log' to the
   metadata format.  This log would need to be committed *before* the
   copy/overwrite operation could proceed.

   I've implemented such an erase log [see patch], to get an idea of
   the performance overhead.  Testing in ideal conditions (ie. large
   writes that are triggering many provision/copy operations so costs
   can be amortised), we see a 25% slowdown in throughput of
   provision/copy operations.  Better than I feared.

   We can improve the performance significantly by observing that it's
   harmless to do too much zeroing of unprovisioned blocks on
   recovery.  This suggests a scheme similar to a mirror log, where we
   mark regions of the data volume that we have pending provision/copy
   operations.  When we recover we just zero *all* unallocated blocks
   in these marked regions.  This will result in fewer commits, since
   newly allocated blocks will commonly come from the same region and
   so avoid the need for a commit.  [TODO: get a proof of concept
   patch together].

** Discards

   DISCARDs *must* result in data being zeroed.  Some devices set the
   discard_zeroes_data flag.  This is not good enough; you cannot use
   this flag as a guarantee that the data no longer exists on the
   disk.  So real zeroing must occur.  I suggest we write a separate
   target that zeroes data just before discarding it, and stack it
   under the thin-pool.  The performance impact of this will be
   significant; to the point that we may wish to turn discard within
   the fs off; instead doing periodic tidy-ups.

** Avoid redundant copying

   The calculation to say whether a block is shared or not (and thus
   liable to suffer a copy-on-write exception), is an approximation.
   It sometimes says something is shared when it isn't, which causes
   us a problem wrt ERASE.  To avoid leaving orphaned copies of data,
   we must either tighten up the sharing detection [patch in the
   works], or zero the old block (via discard).

** Summary of work items [0/5]

   Too much for linux 3.4 timeframe.

   - [ ] Change the shared block detection [1 day, worth doing anyway]

   - [ ] Bitmap based erase log [1 week]

   - [ ] Recovery tool that zeroes unallocated blocks in dirty regions [1 week]

   - [ ] Implement the discard-really-zeroes target [1 month]

   - [ ] Write thin_erase userland tool [1 week]

   - [ ] Update lvm2 tools [3 months]