[dm-devel] [PATCH] dm-raid: add RAID discard support

NeilBrown neilb at suse.de
Wed Oct 1 02:56:25 UTC 2014


On Wed, 24 Sep 2014 13:02:28 +0200 Heinz Mauelshagen <heinzm at redhat.com>
wrote:

> 
> Martin,
> 
> thanks for the good explanation of the state of the discard union.
> Do you have an ETA for the 'zeroout, deallocate' ... support you mentioned?
> 
> I was planning to have a followup patch for dm-raid supporting a dm-raid 
> table
> line argument to prohibit discard passdown.
> 
> In lieu of the fuzzy field situation wrt SSD fw and discard_zeroes_data 
> support
> related to RAID4/5/6, we need that in upstream together with the initial 
> patch.
> 
> That 'no_discard_passdown' table line can be added to dm-raid RAID4/5/6 
> table
> lines to avoid possible data corruption but can be avoided on RAID1/10 
> table lines,
> because the latter are not suffering from any  discard_zeroes_data flaw.
> 
> 
> Neil,
> 
> are you going to disable discards in RAID4/5/6 shortly
> or rather go with your bitmap solution?

Can I just close my eyes and hope it goes away?

The idea of a bitmap of uninitialised areas is not a short-term solution.
But I'm not really keen on simply disabling discard for RAID4/5/6 either. It
would  mean that people with good sensible hardware wouldn't be able to use
it properly.

I would really rather that discard_zeroes_data were only set on devices where
it was actually true.  Then it wouldn't be my problem any more.

Maybe I could do a loud warning
  "Not enabling DISCARD on RAID5 because we cannot trust committees.
   Set "md_mod.willing_to_risk_discard=Y" if your devices reads discarded
   sectors as zeros"

and add an appropriate module parameter......

While we are on the topic, maybe I should write down my thoughts about the
bitmap thing in case someone wants to contribute.

  There are 3 states that a 'region' can be in:
    1- known to be in-sync
    2- possibly not in sync, but it should  be
    3- probably not in sync, contains no valuable data.

 A read from '3' should return zeroes.
 A write to '3' should change the region to be '2'.  It could either
    write zeros before allowing the write to start, or it could just start
    a normal resync.

 Here is a question:  if a region has been discarded, are we guaranteed that
 reads are at least stable.  i.e. if I read twice will I definitely get the
 same value?

 It would be simplest to store all the states in the one bitmap. i.e. have a
 new version of the current bitmap (which distinguishes between '1' and '2')
 which allows the 3rd state.  Probably just 2 bits per region, though we could
 squeeze state for 20 regions into 32 bits if we really wanted :-)

 Then we set the bitmap to all '3's when it is created and set regions to
 '3' when discarding.

 One problem with this approach is that it forces the same region size.
 Regions for the current bitmap can conveniently be 10s of Megabytes.  For
 state '3', we ideally want one region per stripe.
 For 4TB drives with a 512K chunk size (and smaller is not uncommon) that is
 8 million regions which is a megabyte of bits.

 I guess that isn't all that much.  One problem with small regions for the
 1/2 distinction is that we end up updating the bitmap more often, but
 that could be avoided by setting multiple bits at one.
 i.e. keep the internal counters with a larger granularity, and when a
 counter (of outstanding writes) becomes non-zero, set quite a few bits in
 the bitmap.

 So that is probably what I would do:
  - new version for bitmap which has 2 bits per region and encodes 3 states
  - bitmap granularity matches chunk size (by default)
  - decouple region size of bitmap from region size for internal 'dirty'
    accounting
  - write to a 'state 3' region sets it to 'state 2' and kicks off resync
  - 'discard' sets state to '3'.

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 828 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/dm-devel/attachments/20141001/260d840a/attachment.sig>


More information about the dm-devel mailing list