[dm-devel] [PATCH 03/14] dm-multisnap-mikulas-headers

Tue Mar 9 03:30:45 UTC 2010

On Mon, Mar 08 2010 at 10:08pm -0500,
Mikulas Patocka <mpatocka at redhat.com> wrote:

> > > + * The minimum block size is 512, the maximum block size is not specified (it is
> > > + * limited by 32-bit integer size and available memory). All on-disk pointers
> > > + * are in the units of blocks. The pointers are 48-bit, making this format
> > > + * capable of handling 2^48 blocks.
> > 
> > Shouldn't we require the chunk size be at least as big as
> > (and a multiple of) physical_block_size?  E.g. 4096 on a 4K sector
> > device.
> > 
> > This question applies to non-shared snapshots too.
> 
> If the device has a larger physical block size, it will reject smaller 
> chunks. The same for non-shared snapshots.

Correct, but we don't prevent the user from trying to use less than
physical_block_size.  So I think we agree we should.  Hasn't been a
concern but native 4K devices change that.

> > > + * Note: because the disks only support atomic writes of 512 bytes, the commit
> > > + * block has only 512 bytes of valid data. The remaining data in the commit
> > > + * block up to the chunk size is unused.
> > 
> > Are there other places where you assume 512b is beneficial?  My concern
> > is: what will happen on 4K devices?
> 
> With 4K chunk size, it writes 4K, but assumes that only 512b write is 
> atomic. So, if the disk supports atomic write of 4K, it doesn't hurt.

Right, so long as we impose 4K on a native 4K device, etc.  But I was
more wondering if there were other places that assume 512b granularity.
Didn't see any but figured I'd ask.

> > Would making the commit block's size match the physical_block_size give
> > us any multisnapshot benefit?  At a minimum I see a larger commit block
> > would allow us to have more remap entries (larger remap
> > array)..
> >
> > "Remaps" detailed below.  But what does that buy us?
> 
> They reduce the number of blocks written. Without remaps, you'd have to 
> write the path from the root every time.

Sure, I wasn't saying we'd eliminate/reduce rempas.  I was saying we
could increase the number of remaps (think the current limit is 27 per
512b commit block).  Using a 4K commit block would give us 100+?

So I was asking if more remaps help at all.

> > However, and before I get ahead of myself, with blk_stack_limits() we
> > could have a (DM) device that is composed of 4K and 512b devices; with a
> > resulting physical_block_size of 4K.  But 4K wouldn't be atomic to the
> > 512b disk.
> 
> Yes, that's why I must assume only 512b atomic write.

OK, fine by me so long as we impose chunk_size >= physical_block_size.

> > But what if we were to add a checksum to the commit block?  This could
> > give us the ability to have a larger commit block regardless of the
> > physical_block_size. [NOTE: I saw dm_multisnap_commit() is just writing
> > a fixed CB_SIGNATURE]
> 
> That would have to be cryptographic hash --- simple checksum can be 
> fooled.
> 
> Even that wouldn't be correct, because if the hash fails, the commit block 
> is lost.  If you wanted to use full commit blocks, you'd have to:
> 
> - divide the commit block to two.
> - write these two alternatively (so that at least one is valid)
> - hash them or (which is simpler) copy sequence number to each 512b sector 
> (so that if some sectors get written and some not, you find it out by 
> having different sequence number).
> 
> That is possible to do.

Yes, Ric mentioned a comparable strategy was used for database
transactions.

> > And in speaking with Ric Wheeler, using a checksum in the commit block
> > opens up the possibility for optimizing (reducing) the barrier ops
> > associated with:
> > 1) before the commit block is written (flushes journal transaction) 
> > 2) and after the commit block is written.
> 
> No, you have to use barriers. If the data before the commit blocks is not 
> written and the commit block is written (with matching checksum), then the 
> data is corrupted.

Fair enough, though it is used by ext4.  But with ext4 it is used in
conjunction with a checksummed journal.

> Obviously, you can checksum all the data, but SHA1 is slow and it is being 
> phased out already and even slower SHA256 is being recommended...
> 
> > Leaving us with only needing to barrier after the commit block is
> > written.  But this optimization apparently also requires having a
> > checksummed journal.  Ext4 offers this (somewhat controversial yet fast)
> > capability with the 'journal_async_commit' mount option. [NOTE: I'm
> > largely parroting what I heard from Ric]
> > 
> > [NOTE: I couldn't immediately tell if dm_multisnap_commit() is doing
> > multiple barriers when writing out the transaction and commit block]
> 
> It calls dm_bufio_write_dirty_buffers twice and 
> dm_bufio_write_dirty_buffers submits a zero barrier. (there's no point in 
> submitting data-barrier, because that gets split into two zero barriers 
> and non-barrier write anyway)

OK.

> > Taking a step back, any reason you elected to not reuse existing kernel
> > infrastructure (e.g. jbd2) for journaling?  Custom solution needed for
> > the log-nature of the multisnapshot?  [Excuse my naive question(s), I
> > see nilfs2 also has its own journaling... I'm just playing devil's
> > advocate given how important it is that the multisnapshot journal code
> > be correct]
> 
> All the filesystems have their own journaling. jbd is used only by ext3, 
> jbd2 only by ext4. Reiserfs has its own, JFS has its own, XFS has its 
> own... etc.
> 
> I consider the idead of sharing journaling code as inefficient: arguing 
> about the interface would take more time than writing it from scratch.

Well, ocfs2 uses jbd2 too but I understand your point.

> > Above provided, for the benefit of others, to give more context on the
> > role of remap entries (and the commit block's remap array).
> 
> If there were no remaps, change in any B-tree node would require to 
> overwrite all the nodes from the root. Similarly, changing any bitmap 
> would require to overwrite the bitmap directory from the root.
> 
> With remaps, changes to B-tree nodes or bitmaps write just that one block 
> (and commit block, to store the remap). The full write from the root is 
> done later, when the remap table fills up.

Again, I was asking about adding more remap entries in the remaps array
if the commit block was increased from 512b to 4K.

Mike