[dm-devel] [PATCH 03/14] dm-multisnap-mikulas-headers
Mike Snitzer
snitzer at redhat.com
Tue Mar 9 03:30:45 UTC 2010
On Mon, Mar 08 2010 at 10:08pm -0500,
Mikulas Patocka <mpatocka at redhat.com> wrote:
> > > + * The minimum block size is 512, the maximum block size is not specified (it is
> > > + * limited by 32-bit integer size and available memory). All on-disk pointers
> > > + * are in the units of blocks. The pointers are 48-bit, making this format
> > > + * capable of handling 2^48 blocks.
> >
> > Shouldn't we require the chunk size be at least as big as
> > (and a multiple of) physical_block_size? E.g. 4096 on a 4K sector
> > device.
> >
> > This question applies to non-shared snapshots too.
>
> If the device has a larger physical block size, it will reject smaller
> chunks. The same for non-shared snapshots.
Correct, but we don't prevent the user from trying to use less than
physical_block_size. So I think we agree we should. Hasn't been a
concern but native 4K devices change that.
> > > + * Note: because the disks only support atomic writes of 512 bytes, the commit
> > > + * block has only 512 bytes of valid data. The remaining data in the commit
> > > + * block up to the chunk size is unused.
> >
> > Are there other places where you assume 512b is beneficial? My concern
> > is: what will happen on 4K devices?
>
> With 4K chunk size, it writes 4K, but assumes that only 512b write is
> atomic. So, if the disk supports atomic write of 4K, it doesn't hurt.
Right, so long as we impose 4K on a native 4K device, etc. But I was
more wondering if there were other places that assume 512b granularity.
Didn't see any but figured I'd ask.
> > Would making the commit block's size match the physical_block_size give
> > us any multisnapshot benefit? At a minimum I see a larger commit block
> > would allow us to have more remap entries (larger remap
> > array)..
> >
> > "Remaps" detailed below. But what does that buy us?
>
> They reduce the number of blocks written. Without remaps, you'd have to
> write the path from the root every time.
Sure, I wasn't saying we'd eliminate/reduce rempas. I was saying we
could increase the number of remaps (think the current limit is 27 per
512b commit block). Using a 4K commit block would give us 100+?
So I was asking if more remaps help at all.
> > However, and before I get ahead of myself, with blk_stack_limits() we
> > could have a (DM) device that is composed of 4K and 512b devices; with a
> > resulting physical_block_size of 4K. But 4K wouldn't be atomic to the
> > 512b disk.
>
> Yes, that's why I must assume only 512b atomic write.
OK, fine by me so long as we impose chunk_size >= physical_block_size.
> > But what if we were to add a checksum to the commit block? This could
> > give us the ability to have a larger commit block regardless of the
> > physical_block_size. [NOTE: I saw dm_multisnap_commit() is just writing
> > a fixed CB_SIGNATURE]
>
> That would have to be cryptographic hash --- simple checksum can be
> fooled.
>
> Even that wouldn't be correct, because if the hash fails, the commit block
> is lost. If you wanted to use full commit blocks, you'd have to:
>
> - divide the commit block to two.
> - write these two alternatively (so that at least one is valid)
> - hash them or (which is simpler) copy sequence number to each 512b sector
> (so that if some sectors get written and some not, you find it out by
> having different sequence number).
>
> That is possible to do.
Yes, Ric mentioned a comparable strategy was used for database
transactions.
> > And in speaking with Ric Wheeler, using a checksum in the commit block
> > opens up the possibility for optimizing (reducing) the barrier ops
> > associated with:
> > 1) before the commit block is written (flushes journal transaction)
> > 2) and after the commit block is written.
>
> No, you have to use barriers. If the data before the commit blocks is not
> written and the commit block is written (with matching checksum), then the
> data is corrupted.
Fair enough, though it is used by ext4. But with ext4 it is used in
conjunction with a checksummed journal.
> Obviously, you can checksum all the data, but SHA1 is slow and it is being
> phased out already and even slower SHA256 is being recommended...
>
> > Leaving us with only needing to barrier after the commit block is
> > written. But this optimization apparently also requires having a
> > checksummed journal. Ext4 offers this (somewhat controversial yet fast)
> > capability with the 'journal_async_commit' mount option. [NOTE: I'm
> > largely parroting what I heard from Ric]
> >
> > [NOTE: I couldn't immediately tell if dm_multisnap_commit() is doing
> > multiple barriers when writing out the transaction and commit block]
>
> It calls dm_bufio_write_dirty_buffers twice and
> dm_bufio_write_dirty_buffers submits a zero barrier. (there's no point in
> submitting data-barrier, because that gets split into two zero barriers
> and non-barrier write anyway)
OK.
> > Taking a step back, any reason you elected to not reuse existing kernel
> > infrastructure (e.g. jbd2) for journaling? Custom solution needed for
> > the log-nature of the multisnapshot? [Excuse my naive question(s), I
> > see nilfs2 also has its own journaling... I'm just playing devil's
> > advocate given how important it is that the multisnapshot journal code
> > be correct]
>
> All the filesystems have their own journaling. jbd is used only by ext3,
> jbd2 only by ext4. Reiserfs has its own, JFS has its own, XFS has its
> own... etc.
>
> I consider the idead of sharing journaling code as inefficient: arguing
> about the interface would take more time than writing it from scratch.
Well, ocfs2 uses jbd2 too but I understand your point.
> > Above provided, for the benefit of others, to give more context on the
> > role of remap entries (and the commit block's remap array).
>
> If there were no remaps, change in any B-tree node would require to
> overwrite all the nodes from the root. Similarly, changing any bitmap
> would require to overwrite the bitmap directory from the root.
>
> With remaps, changes to B-tree nodes or bitmaps write just that one block
> (and commit block, to store the remap). The full write from the root is
> done later, when the remap table fills up.
Again, I was asking about adding more remap entries in the remaps array
if the commit block was increased from 512b to 4K.
Mike
More information about the dm-devel
mailing list