Re: [dm-devel] A thin-p over 256 GiB fails with I/O errors with non-power-of-two chunk

Dne 21.1.2013 19:49, Mike Snitzer napsal(a):
On Fri, Jan 18 2013 at  5:19am -0500,
Daniel Browning <db kavod com> wrote:

Why do I get the following error, and what should I do about it? When I
create a raid0 md with a non-power-of-two chunk size (e.g. 1152K instead of
512K), then create a thinly-provisioned volume that is over 256 GiB, I get
the following dmesg error when I try to create a file system on it:

     "make_request bug: can't convert block across chunks or bigger than 1152k 4384 127"

This bubbles up to mkfs.xfs as

     "libxfs_device_zero write failed: Input/output error"

What I find interesting is that it seems to require all three conditions
(chunk size, thin-p, and >256 GiB) in order to fail. Without those, it seems
to work fine:

     * Power-of-two chunk (e.g. 512K), thin-p vol, >256 GiB? Works.
     * Non-power-of-two chunk (e.g. 1152K), thin-p vol, <256 GiB? Works.
     * Non-power-of-two chunk (e.g. 1152K), regular vol, >256 GiB? Works.
     * Non-power-of-two chunk (e.g. 1152K), thin-p vol, >256 GiB? FAIL.

Attached is a self-contained test case to reproduce the error, version
numbers, and an strace. Thank you in advance,
Daniel Browning
Kavod Technologies

Appendix A. Self-contained reproduce script
dd if=/dev/zero of=loop0.img bs=1G count=150; losetup /dev/loop0 loop0.img
dd if=/dev/zero of=loop1.img bs=1G count=150; losetup /dev/loop1 loop1.img
mdadm --create /dev/md99 --verbose --level=0 --raid-devices=2 \
       --chunk=1152K /dev/loop0 /dev/loop1
pvcreate /dev/md99
vgcreate test_vg /dev/md99
lvcreate --size 257G --type thin-pool --thinpool test_thin_pool test_vg
lvcreate --virtualsize 257G --thin test_vg/test_thin_pool --name test_lv
mkfs.xfs /dev/test_vg/test_lv

# That is where the error occurs. Next is cleanup.
lvremove -f /dev/test_vg/test_lv
lvremove -f /dev/mapper/test_vg-test_thin_pool
vgremove -f test_vg
pvremove /dev/md99
mdadm --stop /dev/md99
mdadm --zero-superblock /dev/loop0 /dev/loop1
losetup -d /dev/loop0 /dev/loop1
rm loop*.img

Limits of the raid0 device (/dev/md99):
cat /sys/block/md99/queue/minimum_io_size
cat /sys/block/md99/queue/optimal_io_size

Limits of the thin-pool device (/dev/test_vg/test_thin_pool):
cat /sys/block/dm-9/queue/minimum_io_size
cat /sys/block/dm-9/queue/optimal_io_size

Limits of the thin-device device (/dev/test_vg/test_lv):
cat /sys/block/dm-10/queue/minimum_io_size
cat /sys/block/dm-10/queue/optimal_io_size

I notice that lvcreate is not using a thin-pool chunksize that matches
the raid0's chunksize (just uses the lvm2 default of 256K).

Switching the thin-pool lvcreate to use --chunksize 1152K at least
enables me to format the filesystem.

And both the thin-pool and thin device have an optimal_io_size that
matches the chunk_size of the underlying raid volume:

cat /sys/block/dm-9/queue/optimal_io_size
cat /sys/block/dm-10/queue/optimal_io_size

I'm still investigating the limits issue when --chunksize 1152K isn't
used for the thin-pool lvcreate.

Just a comment for the selection of thin chunksize here -

I think it has couple aspects here - by default (unless changed via
lvm.conf {allocation/thin_pool_chunk_size}) it is targeting for 64K
and scales chunksize up to fit thin metadata within 128MB.
So lvm2 here scaled from 64k to 256k in multiTB case.

lvcreate currently doesn't look out for geometry of underlying PV(s) during its allocation (somewhat chicken-egg problem) - yet there are possible ways to try to put this into equation - thought it might not be actually wanted by the user - since for snapshots the smaller chunksize is more usable
(>1MB is quite a lot here IMHO) - but it probably worth some thinking.


