[vdo-devel] Rocky Linux 8.7 & LVM-VDO stability?

Wed Dec 7 04:06:21 UTC 2022

Do you have the rest of the kernel log from that boot session? I'd be 
curious to see what preceded the lockup.

There is a known bug which can result in a lockup of the device, but it 
occurs after the device has been suspended and resumed. That's different 
from shutting it down completely and starting it up again, which is what 
the conversion process does. We've got a fix for it in the RHEL (and 
CentOS) 9 code streams, but for the RHEL 8 version the recommended 
workaround is to fully stop and then restart the device as soon as 
possible after a suspend/restore sequence.

The suspend and restore doesn't have to be explicit on the part of the 
user; it can happen implicitly as part of adding more physical storage 
or changing some of the configuration parameters, as suspend/resume is 
done as part of loading a new configuration into the kernel. So if you 
made a configuration change after the upgrade, that could have tripped 
the bug.

If that wasn't it, maybe there's some other clue in the kernel log...

If it should come up again, there are a few things to look at:

- First, is it really frozen or just slow? The sar or iostat programs 
can show you if I/O is happening.

- Are any of the VDO threads using any CPU time?

- Try running "dmsetup message <vdo-name> 0 dump all" where vdo-name is 
the device name in /dev/mapper, perhaps something like 
vdovg-vdolvol_vpool-vpool if you let the conversion script pick the 
names. Sending this message to VDO will cause it to write a bunch of 
info to the kernel log, which might give us some more insight into the 
problem.

Ken

On 12/5/22 19:39, hostalp at post.cz wrote:
> Hello,
> until recently I was running a Rocky Linux 8.5 VM (at Proxmox 7 
> virtualization solution) with the following config:
>
> kernel-4.18.0-348.23.1.el8_5.x86_64
> lvm2-2.03.12-11.el8_5.x86_64
> vdo-6.2.5.74-14.el8.x86_64
> kmod-kvdo-6.2.5.72-81.el8.x86_64
>
> XFS > VDO > LVM > virtual disk (VirtIO SCSI)
>
> VDO volume was created using the default config, brief summary:
> - logical size 1.2x physical size (based on our past tests on the 
> stored data)
> - compression & deduplication on
> - dense index
> - write mode async
>
> It was mounted using the following options: defaults,noatime,logbsize=128k
> With discards performed periodically via the fstrim.timer.
>
> This was stable during all the uptime (including the time since the 
> whole system creation).
>
> A few days ago I finally updated it to RL 8.7 as well as converted the 
> "VDO on LVM" to the new LVM-VDO solution using the lvm_import_vdo 
> script. The whole process went fine (I already tested it before) and I 
> ended up with the system running in the desired config.
>
> kernel-4.18.0-425.3.1.el8.x86_64
> lvm2-2.03.14-6.el8.x86_64
> vdo-6.2.7.17-14.el8.x86_64
> kmod-kvdo-6.2.7.17-87.el8.x86_64
>
> The current disk space utilization is around 61% (pretty much the same 
> for physical as well as for logical space) and it was never close to 80%.
>
> However it "lasted" for less than a day. During the following night 
> all operations on the VDO volume hung (the other non-VDO volumes were 
> still usable) and I had to perform a hard restart in order to get it 
> back to work.
>
> The only errors/complaints that I found were the blocked task 
> notifications in the console as well as in the /var/log/messages log 
> with the following detail (only the 1st occurrence shown).
>
> Dec  4 01:53:01 lts1 kernel: INFO: task xfsaild/dm-4:5148 blocked for 
> more than 120 seconds.
> Dec  4 01:53:01 lts1 kernel:      Tainted: G           OE --------- -  
> - 4.18.0-425.3.1.el8.x86_64 #1
> Dec  4 01:53:01 lts1 kernel: "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec  4 01:53:01 lts1 kernel: task:xfsaild/dm-4    state:D stack:    0 
> pid: 5148 ppid:     2 flags:0x80004080
> Dec  4 01:53:01 lts1 kernel: Call Trace:
> Dec  4 01:53:01 lts1 kernel: __schedule+0x2d1/0x860
> Dec  4 01:53:01 lts1 kernel: ? finish_wait+0x80/0x80
> Dec  4 01:53:01 lts1 kernel: schedule+0x35/0xa0
> Dec  4 01:53:01 lts1 kernel: io_schedule+0x12/0x40
> Dec  4 01:53:01 lts1 kernel: limiterWaitForOneFree+0xc0/0xf0 [kvdo]
> Dec  4 01:53:01 lts1 kernel: ? finish_wait+0x80/0x80
> Dec  4 01:53:01 lts1 kernel: kvdoMapBio+0xcc/0x2a0 [kvdo]
> Dec  4 01:53:01 lts1 kernel: __map_bio+0x47/0x1b0 [dm_mod]
> Dec  4 01:53:01 lts1 kernel: dm_make_request+0x1a9/0x4d0 [dm_mod]
> Dec  4 01:53:01 lts1 kernel: generic_make_request_no_check+0x202/0x330
> Dec  4 01:53:01 lts1 kernel: submit_bio+0x3c/0x160
> Dec  4 01:53:01 lts1 kernel: ? bio_add_page+0x46/0x60
> Dec  4 01:53:01 lts1 kernel: _xfs_buf_ioapply+0x2af/0x430 [xfs]
> Dec  4 01:53:01 lts1 kernel: ? xfs_iextents_copy+0xba/0x170 [xfs]
> Dec  4 01:53:01 lts1 kernel: ? 
> xfs_buf_delwri_submit_buffers+0x10c/0x2a0 [xfs]
> Dec  4 01:53:01 lts1 kernel: __xfs_buf_submit+0x63/0x1d0 [xfs]
> Dec  4 01:53:01 lts1 kernel: xfs_buf_delwri_submit_buffers+0x10c/0x2a0 
> [xfs]
> Dec  4 01:53:01 lts1 kernel: ? xfsaild+0x26f/0x8c0 [xfs]
> Dec  4 01:53:01 lts1 kernel: xfsaild+0x26f/0x8c0 [xfs]
> Dec  4 01:53:01 lts1 kernel: ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
> Dec  4 01:53:01 lts1 kernel: kthread+0x10b/0x130
> Dec  4 01:53:01 lts1 kernel: ? set_kthread_struct+0x50/0x50
> Dec  4 01:53:01 lts1 kernel: ret_from_fork+0x1f/0x40
>
> I'm now awaiting another occurrence of this and wondering there the 
> issue may be coming from.
> Could it be the new LVM-VDO solution, or the kernel itself?
> Can you perhaps suggest how to collect more information in such case, 
> or provide another tips?
>
> Best regards,
> Petr
>
> _______________________________________________
> vdo-devel mailing list
> vdo-devel at redhat.com
> https://listman.redhat.com/mailman/listinfo/vdo-devel