[vdo-devel] Rocky Linux 8.7 & LVM-VDO stability?

Thu Dec 8 08:28:30 UTC 2022

On 12/7/22 08:39, hostalp at post.cz wrote:

> Hello.
> log from the previous boot until the freeze attached.
> Speaking of suspend/resume I can see some in there.
>
> First there was a suspend of the original VDO-over-LVM device before 
> the conversion to LVM-VDO via the lvm_import_vdo script - Dec  3 06:19:43
> After the conversion the new device started (resumed) as a different one.
> Not sure if the issue you mention could also affect such cases.

The conversion script should be completely shutting the device down and 
starting it up from scratch. The resume and suspend operations are used 
as part of startup and shutdown respectively, but it's specifically the 
suspend-then-resume sequence after having done any sort of write 
operations (or anything that would create journal entries) that can 
trigger the problem. The conversion script shouldn't do that.

The bug is written up at 
https://bugzilla.redhat.com/show_bug.cgi?id=2109047 and some other 
linked tickets.

> Then there was another suspend/resume cycle (actually 2) related to 
> the renaming of the VDO pool LV. First I set it to a different name 
> (mostly to match the naming convention of newly created devices as the 
> conversion script uses slightly different convention than lvcreate), 
> then (after looking at it and thinking about it more thoroughly) I 
> renamed it back (basically I chose my own name that was coincidentally 
> identical to the original one generated by the conversion script)  - 
> Dec  3 06:40:44 - 06:43:37
Yes, this looks like it could well have done it.
>
> The freeze occurred the next day early in the morning - first symptoms 
> visible in the attached log: Dec  4 01:53:01 - e.g. some 19 hours later.
>
> If the freeze was due to those earlier suspend/resume cycles (either 
> related to the conversion to LVM-VDO, or to the later VDO pool LV 
> renaming) - how to properly handle such situations then (without the 
> restart)? Of course I didn't explicitly perform any suspend/resume 
> myself there.
Using "lvchange -an" on the logical volume stored in VDO (post 
conversion) should shut VDO down and clear out the incorrect data 
structures. (You can confirm that with "dmsetup table" -- the "vdo" 
entry should disappear.) Then "lvchange -ay" to make the logical volume 
available will start VDO again with the internal data structures in a 
clean state.
>
> As for the symptomps, the VDO disk was completely frozen, not just slow.
> If it occurs again I'll collect some more information as you suggest.
>
> Best regards,
> Petr

Thanks. Hopefully, if you can avoid operations that involve suspends (we 
documented things like growing the storage, but renaming didn't occur to 
me), or have the opportunity to stop and restart the device soon 
afterwards, you shouldn't see it again...

Ken

>
> ---------- Original message ----------
> From: Ken Raeburn <raeburn at redhat.com>
> To: hostalp at post.cz, vdo-devel at redhat.com
> Sent: 7. 12. 2022 5:06:30
> Subject: Re: [vdo-devel] Rocky Linux 8.7 & LVM-VDO stability?
>
>
>     Do you have the rest of the kernel log from that boot session? I'd be
>     curious to see what preceded the lockup.
>
>     There is a known bug which can result in a lockup of the device,
>     but it
>     occurs after the device has been suspended and resumed. That's
>     different
>     from shutting it down completely and starting it up again, which
>     is what
>     the conversion process does. We've got a fix for it in the RHEL (and
>     CentOS) 9 code streams, but for the RHEL 8 version the recommended
>     workaround is to fully stop and then restart the device as soon as
>     possible after a suspend/restore sequence.
>
>     The suspend and restore doesn't have to be explicit on the part of
>     the
>     user; it can happen implicitly as part of adding more physical
>     storage
>     or changing some of the configuration parameters, as
>     suspend/resume is
>     done as part of loading a new configuration into the kernel. So if
>     you
>     made a configuration change after the upgrade, that could have
>     tripped
>     the bug.
>
>     If that wasn't it, maybe there's some other clue in the kernel log...
>
>     If it should come up again, there are a few things to look at:
>
>     - First, is it really frozen or just slow? The sar or iostat programs
>     can show you if I/O is happening.
>
>     - Are any of the VDO threads using any CPU time?
>
>     - Try running "dmsetup message <vdo-name> 0 dump all" where
>     vdo-name is
>     the device name in /dev/mapper, perhaps something like
>     vdovg-vdolvol_vpool-vpool if you let the conversion script pick the
>     names. Sending this message to VDO will cause it to write a bunch of
>     info to the kernel log, which might give us some more insight into
>     the
>     problem.
>
>     Ken
>
>     On 12/5/22 19:39, hostalp at post.cz wrote:
>     > Hello,
>     > until recently I was running a Rocky Linux 8.5 VM (at Proxmox 7
>     > virtualization solution) with the following config:
>     >
>     > kernel-4.18.0-348.23.1.el8_5.x86_64
>     > lvm2-2.03.12-11.el8_5.x86_64
>     > vdo-6.2.5.74-14.el8.x86_64
>     > kmod-kvdo-6.2.5.72-81.el8.x86_64
>     >
>     > XFS > VDO > LVM > virtual disk (VirtIO SCSI)
>     >
>     > VDO volume was created using the default config, brief summary:
>     > - logical size 1.2x physical size (based on our past tests on the
>     > stored data)
>     > - compression & deduplication on
>     > - dense index
>     > - write mode async
>     >
>     > It was mounted using the following options:
>     defaults,noatime,logbsize=128k
>     > With discards performed periodically via the fstrim.timer.
>     >
>     > This was stable during all the uptime (including the time since the
>     > whole system creation).
>     >
>     > A few days ago I finally updated it to RL 8.7 as well as
>     converted the
>     > "VDO on LVM" to the new LVM-VDO solution using the lvm_import_vdo
>     > script. The whole process went fine (I already tested it before)
>     and I
>     > ended up with the system running in the desired config.
>     >
>     > kernel-4.18.0-425.3.1.el8.x86_64
>     > lvm2-2.03.14-6.el8.x86_64
>     > vdo-6.2.7.17-14.el8.x86_64
>     > kmod-kvdo-6.2.7.17-87.el8.x86_64
>     >
>     > The current disk space utilization is around 61% (pretty much
>     the same
>     > for physical as well as for logical space) and it was never
>     close to 80%.
>     >
>     > However it "lasted" for less than a day. During the following night
>     > all operations on the VDO volume hung (the other non-VDO volumes
>     were
>     > still usable) and I had to perform a hard restart in order to
>     get it
>     > back to work.
>     >
>     > The only errors/complaints that I found were the blocked task
>     > notifications in the console as well as in the /var/log/messages
>     log
>     > with the following detail (only the 1st occurrence shown).
>     >
>     > Dec  4 01:53:01 lts1 kernel: INFO: task xfsaild/dm-4:5148
>     blocked for
>     > more than 120 seconds.
>     > Dec  4 01:53:01 lts1 kernel:      Tainted: G           OE
>     --------- -
>     > - 4.18.0-425.3.1.el8.x86_64 #1
>     > Dec  4 01:53:01 lts1 kernel: "echo 0 >
>     > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>     > Dec  4 01:53:01 lts1 kernel: task:xfsaild/dm-4    state:D
>     stack:    0
>     > pid: 5148 ppid:     2 flags:0x80004080
>     > Dec  4 01:53:01 lts1 kernel: Call Trace:
>     > Dec  4 01:53:01 lts1 kernel: __schedule+0x2d1/0x860
>     > Dec  4 01:53:01 lts1 kernel: ? finish_wait+0x80/0x80
>     > Dec  4 01:53:01 lts1 kernel: schedule+0x35/0xa0
>     > Dec  4 01:53:01 lts1 kernel: io_schedule+0x12/0x40
>     > Dec  4 01:53:01 lts1 kernel: limiterWaitForOneFree+0xc0/0xf0 [kvdo]
>     > Dec  4 01:53:01 lts1 kernel: ? finish_wait+0x80/0x80
>     > Dec  4 01:53:01 lts1 kernel: kvdoMapBio+0xcc/0x2a0 [kvdo]
>     > Dec  4 01:53:01 lts1 kernel: __map_bio+0x47/0x1b0 [dm_mod]
>     > Dec  4 01:53:01 lts1 kernel: dm_make_request+0x1a9/0x4d0 [dm_mod]
>     > Dec  4 01:53:01 lts1 kernel:
>     generic_make_request_no_check+0x202/0x330
>     > Dec  4 01:53:01 lts1 kernel: submit_bio+0x3c/0x160
>     > Dec  4 01:53:01 lts1 kernel: ? bio_add_page+0x46/0x60
>     > Dec  4 01:53:01 lts1 kernel: _xfs_buf_ioapply+0x2af/0x430 [xfs]
>     > Dec  4 01:53:01 lts1 kernel: ? xfs_iextents_copy+0xba/0x170 [xfs]
>     > Dec  4 01:53:01 lts1 kernel: ?
>     > xfs_buf_delwri_submit_buffers+0x10c/0x2a0 [xfs]
>     > Dec  4 01:53:01 lts1 kernel: __xfs_buf_submit+0x63/0x1d0 [xfs]
>     > Dec  4 01:53:01 lts1 kernel:
>     xfs_buf_delwri_submit_buffers+0x10c/0x2a0
>     > [xfs]
>     > Dec  4 01:53:01 lts1 kernel: ? xfsaild+0x26f/0x8c0 [xfs]
>     > Dec  4 01:53:01 lts1 kernel: xfsaild+0x26f/0x8c0 [xfs]
>     > Dec  4 01:53:01 lts1 kernel: ?
>     xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
>     > Dec  4 01:53:01 lts1 kernel: kthread+0x10b/0x130
>     > Dec  4 01:53:01 lts1 kernel: ? set_kthread_struct+0x50/0x50
>     > Dec  4 01:53:01 lts1 kernel: ret_from_fork+0x1f/0x40
>     >
>     > I'm now awaiting another occurrence of this and wondering there the
>     > issue may be coming from.
>     > Could it be the new LVM-VDO solution, or the kernel itself?
>     > Can you perhaps suggest how to collect more information in such
>     case,
>     > or provide another tips?
>     >
>     > Best regards,
>     > Petr
>     >
>     > _______________________________________________
>     > vdo-devel mailing list
>     > vdo-devel at redhat.com
>     > https://listman.redhat.com/mailman/listinfo/vdo-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/vdo-devel/attachments/20221208/47542db5/attachment.htm>