Well, the issue first – lvm hangs up on sync command after snapshot overflow.
How to reproduce the problem
You can do that with the script – test.sh, which is in the attachments. It may appear rather big, but that's primarily due to debug messages - in fact It's quite simple. First it creates physical volume on a chosen physical disk, creates volume group and 2 logical volumes. One of them is the original LV that we write data to and the other one reserved for a snapshot. Later it mounts the original volume, converts the second LV to a snapshot and writes data to the origin LV in the amount that would make the next snapshot overflow. Then sync is executed. Afterwards another logical volume is created and converted to snapshot. sync. This sync hangs up.
It is advised to perform tests in virtual environment, because besides other reasons, you won't be able to reboot normally. When you run the script for the next time after a reboot it will take care of the old stuff – the required commands are at the very beginning of the script.
And this is what we have so far
We started off here: http://www.redhat.com/archives/dm-devel/2011-May/msg00059.html, but after a bunch of tests came to a conclusion that it is neither the kernel version, nor its configuration or file system that has an impact on hangup. By now we know that this issue occurs on all versions of lvm past 2.02.56 (2.02.57 fails). An interesting fact is that when we built the most verbose version of kernel possible (meaning the amount of kernel logs) and the system became real slow the newer version (2.02.57), that had previously hung up, - passed! Based on this we think there might be an overrun present that leads to a deadlock.
For now there are two basic errors:
lvconvert device-mapper: suspend ioctl failed: Input/output error
lvconvert Unable to suspend VG-sn_x (252:3)
lvconvert Failed to suspend origin lv
------- and --------
LV VG/sn_x in use: not deactivating
Couldn't deactivate LV sn_x
The first one always precedes the hang up, while the second one doesn't appear every time, but always comes first of the two and can appear multiple times before the first error. In both cases _lock_vol. returns 0.
As of the second error. The function lvconvert_snapshot fails, reporting “Couldn't deactivate LV sn_x”, because info.open_count is not equal to zero. That's indicated by “LV VG/sn_x in use: not deactivating” error. The value of info.open_count is clearly set to 1 with the lv_info function, but seems to be never cleared - the value of info.open_count is set to the value of a field, stored in dm_ioctl struct, which is a member of dm_task struct, but I couldn't find were it is assigned.
Things get much more complicated due to inability to use a debugger, so an attending question would be – how do you properly build lvm to get debugging symbols on? Right now lvm wouldn't build with debug symbols even though configuration script is provided with appropriate option and it's proved to be applied, while building (configuration log says it's on and the corresponding option (-g) is added to the list of flags, passed to gcc).
It the attached archive you will find the following files:
kernel_logs - kernel logs after each tool invocation, retrieved by dmesg -c.
lvm.conf - lvm configuration file that we have used
lvm2.log - lvm logs with debug level set to 7
output_logs - 2 versions: neat and verbose. The difference is that verbose contains commands performed (set -x)
test.sh - the main test-script
remove.sh - a portion of test.sh responsible for cleanup (sometimes convenient to have separate)
We continue to study the problem, but any help or guidance from people, how are familiar with the structure and code of lvm would be highly appreciated. Thanks a lot!
Description: GNU Zip compressed data