[linux-lvm] Random file system errors

f-lvm at media.mit.edu f-lvm at media.mit.edu
Wed Apr 29 03:52:43 UTC 2009


Btw, one way to proceed on the test-your-hardware angle without
yanking disks (or even opening the case) and possibly turning this
into a heisenbug if it really -is- something like cabling would be
to do something like this:

   dd if=/dev/hda bs=1M count=1000 | md5sum

for each of hdX and sdX or whatever describes the raw physical
devices.  Do this with the LVM -completely deactivated- so you
know that absolutely nothing can be writing to the disks; you
should probably boot from a LiveCD to ensure this.

Run each test at least twice for the same disk and record the results;
I'll bet that at least one of your disks will return inconsistent
data; perhaps all disks on one IDE channel or one SATA channel will,
or perhaps every single disk will if you've got RAM, PSU, or
bridge-chip troubles, etc.

If you're seeing a very low frequency of bit flips, raise the count on
the dd to something larger, like maybe 10000 instead or whatever;
that'll slow down the test but raise your confidence in it.

Either way, try it on a USB device as well.  Very different hardware
and software paths.  Might be illuminating.

Just make -damned- sure that your dd is using "if" and not "of"!

If you -can't- make it fail, you might get fancier and try something
that forces lots of head seeking (since that will consume more power
and maybe stress your PSU), or try running all the disk tests in
parallel (since that will chew up more CPU) or perhaps run something
that runs your CPU flat out in one process while doing the dd in
another.

If you still can't make it fail, try activating the LVM -from a LiveCD- 
(e.g., -not- booted from it) and then repeat the tests on the LV's.
If it fails on LV's that have no mounted filesystems and aren't being
touched, but works on the raw devices, -then- you're starting to point
a finger at LVM...  (And if you have to mount a FS to start getting
failures, only then might we start thinking about write barriers or
whatever...)

If everything you do doesn't make it fail, but it fails when you're
booted and running from that LVM, I'd start to suspect LVM and/or
kernel issues in the actual software you're running.  But I'll bet
that you'll see a failure before that point.

And report back; it'd be good to close the loop on this if it's proven
-not- to be an LVM issue.




More information about the linux-lvm mailing list