Re: [dm-devel] kernel update and dmraid causing grub errors

On 11/09/2010 11:55 AM, David C. Rankin wrote:
> On 11/04/2010 11:17 AM, David C. Rankin wrote:
>> Hi Heinz,
>> 	No, grub is (grub-0.97-17) and it hasn't changed since April 25, 2010. So
>> whatever is happening, isn't due to a grub change. Some of the Arch devs think
>> it might be a kernel issue. Last night, I posted the issue to the kernel list at
>> kernel.org and we will see what response we get back. The post to kernel.org was
>> pretty much the complete history of the issue, so I'll include the additional
>> information posted to the kernel list below for completeness:
> Heinz,
> 	Just as a follow-up, I didn't get a response from the kernel.org list on the
> issue. In fact the only dm related post on the list in the past week was the CFQ
> dm-crypt post that I also see was cc'ed here. I'll try the grub list and see if
> they have any ideas. If I get a response, I'll let you know. If you have any
> epiphanies on the issue, please let me know. Thanks.


    I have one more piece of input and one more question. The issue may be more
than just this one box. I have two x86_64 nv dmraid boxes at the house
(primary/backup servers). The one I have had the boot problems with (MSI K9N2
SLI Platinum - Award BIOS) (running and the other one is based on a
Tyan Tomcat K8e (Model: S2865 - Pheonix BIOS/Opteron 180) (running
Both have similar nv dmraid setups. (MSI box has 2 RAID 1 arrays, Tyan box has 1
RAID 1 array)

    What I have noticed recently, the Tyan box boots and experiences what sounds
like disk/drive controller "confusion." What is weird is that it depends on how
the box inits. The problem is either "there" or it "isn't".

    What I mean is that when the problem occurs on the Tyan box -- it effects
the box from boot until shutdown. It behaves just like there is an interrupt
conflict or drive/controller fault. I can hear consistent read/write head
excursions (once every 2-3 secs.) and I get 15-30-60 second delays with
everything (type ls -- then wait 30,60 seconds for the listing or rt-click on
the desktop and wait, and wait... for the context menu). It doesn't matter
whether I have a desktop running or boot to runlevel 3 -- it's a low-level

    Normally that is a "Hey stupid, you have a drive failing... go fix it"
issue. But it's not. smartctl is fine on all drives -- "no errors logged".
Nothing in syslog or dmesg, and the disks are clean.

    A shutdown or reboot will completely "fix" the problem. Although today I had
to shutdown/restart 3 times before it "fixed" itself. When the box "inits"
without having this problem - it never exhibits *any* problem until the next
boot when whatever it is strikes again.

    Since I rarely boot the box, I don't exactly know when this started, but it
has been within the past month -- which is consistent with the latest round of
boot failures on the MSI box moving from kernel to .8.

    I don't know what to make of it? It seems like something has just gone
"flaky" with how dmraid is working (or grub or kernel or whatever), and it's
like some part of the setup is just confused. On the MSI box, it appears as some
attempt to read beyond the partition boundary or the box thinking there is a
corrupt partition table and booting fails with the latest kernels. On the Tyan
box, it appears as something that causes read/write head excursions and causes
the 15-60 second hangs like there is an interrupt conflict or some hardware
thing waiting on a timeout.

    One item that did catch my eye on the kernel list was a dmraid issue
concerning a "CFQ dm-crypt" problem. I have no idea what that is other than
gleaning it had to do with some type of dmraid queue/scheduler that was causing
problems. I don't know if that could point to some area of dmraid that might be
the culprit.

    If you have any ideas of any type of test and/or diagnostic I could use the
next time the Tyan box exhibits the problem -- to look at where the hang/timeout
issue is, I would appreciate your ideas. (that's an area where I have no clue...
how or what to look for)

    Thanks for all your continued help and willingness to provide ideas. I know
this is a weird issue, but now that I have two boxes showing some signs of a
similar problem -- hopefully that will help me narrow it down.

