RAID drive failed, but SMART shows no errors?

Tue Mar 13 23:08:52 UTC 2007

Mogens Kjaer writes:

> Sam Varshavchik wrote:
> ...
>> But smartctl gives this drive a clean bill of health:
>> 
>> [root at headache ~]# smartctl -H /dev/sda
>> smartctl version 5.36 [i386-redhat-linux-gnu] Copyright © 2002-6 Bruce 
>> Allen
>> Home page is http://smartmontools.sourceforge.net/
>> 
>> SMART Health Status: OK
> 
> Try running a SMART test on the drive:
> 
> smartctl -t long /dev/sda
> 
> It will tell you how long time it takes to run the test,
> you'll have to probe once in a while with
> 
> smartctl -a /dev/sda
> 
> to get the result of the test. It will be at the end:
> 
> SMART Self-test log
> Num  Test              Status                 segment  LifeTime 
> LBA_first_err [SK ASC ASQ]
>       Description                              number   (hours)
> # 1  Background long   Completed                   - 12641 
>       - [-   -    -]

Came up clean.  Nothing shown for LBA_first_err.  But the fact remains that 
the drive did err out.  smartctl -a shows "Elements in grown defect list: 
3", so I suppose that it remapped 3 sectors.  I can't find anything in the 
output that tells me how many spare sectors are available for remapping.

Also, despite the 3 defects, in the "Error counter log" portion, both read 
and write show "0" for "Total uncorrected errors", so I'm not sure how to 
reconcile that.  Sounds to me like the drive succesfully remapped a few 
defects and informed the host about it, but the kernel interpreted the 
result as a permanent error, and took the partition out of RAID.

>> I have three RAID-1 partitions on these disks.  The one that reported an 
>> error was the largest one.  I dropped the degraded partition, and 
>> hot-added it back.  Immediately, another error was logged to 
>> /var/log/messages, for the same block, but despite the error, the kernel 
>> started resyncing the array:
> ...
> 
> If it were me, I would replace this disk. The next time you
> run into this read error could be when sdb fails and you try
> to resync a new sdb :-(

Yeah, I'm going to do that.  But, with a clean long test, I think I have 
some breathing room to wait a few days for a convenient time to do it.

>> If I cannot do this, my third question is what do I need to do, 
>> grub-wise, to be able to swap sdb with sda?  sda is the one that's 
>> failing the RAID-1 array.  If I can't hot-swap it, I'll need to replace 
>> it with the sdb drive, but right now grub is installed only on sda, so 
>> how do I install a copy of all the grub boot-related stuff on sdb?
> 
> Hm? If you have used the GUI to create the RAID partitions during
> installation GRUB should be on both drives.

No, I don't believe I used a GUI; I believe this was originally a text 
install.  grub-install takes a parameter, according to its man page.  It's 
not clear, but I think that passing it /dev/sdb will install it to the 
second drive.  But, reading the man page's description of the 
--root-directory parameter muddled things a bit.  Not sure I understand its 
purpose.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/fedora-list/attachments/20070313/29d19033/attachment-0001.sig>