[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: fixing a box where the hard disc may failed



On Thu, 2006-09-21 at 13:24 +0100, James Wilkinson wrote: 
> Since you say that this is a "scratch" test PC, I'd do a 
> smartctl -H /dev/hda 
> (which is probably what I should have told you in the first 
> place). If that says "PASSED", I'd do a combination of 
> dd if=/dev/zero of=/dev/hda 
> to blank the drive (that should remap all the bad sectors), and 
> dd if=/dev/hda of=/dev/null 
> to read them all back. Then check for any more errors. If you 
> don't get any, I'd trust the drive for testing purposes.
> 
> Those dd commands will probably take several hours.

Um, no actually.  Under an hour, 'twas only a 15 gig drive.  I did a
quick test of seeing what what happen if I did dd to the drive that the
computer had booted from.  Watched it working, went away, came back to a
black screen (about what I expected).  Then I took the drive out and put
it into another box; results below.

[root box ~]# dd if=/dev/zero of=/dev/hdc
dd: writing to `/dev/hdc': Input/output error
23953097+0 records in
23953096+0 records out

Above is as I'd expect.  Below, seems about right (same output count as
input, same number as worked above, and an error).  I'm not sure at what
stage a bad block gets mapped out of use.  In the past, I'd have done
that while prepping/formatting a drive. 

[root box ~]# dd if=/dev/hdc of=/dev/null
dd: reading `/dev/hdc': Input/output error
23952864+0 records in
23952864+0 records out

Then did a "smartctl -t short /dev/hdc" looked at the results, then a 
"smartctl -t long /dev/hdc", results after both further below.  The
basic health check showed fine:

[root box ~]# smartctl -H /dev/hdc
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

So that looks okay.  But the "smartctl -a /dev/hdc" is less inspiring:

[root box ~]# smartctl -a /dev/hdc
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD153AA-00BAA0
Serial Number:    WD-WMA2L2483801
Firmware Version: 10.09K11
User Capacity:    15,393,079,296 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   4
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sat Sep 23 19:13:27 2006 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                 (1040) seconds.
Offline data collection
capabilities:                    (0x1b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  14) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   197   098   051    Pre-fail  Always       -       45
  3 Spin_Up_Time            0x0006   109   104   000    Old_age   Always       -       1150
  4 Start_Stop_Count        0x0012   098   098   040    Old_age   Always       -       2524
  5 Reallocated_Sector_Ct   0x0012   198   198   112    Old_age   Always       -       5
  9 Power_On_Hours          0x0012   065   065   000    Old_age   Always       -       26136
 10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0013   100   100   051    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0012   098   098   000    Old_age   Always       -       2297
196 Reallocated_Event_Count 0x0012   196   196   000    Old_age   Always       -       4
197 Current_Pending_Sector  0x0012   200   199   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0012   100   253   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 572 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 572 occurred at disk power-on lifetime: 1013 hours (42 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 18 cd 7e 6d e1  Error: UNC 24 sectors at LBA = 0x016d7ecd = 23953101

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 18 c8 7e 6d e1 00      00:57:28.650  READ DMA
  c8 00 20 c0 7e 6d e1 00      00:57:22.800  READ DMA
  c8 00 28 b8 7e 6d e1 00      00:57:16.700  READ DMA
  c8 00 30 b0 7e 6d e1 00      00:57:10.750  READ DMA
  c8 00 38 a8 7e 6d e1 00      00:57:04.750  READ DMA

Error 571 occurred at disk power-on lifetime: 1013 hours (42 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 20 cd 7e 6d e1  Error: UNC 32 sectors at LBA = 0x016d7ecd = 23953101

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 20 c0 7e 6d e1 00      00:57:22.800  READ DMA
  c8 00 28 b8 7e 6d e1 00      00:57:16.700  READ DMA
  c8 00 30 b0 7e 6d e1 00      00:57:10.750  READ DMA
  c8 00 38 a8 7e 6d e1 00      00:57:04.750  READ DMA
  c8 00 40 a0 7e 6d e1 00      00:56:58.850  READ DMA

Error 570 occurred at disk power-on lifetime: 1013 hours (42 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 28 cd 7e 6d e1  Error: UNC 40 sectors at LBA = 0x016d7ecd = 23953101

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 28 b8 7e 6d e1 00      00:57:16.700  READ DMA
  c8 00 30 b0 7e 6d e1 00      00:57:10.750  READ DMA
  c8 00 38 a8 7e 6d e1 00      00:57:04.750  READ DMA
  c8 00 40 a0 7e 6d e1 00      00:56:58.850  READ DMA
  c8 00 48 98 7e 6d e1 00      00:56:53.050  READ DMA

Error 569 occurred at disk power-on lifetime: 1013 hours (42 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 30 cd 7e 6d e1  Error: UNC 48 sectors at LBA = 0x016d7ecd = 23953101

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 30 b0 7e 6d e1 00      00:57:10.750  READ DMA
  c8 00 38 a8 7e 6d e1 00      00:57:04.750  READ DMA
  c8 00 40 a0 7e 6d e1 00      00:56:58.850  READ DMA
  c8 00 48 98 7e 6d e1 00      00:56:53.050  READ DMA
  c8 00 50 90 7e 6d e1 00      00:56:47.350  READ DMA

Error 568 occurred at disk power-on lifetime: 1013 hours (42 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 38 cd 7e 6d e1  Error: UNC 56 sectors at LBA = 0x016d7ecd = 23953101

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 38 a8 7e 6d e1 00      00:57:04.750  READ DMA
  c8 00 40 a0 7e 6d e1 00      00:56:58.850  READ DMA
  c8 00 48 98 7e 6d e1 00      00:56:53.050  READ DMA
  c8 00 50 90 7e 6d e1 00      00:56:47.350  READ DMA
  c8 00 58 88 7e 6d e1 00      00:56:41.550  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      1014         23953101
# 2  Short offline       Completed: read failure       90%      1013         23953101
# 3  Extended offline    Completed: read failure       30%       990         23953101
# 4  Short offline       Completed without error       00%       990         -
# 5  Short offline       Completed without error       00%       327         -
# 6  Short offline       Completed without error       00%        93         -
# 7  Short captive       Completed without error       00%         0         -

Device does not support Selective Self Tests/Logging

Tests #1 & #2 are after the dd experiment, the rest are from before.  A
quick perusal of information doesn't give me any clues as to what the
remaining and lifetime columns mean.  Predicted failure time, uptime?

-- 
(Currently running FC4, occasionally trying FC5.)

Don't send private replies to my address, the mailbox is ignored.
I read messages from the public lists.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]