Advice for dealing with bad sectors on /

Sun Jan 2 04:16:39 UTC 2005

All,

Comments in-line.

--- Larry McVoy <lm at bitmover.com> wrote:

> The one thing I'd add to Joseph's good advice is that when I see stuff like
> this (which I do, I manage a lot of Linux boxes) I tend to start swapping
> things.  Put the drive in a known good system with a known good cable on
> the cable by itself and then see if you get errors.  If you don't get
> errors in that situation it is likely your drive is fine and you have
> some bad hardware elsewhere.
> 
> Hardware debugging is basically swapping parts until you find the guilty
> party.

Thanks to both you and Joseph for making me think about things that I simply
wouldn't have (or, at least not without first fixing something that wasn't
broke).  I would have immediately suspected the hard drive, not cables or other
hardware.  But, I guess that comes with experience, so thanks for sharing.  The
first thing I'll do is make sure that the cables are secure, and swapping
cables as a quick test is easy enough to do.  The controller is integrated into
the MB, so that would be more problematical :)

> On Sat, Jan 01, 2005 at 01:28:39PM -0600, Joseph D. Wagner wrote:
> > > Getting errors similar to:
> > > 
> > > Dec 31 20:44:30 mybox kernel: hdb: dma_intr: status=0x51 { DriveReady
> > > SeekComplete Error }
> > > Dec 31 20:44:30 mybox kernel: hdb: dma_intr: error=0x40 {
> > > UncorrectableError },
> > > LBAsect=163423, high=0, low=163423, sector=163360
> > > Dec 31 20:44:30 mybox kernel: end_request: I/O error, dev 03:41 (hdb),
> > > sector
> > > 163360
> > 
> > This may not be the disk; it could also be the controller.  I've seen it go
> both ways.  Any problems on hda?

No problems on hda.  But, if it's the controller, that's built into the MB, so
that wouldn't be good.

I didn't just get DMA-type errors, there are others, like the one below.  Can't
say that this is a complete list, though:

Dec 29 16:40:40 mybox kernel: hdb: read_intr: status=0x59 { DriveReady
SeekComplete DataRequest Error }
Dec 29 16:40:40 mybox kernel: hdb: read_intr: error=0x40 { UncorrectableError
}, LBAsect=163423, high=0, low=163423, sector=163360
Dec 29 16:40:40 mybox kernel: end_request: I/O error, dev 03:41 (hdb), sector
163360

> > Try adding ide=nodma to the kernel parameters.  If the problem goes away,
> the problem is in the kernel driver for the controller or motherboard
> chipset.

Excellent sugguestion.  Will give that a try.

> > > When I rebooted, the system threw me into a shell, to get me to "fix"
> > > things. So, I did an e2fsck -c -v /dev/hdb1 to attempt to fix things.
> > > The badblocks checking took 20 hours (it's a 200GB disk).  Then I went
> > > through the question/answer session, hoping to get through the
> problems...
> > 
> > A better way to go about this is booting off the rescue CD and doing the
> e2fsck scan there.  Otherwise, there could be leftover problems from running
> the scan off of the partition you are scanning.

Ah, now that advice is somthing to put in my back pocket to remember.  Never
gave that a thought, since it "booted enough" to get me to a shell prompt to
run e2fsck.  Guess I wasn't forced to think "Rescue CD".

> > > Some questions:
> > 
> > Best advice to all 3 questions: get some sort of disk imaging software.
> > 
> > The disk imaging software may copy the bad sectors (i.e. sectors marked bad
> now may also be marked bad on the new drive), but you can force e2fsck to
> rescan bad sectors.
> > 

[SNIP]

Thanks a million for all of the advice.  I really appreciate it!  About to go
try these suggestions.
Steve

__________________________________ 
Do you Yahoo!? 
Take Yahoo! Mail with you! Get it on your mobile phone. 
http://mobile.yahoo.com/maildemo