[dm-devel] Hard drives shutting themselves off in RAID mode

Thu Jun 15 20:29:12 UTC 2006

Tom,

I did not review your e-mail in total, but using lots of SATA drives
in a big RAID array is not something I would attempt with 2.6.17 or
older kernels.  (I know 2.6.17 is not even out yet.).

In 2.6.17-mm there is a huge SATA error handler (EH) rewrite.  Is is
planned to hit the stable Linus kernel with 2.6.18 towards the end of
the summer, but even then it will only have a few of the actual
drivers modified to use the EH infrastructure.

I would repost your problem to the lkml-ide list and see if they think
that the new EH should help you, and when/if your controller will be
using the new EH infrastructure.

FYI: that is linux-ide at vger.kernel.org: sata is discussed there, no
need to subscribe, they will cc you on responses.

Also, there is a ton of testing going on with the new EH, so if your
willing to be a guinea pig, I'm sure you will get a lot of support
from the dev. team and get your specific driver updated ASAP.

HTH
Greg
-- 
Greg Freemyer

On 6/14/06, Tom Wirschell <Tom at wirschell.nl> wrote:
> On 14 Jun 2006, Rune Saetre wrote:
> >
> > I always thought the loud click came from the disks parking their
> > heads before spinning down.
>
> Well, it's most certainly loud. The same type of loud that you get when
> the machine shuts down and removes the power from the drives. I thought
> recalibration ticks weren't particularly loud.
>
> > Anyway, it can take several seconds before a disk responds to
> > commands after having spun down.
>
> The problem isn't that it takes time to come back up after a spin down.
> The drive isn't spinning down. It's turning itself off completely
> (note the 'no device found' bit in the error). And it does this while
> it's actively being used.
>
> > On Wed, 14 Jun 2006, Molle Bestefich wrote:
> > >
> > > Does the drive's SMART log say anything interesting?
>
> That's a damned good question. I didn't even know you could query that,
> so I just recreated the array and started my test again. Took about 90
> minutes for one of the drives to die. Unfortunately when it dies it
> refuses to respond to anything.
>
> When I try the smartctl program on the failed drive I get:
> Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)
> When I issue the exact same command for another disk on the controller
> I get a nice listing that you would expect from this program.
>
> When I use hdparm -I on the died drive I get:
> HDIO_DRIVE_CMD(identify) failed: Input/output error
> And again, if I issue the exact same command for another disk on this
> same controller I get a nice bit of info on the drive.
>
> To me at least, this basically says that the drive is actually turned
> off at this point in time. It would explain why SMART isn't getting any
> data. On the other hand, it doesn't explain *WHY* the drive is off.
> Do you know any program that's capable of telling a drive that isn't on
> to activate itself? I don't think it's even possible but might be
> mistaken there.
>
> So, I reboot, run smartctl again and I'm presented with a nice sheet
> of output that basically says all is well, nothing ever went wrong with
> this drive and you can feel safe in using it.
>
> This royally sucks...
>
> > > Have you tried poking the IDE driver to reset the bus, might get it
> > > running again?
>
> How would I do this? I've compiled the driver into the kernel. But if
> SMART data is kept even when a drive is off, this won't fix anything.
>
> > > Not a very pretty solution, especially since you might still suffer
> > > two drives going down at once from time to time.  Maybe you can
> > > patch MD to pause the array and poke the IDE driver whenever a disk
> > > is lost? Then you would at least only have intermittent failures /
> > > timeouts on a rare basis rather than a non-redundant array when
> > > something happens.
>
> The problem is that I can't tell if it's really MD that is telling the
> drive to turn itself off. Is there even code in MD that does this?
> Shouldn't it complain VERY LOUDLY that it's unhappy with a drive and
> thus decide to kill it?
>
> > > If the disk never comes up, being patient surely won't help.
> > > Wait for an hour and see if the drive comes up, ask the WD folks
> > > exactly how patient they want you to be? :-)
>
> The assumption was that since the drive took so long to respond, MD is
> telling the drive "You know what, fuck it. Never mind those outstanding
> requests, just shut down and let the rest of us get on with business",
> only thereby killing the array.
>
> > > bonnie++ does random seeks, right?
>
> I think so, yeah.
>
> Kind regards,
>
> Tom Wirschell
>
> --
> dm-devel mailing list
> dm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>

-- 
Greg Freemyer
The Norcross Group
Forensics for the 21st Century