Seagate disk problems (NCQ bug???)

D. Hugh Redelmeier hugh at mimosa.com
Mon May 11 03:53:14 UTC 2009


| Date: Tue, 28 Apr 2009 17:59:50 -0700

Sorry for such a slow reply.

| From: Dave Stevens <geek at uniserve.com>
| Subject: Re: Seagate disk problems (NCQ bug???)
| 
| Quoting "Wolfgang S. Rupprecht" <wolfgang.rupprecht+gnus200904 at gmail.com>:
| 
| >
| >After running flawlessly for 6+ months I just had my Seagate
| >ST31500343AS (w. SD35 firmware) flake out.  Does this look like the NCQ
| >bug or just a random event?  The final error msg was around the time the
| >machine hung hard.
| 
| There is a specific test you can download from Seagate and burn to a bootable
| cd. The test on the cd will tell you if it is the ncq bug. They are offering
| data recovery if it is indeed a blown disk, they're treating it as a warranty
| issue.

Can you give us a pointer to official and unofficial information about
the NCQ bug?

Seagate had a bug in firmware for 7200.11 drives.  They publically
disclosed a bit about the problem and offered a firmware upgrade near
the end of January 2007.  If the bug tripped, the drive would locked
up and could not be fixed in place.  See
  http://forums.seagate.com/stx/board/message?board.id=ata_drives&thread.id=11972&view=by_date_ascending&page=1

That firmware fix has left a lot of complaining users.  That forum
thread has 794 messages currently!  I've read all of them and cannot
really see a pattern for the remaining problems.

I started this thread to try to get more coherent reports but it
hasn't worked.
  http://forums.seagate.com/stx/board/message?board.id=ata_drives&thread.id=11184

A number of reports appear to be cases of drives going "offline" for
no reported reason.  One symptom is drives falling our of RAID arrays.
These drives come back after a power cycle.

Perhaps your problem is like this one.  And I have no idea if NCQ is
implicated.  But once your drive gets in a bad state, the driver tries
a reset and still isn't happy.  I'd be surprised if NCQ is used
between the reset and the subsequent failure

| >Apr 28 04:26:29 arbol kernel: ata1: SATA max UDMA/133 irq_stat 0x00400000,
| >PHY RDY changed irq 22
| >Apr 28 04:26:29 arbol kernel: ata1: softreset failed (device not ready)
| >Apr 28 04:26:29 arbol kernel: ata1: failed due to HW bug, retry pmp=0
| >Apr 28 04:26:29 arbol kernel: ata1: SATA link up 3.0 Gbps (SStatus 123
| >SControl 300)
| >Apr 28 04:26:29 arbol kernel: ata1.00: ATA-8: ST31500343AS, SD35, max
| >UDMA/133
| >Apr 28 04:26:29 arbol kernel: ata1.00: 2930277168 sectors, multi 16: LBA48
| >NCQ (depth 31/32)
| >Apr 28 04:26:29 arbol kernel: ata1.00: configured for UDMA/133

Time passes.  Happily, I assume.

| >Apr 28 06:17:02 arbol kernel: ata1.00: exception Emask 0x50 SAct 0x1 SErr
| >0x90a02 action 0xe frozen
| >Apr 28 06:17:02 arbol kernel: ata1.00: irq_stat 0x00400000, PHY RDY changed
| >Apr 28 06:17:02 arbol kernel: ata1: SError: { RecovComm Persist HostInt
| >PHYRdyChg 10B8B }
| >Apr 28 06:17:02 arbol kernel: ata1.00: cmd
| >60/08:00:e1:81:24/00:00:74:00:00/40 tag 0 ncq 4096 in
| >Apr 28 06:17:02 arbol kernel:         res 40/00:00:e1:81:24/00:00:74:00:00/40
| >Emask 0x50 (ATA bus error)
| >Apr 28 06:17:02 arbol kernel: ata1.00: status: { DRDY }

Something has gone wrong (duh!), but I don't know enough to say what.

| >Apr 28 06:17:02 arbol kernel: ata1: hard resetting link

Here's a reset.  I bet NCQ will not be used for a while (until drive
appears to be up again after the reset).

| >Apr 28 06:17:04 arbol kernel: ata1: SATA link up 3.0 Gbps (SStatus 123
| >SControl 300)
| >Apr 28 06:17:09 arbol kernel: ata1.00: qc timeout (cmd 0xec)
| >Apr 28 06:17:09 arbol kernel: ata1.00: failed to IDENTIFY (I/O error,
| >err_mask=0x4)

IDENTIFY is a command that asks the drive about its characteristics.
I would be astonished if a driver would be using NCQ at this point.

| >Apr 28 06:17:09 arbol kernel: ata1.00: revalidation failed (errno=-5)
| >Apr 28 06:17:09 arbol kernel: ata1: hard resetting link

Another reset.

| >Apr 28 06:17:11 arbol kernel: ata1: SATA link up 3.0 Gbps (SStatus 123
| >SControl 300)
| >Apr 28 06:17:21 arbol kernel: ata1.00: qc timeout (cmd 0xec)
| >Apr 28 06:17:21 arbol kernel: ata1.00: failed to IDENTIFY (I/O error,
| >err_mask=0x4)

Another failure.  Again, I would not expect NCQ to be used at this
point.

And so it goes.  I infer that this cycle goes on until the power is
turned off.




More information about the fedora-list mailing list