bad blocks... random death : solution ?

Thierry ITTY thierry.itty at besancon.org
Wed Aug 25 19:01:08 UTC 2004


well after having carefully read the answer of Kenneth I finally decided to
fight EMI, ESD, RFI and all that kind of troubles

my disks were mounted with rubber cylinders and thus had NO ground chassis
connection. I installed a "ground wire" on every disk (a wire connected on
one side to the disk cabinet and on the other side to the chassis).
I got less errors.

the 2 machines were connected to each other with a small gigabit switch,
and I noticed that this switch had NO grounding too (just a small AC/DC
converter as power input) then I installed a ground wire on the switch
going to one of the chassis (I could also have connected this ground wire
to some AC ground plug...)
remember that the 2 machines have a 200 Mbps continuous flow for hours to
copy all the data from one to the other !

now I have NO MORE ERRORS

I can say that it's very likely that all the troubles I had, disks errors,
bad blocks, file system corruption, etc, came from the ungrounded switch
(and possibly disks). so now I "ground" everything I can !

many thanks to Kenneth for this clue !
hth




A 12:44 13/08/2004 -0400, vous avez écrit :
>Two cents worth being oblivious to previous discussions in
>this thread.
>see below in-line.
>
>>  -----Original Message-----
>>  From: redhat-list-bounces at redhat.com
>>  [mailto:redhat-list-bounces at redhat.com]On Behalf Of
>Thierry ITTY
>>  Sent: Friday, August 13, 2004 11:33 AM
>>  To: redhat-list at redhat.com
>>  Subject: bad blocks... random death
>>
>>
>>  this continues discussions about bad disk blocks not
>really
>>  bad and redhat
>>  9 dying randomly
>>
>>  we're now a few on this list experiencing various
>symptoms
>>  (dma errors, bad
>>  blocks on disks, system freeze or death) that look like
>>  hardware problems.
>>  after talking together we can now say that those problems
>>  are pure OS
>>  problems.
>
>If all are SMP systems, then perhaps there is a Spinlock
>conflict
>(multi-cpu contention) problem with the disk driver.
>But I doubt that the disk drivers in the kernel have changed
>in years.
>I am running RH9 on several heavily used scsi based Compaq
>multi-cpu machines with no problems.
>So based on my experience, I dount believe in a softwrae
>issue here.
>
>>
>>  the disks with bad blocks work actually fine elswhere (in
>my
>>  case I ran the
>>  manufacturer low-level diags and no disk had any problem.
>>  and, ain't it
>>  very strange that 10 disks get the same problems at the
>same
>>  time ?!!!)
>
>Not if you have an EMI (electro-magnetic interference)
>shielding issue. The drives are fine.
>They might be cross
>polluting each other ,the cables and/or the controllers with
>EMI.
>that will corrupt the bit sream between the drives and the
>controller and give you errors.
>
>The heavier you use the drives, the more the
>magnetic coils that move the heads are used.  Those coils
>put out an EMI field.
>The more your use the drives, the more consistent that EMI
>field is and without good grounding
>it "leak" into whatever copper ground path is available
>including your drive cables,
>power cables, etc.
>normally Emi is drained off through the drive's grounds to
>the chassis. It's
>grounded to the chassis and through the chassis to the
>ground line on the power supply to earth.
>
>check the following if you haven't already as it applies to
>your system:
>
>1) get an electrical outlet tester at your local Home
>Depot/Loews et.al
>
>2) Check the outlets your systems are plugged into. (if you
>use non nema 5-15R/5-20R outlets (household type)
>then get a tester or electrical testing service in to check
>your grounds.)
>
>3) Make sure you have a good reliable earth ground at the
>outlet. If you dont, get it fixed.
>You would be surprised at how many outlets dont have valid
>earth grounds.
>If you are in a commercial building, your data center
>outlets should have been installed with
>ISOLATED Grounds , that is a separate ground wire between
>the power panel and the receptable.
>Most commercial electrical uses the metal jacket as a ground
>path and that tends to come apart over time
>(ie NO MORE GROUND)
>
>4) Check the power supply - make sure you are not
>overloading it past it's rated maximum output. Make sure
>that it is grounded to the chassis and to the earth ground.
>Normally it grounds the chassis through it's case
>but some have separate ground connections, look for ground
>screw connections.
>
>5) If your drives have ground screws or Tabs on them,
>connect them to a reliable chassis ground point.
>dont assume they have a good ground through the drive
>mounting screws.
>
>6) Use round shielded cables and watch the grounds on them.
>If they are single ended grounds on the shields
>make sure that the connected end is connected to a valid
>ground source.
>
>7) Grounds are normally single end connected to prevent
>ground fault loops, that is, you dont want more than one
>ground path here if you can help it. Multiple ground paths
>wont help and can hurt under the wrong circumstances. Drives
>with ground tabs dont generally ground through the mounting
>screws, but check the drive specs. A cable with the shield
>connected at both ends is also expecting to ground the
>drive, the cable
>should be connecting to a ground pin on the drives
>interface.
>
>8) If you have these drives "dense packed" in your chassis,
>you might want to consider putting
>grounded shields between them if all else fails, grounded
>copper plates for example.
>
>9) Make sure that you route the power cables away from the
>drive controller cables within the chassis.
>
>10) look for ways that EMI could be crossing.
>
>11) You might just have one really EMI noisy drive. There
>are EMI meters that can be used
>to measure EMI levels.
>
>12) You can also be subject to a different wavelength of
>radiation knows as RFI , or Radio Frequency
>Interference.
>
>
>
>
>>
>>  the problem happens on various machines (gigabyte, asus,
>>  athlon, pentium,
>>  maxtor, western...).
>>
>>  it seems it is related to high load periods (in my case a
>>  heavily used file
>>  server).
>>
>>  we've been advised to change dma disks settings. I tried
>>  various things (no
>>  dma at all, forcing mdma0 or udma2). the system behave
>>  differently (either
>>  no errors or other errors as dma timeouts), but it's not
>>  working quite well
>>  (for example deactivating dma on disks lowers the average
>network
>>  throughput from 50 MB/s to 1.5 !!! almost 40 times slower
>!!!
>>
>>  we really need help to investigate this problem which
>causes
>>  io errors and
>>  fs corruption !
>>
>>  tia
>>
>>
>>  --
>>  redhat-list mailing list
>>  unsubscribe
>mailto:redhat-list-request at redhat.com?subject=unsubscribe
>>  https://www.redhat.com/mailman/listinfo/redhat-list
>>
>
>
>-- 
>redhat-list mailing list
>unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
>https://www.redhat.com/mailman/listinfo/redhat-list
>
>
			- * - * - * - * - * - * -
Bien sûr que je suis perfectionniste !
Mais ne pourrais-je pas l'être mieux ?
	Thierry ITTY
eMail : Thierry.Itty at Besancon.org		FRANCE





More information about the redhat-list mailing list