[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [rhelv5-list] Replacing disks under 3ware-9550SX safely

On Thu, May 01, 2008 at 09:35:19PM +0300, Ahmed Kamal wrote:

> Hello,
> I'm working on a server with a 3w-9550SX controller, with 3x500G disks in a
> raid-5 and 1x500G hot spare. One night, a disk fails, and the server
> crashes! Working on the server, I see that many filesystems were destroyed
> beyond repair!! This was too bad to hear. Some LVM volumes were repaired,
> others were restored from backup. The bad disk was removed. I learnt that
> 3ware controllers aren't really high quality, and they probably corrupt the
> FSs.
> Since all disks are same age, I thought I'd buy new disks to replace the old
> ones. I bought 4x500G barracuda-ES drives, which should be high quality.
> Here lies my problem. I need to replace the 3 running disks, with 3 new
> disks, and add an extra one as hot spare. I am scared to do that, because
> the standard way is to "fail" a disk, and rebuild on a new one, then repeat
> for the other 2 disks till all 3 are replaced. Now this puts me in a
> vulnerable situation, if I "fail" a disk, and while rebuilding another disk
> naturally fails, all data is gone! Is there any other "wise" way to do what
> I want safely ? I contacted 3w support, and they just insist I should
> fail/rebuild, but since I don't have much faith in their controllers or the
> old disks ... any smarter way to do this ?

I have some 3ware controllers as well and while I can't say that they
are the best (performance is horrible in many cases) I never lost any
data unless I had two dead disks in RAID5.

The most common reason for a rebuild to fail is if any of your remaining
disks in the raid have a fault (bad blocks). The best way to deal with
this is to have the 3ware card to run a verify task every few days to
deal with problems like this. Also have smartd running to monitor the disks
so you get a warning.

So before your rebuilds, *backup* your data if you can. Run a verify
task so the controller/disks have a chance to correct any exsiting
errors, check with smartctl your disks and start with the one with the
most bad blocks (if any).  

Sadly most SATA disks have an unrecoverable read error rate of 1/10^14
or so which means that statistically you'll get one error every ~13TB
that you read. So during every rebuild you'll have a 1/13 chance to
loose a block whatever you do :(


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]