Software RAID problem

Thu Jul 13 22:06:03 UTC 2006

On Thu, 2006-07-13 at 14:07 -0600, redhat at buglecreek.com wrote:
> We have a critical system that has Redhat 8.0 installed.  The system
> uses the older raidtools not mdadm. We are in the process of rebuilding
> a new box, but in the meantime we have a software raid issue.  The
> system had to be rebooted and we ended up with the following raid
> problem: 
> cat /proc/mdstat shows: 
> 
> Personalities : [raid0] [raid1]
> read_ahead 1024 sectors
> md1 : active raid1 hda2[0]
>       119684160 blocks [2/1] [U_]
> 
> md2 : active raid0 hda3[0] hdb2[1]
>       208640 blocks 64k chunks
> 
> md0 : active raid1 hda1[0] hdb1[1]
>       264960 blocks [2/2] [UU]
> 
> Looks like we have a problem with md1 device which is the / partition.
> lsraid -A -a /dev/md1 shows:
> 
> [dev   9,   1] /dev/md1         C27DAE7E.7C02AF01.5143DCC8.62FD07C3
> online
> [dev   3,   2] /dev/hda2        C27DAE7E.7C02AF01.5143DCC8.62FD07C3 good
> [dev   ?,   ?] (unknown)        00000000.00000000.00000000.00000000
> missing
> 
> The applicable section of /etc/raidtab is:
> 
> raiddev             /dev/md1
> raid-level                  1
> nr-raid-disks               2
> chunk-size                  64k
> persistent-superblock       1
> nr-spare-disks              0
>     device          /dev/hda2
>     raid-disk     0
>     device          /dev/hdb3
>     raid-disk     
> 
> It seems that /dev/hdb3 has issues.  Is there a way to get /dev/hdb3
> back online.  Can you do something with raidhotadd:
> raidhotadd /dev/md1 /dev/hdb3
> 
> This is a very critical system and I want to make sure we don't do
> anything that would totally bring the system down, at least until we can
> build a new system.  Any help would be appreciated.

The FIRST thing you do is back up /dev/md1 (or what's left of it) in
case the remediation doesn't work or does something evil (it shouldn't).
And you can continue to run in the degraded state.

You can use raidhotadd to try to bring the drive back into the fold, but
it may not join if the drive is indeed defective.  Try the raidhotadd,
then check /proc/mdstat again.  If you see a "(F)" following the
"hdb3[1]" bit, the drive failed.  That doesn't mean the drive is fried,
but SOMETHING is wrong.

Try to raidhotremove the drive from the RAID, then run badblocks on the
partition in question (/dev/hdb3).  When it completes, try the
raidhotadd again and see if it joins and starts the resync.

Probably none of my business, but why is such a critical machine still
running RH8?  RH8.0 is farking ancient and, IMHO, the absolute worst
release of RH ever...which is why RH9 came out so quickly after it.

----------------------------------------------------------------------
- Rick Stevens, Senior Systems Engineer     rstevens at vitalstream.com -
- VitalStream, Inc.                       http://www.vitalstream.com -
-                                                                    -
-      A day for firm decisions!!!   Well, then again, maybe not!    -
----------------------------------------------------------------------