[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
Re: Ext3 strangeness data loss
- From: "Bodrogi Viktor" <viktor neotek hu>
- To: "Theodore Ts'o" <tytso mit edu>, ext3-users redhat com
- Subject: Re: Ext3 strangeness data loss
- Date: 4 Feb 2003 12:47:06 -0000
Hi!
This morning I booted and, what a horror, found bad superblock on /var!
fsck -ing reported nothing, but mount said bad superblock.
It's the best can happen after due day of project, but before finishing it,
isn't?
So I decided to switch to reiserfs, which has performance advantages too.
After about fifth reboot I could mount /var, and copied it to a new
partition together with root partition.
And, terrible, I had the same problem with /usr/sbin/sshd startup, without
the binary changes, according to a diff with a probably-good backup (who can
be sure about after all these...).
So the conclusion is that pssibly this has nothing to do with ext3.
It's not openssh because I had problems with other files/dirs, too...
Maybe it's evms?
Maybe it's the kernel?
It's a stock 2.4.19, only with evms and vserves patches.
I don't think it's a distro problem...
So sorry about talking about this on ext3 list!
Thanks for all help!
viktor
more comments below...
> >
> > Seems interesting.
> > I forgot to mention (yes, sorry, it's important piece of information),
> > that I have RAID 1 (mirrored disks), so HW problem is less possible.
> > And I have reiserfs partition on the mirror too, without any problem.
>
> Raid protects you against disk failures. It does not protect you from
> cable problems causing data corruption, or your RAID controller going
> insane. Unfortunately a lot of people seem to believe that just
> because they have RAID, they are immune from hardware problems, and
> then stop doing backups. I usually hear from them after they've
> gotten screwed, and when they ask if I can perform miracles....
Yes, RAID is completly different than backup.
RAID doesn't protect you of rm -fr / ;))
>
> In any case, the scenario I described (a controller/cable problem, or
> an incorrectly configured IDE DMA settings) are all still possible
> with RAID; RAID does not help you prevent these sorts of problems.
It's SW RAID-1, disks are on the same controller,
but different buses / cables.
Am I right, that in this case HW errors are *very* unlikely?
That would mean that there are exactly the same bits of errors at exactly
the same time on different cables/disks...
> As far as your not noticing the problem with reiserfs that could be
> because you've been lucky, and not noticed because the block addresses
> causing the problem do not (yet) contain data. But the symptoms
> you've described sound very much like hardware induced errors.
>
> > Anyway, do you have an idea how to test for HW errors?
>
> Well, if you have a scratch partition that's not being used, you can
> try using the badblocks program. Try using the -w option, which will
> do a read/write test. This doesn't do a random access test, so it
> might not detect any problems, though.
>
> I'd suggest checking your internal cabling, and replacing the
> controller cable if it looks dubious. Making everything is well
> plugged in, too.
>
I use the most expensive, twisted, shielded, etc. cables, plugged well, at
least visualy...
Thanks for all answers!
viktor
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]