Re: [Linux-cluster] Shared storage problems with LSI controller

On Wed, 2007-01-24 at 00:11 +0100, Jos Vos wrote:
> Hi,
> I have a configuration with two servers and a shared storage cabinet
> (connected via two *independent* SCSI busses) causing fatal SCSI errors
> when one server is doing a lot of I/O and the other server is rebooting
> (i.e. loading the Linux driver and initializing the controller).
> This problem is fully reproducable with the latest RHEL4 kernel, but
> it is *not* reproducable with RHEL5b2.
> When using this shared device with cluster suite and GFS (I only tried
> this with RHEL4), the GFS filesystem is damaged unrepairable when one
> node reboots!
> I see some buzilla entries about this driver (although with different
> errors) and when Googling I found some more complaints about weak error
> handling/recovery in this driver.
> I tried to port the MPT Fusion driver from the RHEL5b2 kernel to the
> RHEL4 kernel, but this seems to require some non-trivial backporting.
> Is this indeed a problem with the LSI driver?  Are there any upgrades
> for the driver that can be compiled for the RHEL4 kernels?

I've seen abysmal performance in some megaraid+jbod configurations (e.g.
50+ seconds to get 2 block reads and 2 block writes on RHEL2.1), but
I've never seen corruption like what you're describing.  Of course, it's
been almost 4 years since I used a host-RAID configuration, and I
haven't ever used one with GFS... ;(

Apparently the SCSI megaraid driver has gone into maintenance mode, so
it's not going to get any better.

Your controllers are in "cluster mode" and/or "have cache entirely
disabled", right?

-- Lon

