[Linux-cluster] Cluster fail over fails when umounting fs

Fri Aug 18 17:03:01 UTC 2006

On Fri, 2006-08-18 at 12:00 -0400, Neil Watson wrote:
> I'm build a cluster that runs a DB2 service.  The cluster has 2 nodes
> in an active standby configuration.  I am now performing fail over
> tests.
> 
> Shared resources:
> DB2 controlled by /etc/init.d/db2 start stop script.
> Floating IP address.
> /db2 ext3 file system located on a SAN and connected via HBA.
> 
> Nodes are fenced with ILO cards.
> 
> Nodes are running AS4 x86_64 with the Redhat Cluster Suite.  RPMs are up
> to date.
> 
> Procedure:
> 
> 1. Connect to DB2 remotely and begin a long SQL insert program.
> 2. While the inserts a being performed, disconnected the fibre cable
> from the HBA, on the active node.
> 3. Examine the system logs an observe for fail over.
> 
> Observations:
> 
> 1. Cluster does not fail over to standby node.  Service becomes
> unavailable.
> 2. The log files of the active node report a 'generic error' about the status
> of the shared file system.
> 
> Aug 16 15:32:37 caesar kernel: qla2300 0000:06:01.0: LOOP DOWN detected (2).
> Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 15839
> Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 15847
> Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 15855
> Aug 16 15:32:45 caesar kernel: Buffer I/O error on device sda1, logical block 1974
> Aug 16 15:32:45 caesar kernel: lost page write due to I/O error on sda1
> Aug 16 15:32:45 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:45 caesar kernel: end_request: I/O error, dev sda, sector 103813199
> Aug 16 15:32:45 caesar kernel: Buffer I/O error on device sda1, logical block 12976642
> Aug 16 15:32:45 caesar kernel: lost page write due to I/O error on sda1
> Aug 16 15:32:45 caesar kernel: Aborting journal on device sda1.
> Aug 16 15:32:45 caesar kernel: ext3_abort called.
> Aug 16 15:32:45 caesar kernel: EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
> Aug 16 15:32:45 caesar kernel: Remounting filesystem read-only
> Aug 16 15:32:47 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:47 caesar kernel: end_request: I/O error, dev sda, sector 8279
> Aug 16 15:32:47 caesar kernel: Buffer I/O error on device sda1, logical block 1027
> Aug 16 15:32:47 caesar kernel: lost page write due to I/O error on sda1
> Aug 16 15:32:47 caesar kernel: SCSI error : <0 0 0 1> return code = 0x10000
> Aug 16 15:32:47 caesar kernel: end_request: I/O error, dev sda, sector 103546959
> Aug 16 15:32:47 caesar kernel: Buffer I/O error on device sda1, logical block 12943362
> Aug 16 15:32:47 caesar kernel: lost page write due to I/O error on sda1
> Aug 16 15:32:48 caesar clurgmgrd[5159]: <notice> status on fs "db2" returned 1 (generic error)
> Aug 16 15:32:48 caesar clurgmgrd[5159]: <notice> Stopping service db2
> Aug 16 15:32:48 caesar clurgmgrd: [5159]: <info> Executing /etc/rc.d/init.d/db2 stop
> Aug 16 15:32:48 caesar su(pam_unix)[1227]: session opened for user dwapinst by (uid=0)
> Aug 16 15:32:49 caesar su:
> Aug 16 15:32:49 caesar su: Instance  : dwapinst
> Aug 16 15:32:49 caesar su: DB2 State : Available
> Aug 16 15:32:49 caesar su(pam_unix)[1227]: session closed for user dwapinst
> Aug 16 15:32:49 caesar db2:  succeeded
> Aug 16 15:32:49 caesar su(pam_unix)[1322]: session opened for user dwapinst by (uid=0)
> Aug 16 15:36:51 caesar su(pam_unix)[5473]: session opened for user root by nhwatson(uid=0)
> Aug 16 15:36:55 caesar su(pam_unix)[6000]: session opened for user dwapinst by nhwatson(uid=0)
> Aug 16 15:36:55 caesar su:
> Aug 16 15:36:55 caesar su: Instance  : dwapinst
> Aug 16 15:36:55 caesar su: DB2 State : Operable
> Aug 16 15:36:55 caesar su(pam_unix)[6000]: session closed for user dwapinst
> Aug 16 15:36:55 caesar db2:  failed 
> 
> 3. The are no log entries for this event on the standby node.
> 
> Why does the cluster fail during this test?  What does the 'generic error'
> mean?

There are only a few types of errors noted in the OCF RA API, that's the
most often used one.

With self_fence set, the node should reboot itself if it can not unmount
the file system successfully.  However, because a script is involved, a
bug in rgmanager from linux-cluster 1.02 / RHCS4U3 (and previous
versions) caused the file system to not be stopped if the script failed
to stop.

Because the file system was not stopped when it should have been, the
unmount (and thus, reboot), would never be tried - causing the service
to enter the failed state.

Here's the bug:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=193859

If upgrading does not fix the problem, please file a bugzilla.

-- Lon