[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: GFS corruption?



It's seem this issue on Bugzilla i think:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=175589

try to restart the services and try dmesg command what is the result , also may be you have big process and load on the cluster.


On 3/8/06, Pena, Francisco Javier <francisco_javier pena roche com> wrote:
Hi,

During some stress tests with GFS, I have come across a sequence of
events that could lead to a GFS file system corruption. It is a bit
weird, but I think it reflects a flaw in the way the cluster processes
are handled. The process would be:

1. Setup a 2-node GFS cluster (in my case I used 2xProliant servers, iLO
fencing, RHEL 4 AS U1, GFS 6.1)
2. Put some load on the file system from both nodes (bonnie++ should do
the trick)
3. On one of the nodes, kill the following daemons:

        - fenced
        - ccsd
        - clvmd

   Maybe we could skip killing some process, but this is what I did. The
cluster did not complain at all, FS access was still possible.

4. Kill network connectivity between the nodes. I did this with
"ifconfig down" on the node where I had killed the daemons.

At this point, both servers seem to think they can get the other node's
journal, and in fact they do:

Mar  7 16:01:02 rmamsewlab20 kernel: CMAN: removing node rmamsewlab30
from the cluster : Missed too many heartbeats
Mar  7 16:01:02 rmamsewlab20 kernel: CMANsendmsg failed: -101
Mar  7 16:01:04 rmamsewlab20 kernel: GFS: fsid=NEWCLUSTER:lvgfs.1:
jid=0: Trying to acquire journal lock...
Mar  7 16:01:04 rmamsewlab20 kernel: GFS: fsid=NEWCLUSTER:lvgfs.1:
jid=0: Looking at journal...
Mar  7 16:01:04 rmamsewlab20 kernel: GFS: fsid=NEWCLUSTER: lvgfs.1:
jid=0: Done

Mar  7 16:01:02 rmamsewlab30 kernel: CMAN: removing node rmamsewlab20
from the cluster : Missed too many heartbeats
Mar  7 16:01:04 rmamsewlab30 kernel: GFS: fsid=NEWCLUSTER:lvgfs.0:
jid=1: Trying to acquire journal lock...
Mar  7 16:01:04 rmamsewlab30 kernel: GFS: fsid=NEWCLUSTER:lvgfs.0:
jid=1: Looking at journal...
Mar  7 16:01:04 rmamsewlab30 kernel: GFS: fsid=NEWCLUSTER:lvgfs.0:
jid=1: Acquiring the transaction lock...
Mar  7 16:01:04 rmamsewlab30 kernel: GFS: fsid=NEWCLUSTER:lvgfs.0:
jid=1: Replaying journal...
Mar  7 16:01:04 rmamsewlab30 kernel: GFS: fsid=NEWCLUSTER:lvgfs.0:
jid=1: Replayed 0 of 16 blocks
Mar  7 16:01:04 rmamsewlab30 kernel: GFS: fsid=NEWCLUSTER: lvgfs.0:
jid=1: replays = 0, skips = 14, sames = 2
Mar  7 16:01:04 rmamsewlab30 kernel: GFS: fsid=NEWCLUSTER:lvgfs.0:
jid=1: Journal replayed in 1s
Mar  7 16:01:04 rmamsewlab30 kernel: GFS: fsid=NEWCLUSTER:lvgfs.0 :
jid=1: Done

As you can imagine, the file system is corrupted soon after that. So
there are some questions that come to my mind:

- The node where I did not kill the processes did not try to fence the
failing one. Why? And why didn't the other node complain when its fenced
daemon died?

- Is there a way to make the cluster software check its critical
processesm, and restart them / reboot the server when they fail? Sure it
could be done with some scripts or a monitoring software, but I believe
it is important enough to do it from within the cluster framework. I
have seen some other cluster solutions solve this using either a kernel
watchdog timer or a watchdog process.

Cheers,

Javier

--
nahant-list mailing list
nahant-list redhat com
https://www.redhat.com/mailman/listinfo/nahant-list



--
---------------------------------
:..Best Regards.
:..Waleed Mubark.
[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]