[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
Re: GFS corruption?
- From: "Waleed Mubark" <waleed harbi gmail com>
- To: "Red Hat Enterprise Linux 4 (Nahant) Discussion List" <nahant-list redhat com>
- Subject: Re: GFS corruption?
- Date: Wed, 8 Mar 2006 15:03:10 +0300
It's seem this issue on Bugzilla i think:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=175589
try to restart the services and try dmesg command what is the result , also may be you have big process and load on the cluster.
On 3/8/06, Pena, Francisco Javier <francisco_javier pena roche com> wrote:
Hi,
During some stress tests with GFS, I have come across a sequence of
events that could lead to a GFS file system corruption. It is a bit
weird, but I think it reflects a flaw in the way the cluster processes
are handled. The process would be:
1. Setup a 2-node GFS cluster (in my case I used 2xProliant servers, iLO
fencing, RHEL 4 AS U1, GFS 6.1)
2. Put some load on the file system from both nodes (bonnie++ should do
the trick)
3. On one of the nodes, kill the following daemons:
- fenced
- ccsd
- clvmd
Maybe we could skip killing some process, but this is what I did. The
cluster did not complain at all, FS access was still possible.
4. Kill network connectivity between the nodes. I did this with
"ifconfig down" on the node where I had killed the daemons.
At this point, both servers seem to think they can get the other node's
journal, and in fact they do:
Mar 7 16:01:02 rmamsewlab20 kernel: CMAN: removing node rmamsewlab30
from the cluster : Missed too many heartbeats
Mar 7 16:01:02 rmamsewlab20 kernel: CMANsendmsg failed: -101
Mar 7 16:01:04 rmamsewlab20 kernel: GFS: fsid=NEWCLUSTER:lvgfs.1:
jid=0: Trying to acquire journal lock...
Mar 7 16:01:04 rmamsewlab20 kernel: GFS: fsid=NEWCLUSTER:lvgfs.1:
jid=0: Looking at journal...
Mar 7 16:01:04 rmamsewlab20 kernel: GFS: fsid=NEWCLUSTER:
lvgfs.1:
jid=0: Done
Mar 7 16:01:02 rmamsewlab30 kernel: CMAN: removing node rmamsewlab20
from the cluster : Missed too many heartbeats
Mar 7 16:01:04 rmamsewlab30 kernel: GFS: fsid=NEWCLUSTER:lvgfs.0:
jid=1: Trying to acquire journal lock...
Mar 7 16:01:04 rmamsewlab30 kernel: GFS: fsid=NEWCLUSTER:lvgfs.0:
jid=1: Looking at journal...
Mar 7 16:01:04 rmamsewlab30 kernel: GFS: fsid=NEWCLUSTER:lvgfs.0:
jid=1: Acquiring the transaction lock...
Mar 7 16:01:04 rmamsewlab30 kernel: GFS: fsid=NEWCLUSTER:lvgfs.0:
jid=1: Replaying journal...
Mar 7 16:01:04 rmamsewlab30 kernel: GFS: fsid=NEWCLUSTER:lvgfs.0:
jid=1: Replayed 0 of 16 blocks
Mar 7 16:01:04 rmamsewlab30 kernel: GFS: fsid=NEWCLUSTER:
lvgfs.0:
jid=1: replays = 0, skips = 14, sames = 2
Mar 7 16:01:04 rmamsewlab30 kernel: GFS: fsid=NEWCLUSTER:lvgfs.0:
jid=1: Journal replayed in 1s
Mar 7 16:01:04 rmamsewlab30 kernel: GFS: fsid=NEWCLUSTER:lvgfs.0
:
jid=1: Done
As you can imagine, the file system is corrupted soon after that. So
there are some questions that come to my mind:
- The node where I did not kill the processes did not try to fence the
failing one. Why? And why didn't the other node complain when its fenced
daemon died?
- Is there a way to make the cluster software check its critical
processesm, and restart them / reboot the server when they fail? Sure it
could be done with some scripts or a monitoring software, but I believe
it is important enough to do it from within the cluster framework. I
have seen some other cluster solutions solve this using either a kernel
watchdog timer or a watchdog process.
Cheers,
Javier
--
nahant-list mailing list
nahant-list redhat com
https://www.redhat.com/mailman/listinfo/nahant-list
--
---------------------------------
:..Best Regards.
:..Waleed Mubark.
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]