[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
RE: GFS corruption?
- From: "Pena, Francisco Javier" <francisco_javier pena roche com>
- To: "Red Hat Enterprise Linux 4 \(Nahant\) Discussion List" <nahant-list redhat com>
- Subject: RE: GFS corruption?
- Date: Wed, 8 Mar 2006 16:15:14 +0100
Title: Message
Well,
I have checked the Bugzilla entry, and I think this is a different issue. All
problems detected there happen while putting a high load on the file system. In
this case, it is due to the fact that I am killing processes. FS load is
completely optional, it just makes the problem happen
faster.
The
servers are completely idle during the testing (they're just test servers).
It's seem this
issue on Bugzilla i think:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=175589
try
to restart the services and try dmesg command what is the result , also may be
you have big process and load on the cluster.
On 3/8/06, Pena,
Francisco Javier <francisco_javier pena roche com>
wrote:
Hi,
During
some stress tests with GFS, I have come across a sequence of
events that
could lead to a GFS file system corruption. It is a bit
weird, but I
think it reflects a flaw in the way the cluster processes
are handled.
The process would be:
1. Setup a 2-node GFS cluster (in my case I
used 2xProliant servers, iLO
fencing, RHEL 4 AS U1, GFS 6.1)
2. Put
some load on the file system from both nodes (bonnie++ should do
the
trick)
3. On one of the nodes, kill the following
daemons:
-
fenced
-
ccsd
-
clvmd
Maybe we could skip killing some process, but this
is what I did. The
cluster did not complain at all, FS access was still
possible.
4. Kill network connectivity between the nodes. I did this
with
"ifconfig down" on the node where I had killed the
daemons.
At this point, both servers seem to think they can get the
other node's
journal, and in fact they do:
Mar 7
16:01:02 rmamsewlab20 kernel: CMAN: removing node rmamsewlab30
from the
cluster : Missed too many heartbeats
Mar 7 16:01:02
rmamsewlab20 kernel: CMANsendmsg failed: -101
Mar 7 16:01:04
rmamsewlab20 kernel: GFS: fsid=NEWCLUSTER:lvgfs.1:
jid=0: Trying to
acquire journal lock...
Mar 7 16:01:04 rmamsewlab20 kernel:
GFS: fsid=NEWCLUSTER:lvgfs.1:
jid=0: Looking at
journal...
Mar 7 16:01:04 rmamsewlab20 kernel: GFS:
fsid=NEWCLUSTER: lvgfs.1:
jid=0: Done
Mar 7 16:01:02
rmamsewlab30 kernel: CMAN: removing node rmamsewlab20
from the cluster :
Missed too many heartbeats
Mar 7 16:01:04 rmamsewlab30 kernel:
GFS: fsid=NEWCLUSTER:lvgfs.0:
jid=1: Trying to acquire journal
lock...
Mar 7 16:01:04 rmamsewlab30 kernel: GFS:
fsid=NEWCLUSTER:lvgfs.0:
jid=1: Looking at journal...
Mar 7
16:01:04 rmamsewlab30 kernel: GFS: fsid=NEWCLUSTER:lvgfs.0:
jid=1:
Acquiring the transaction lock...
Mar 7 16:01:04 rmamsewlab30
kernel: GFS: fsid=NEWCLUSTER:lvgfs.0:
jid=1: Replaying
journal...
Mar 7 16:01:04 rmamsewlab30 kernel: GFS:
fsid=NEWCLUSTER:lvgfs.0:
jid=1: Replayed 0 of 16
blocks
Mar 7 16:01:04 rmamsewlab30 kernel: GFS:
fsid=NEWCLUSTER: lvgfs.0:
jid=1: replays = 0, skips = 14, sames =
2
Mar 7 16:01:04 rmamsewlab30 kernel: GFS:
fsid=NEWCLUSTER:lvgfs.0:
jid=1: Journal replayed in
1s
Mar 7 16:01:04 rmamsewlab30 kernel: GFS:
fsid=NEWCLUSTER:lvgfs.0 :
jid=1: Done
As you can imagine, the file
system is corrupted soon after that. So
there are some questions that
come to my mind:
- The node where I did not kill the processes did
not try to fence the
failing one. Why? And why didn't the other node
complain when its fenced
daemon died?
- Is there a way to make the
cluster software check its critical
processesm, and restart them / reboot
the server when they fail? Sure it
could be done with some scripts or a
monitoring software, but I believe
it is important enough to do it from
within the cluster framework. I
have seen some other cluster solutions
solve this using either a kernel
watchdog timer or a watchdog
process.
Cheers,
Javier
--
nahant-list mailing
list
nahant-list redhat com
https://www.redhat.com/mailman/listinfo/nahant-list
--
---------------------------------
:..Best
Regards.
:..Waleed Mubark.
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]