[Linux-cluster] Re: write's pausing - which tools to debug?


On Tue, Oct 18, 2005 at 09:20:14AM -0500, Troy Dawson wrote:
> We've been having some problems with doing a write's to our GFS file 
> system, and it will pause, for long periods.  (Like from 5 to 10 
> seconds, to 30 seconds, and occasially 5 minutes)  After the pause, it's 
> like nothing happened, whatever the process is, just keeps going happy 
> as can be.
> Except for these pauses, our GFS is quite zippy, both reads and writes. 
>  But these pauses are holding us back from going full production.
> I need to know what tools I should use to figure out what is causing 
> these pauses.
> Here is the setup.
> All machines: RHEL 4 update 1 (ok, actually S.L. 4.1), kernel 
> 2.6.9-11.ELsmp, GFS 6.1.0, ccs 1.0.0, gulm 1.0.0, rgmanager 1.9.34
> I have no ability to do fencing yet, so I chose to use the gulm locking 
> mechanism.  I have it setup so that there are 3 lock servers, for 
> failover.  I have tested the failover, and it works quite well.

If this is a testing environment use manual fencing. E.g. if a node
needs to get fenced you get a log message saying that you should do
that and acknowledge that.

> I have 5 machines in the cluster.  1 isn't connected to the SAN, or 
> using GFS.  It is just a failover gulm lock server incase the other two 
> lock servers go down.
> So I have 4 machines connected to our SAN and using GFS.  3 are 
> read-only, 1 is read-write.  If it is important, the 3 read-only are 
> x86_64, the 1 read-write and the 1 not connected are i386.
> The read/write machine is our master lock server.  Then one of the 
> read-only is a fallback lock server, as is the machine not using GFS.
> Anyway, we're getting these pauses when writting, and I'm having a hard 
> time tracking down where the problem is.  I *think* that we can still 
> read from the other machines.  But since this comes and goes, I haven't 
> been able to verify that.

What SAN hardware is attached to the nodes?

> Anyway, which tools do you think would be best in diagnosing this?

I'd suggest to check/monitor networking. Also place the cluster
communication on a separate network that the SAN/LAN network. The
cluster heartbeat goes over UDP and a congested network may delay
these packages or drop the completely. At least that's the CMAN
picture, lock_gulm may be different.

Also don't mix RHELU1 and U2 or FC<N>. Just in case you'd like to
upgrade to SL4.2 one by one.

There have been many changes/bug fixes to the cluster bits in RHELU2,
and there are also some new spiffy features like multipath. Perhaps
it's worth rebasing your testing environment?
Axel.Thimm at ATrpms.net

