[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Re: write's pausing - which tools to debug?

Axel Thimm wrote:

On Tue, Oct 18, 2005 at 09:20:14AM -0500, Troy Dawson wrote:

We've been having some problems with doing a write's to our GFS file system, and it will pause, for long periods. (Like from 5 to 10 seconds, to 30 seconds, and occasially 5 minutes) After the pause, it's like nothing happened, whatever the process is, just keeps going happy as can be. Except for these pauses, our GFS is quite zippy, both reads and writes. But these pauses are holding us back from going full production. I need to know what tools I should use to figure out what is causing these pauses.

Here is the setup.
All machines: RHEL 4 update 1 (ok, actually S.L. 4.1), kernel 2.6.9-11.ELsmp, GFS 6.1.0, ccs 1.0.0, gulm 1.0.0, rgmanager 1.9.34

I have no ability to do fencing yet, so I chose to use the gulm locking mechanism. I have it setup so that there are 3 lock servers, for failover. I have tested the failover, and it works quite well.

If this is a testing environment use manual fencing. E.g. if a node
needs to get fenced you get a log message saying that you should do
that and acknowledge that.

I have 5 machines in the cluster. 1 isn't connected to the SAN, or using GFS. It is just a failover gulm lock server incase the other two lock servers go down.

So I have 4 machines connected to our SAN and using GFS. 3 are read-only, 1 is read-write. If it is important, the 3 read-only are x86_64, the 1 read-write and the 1 not connected are i386.

The read/write machine is our master lock server. Then one of the read-only is a fallback lock server, as is the machine not using GFS.

Anyway, we're getting these pauses when writting, and I'm having a hard time tracking down where the problem is. I *think* that we can still read from the other machines. But since this comes and goes, I haven't been able to verify that.

What SAN hardware is attached to the nodes?

From the switch on down, I don't know. It's a centrally managed SAN, that I have been allowed to plug into and given disk space. I do have Qlogic cards in the machines.

Anyway, which tools do you think would be best in diagnosing this?

I'd suggest to check/monitor networking. Also place the cluster
communication on a separate network that the SAN/LAN network. The
cluster heartbeat goes over UDP and a congested network may delay
these packages or drop the completely. At least that's the CMAN
picture, lock_gulm may be different.

That sounds like a good idea. All of our machines have two ethernet ports, and I'm not using the second one on any of them. That would actually fix some other problems as well.

Also don't mix RHELU1 and U2 or FC<N>. Just in case you'd like to
upgrade to SL4.2 one by one.

Yup, read that, but thanks for the reminder.

There have been many changes/bug fixes to the cluster bits in RHELU2,
and there are also some new spiffy features like multipath. Perhaps
it's worth rebasing your testing environment?

Don't I wish it was a testing enviroment. But at least the machines don't HAVE to be 24x7. And I've only got one of them in production right now, so it's only one going down.

Troy Dawson  dawson fnal gov  (630)840-6468
Fermilab  ComputingDivision/CSS  CSI Group

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]