[Linux-cluster] LOCK_DLM Performance under Fire

Wed Apr 6 19:01:02 UTC 2005

Ick...it appears the apps's locking mechanism is fnctl.  An strace off
the app is full of...

fcntl64(8, F_SETLK64, {type=F_UNLCK, whence=SEEK_SET, start=2147478526,
len=1024}, 0xbffff5a0) = 0
fcntl64(8, F_SETLK64, {type=F_WRLCK, whence=SEEK_SET, start=2147477478,
len=1}, 0xbffff4f0) = 0

...type messages.

The app itself is a really old COBOL app built on Liant's RM/Cobol -- an
abstraction software similar to java which allows the same object code
to run on Linux, UNIX, and Windows with very little modification through
a runtime application.  So, while I have access to the source for the
compiled object, I don't have access to the runtime app code, which is
really the thing doing all the locking.

This specific testing app is opening one file with locks, but it's
beating that file up.  Essentially, it's going through the file and
performing a series of sorts and searches, which, for the most part,
would beat up the proc more than the I/O.  The "real" application for
the most part will not be nearly as intense, but will open probably
around 100 shared files simultaneously with posix locking.  Would
adjusting the SHRINK_CACHE_COUNT and SHRINK_CACHE_MAX in lock_dlm.h
affect this type of application?  Any other tunable parameters which
will help out?  I'm not tied to DLM at this point...is there another
mechanism which would do this equally well?

As for a test app...I'm not sure I'll be able to provide that.  I'll
look into it, though.

--Peter

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Tuesday, April 05, 2005 8:48 PM
To: Peter Shearer
Cc: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] LOCK_DLM Performance under Fire

On Tue, Apr 05, 2005 at 05:35:01PM -0700, Peter Shearer wrote:

> ext3 on local disk, the test app takes about 3 min 20 sec to complete.
> ext3 on GNBD exported disk (one node only, obviously); completes in
> about 3 min 35 sec.
> GFS on GNBD mounted with the localflocks option; completes in 5 min 30
> sec.
> GFS on GNBD mounted using LOCK_DLM with only one server mounting the
fs;
> completes in 50 min 45 sec.
> GFS on GNBD mounted using LOCK_DLM with two servers mounting the fs;
> went over 80 min and wasn't even half done.

It sounds like the app is using fcntl (posix) locks, not flock(2)?
If so, that's a weak spot for lock_dlm which translates posix-lock
requests into multiple dlm lock operations.

That said, it's possible the code may be doing some dumb things that
could be fixed to improve the speed.  If there are hundreds of files
being locked, one simple thing to try is to increase SHRINK_CACHE_COUNT
and SHRINK_CACHE_MAX in lock_dlm.h (sorry, never made them tunable
through proc.)  This relates to some basic caching lock_dlm does for
files that are repeatedly locked/unlocked.

If the app could get by with just using flock() that would certainly be
much faster.  Also, if you could provide the test you use or a
simplified
equivalent it would help.

-- 
Dave Teigland  <teigland at redhat.com>