[Linux-cluster] GFS2 - F_SETLK fails with "ENOSYS" after umount + mount

Wed Jan 30 16:11:09 UTC 2013

On Wed, 2013-01-30 at 13:34 +0000, Steven Whitehouse wrote:
> Hi,
> 
> On Wed, 2013-01-30 at 12:31 +0100, Kristian Grønfeldt Sørensen wrote:
> > Hi,
> > 
> > I'm setting up a two-node cluster sharing a single GFS2 filesystem
> > backed by a dual-primary DRBD-device (DRBD on top of LVM, so no CLVM
> > involved).
> > 
> > I am experiencing more or less the same as the OP in this thread:
> > http://www.redhat.com/archives/linux-cluster/2010-July/msg00136.html
> > 
> 
> Well I'm not so sure about that. We never found out what the issue was
> in that case, but in your case it seems that you are doing something
> which should work. Also, in the msg00136 case it seems that the lock
> request didn't work at all, whereas in your case it appears that it does
> work until a umount/mount of one node - at least if I've understood it
> correctly.

Correct. And I am able to bring the system into a working state by
unmounting the file system from all nodes at the same time, and mounting
it again. 

> Which kernel and userspace are you using?

It's Debian testing - kernel is from experimental
( 3.7.1-1~experimental.2), since I had problems deleting files with the
gfs2-module included in the default Debian testing kernel (3.2.x). 

cman + libdlm3 is v3.0.12
corosync is v1.4.2

Let me know if you need version numbers of other stuff.

> It would be a good plan to report this as a bug (or via support if you
> are a supported customer and are using RHEL) as it should work
> correctly,

OK will probably file a bug report then. It's at least encouraging to
hear that it should work:-)

/Kristian

> Steve.
> 
> 
> > I have an activemq-5.6.0 instance on each server that tries to lock a
> > file on the GFS2-filesystem (using ).  
> > 
> > When i start the cluster, everything works as expected. The first
> > activemq instance that starts up acquires the lock, the lock is released
> > when the activemq exits, and the second instance takes the lock. 
> > 
> > The problem shows when I unmount and subsequently mount the GFS2
> > filesystem  again on one of the nodes, or reboot one of the nodes (after
> > having started at least one activemq instance.) 
> > The I start seeing statements like this in the activemq log files:
> > 
> > Database /srv/activemq/queue#3a#2f#2fstat.#3e/lock is locked... waiting 10 seconds for the database to be unlocked. Reason: java.io.IOException: Function not implemented | org.apache.activemq.store.kahadb.MessageDatabase
> > 
> > strace -f while that message is logged gives the following:
> > 
> > [pid  3549] stat("/srv/activemq/queue#3a#2f#2fstat.#3e", {st_mode=S_IFDIR|0755, st_size=3864, ...}) = 0
> > [pid  3549] stat("/srv/activemq/queue#3a#2f#2fstat.#3e", {st_mode=S_IFDIR|0755, st_size=3864, ...}) = 0
> > [pid  3549] open("/srv/activemq/queue#3a#2f#2fstat.#3e/lock", O_RDWR|O_CREAT, 0666) = 133
> > [pid  3549] fstat(133, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> > [pid  3549] fcntl(133, F_GETFD)         = 0
> > [pid  3549] fcntl(133, F_SETFD, FD_CLOEXEC) = 0
> > [pid  3549] fstat(133, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> > [pid  3549] fstat(133, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> > [pid  3549] fcntl(133, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = -1 ENOSYS (Function not implemented)
> > [pid  3549] dup2(138, 133)              = 133
> > [pid  3549] close(133)
> > 
> > As you can see, the "Function not implemented" originates from the
> > F_SETLK fnctl that the JVM does. 
> > The only way to recover from this state seems to be by unmounting the
> > GFS2-filesystem on both nodes, then mounting it again again on both
> > nodes. 
> > 
> > I've tried to isolate this by using a simpler testcase than starting two
> > activemq instances. I ended up using the java sample from
> > http://www.javabeat.net/2007/10/locking-files-using-java/ . 
> > 
> > I haven't managed to get the system in to a state where F_SETLK returns
> > "Function no implemented" by only using the above FileLockTest class, (I
> > need activemq in order to trigger the situation) but when the system is
> > in that state, I can run FileLockTest, and it will print out the
> > following stacktrace.
> > 
> > Exception in thread "main" java.io.IOException: Function not implemented
> >         at sun.nio.ch.FileChannelImpl.lock0(Native Method)
> >         at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:871)
> >         at java.nio.channels.FileChannel.tryLock(FileChannel.java:962)
> >         at FileLockTest.main(FileLockTest.java:15)
> > 
> > 
> > If I run this on the other server (where the GFS2 fs was not unmounted
> > and mounted again), it works correctly. 
> > 
> > Any ideas to what happens, and why?
> > 
> > BR
> > Kristian Sørensen
> > 
> > -- 
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
>