[Linux-cluster] no version for "gfs2_unmount_lockproto"

Wed Feb 13 16:22:11 UTC 2008

Bob Peterson <rpeterso at redhat.com> writes:

> On Wed, 2008-02-13 at 09:23 +0100, Ferenc Wagner wrote:
>
>> Thanks!  This patch indeed fixed the hang.  But of course not the
>> mount:
>> 
>> Trying to join cluster "lock_dlm", "pilot:test"
>> Joined cluster. Now mounting FS...
>> GFS: fsid=pilot:test.4294967295: can't mount journal #4294967295
>> GFS: fsid=pilot:test.4294967295: there are only 6 journals (0 - 5)

Hi Bob,

Thanks for looking into this.  Find my answers below.

> The "4294967295" is really a -1 which is a bad return code on the
> mount.

Aha.  I expected something like that, though it looks more like a
journal number in the output.  Nevermind.

> So it should be a process of elimination to find out what went
> wrong.  Several possibilities of what can be going wrong come to
> mind:
>
> 1. Is it possible that your file system has a different cluster
>    name ("pilot") from the the cluster name in your cluster.conf file?

No: <cluster name="pilot" config_version="3"> in the config.

> 2. Perhaps there is another gfs file system with the same name "test"
>    already mounted?

No, there isn't.  I rebooted the node several times, and it does not
start the cluster infrastructure automatically.

> 3. Perhaps it can't find the locking protocol, lock_dlm (I hope)?
>    Make sure lock_dlm shows up in lsmod.

It does:

# lsmod | grep lock
lock_nolock             3456  0 
lock_dlm               21260  1 
gfs2                  333228  3 gfs,lock_nolock,lock_dlm
dlm                   108564  10 lock_dlm

> 4. Perhaps gfs can't find the rest of the cluster infrastructure?
>    Check to make sure you did "service cman start"

I did cman_tool join, which started aisexec.

>    and have aisexec running on the system having the problem.

Yes, it's still running.

>    Also, check /var/log/messages for messages pertaining to cluster
>    problems.

Starts with usual stuff, then:

openais[4504]: [MAIN ] AIS Executive Service RELEASE 'subrev 1358 version 0.80.3' 
  <lots of component loads>
openais[4504]: [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms) 
  <lots of technical data>
openais[4504]: [TOTEM] entering GATHER state from 15.
openais[4504]: [SERV ] Initialising service handler 'openais extended virtual synchrony service' 
  <lots of similar lines>
openais[4504]: [CMAN ] CMAN 2.01.00 (built Feb 12 2008 22:08:05) started 
openais[4504]: [SYNC ] Not using a virtual synchrony filter. 
openais[4504]: [TOTEM] Creating commit token because I am the rep. 
  <some state transitions>
openais[4504]: [TOTEM] entering OPERATIONAL state. 
openais[4504]: [CMAN ] quorum regained, resuming activity 
openais[4504]: [CLM  ] got nodejoin message <IP of node1>
openais[4504]: [CLM  ] got nodejoin message <IP of node3>
openais[4504]: [CPG  ] got joinlist message from node 3 
ccsd[4500]: Initial status:: Quorate 

*Here* comes something possibly interesting, after fence_tool join:

fenced[4543]: fencing deferred to prior member

Though it doesn't look like node3 (which has the filesystem mounted)
would want to fence node1 (which has this message in its syslog).  Is
there a command available to find out the current fencing status or
history?

Then comes the usual error message:

kernel: dlm: Using TCP for communications
kernel: dlm: connecting to 3
kernel: dlm: got connection from 3
clvmd: Cluster LVM daemon started - connected to CMAN
kernel: Trying to join cluster "lock_dlm", "pilot:test"
kernel: Joined cluster. Now mounting FS...
kernel: GFS: fsid=pilot:test.4294967295: can't mount journal #4294967295
kernel: GFS: fsid=pilot:test.4294967295: there are only 6 journals (0 - 5)

> It sounds to me like we should have a better error message for
> whatever went wrong.  Let's figure that out first and then we can
> go about improving the error messages with a bugzilla if needed.

Sounds like a plan.  Good error messages always help a lot.

> We have improved the error messages considerably from earlier.
> I don't know what version of the gfs2-utils you have, but that
> will contain the common mount helper (/sbin/mount.gfs2 is a hard
> link to /sbin/mount.gfs) that does some of this error processing
> when mounts fail.  So a newer version of the mount helper may be
> better at pointing out what it doesn't like about your file system.

Maybe, but I'm using cluster-2.01.00, and have bad experience with CVS
versions, like dependence on bleeding edge kernel and device mapper.

>> # gfs_tool jindex /dev/mapper/gfs-test 
>> gfs_tool: /dev/mapper/gfs-test is not a GFS file/filesystem
>> 
>> Scary.  What may be the problem?  The other node is using this
>> volume...  Can even unmount/remount it.  Though in dmesg it says:
>
> I wouldn't call it scary at all.  It sounds like gfs_tool may be
> somewhat confused about the mount point.  Try using the mount
> point that was used on the mount command, not the /dev/mapper
> mount point and see if that helps.

Well, it helps on the node which has the filesystem mounted.  Of
course not on the other.  Is gfs_tool supposed to work on mounted
filesystems only?  Probably so.

> I've actually been working on making a better version of that code
> too--both kernel and userland--that improves how gfs_tool finds
> mount points.  For RHEL5, they're bugzillas 431951 (gfs_tool) and
> 431952 (kernel) respectively.  Those changes have not been shipped
> yet, due to code freeze, but patches are in the bugzilla records.

Do you think I should apply them?  It doesn't sound like they would
help with this problem.

> As for all the kernel dmesgs you noted, that's perfectly normal.
> When you mount a gfs file system, it runs through all the journals
> regardless, checking if they are clean or need to be replayed,
> so that's all those kernel messages mean.  They're not locked
> (well, they are, but only for a couple seconds).

Thanks for the clarification.  And what does that deferred fencing
mean?
-- 
Regards,
Feri.