[Linux-cluster] no version for "gfs2_unmount_lockproto"
Ferenc Wagner
wferi at niif.hu
Wed Feb 13 16:22:11 UTC 2008
Bob Peterson <rpeterso at redhat.com> writes:
> On Wed, 2008-02-13 at 09:23 +0100, Ferenc Wagner wrote:
>
>> Thanks! This patch indeed fixed the hang. But of course not the
>> mount:
>>
>> Trying to join cluster "lock_dlm", "pilot:test"
>> Joined cluster. Now mounting FS...
>> GFS: fsid=pilot:test.4294967295: can't mount journal #4294967295
>> GFS: fsid=pilot:test.4294967295: there are only 6 journals (0 - 5)
Hi Bob,
Thanks for looking into this. Find my answers below.
> The "4294967295" is really a -1 which is a bad return code on the
> mount.
Aha. I expected something like that, though it looks more like a
journal number in the output. Nevermind.
> So it should be a process of elimination to find out what went
> wrong. Several possibilities of what can be going wrong come to
> mind:
>
> 1. Is it possible that your file system has a different cluster
> name ("pilot") from the the cluster name in your cluster.conf file?
No: <cluster name="pilot" config_version="3"> in the config.
> 2. Perhaps there is another gfs file system with the same name "test"
> already mounted?
No, there isn't. I rebooted the node several times, and it does not
start the cluster infrastructure automatically.
> 3. Perhaps it can't find the locking protocol, lock_dlm (I hope)?
> Make sure lock_dlm shows up in lsmod.
It does:
# lsmod | grep lock
lock_nolock 3456 0
lock_dlm 21260 1
gfs2 333228 3 gfs,lock_nolock,lock_dlm
dlm 108564 10 lock_dlm
> 4. Perhaps gfs can't find the rest of the cluster infrastructure?
> Check to make sure you did "service cman start"
I did cman_tool join, which started aisexec.
> and have aisexec running on the system having the problem.
Yes, it's still running.
> Also, check /var/log/messages for messages pertaining to cluster
> problems.
Starts with usual stuff, then:
openais[4504]: [MAIN ] AIS Executive Service RELEASE 'subrev 1358 version 0.80.3'
<lots of component loads>
openais[4504]: [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms)
<lots of technical data>
openais[4504]: [TOTEM] entering GATHER state from 15.
openais[4504]: [SERV ] Initialising service handler 'openais extended virtual synchrony service'
<lots of similar lines>
openais[4504]: [CMAN ] CMAN 2.01.00 (built Feb 12 2008 22:08:05) started
openais[4504]: [SYNC ] Not using a virtual synchrony filter.
openais[4504]: [TOTEM] Creating commit token because I am the rep.
<some state transitions>
openais[4504]: [TOTEM] entering OPERATIONAL state.
openais[4504]: [CMAN ] quorum regained, resuming activity
openais[4504]: [CLM ] got nodejoin message <IP of node1>
openais[4504]: [CLM ] got nodejoin message <IP of node3>
openais[4504]: [CPG ] got joinlist message from node 3
ccsd[4500]: Initial status:: Quorate
*Here* comes something possibly interesting, after fence_tool join:
fenced[4543]: fencing deferred to prior member
Though it doesn't look like node3 (which has the filesystem mounted)
would want to fence node1 (which has this message in its syslog). Is
there a command available to find out the current fencing status or
history?
Then comes the usual error message:
kernel: dlm: Using TCP for communications
kernel: dlm: connecting to 3
kernel: dlm: got connection from 3
clvmd: Cluster LVM daemon started - connected to CMAN
kernel: Trying to join cluster "lock_dlm", "pilot:test"
kernel: Joined cluster. Now mounting FS...
kernel: GFS: fsid=pilot:test.4294967295: can't mount journal #4294967295
kernel: GFS: fsid=pilot:test.4294967295: there are only 6 journals (0 - 5)
> It sounds to me like we should have a better error message for
> whatever went wrong. Let's figure that out first and then we can
> go about improving the error messages with a bugzilla if needed.
Sounds like a plan. Good error messages always help a lot.
> We have improved the error messages considerably from earlier.
> I don't know what version of the gfs2-utils you have, but that
> will contain the common mount helper (/sbin/mount.gfs2 is a hard
> link to /sbin/mount.gfs) that does some of this error processing
> when mounts fail. So a newer version of the mount helper may be
> better at pointing out what it doesn't like about your file system.
Maybe, but I'm using cluster-2.01.00, and have bad experience with CVS
versions, like dependence on bleeding edge kernel and device mapper.
>> # gfs_tool jindex /dev/mapper/gfs-test
>> gfs_tool: /dev/mapper/gfs-test is not a GFS file/filesystem
>>
>> Scary. What may be the problem? The other node is using this
>> volume... Can even unmount/remount it. Though in dmesg it says:
>
> I wouldn't call it scary at all. It sounds like gfs_tool may be
> somewhat confused about the mount point. Try using the mount
> point that was used on the mount command, not the /dev/mapper
> mount point and see if that helps.
Well, it helps on the node which has the filesystem mounted. Of
course not on the other. Is gfs_tool supposed to work on mounted
filesystems only? Probably so.
> I've actually been working on making a better version of that code
> too--both kernel and userland--that improves how gfs_tool finds
> mount points. For RHEL5, they're bugzillas 431951 (gfs_tool) and
> 431952 (kernel) respectively. Those changes have not been shipped
> yet, due to code freeze, but patches are in the bugzilla records.
Do you think I should apply them? It doesn't sound like they would
help with this problem.
> As for all the kernel dmesgs you noted, that's perfectly normal.
> When you mount a gfs file system, it runs through all the journals
> regardless, checking if they are clean or need to be replayed,
> so that's all those kernel messages mean. They're not locked
> (well, they are, but only for a couple seconds).
Thanks for the clarification. And what does that deferred fencing
mean?
--
Regards,
Feri.
More information about the Linux-cluster
mailing list