[Linux-cluster] Clearing a glock
Steven Whitehouse
swhiteho at redhat.com
Tue Jul 27 17:35:28 UTC 2010
Hi,
On Tue, 2010-07-27 at 10:14 -0700, Scooter Morris wrote:
> Hi Steve,
> More information. The offending file was /usr/local/bin/python2.6,
> which we use heavily on all nodes. Our general use is through the #!
> mechanism in .py files. Does this offer any clues as to why we had all
> of those processes waiting on a lock with no holder?
>
> -- scooter
>
Not really. I'd have expected that to be mapped read-only on the nodes
and there should be no write activity to it at all, so it should scale
very well. Did you set noatime?
I can't think of any other reason why that should have been an issue,
Steve.
> On 07/27/2010 06:18 AM, Steven Whitehouse wrote:
> > Hi,
> >
> > On Tue, 2010-07-27 at 05:57 -0700, Scooter Morris wrote:
> >> On 7/27/10 5:15 AM, Steven Whitehouse wrote:
> >>> Hi,
> >>>
> >>> If you translate a5b67f into decimal, then that is the inode number of
> >>> the inode which is causing a problem. It looks to me as if you have too
> >>> many processes trying to access this one inode from multiple nodes.
> >>>
> >>> Its not obvious from the traces that anything is actually stuck, but if
> >>> you take two traces, a few seconds or minutes apart, then it should
> >>> become more obvious whether the cluster is making progress or whether it
> >>> really is stuck,
> >>>
> >>> Steve.
> >>>
> >>>
> >>> --
> >>> Linux-cluster mailing list
> >>> Linux-cluster at redhat.com
> >>> https://www.redhat.com/mailman/listinfo/linux-cluster
> >> Hi Steve,
> >> As always, thanks for the reply. The cluster was, indeed, truly
> >> stuck. I rebooted it last night to clear everything out. I never did
> >> figure out which file was the problem. I did a find -inum, but the find
> >> hung too. By that point the load average was up to 80 and climbing.
> >> Any ideas on how to avoid this? Are there tunable values I need to
> >> increase to allow more processes to access any individual inode?
> >>
> > The LA includes processes waiting for glocks since that is an
> > uninterruptible wait, so thats where most of the LA came from.
> >
> > The find is unlikely to work while the cluster is stuck, since if it
> > does find the cuplrit inode, it is, by definition already stuck so the
> > find process would just join the queue. If a find fails to discover the
> > inode when the cluster has been rebooted and is back working again, then
> > it was probably a temporary file of some kind.
> >
> > There are no tunable values since the limitation on the access to the
> > inode is the speed of the hardware in terms of how many times a given
> > inode can be synced, invalidated and the glock passed on to another node
> > in a given time period. It is a limitation of the hardware and the
> > architecture of the filesystem.
> >
> > There are a few things which can probably be improved in due course, but
> > in the main the best way to avoid problems with congestion on inodes is
> > just to be careful about the access pattern across nodes.
> >
> > That said, if it really was completely stuck, that is a real bug and not
> > the result of the access pattern since the code is designed such that
> > progress should always be made, even if its painfully slow,
> >
> > Steve.
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
More information about the Linux-cluster
mailing list