[Cluster-devel] [GFS2 PATCH] GFS2: Rework rgrp glock congestion functions for intra-node (v2)

Thu Nov 2 17:06:47 UTC 2017

Addressing Steve's other questions:

----- Original Message -----
| > We can predict whether a rgrp glock is the victim of intranode
| > congestion based on set of rules. Rules Enforced by this patch:
| >
| > 1. If the current process has a multi-block reservation in the
| >     rgrp, it needs to use it regardless of the congestion. The
| >     congestion algorithm should have prevented the reservation
| >     in the first place.
| > 2. If some process has the rgrp gl_lockref spin_lock locked,
| >     they are preparing to use it for a reservation. So we take
| >     this as a clear sign of impending contention.
| There can be all kinds of reasons why this lock has been taken. It
| should have very low contention on it, so that I don't think this is
| useful overall.

I'll try some experiments without this check.

| > 3. If the rgrp currently has a glock holder, we know we need to
| >     wait before we can lock it, regardless of whether the holder
| >     represents an actual holder or a waiter.
| That is probably wrong. We ought to be able to to multiple allocations
| in an rgrp if the node is holding the rgrp lock, without having to
| constantly queue and requeue locks. In other words, if we hold the rgrp
| in excl mode, then a useful extension would be to allow multiple users
| to perform allocations in parallel. So the exclusive rgrp glock becomes
| locally sharable - that is the real issue here I think.

That sounds like a really good enhancement, but out of the scope of
this patch.

| > 4. If the rgrp currently has a multi-block reservation, and we
| >     already know it's not ours, then intranode contention is
| >     likely.
| What do you mean by "not ours"? In general each node only sees the
| reservations which are taken out on the node itself.

What I mean is: is the reservation associated with the current inode?
If the current inode for which we're allocating blocks has a current
reservation associated with it (gfs2_rs_active(rs) where rs is ip->i_res)
then we've obliged to use it. If not, we go out searching for a new
reservation in the rgrp. If the rgrp has a reservation queued from
a different inode (so its rgrp->rd_rstree is not empty) then another
process on this node has queued a reservation to this rgrp for the
purposes of a different file.

| > 5. If none of these conditions are true, we check to see if we
| >     can acquire (lock / enqueue) the glock relatively quickly.
| >     If this is lock_dlm protocol, we check if the rgrp is one
| >     of our preferred rgrps. If so, we treat it as fast. As it
| >     was before this patch, for lock_nolock protocol, all rgrps
| >     are considered "preferred."
| The preferred rgrp is the one which is closest to the previous
| allocation for the inode, so that we reduce fragmentation.

That's not what I mean. What I mean is: Not too long ago, we
introduced the concept of a "preferred" set of resource groups such
that each node in the cluster has a unique set of rgrps it prefers to
use. When its block allocations stay within the set of
"preferred" rgrps for that node, there is no contention for rgrp
glocks. The only fighting that takes place is if a node goes
outside its preferred set of rgrps, which it sometimes needs to do.
This was a performance enhancement and it did make a big difference.

| > 6. If the rgrp glock is unlocked, it's generally not fast to
| >     acquire. At the least, we need to push it through the glock
| >     state machine and read it from the media. Worst case, we also
| >     need to get dlm's permission to do so. This is ignored for
| >     preferred rgrps, since we want to set up easy access to them
| >     anyway.
| It should be fast to acquire unless another node is using it... if not
| then we need to look to see what is holding things up. Chances are it is
| due to the wait for a journal commit on a remote node, and that needs to
| be looked at first I think.

But if multiple processes on the same node all want to do block
allocations, they need to wait for the appropriate rgrp glock enqueue,
which means their allocations may queue up behind others. This patch
is all about intra-node. The inter-node case is already well taken
care of. Of course, if we implement a way for multiple processes to
share the same rgrp glock in EX as you suggested above (taking the
EX lock in the name of, say, a single block allocator rather than per
process) then this all becomes moot. But that could be tricky to
implement. It's definitely worth investigating though.

Regards,

Bob Peterson
Red Hat File Systems