[Cluster-devel] [Upstream patch] DLM: Convert rsb data from linked list to rb_tree

Mon Oct 10 19:00:07 UTC 2011

Hi,

On Mon, 2011-10-10 at 13:01 -0400, David Teigland wrote:
> On Mon, Oct 10, 2011 at 04:51:01PM +0100, Steven Whitehouse wrote:
> > Hi,
> > 
> > On Mon, 2011-10-10 at 10:43 -0400, David Teigland wrote:
> > > On Sat, Oct 08, 2011 at 06:13:52AM -0400, Bob Peterson wrote:
> > > > ----- Original Message -----
> > > > | On Wed, Oct 05, 2011 at 03:25:39PM -0400, Bob Peterson wrote:
> > > > | > Hi,
> > > > | > 
> > > > | > This upstream patch changes the way DLM keeps track of RSBs.
> > > > | > Before, they were in a linked list off a hash table.  Now,
> > > > | > they're an rb_tree off the same hash table.  This speeds up
> > > > | > DLM lookups greatly.
> > > > | > 
> > > > | > Today's DLM is faster than older DLMs for many file systems,
> > > > | > (e.g. in RHEL5) due to the larger hash table size.  However,
> > > > | > this rb_tree implementation scales much better.  For my
> > > > | > 1000-directories-with-1000-files test, the patch doesn't
> > > > | > show much of an improvement.  But when I scale the file system
> > > > | > to 4000 directories with 4000 files (16 million files), it
> > > > | > helps greatly. The time to do rm -fR /mnt/gfs2/* drops from
> > > > | > 42.01 hours to 23.68 hours.
> > > > | 
> > > > | How many hash table buckets were you using in that test?
> > > > | If it was the default (1024), I'd be interested to know how
> > > > | 16k compares.
> > > > 
> > > > Hi,
> > > > 
> > > > Interestingly, on the stock 2.6.32-206.el6.x86_64 kernel
> > > > and 16K hash buckets, the time was virtually the same as
> > > > with my patch: 1405m46.519s (23.43 hours). So perhaps we
> > > > should re-evaluate whether we should use the rb_tree
> > > > implementation or just increase the hash buckets as needed.
> > > > I guess the question is now mainly related to scaling and
> > > > memory usage for all those hash tables at this point.
> > > 
> > > I'm still interested in possibly using an rbtree with fewer hash buckets.
> > > 
> > > At the same time, I think the bigger problem may be why gfs2 is caching so
> > > many locks in the first place, especially for millions of unlinked files
> > > whose locks will never benefit you again.
> > > 
> > 
> > I doubt that it is caching locks for unlinked files for any great period
> > of time if there is any memory pressure. It is memory pressure which
> > instigates the dropping of glocks relating to inodes usually. If the
> > glock is unlocked then simply dropping the ref count on the glock will
> > cause the deallocation (dlm lock drop) to occur.
> >
> > The reason why this tends to be seen at unlink time is just that
> > unlinking small files is a good way to iterate over a lot of inodes in a
> > relatively short time period, and thus locking can start to dominate the
> > timings if it isn't very low latency. I suspect there are lots of other
> > workloads which would see this (find perhaps?) as well. The faster the
> > array, the more noticeable the effect is likely to be.
> > 
> > Also, we do get some benefit if glocks are cached after inodes have been
> > unlinked. If an inode is then recreated in the same disk block, we'll
> > have all the glocks ready for it, without having to wait for them to be
> > set up from scratch.
> > 
> > I am slightly concerned that we ought to be testing with much greater
> > numbers of inodes for these kinds of tests. Considering the memory sizes
> > of modern servers, having 1m or even 10m files in cache is really not a
> > lot these days.
> 
> The fact remains that caching "as much as possible" tends to be harmful,
> and some careful limiting would be a good investment.
> 
There is a limit. The point is that the limit is dynamic and depends on
memory pressure. The VM requests a reduction in the number of cached
objects when it wants to reduce the size of what we have cached. So the
larger the amount of RAM, the more inodes may be potentially cached.

I don't agree that caching as much as possible (given the constraints
just mentioned) is bad. The more we cache, the more disk I/O is avoided
so the better the performance will be.

> > There is a second consideration in selecting the data structure for the
> > dlm lock/resource etc. tables, which is the locking. Using a simple hash
> > table does make it much easier to use RCU based locking which is a great
> > advantage if there are a lot of look up operations compared with the
> > number of insert/remove operations. That becomes a lot more difficult
> > (and requires a much greater lookup to insert/remove ratio) when trees
> > are involved.
> > 
> > We have run into the same kind of issues with GFS2's glock hash table.
> > We firstly reduced the size of the hash buckets to a single pointer so
> > that we could have a lot more hash heads for the given amount of memory.
> > Currently we have 2^16 hash table entries.
> > 
> > Also, we are now using RCU in order to reduce the locking overhead as
> > much as possible. We still have a spinlock for the lru list, but that is
> > not taken that much by comparison, and there is no locking in the lookup
> > path now except to disable preempt through the small critical section.
> > 
> > I'm not yet convinced that this is enough, even so. I would like to see
> > a tree based system for the glocks at some future point, but the
> > complexity of tree based RCU is currently putting me off, despite the
> > obvious benefits of a log N (where N is the number of items in the tree)
> > lookup compared with the N/2 average lookup time for a hash chain. We'd
> > need to add an seq lock on the read side which would take away some of
> > the benefit of using RCU in the first place.
> > 
> > However, where we have scored real performance gains is simply by
> > avoiding looking up the glocks at all. Hence why we cache them in the
> > inodes, and this is also why the void pointer we pass to the dlm is a
> > pointer to the glock, so there are then no hash table lookups when we
> > receive dlm replies. I think there is scope to explore this option in
> > the dlm in a few places too.
> > 
> > Our most recent expansion of this concept in gfs2 was the addition of a
> > variable in the inode to cache the last resource group used. This
> > reduces the number of lookups of the resource group data structure
> > dramatically during block allocations and deallocations.
> > 
> > Anyway, I just wanted to point out some of the tradeoffs of the possible
> > choices available in making gfs2/dlm scale to larger numbers of inodes.
> > I would very much like to have an RCU based rbtree (or similar) which
> > doesn't require the seq_lock if anybody has suggestions on how to
> > achieve that,
> 
> While you don't want to be blatently inefficient, I suspect investing time
> into things like RCU runs afoul of Amdahl's Law.  i.e. you could reduce
> the time of the data structure operations to 0 and not affect real world
> performance.
> 
> When deciding where to focus improvements, I think cache limting could
> produce a big payoff, and data structure optimizations a small payoff.
> 

We already have a limit (as per comments above) and I'm not sure whether
you are proposing any further constraints here. The fact remains though
that we ought to be able to make maximum use of whatever RAM is
available on the server, and it makes no sense to artificially limit the
amount of RAM used by glocks/dlm locks (i.e. number of objects which are
cached) beyond that point.

As a consequence we need to ensure that locking operations do not slow
down too much as the number of locks increases. I had thought that we
both were agreed on that point, at least, even if not what to do about
it.

The point of RCU is to avoid the cache pollution associated with
reader/writer locks, and that can be a big win, depending on the access
pattern to the data structure. So it does depend a great deal on that
pattern as to what pay off you get, but it was certainly worth doing for
the glocks.

One side effect of this is that nothing can ever block a glock dump now.
That can potentially be useful for debugging, although it was not the
main reason for doing it.

I would like to use RCU for resource groups too - they are "read only"
from mount time to umount time (aside from if we grow the filesystem)
but the tree structure we have for resource groups rather puts me off.
Most of the performance gain we've got so far has been down to the
caching of the resource groups in the inode, and there may be little
left to gain from RCU. That said, we'll be testing it soon I hope, and
we'll see if is worth doing or not,

Steve.