[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Cluster-devel] [Upstream patch] DLM: Convert rsb data from linked list to rb_tree


On Mon, 2011-10-10 at 10:43 -0400, David Teigland wrote:
> On Sat, Oct 08, 2011 at 06:13:52AM -0400, Bob Peterson wrote:
> > ----- Original Message -----
> > | On Wed, Oct 05, 2011 at 03:25:39PM -0400, Bob Peterson wrote:
> > | > Hi,
> > | > 
> > | > This upstream patch changes the way DLM keeps track of RSBs.
> > | > Before, they were in a linked list off a hash table.  Now,
> > | > they're an rb_tree off the same hash table.  This speeds up
> > | > DLM lookups greatly.
> > | > 
> > | > Today's DLM is faster than older DLMs for many file systems,
> > | > (e.g. in RHEL5) due to the larger hash table size.  However,
> > | > this rb_tree implementation scales much better.  For my
> > | > 1000-directories-with-1000-files test, the patch doesn't
> > | > show much of an improvement.  But when I scale the file system
> > | > to 4000 directories with 4000 files (16 million files), it
> > | > helps greatly. The time to do rm -fR /mnt/gfs2/* drops from
> > | > 42.01 hours to 23.68 hours.
> > | 
> > | How many hash table buckets were you using in that test?
> > | If it was the default (1024), I'd be interested to know how
> > | 16k compares.
> > 
> > Hi,
> > 
> > Interestingly, on the stock 2.6.32-206.el6.x86_64 kernel
> > and 16K hash buckets, the time was virtually the same as
> > with my patch: 1405m46.519s (23.43 hours). So perhaps we
> > should re-evaluate whether we should use the rb_tree
> > implementation or just increase the hash buckets as needed.
> > I guess the question is now mainly related to scaling and
> > memory usage for all those hash tables at this point.
> I'm still interested in possibly using an rbtree with fewer hash buckets.
> At the same time, I think the bigger problem may be why gfs2 is caching so
> many locks in the first place, especially for millions of unlinked files
> whose locks will never benefit you again.

I doubt that it is caching locks for unlinked files for any great period
of time if there is any memory pressure. It is memory pressure which
instigates the dropping of glocks relating to inodes usually. If the
glock is unlocked then simply dropping the ref count on the glock will
cause the deallocation (dlm lock drop) to occur.

The reason why this tends to be seen at unlink time is just that
unlinking small files is a good way to iterate over a lot of inodes in a
relatively short time period, and thus locking can start to dominate the
timings if it isn't very low latency. I suspect there are lots of other
workloads which would see this (find perhaps?) as well. The faster the
array, the more noticeable the effect is likely to be.

Also, we do get some benefit if glocks are cached after inodes have been
unlinked. If an inode is then recreated in the same disk block, we'll
have all the glocks ready for it, without having to wait for them to be
set up from scratch.

I am slightly concerned that we ought to be testing with much greater
numbers of inodes for these kinds of tests. Considering the memory sizes
of modern servers, having 1m or even 10m files in cache is really not a
lot these days.

There is a second consideration in selecting the data structure for the
dlm lock/resource etc. tables, which is the locking. Using a simple hash
table does make it much easier to use RCU based locking which is a great
advantage if there are a lot of look up operations compared with the
number of insert/remove operations. That becomes a lot more difficult
(and requires a much greater lookup to insert/remove ratio) when trees
are involved.

We have run into the same kind of issues with GFS2's glock hash table.
We firstly reduced the size of the hash buckets to a single pointer so
that we could have a lot more hash heads for the given amount of memory.
Currently we have 2^16 hash table entries.

Also, we are now using RCU in order to reduce the locking overhead as
much as possible. We still have a spinlock for the lru list, but that is
not taken that much by comparison, and there is no locking in the lookup
path now except to disable preempt through the small critical section.

I'm not yet convinced that this is enough, even so. I would like to see
a tree based system for the glocks at some future point, but the
complexity of tree based RCU is currently putting me off, despite the
obvious benefits of a log N (where N is the number of items in the tree)
lookup compared with the N/2 average lookup time for a hash chain. We'd
need to add an seq lock on the read side which would take away some of
the benefit of using RCU in the first place.

However, where we have scored real performance gains is simply by
avoiding looking up the glocks at all. Hence why we cache them in the
inodes, and this is also why the void pointer we pass to the dlm is a
pointer to the glock, so there are then no hash table lookups when we
receive dlm replies. I think there is scope to explore this option in
the dlm in a few places too.

Our most recent expansion of this concept in gfs2 was the addition of a
variable in the inode to cache the last resource group used. This
reduces the number of lookups of the resource group data structure
dramatically during block allocations and deallocations.

Anyway, I just wanted to point out some of the tradeoffs of the possible
choices available in making gfs2/dlm scale to larger numbers of inodes.
I would very much like to have an RCU based rbtree (or similar) which
doesn't require the seq_lock if anybody has suggestions on how to
achieve that,


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]