[Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock

Wed Oct 10 08:22:52 UTC 2018

On Tuesday, 9 October 2018 16:08:10 BST Tim Smith wrote:
> On Tuesday, 9 October 2018 14:00:34 BST Andreas Gruenbacher wrote:
> > On Tue, 9 Oct 2018 at 14:46, Tim Smith <tim.smith at citrix.com> wrote:
> > > On Tuesday, 9 October 2018 13:34:47 BST you wrote:
> > > > There must be another reason for the missed wake-up. I'll have to
> > > > study the code some more.
> > > 
> > > I think it needs to go into gfs2_glock_dealloc(), in such a way as to
> > > avoid
> > > that problem. Currently working out a patch to do that.
> > 
> > That doesn't sound right: find_insert_glock is waiting for the glock
> > to be removed from the rhashtable. In gfs2_glock_free, we remove the
> > glock from the rhashtable and then we do the wake-up. Delaying the
> > wakeup further isn't going to help (but it might hide the problem).
> 
> The only way we can get the problem we're seeing is if we get an effective
> order of
> 
> T0: wake_up_glock()
> T1: prepare_to_wait()
> T1: schedule()
> 
> so clearly there's a way for that to happen. Any other order and either
> schedule() doesn't sleep or it gets woken.
> 
> The only way I can see at the moment to ensure that wake_up_glock() *cannot*
> get called until after prepare_to_wait() is to delay it until the read_side
> critical sections are done, and the first place that's got that property is
> the start of gfs2_glock_dealloc(), unless we want to add synchronise_rcu()
> to gfs2_glock_free() and I'm guessing there's a reason it's using
> call_rcu() instead.
> 
> I'll keep thinking about it.

OK, we have a result that this definitely *isn't* the right answer (still 
getting the wakeup happening). This is lends more weight to the idea that 
there are multiple waiters, so we'll try your patch.

-- 
Tim Smith <tim.smith at citrix.com>