[dm-devel] [PATCH]dm-mpath: fix for race condition between multipath_dtr and pg_init_done.

Merla, ShivaKrishna ShivaKrishna.Merla at netapp.com
Wed Oct 30 13:32:20 UTC 2013



> -----Original Message-----
> From: Mike Snitzer [mailto:snitzer at redhat.com]
> Sent: Tuesday, October 29, 2013 7:57 PM
> To: Merla, ShivaKrishna
> Cc: Hannes Reinecke; dm-devel at redhat.com; agk at redhat.com; Mikulas
> Patocka
> Subject: Re: [PATCH]dm-mpath: fix for race condition between
> multipath_dtr and pg_init_done.
> 
> On Thu, Oct 17 2013 at  6:03pm -0400,
> Merla, ShivaKrishna <ShivaKrishna.Merla at netapp.com> wrote:
> 
> > > From: Mike Snitzer [mailto:snitzer at redhat.com]
> > > Sent: Thursday, October 17, 2013 4:10 PM
> > > To: Hannes Reinecke
> > > Cc: Merla, ShivaKrishna; dm-devel at redhat.com; agk at redhat.com;
> Mikulas
> > > Patocka
> > > Subject: Re: [PATCH]dm-mpath: fix for race condition between
> > > multipath_dtr and pg_init_done.
> > >
> > > On Thu, Oct 17 2013 at  5:47pm -0400,
> > > Hannes Reinecke <hare at suse.de> wrote:
> > >
> > > > On 10/17/2013 08:53 PM, Mike Snitzer wrote:
> > > > >Thanks for reporting this.  Much appreciated.  More comments below.
> > > > >
> > > > >On Thu, Oct 17 2013 at  1:31pm -0400,
> > > > >Merla, ShivaKrishna <ShivaKrishna.Merla at netapp.com> wrote:
> > > > >
> > > > >>Whenever multipath_dtr is happening, we should prevent queueing
> any
> > > further path
> > > > >>activation work. There was a kernel panic where after pg_init_done()
> > > decrements
> > > > >>pg_init_in_progress to 0, wait_for_pg_init_completion call assumes
> > > there are no
> > > > >>more pending path management commands. But if pg_init_required
> is
> > > set by
> > > > >>pg_init_done call due to retriable mode_select errors , then
> > > process_queued_ios()
> > > > >>will again queue the path activation work. If free_multipath call has
> been
> > > > >>completed by the time activate_path work is called, kernel panic was
> > > seen on
> > > > >>accessing multipath members.
> > > > >
> > > > >Your locking looks suspect to me, see comment inlined below
> > > multipath_dtr
> > > > >
> > > > >But shouldn't we just train multipath_wait_for_pg_init_completion()
> to
> > > > >look at m->pg_init_required?  Have it wait for both pg_init_required
> and
> > > > >pg_init_in_progress to be zero?  We'd also have to audit that
> > > > >pg_init_required cannot be set while pg_init_in_progress.
> > > > >
> > > > Hmm.
> > > >
> > > > We _could_ try to resolve it by pushing I/O back onto the request
> queue
> > > > (cf my earlier post 'requeue I/O during pg_init').
> > > >
> > > > I was hoping to excite some comments with that, but seems to be my
> > > > fate nowadays to send out patches with no reply.
> > >
> > > patchwork caught it:
> > > https://patchwork.kernel.org/patch/2969111/
> > >
> > > I've just been distracted with other stuff the past week; but I'll be
> > > looking closer at this issue (and your earlier patch) shortly and we'll
> > > get a fix queued for 3.13.
> > >
> > > > Anyway, maybe this will be giving it some more attention.
> > > > It definitely would avoid this problem, by virtue of not having to
> > > > queue I/O internally during pg_init, so we could easily tear down
> > > > the queue.
> > >
> > > Sounds good.
> >
> > Thanks for your comments.  I agree we should lock while setting
> dtr_in_progress, I think I overlooked it as its handled in process_queued_ios
> as well.
> > We looked into handling this in wait_for_pg_init_completion() but checking
> for pg_init_required here will not help as well ( until we prevent setting
> pg_init_required while pg_init_in_progress is set ).
> > Here due to SCSI_DH_RETRY on mode_select, pg_init_done will set the
> pg_init_required as activation needs to be retried under normal
> > circumstances. But it completely differs when multipath target  is being
> destroyed.  Apparently I didn't see any pending_ios in our test
> > while this is happening. Just path activations are held up since controller
> was returning 5/91/36 CC's. With this condition either one of
> pg_init_required or pg_init_in_progress flags are set all the time.
> > Hannes patch will take care of preventing queueing of IO's when
> pg_init_in_progress is set, but currently running activation commands will
> not return until controller returns SUCCESS on mode_select.
> 
> So... do you have an updated patch with proper locking that takes
> Hannes' patch into consideration?
> 
> I'll be reviewing hannes patch closer (for v3.13) tomorrow so if you'd
> like your issue resolved in the near-term I'd appreciate us getting some
> closer on proposed solutions.
> 
> Thanks,
> Mike
I have re-submitted my patch with locking in multipath_dtr. I revisited Hannes patch and I think 
we still have this issue when multipath is destroyed and path activation work is queued. This is with
the scenario of scsi_dh_rdac returning SCSI_DH_RETRY on mode_select due to 5/91/36 CC's.




More information about the dm-devel mailing list