[dm-devel] more multipath deadlocks -- this time involving memory

Mon Mar 28 22:48:55 UTC 2005

A new snapshot should have poped up on the ftp (0.4.4-pre4).
It contains an async logger thread for the daemon.

Edward, can you test if it solves the issues with direct syslog call you
spotted ?

open-iscsi guys, I don't know if this logger thing is suitable to your
needs, but feel free to rip it : it's contained in 

multipathd/log.[ch] : loging primitives and staging structs
multipathd/log_pthread.[ch] : a wrapper aroung log.[ch] which takes care
of the asynchronous details

It should be easy to add a log_ipc.[ch] wrapper, if needed.

Regards,
cvaroqui

On jeu, 2005-03-24 at 00:10 +0100, christophe varoqui wrote:
> Just to let you know I'm not ignoring your comments and analysis.
> 
> I opened the 0.4.4-pre* festival, and hope we can fix these nasties
> before the end of this cycle.
> 
> I started the branch with an id cache in multipath. I'm not sure on the
> design, so I will take comments.
> 
> As seen with Lars, I'll then continue moving bits from multipath/ to
> libmultipath/ until the daemon can be switched to
> libmultipath:multipath() instead of exec(/sbin/multipath).
> 
> That, plus a loging rework that is under discussion with open-iscsi
> guys, should address most of your concerns.
> 
> A mempool would have to wait for another release, if at all desirable.
> 
> Regards,
> cvaroqui
> 
> On lun, 2005-03-21 at 21:34 -0500, goggin, edward wrote:
> > Looks like some troublesome deadlock issues involving multipath, memory,
> > and all-paths-down use cases.  While one might typically expect such a use
> > case to result in errors, deadlock is not to be expected.  Furthermore, for
> > non-
> > destructive ucode upgrades of an EMC CLARiion storage system, it is expected
> > that for a short period of time, all paths to the storage system in question
> > will
> > appear to a host to be failed.  It is expected that any multipathing
> > solution will
> > ride through this NDU scenario without a problem.
> > 
> > While I see three separate instances of the problem being plausible, I have
> > only
> > seen the first problem instance described below.  The second and third
> > scenarios
> > require high levels of memory contention which I have not spent significant
> > time
> > creating.
> > 
> > The first problem scenario involves a deadlock between multipathd and
> > syslogd.  The second scenario involves the potential for multipathd,
> > multipath,
> > or any of the executables invoked by multipath to be deadlocked trying doing
> > synchronous page reclamation while allocating memory pages for user or
> > kernel heap memory in a system with a high degree of memory contention
> > and several multipath mapped devices in an all-paths-down failure state.
> > The third scenario is extremely similar to the second but involves the need
> > to allocate pages not for heap memory but to swap in working set pages for
> > the multipathd, multipath, or any of the executables invoked by multipath.
> > 
> > First, it seems like __every__ time I try an NDU of the EMC CLARiion ucode,
> > one of the two (checkerloop or waiterloop) multipathd sub-threads gets
> > blocked in unix_wait_for_peer waiting to send a syslog message through
> > a UNIX domain socket to syslogd.  Unfortunately, syslogd is blocked in
> > blk_congestion_wait waiting for the number of dirty pages in the page
> > cache to drop below a pre-defined threshold while it was trying to write log
> > info to its /var/log/messages log file.  Unfortunately, getting this to
> > happen
> > is dependent on the multipathd checkerloop thread periodically checking
> > path connectivity and invoking multipath in order to reconfigure multipath
> > maps and/or re-enable some now valid paths.  Since the multipathd
> > waiterloop event thread will deadlock on the multipathd allpaths mutex
> > currently owned by the checkerloop thread, starting i/o on a failed path
> > will not free up the log jam.  Assuming enough free memory is available
> > to do so, manually running multipath often resolves the problem.  Yet,
> > this is hardly a work around to be recommended to an enterprise customer.
> > 
> > I have only been able to avoid this deadly embrace by killing syslogd
> > before starting the test.  Without syslogd running, I made it through
> > this test 3 consecutive times.  It seems that I cannot get through the test
> > at all with syslogd running.  I think simply changing syslogd to do direct
> > i/o
> > instead of page cache buffered i/o to its log file(s) will avoid this
> > problem.  I am
> > running with 2.6.11-rc3-upd2 kernel and 0.4.3-pre9 multipath tools by the
> > way.
> > 
> > The 2nd scenario involves blockable user or kernel memory allocation
> > requests
> > requiring page write-out of dirty pages on multipath mapped devices in the
> > synchronous page reclaim algorithm of __alloc_pages.  Seems to me that while
> > mlockall can pin all current and future pages of a process's working set, it
> > does
> > not prevent synchronous page reclamation by the process as part of a
> > blockable
> > page allocation request.  If many of the mapped devices are in queue
> > congestion
> > mode due to failed paths on a storage system which is queuing failed bios
> > (as
> > the EMC CLARiion must), multipathd, multipath, or any executable invoked by
> > multipath could block trying to page out dirty pages to these mapped devices
> > while trying to allocate memory before being able to inform the kernel
> > resident
> > multipath components of the existence of valid paths for these mapped
> > devices.
> > 
> > The 3rd scenario involves the need to mlockall for all executables which are
> > invoked by multipathd (multipath) and the executables invoked by these
> > executables (scsi_id, /bin/false, ...).  Otherwise, any of these executables
> > can
> > block during page reclamation while trying to allocate free pages.  Also,
> > does the
> > effect of mlockall survive in the parent beyond fork/clone call or does it
> > need to be
> > renewed afterwards?
> > 
> > Overall, it seems like the code path to test and restore a target path of a
> > multipath
> > mapped device should not require any blockable memory allocations.  This
> > would
> > of course rule out fork/clone/exec.  A possible alternative design to the
> > current one
> > is to pre-allocate or reserve the memory requirements for these tasks --
> > possibly
> > enough memory for testing and restoring a single path to a single LU at a
> > time.
> > While this design would be tuned specifically to this job, I think it would
> > not need
> > to be kernel resident.
> > 
> > --
> > dm-devel mailing list
> > dm-devel at redhat.com
> > https://www.redhat.com/mailman/listinfo/dm-devel
-- 
christophe varoqui <christophe.varoqui at free.fr>