[dm-devel] [PATCH 05/10] block: remove per-queue plugging

Tue Apr 12 23:35:36 UTC 2011

On Tue, Apr 12, 2011 at 03:48:10PM +0200, Jens Axboe wrote:
> On 2011-04-12 15:40, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
> >> On 2011-04-12 14:22, Dave Chinner wrote:
> >>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> >>>> On 2011-04-12 03:12, hch at infradead.org wrote:
> >>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >>>>>    function calls.
> >>>>>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
> >>>>>    unlikely is the static branch prediction hint to mark the case
> >>>>>    extremly unlikely and is even used for hot/cold partitioning.  But
> >>>>>    when we call it we usually check beforehand if we actually have
> >>>>>    plugs, so it's actually likely to happen.
> >>>>
> >>>> The existance and out-of-line is for the scheduler() hook. It should be
> >>>> an unlikely event to schedule with a plug held, normally the plug should
> >>>> have been explicitly unplugged before that happens.
> >>>
> >>> Though if it does, haven't you just added a significant amount of
> >>> depth to the worst case stack usage? I'm seeing this sort of thing
> >>> from io_schedule():
> >>>
> >>>         Depth    Size   Location    (40 entries)
> >>>         -----    ----   --------
> >>>   0)     4256      16   mempool_alloc_slab+0x15/0x20
> >>>   1)     4240     144   mempool_alloc+0x63/0x160
> >>>   2)     4096      16   scsi_sg_alloc+0x4c/0x60
> >>>   3)     4080     112   __sg_alloc_table+0x66/0x140
> >>>   4)     3968      32   scsi_init_sgtable+0x33/0x90
> >>>   5)     3936      48   scsi_init_io+0x31/0xc0
> >>>   6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
> >>>   7)     3856     112   sd_prep_fn+0x150/0xa90
> >>>   8)     3744      48   blk_peek_request+0x6a/0x1f0
> >>>   9)     3696      96   scsi_request_fn+0x60/0x510
> >>>  10)     3600      32   __blk_run_queue+0x57/0x100
> >>>  11)     3568      80   flush_plug_list+0x133/0x1d0
> >>>  12)     3488      32   __blk_flush_plug+0x24/0x50
> >>>  13)     3456      32   io_schedule+0x79/0x80
> >>>
> >>> (This is from a page fault on ext3 that is doing page cache
> >>> readahead and blocking on a locked buffer.)
> > 
> > FYI, the next step in the allocation chain adds >900 bytes to that
> > stack:
> > 
> > $ cat /sys/kernel/debug/tracing/stack_trace
> >         Depth    Size   Location    (47 entries)
> >         -----    ----   --------
> >   0)     5176      40   zone_statistics+0xad/0xc0
> >   1)     5136     288   get_page_from_freelist+0x2cf/0x840
> >   2)     4848     304   __alloc_pages_nodemask+0x121/0x930
> >   3)     4544      48   kmem_getpages+0x62/0x160
> >   4)     4496      96   cache_grow+0x308/0x330
> >   5)     4400      80   cache_alloc_refill+0x21c/0x260
> >   6)     4320      64   kmem_cache_alloc+0x1b7/0x1e0
> >   7)     4256      16   mempool_alloc_slab+0x15/0x20
> >   8)     4240     144   mempool_alloc+0x63/0x160
> >   9)     4096      16   scsi_sg_alloc+0x4c/0x60
> >  10)     4080     112   __sg_alloc_table+0x66/0x140
> >  11)     3968      32   scsi_init_sgtable+0x33/0x90
> >  12)     3936      48   scsi_init_io+0x31/0xc0
> >  13)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
> >  14)     3856     112   sd_prep_fn+0x150/0xa90
> >  15)     3744      48   blk_peek_request+0x6a/0x1f0
> >  16)     3696      96   scsi_request_fn+0x60/0x510
> >  17)     3600      32   __blk_run_queue+0x57/0x100
> >  18)     3568      80   flush_plug_list+0x133/0x1d0
> >  19)     3488      32   __blk_flush_plug+0x24/0x50
> >  20)     3456      32   io_schedule+0x79/0x80
> > 
> > That's close to 1800 bytes now, and that's not entering the reclaim
> > path. If i get one deeper than that, I'll be sure to post it. :)
> 
> Do you have traces from 2.6.38, or are you just doing them now?

I do stack checks like this all the time. I generally don't keep
them around, just pay attention to the path and depth. ext3 is used
for / on my test VMs, and has never shown up as the worse case stack
usage when running xfstests. As of the block plugging code, this
trace is the top stack user for the first ~130 tests, and often for
the entire test run on XFS....

> The path you quote above should not go into reclaim, it's a GFP_ATOMIC
> allocation.

Right. I'm still trying to produce a trace that shows more stack
usage in the block layer. It's random chance as to what pops up most
of the time. However, some of the stacks that are showing up in
2.6.39 are quite different from any I've ever seen before...

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com