[dm-devel] Re: IO scheduler based IO Controller V2

Thu May 7 01:48:24 UTC 2009

From: Andrea Righi <righi.andrea at gmail.com>
Subject: Re: IO scheduler based IO Controller V2
Date: Thu, 7 May 2009 00:35:13 +0200

> On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> > On Wed, May 06, 2009 at 11:34:54PM +0200, Andrea Righi wrote:
> > > On Wed, May 06, 2009 at 04:32:28PM -0400, Vivek Goyal wrote:
> > > > Hi Andrea and others,
> > > > 
> > > > I always had this doubt in mind that any kind of 2nd level controller will
> > > > have no idea about underlying IO scheduler queues/semantics. So while it
> > > > can implement a particular cgroup policy (max bw like io-throttle or
> > > > proportional bw like dm-ioband) but there are high chances that it will
> > > > break IO scheduler's semantics in one way or other.
> > > > 
> > > > I had already sent out the results for dm-ioband in a separate thread.
> > > > 
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
> > > > 
> > > > Here are some basic results with io-throttle. Andrea, please let me know
> > > > if you think this is procedural problem. Playing with io-throttle patches
> > > > for the first time.
> > > > 
> > > > I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
> > > > scheduler.
> > > > 
> > > > I have got one SATA drive with one partition on it.
> > > > 
> > > > I am trying to create one cgroup and assignn 8MB/s limit to it and launch
> > > > on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
> > > > between these tasks. Following are the results.
> > > > 
> > > > Following is my test script.
> > > > 
> > > > *******************************************************************
> > > > #!/bin/bash
> > > > 
> > > > mount /dev/sdb1 /mnt/sdb
> > > > 
> > > > mount -t cgroup -o blockio blockio /cgroup/iot/
> > > > mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
> > > > 
> > > > # Set bw limit of 8 MB/ps on sdb
> > > > echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
> > > > /cgroup/iot/test1/blockio.bandwidth-max
> > > > 
> > > > sync
> > > > echo 3 > /proc/sys/vm/drop_caches
> > > > 
> > > > echo $$ > /cgroup/iot/test1/tasks
> > > > 
> > > > # Launch a normal prio reader.
> > > > ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
> > > > pid1=$!
> > > > echo $pid1
> > > > 
> > > > # Launch an RT reader  
> > > > ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> > > > pid2=$!
> > > > echo $pid2
> > > > 
> > > > wait $pid2
> > > > echo "RT task finished"
> > > > **********************************************************************
> > > > 
> > > > Test1
> > > > =====
> > > > Test two readers (one RT class and one BE class) and see how BW is
> > > > allocated with-in cgroup
> > > > 
> > > > With io-throttle patches
> > > > ------------------------
> > > > - Two readers, first BE prio 7, second RT prio 0
> > > > 
> > > > 234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
> > > > 234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
> > > > RT task finished
> > > > 
> > > > Note: See, there is no difference in the performance of RT or BE task.
> > > > Looks like these got throttled equally.
> > > 
> > > OK, this is coherent with the current io-throttle implementation. IO
> > > requests are throttled without the concept of the ioprio model.
> > > 
> > > We could try to distribute the throttle using a function of each task's
> > > ioprio, but ok, the obvious drawback is that it totally breaks the logic
> > > used by the underlying layers.
> > > 
> > > BTW, I'm wondering, is it a very critical issue? I would say why not to
> > > move the RT task to a different cgroup with unlimited BW? or limited BW
> > > but with other tasks running at the same IO priority...
> > 
> > So one of hypothetical use case probably  could be following. Somebody
> > is having a hosted server and customers are going to get there
> > applications running in a particular cgroup with a limit on max bw.
> > 
> > 			root
> > 		  /      |      \
> > 	     cust1      cust2   cust3
> > 	   (20 MB/s)  (40MB/s)  (30MB/s)
> > 
> > Now all three customers will run their own applications/virtual machines
> > in their respective groups with upper limits. Will we say to these that
> > all your tasks will be considered as same class and same prio level.
> > 
> > Assume cust1 is running a hypothetical application which creates multiple
> > threads and assigns these threads different priorities based on its needs
> > at run time. How would we handle this thing?
> > 
> > You can't collect all the RT tasks from all customers and move these to a
> > single cgroup. Or ask customers to separate out their tasks based on
> > priority level and give them multiple groups of different priority.
> 
> Clear.
> 
> Unfortunately, I think, with absolute BW limits at a certain point, if
> we hit the limit, we need to block the IO request. That's the same
> either, when we dispatch or submit the request. And the risk is to break
> the logic of the IO priorities and fall in the classic priority
> inversion problem.
> 
> The difference is that probably working at the CFQ level gives a better
> control so we can handle these cases appropriately and avoid the
> priority inversion problems.
> 
> Thanks,
> -Andrea

If RT tasks in cust1 issue IOs intensively, are IOs issued from BE
tasks running on cust2 and cust3 suppressed and cust1 can use whole
bandwidth?
I think that CFQ's class and priority should be preserved within a
given bandwidth to each cgroup.

Thanks,
Ryo Tsuruta