[dm-devel] [PATCH} dm-throttle: new device mapper target to throttle reads and writes

Thu Aug 12 16:46:36 UTC 2010

On Thu, Aug 12, 2010 at 11:08:09AM +0200, Heinz Mauelshagen wrote:
> On Tue, 2010-08-10 at 10:44 -0400, Vivek Goyal wrote:
> > On Tue, Aug 10, 2010 at 03:42:22PM +0200, Heinz Mauelshagen wrote:
> > > 
> > > This is a new device mapper "throttle" target which allows for
> > > throttling reads and writes (ie. enforcing throughput limits) in units
> > > of kilobytes per second.
> > > 
> > 
> > Hi Heinz,
> > 
> > How about extending this stuff to handle cgroups also. So instead of
> > having deivice wide throttling policy, we throttle cgroups. That will
> > be a much more useful thing and will serve well the use case of throttling
> > virtual machines in cgroup.
> 
> 
> Hi Vivek,
> 
> needs a serious design discussion but I think we could leverage it to
> allow for throttling of cgroups.
> 

We need to parse cgroup information inside the dm taget (as CFQ does) and
just prepare one queue per group and queue the IO there (If we exceeded
the IO rate of group) and dispatch it later.

We also need to get the per cgroup rules from cgroup interface and not
with the help of static device mapper tables at the device creation
time.

I can write some code for dm-throttle for cgroup functioinality once
basic dm-throttle is in.

I am not sure that how to keep both the modes in dm-throttle taget

- cgroup mode
- the whole device limitation mode. (you just created).

Mike Snitzer suggested that we can have both the modes and specify the
mode at the time of device creation.

> > 
> > Yesterday I had raised the issue of cgroup IO bandwidth throttling at
> > Linux Storage and Filesystem session. I thought that a device mapper
> > target will be the easiest thing to because I can make use of lots
> > of existing infrastructure.
> > 
> > Christoph did not like it because of configuration concerns. He preferred
> > something in block layer/request queue. It was also hinted that there
> > were some ideas floating of better integation of device mapper
> > infrastructure with request queue and this thing should go behind that.
> 
> Right, if a block layer change of that kind will be pending, we should
> wait for it to settle.

I don't have details but to me it sounds as if it is just a concept at
this point of time. So waiting for that to happen might take too long
and we want max bw control feature as soon as possible. IMHO, converting
the dm-throttle later to use that new infrastructure will be a much
better option.

> 
> > But the problem is I am not sure how long it is going to take before
> > this new infrastructure becomes a reality and it will not be practical
> > to wait for that.
> 
> Did any reliable plans come out of the discussion or will there be any
> in the near future?

I am not aware of any. Alasdair will know more about it.

> 
> > 
> > There is a possibility that we can put a hook in __make_request function
> > and first take out all the bios and subject them to bandwidth limitation
> > and then pass it to lower layers. But that will mean redoing lots of
> > common infrastructure which has already been done. For example,
> > 
> > - What happens to queue congestion semantics.
> > 
> > 	- Request queue already has it based on requests and device mapper
> > 	  seems to have its own congestion functions.
> 
> Yes, dm does.

I was looking into the dm code and found dm_any_congested(). So it looks
like that dm just calls underlying devices to find out if any of the device
is congested or not.

Thinking more about it, congestion semantics seem to have been defined
for a thread which does not want to sleep because of request descriptor
allocation. In case of bandwidth control, we will not be allocating
any request descriptors. Bios will be handed to us. No cloning operation
required so no bio allocations required. I might have to do some
allocation of internal structures though like group, queue etc when a new 
request comes in.

So because I will not be putting any artificial restrictions on number
if bios queued for throttling (unlike request descriptors), I probably
don't require any congestion semantics. The only time a thread might
be put to sleep if mempool_alloc() puts it to sleep because of some
memory reclaim taking place. That's how dm seems to be handling it and
if that is acceptable then it should be acceptable for bandwidth
controller on request queue?

> 
> > 
> > 	- If I go for taking the bio out on request queue and hold them
> >    	  back then I am not sure how to define congestion semantics.
> > 	  To keep congestion semantcs simple, it would make sense to
> >  	  create a new request queue (with the help of dm target), and
> > 	  use that.
> 
> Yes, that's an obvious approach to stay with the same congestion
> semantics.

See above? If I am not putting an artificial limit on number of bios that
can be submitted on request queue, then I don't require any additional
congestion semantics. The only time a thread will put to sleep if we
are not able to allocate some objects like group and per group queue etc.
Otherwise, a thread will submit the bio and go back and do something else
or wait for io to finish.

> 
> > 
> > - I have yet to think through it but I think I wil be doing other common
> >   operations like holding back requests in internal queues, dispatching
> >   these later with the help of a kernel thread, allowing some to dispatch
> >   immediately as these come in, Putting processes to sleep and waking
> >   them later if we are already holding too many bios etc.
> > 
> > To me it sounds that doing it is lot simpler with the help of device
> > mapper target. Though the not so nice part is the need of configuring
> > another device mapper target on every block device we want to control.
> 
> Yes, we'd need identity mappings in the stack to be prepared.
> 
> Or we need some __generic_make_request() hack ala bcache to hijack the
> request function on the fly.

I will look at bcache but yes it would be a hook in __generic_make_request()
if bandwidth control has to be done in request queue/block layer and not
as device mapper target.

Vivek