[Date Prev][Date Next] [Thread Prev][Thread Next]
[dm-devel] Re: RFC: I/O bandwidth controller (was Re: Too many I/O controller patches)
- From: Andrea Righi <righi andrea gmail com>
- To: Fernando Luis Vázquez Cao <fernando oss ntt co jp>
- Cc: xen-devel lists xensource com, uchida ap jp nec com, containers lists linux-foundation org, linux-kernel vger kernel org, Dave Hansen <dave linux vnet ibm com>, yoshikawa takuya oss ntt co jp, dm-devel redhat com, agk sourceware org, virtualization lists linux-foundation org, ngupta google com
- Subject: [dm-devel] Re: RFC: I/O bandwidth controller (was Re: Too many I/O controller patches)
- Date: Thu, 07 Aug 2008 07:46:55 -0000
Fernando Luis Vázquez Cao wrote:
This RFC ended up being a bit longer than I had originally intended, but
hopefully it will serve as the start of a fruitful discussion.
Thanks for posting this detailed RFC! A few comments below.
As you pointed out, it seems that there is not much consensus building
going on, but that does not mean there is a lack of interest. To get the
ball rolling it is probably a good idea to clarify the state of things
and try to establish what we are trying to accomplish.
*** State of things in the mainstream kernel<BR>
The kernel has had somewhat adavanced I/O control capabilities for quite
some time now: CFQ. But the current CFQ has some problems:
- I/O priority can be set by PID, PGRP, or UID, but...
- ...all the processes that fall within the same class/priority are
scheduled together and arbitrary grouping are not possible.
- Buffered I/O is not handled properly.
- CFQ's IO priority is an attribute of a process that affects all
devices it sends I/O requests to. In other words, with the current
implementation it is not possible to assign per-device IO priorities to
1. Cgroups-aware I/O scheduling (being able to define arbitrary
groupings of processes and treat each group as a single scheduling
2. Being able to perform I/O bandwidth control independently on each
3. I/O bandwidth shaping.
4. Scheduler-independent I/O bandwidth control.
5. Usable with stacking devices (md, dm and other devices of that
6. I/O tracking (handle buffered and asynchronous I/O properly).
The same above also for IO operations/sec (bandwidth intended not only
in terms of bytes/sec), plus:
7. Optimal bandwidth usage: allow to exceed the IO limits to take
advantage of free/unused IO resources (i.e. allow "bursts" when the
whole physical bandwidth for a block device is not fully used and then
"throttle" again when IO from unlimited cgroups comes into place)
8. "fair throttling": avoid to throttle always the same task within a
cgroup, but try to distribute the throttling among all the tasks
belonging to the throttle cgroup
The list of goals above is not exhaustive and it is also likely to
contain some not-so-nice-to-have features so your feedback would be
1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary
groupings of processes and treat each group as a single scheduling
We obviously need this because our final goal is to be able to control
the IO generated by a Linux container. The good news is that we already
have the cgroups infrastructure so, regarding this problem, we would
just have to transform our I/O bandwidth controller into a cgroup
This seems to be the easiest part, but the current cgroups
infrastructure has some limitations when it comes to dealing with block
devices: impossibility of creating/removing certain control structures
dynamically and hardcoding of subsystems (i.e. resource controllers).
This makes it difficult to handle block devices that can be hotplugged
and go away at any time (this applies not only to usb storage but also
to some SATA and SCSI devices). To cope with this situation properly we
would need hotplug support in cgroups, but, as suggested before and
discussed in the past (see (0) below), there are some limitations.
Even in the non-hotplug case it would be nice if we could treat each
block I/O device as an independent resource, which means we could do
things like allocating I/O bandwidth on a per-device basis. As long as
performance is not compromised too much, adding some kind of basic
hotplug support to cgroups is probably worth it.
What about using major,minor numbers to identify each device and account
IO statistics? If a device is unplugged we could reset IO statistics
and/or remove IO limitations for that device from userspace (i.e. by a
deamon), but pluggin/unplugging the device would not be blocked/affected
in any case. Or am I oversimplifying the problem?
3. & 4. & 5. - I/O bandwidth shaping & General design aspects
The implementation of an I/O scheduling algorithm is to a certain extent
influenced by what we are trying to achieve in terms of I/O bandwidth
shaping, but, as discussed below, the required accuracy can determine
the layer where the I/O controller has to reside. Off the top of my
head, there are three basic operations we may want perform:
- I/O nice prioritization: ionice-like approach.
- Proportional bandwidth scheduling: each process/group of processes
has a weight that determines the share of bandwidth they receive.
- I/O limiting: set an upper limit to the bandwidth a group of tasks
Use a deadline-based IO scheduling could be an interesting path to be
explored as well, IMHO, to try to guarantee per-cgroup minimum bandwidth
If we are pursuing a I/O prioritization model à la CFQ the temptation is
to implement it at the elevator layer or extend any of the existing I/O
There have been several proposals that extend either the CFQ scheduler
(see (1), (2) below) or the AS scheduler (see (3) below). The problem
with these controllers is that they are scheduler dependent, which means
that they become unusable when we change the scheduler or when we want
to control stacking devices which define their own make_request_fn
function (md and dm come to mind). It could be argued that the physical
devices controlled by a dm or md driver are likely to be fed by
traditional I/O schedulers such as CFQ, but these I/O schedulers would
be running independently from each other, each one controlling its own
device ignoring the fact that they part of a stacking device. This lack
of information at the elevator layer makes it pretty difficult to obtain
accurate results when using stacking devices. It seems that unless we
can make the elevator layer aware of the topology of stacking devices
(possibly by extending the elevator API?) evelator-based approaches do
not constitute a generic solution. Here onwards, for discussion
purposes, I will refer to this type of I/O bandwidth controllers as
elevator-based I/O controllers.
A simple way of solving the problems discussed in the previous paragraph
is to perform I/O control before the I/O actually enters the block layer
either at the pagecache level (when pages are dirtied) or at the entry
point to the generic block layer (generic_make_request()). Andrea's I/O
throttling patches stick to the former variant (see (4) below) and
Tsuruta-san and Takahashi-san's dm-ioband (see (5) below) take the later
approach. The rationale is that by hooking into the source of I/O
requests we can perform I/O control in a topology-agnostic and
elevator-agnostic way. I will refer to this new type of I/O bandwidth
controller as block layer I/O controller.
By residing just above the generic block layer the implementation of a
block layer I/O controller becomes relatively easy, but by not taking
into account the characteristics of the underlying devices we might risk
underutilizing them. For this reason, in some cases it would probably
make sense to complement a generic I/O controller with elevator-based
I/O controller, so that the maximum throughput can be squeezed from the
(1) Uchida-san's CFQ-based scheduler: http://lwn.net/Articles/275944/
(2) Vasily's CFQ-based scheduler: http://lwn.net/Articles/274652/
(3) Naveen Gupta's AS-based scheduler: http://lwn.net/Articles/288895/
(4) Andrea Righi's i/o bandwidth controller (I/O throttling):http://thread.gmane.org/gmane.linux.kernel.containers/5975
(5) Tsuruta-san and Takahashi-san's dm-ioband: http://thread.gmane.org/gmane.linux.kernel.virtualization/6581
6.- I/O tracking
This is arguably the most important part, since to perform I/O control
we need to be able to determine where the I/O is coming from.
Reads are trivial because they are served in the context of the task
that generated the I/O. But most writes are performed by pdflush,
kswapd, and friends so performing I/O control just in the synchronous
I/O path would lead to large inaccuracy. To get this right we would need
to track ownership all the way up to the pagecache page. In other words,
it is necessary to track who is dirtying pages so that when they are
written to disk the right task is charged for that I/O.
Fortunately, such tracking of pages is one of the things the existing
memory resource controller is doing to control memory usage. This is a
clever observation which has a useful implication: if the rather
imbricated tracking and accounting parts of the memory resource
controller were split the I/O controller could leverage the existing
infrastructure to track buffered and asynchronous I/O. This is exactly
what the bio-cgroup (see (6) below) patches set out to do.
It is also possible to do without I/O tracking. For that we would need
to hook into the synchronous I/O path and every place in the kernel
where pages are dirtied (see (4) above for details). However controlling
the rate at which a cgroup can generate dirty pages seems to be a task
that belongs in the memory controller not the I/O controller. As Dave
and Paul suggested its probably better to delegate this to the memory
controller. In fact, it seems that Yamamoto-san is cooking some patches
that implement just that: dirty balancing for cgroups (see (7) for
Another argument in favor of I/O tracking is that not only block layer
I/O controllers would benefit from it, but also the existing I/O
schedulers and the elevator-based I/O controllers proposed by
Uchida-san, Vasily, and Naveen (Yoshikawa-san, who is CCed, and myself
are working on this and hopefully will be sending patches soon).
(6) Tsuruta-san and Takahashi-san's I/O tracking patches: http://lkml.org/lkml/2008/8/4/90
(7) Yamamoto-san dirty balancing patches: http://lwn.net/Articles/289237/
*** How to move on
As discussed before, it probably makes sense to have both a block layer
I/O controller and a elevator-based one, and they could certainly
cohabitate. As discussed before, all of them need I/O tracking
capabilities so I would like to suggest the plan below to get things
- Improve the I/O tracking patches (see (6) above) until they are in
- Fix CFQ and AS to use the new I/O tracking functionality to show its
benefits. If the performance impact is acceptable this should suffice to
convince the respective maintainer and get the I/O tracking patches
- Implement a block layer resource controller. dm-ioband is a working
solution and feature rich but its dependency on the dm infrastructure is
likely to find opposition (the dm layer does not handle barriers
properly and the maximum size of I/O requests can be limited in some
cases). In such a case, we could either try to build a standalone
resource controller based on dm-ioband (which would probably hook into
generic_make_request) or try to come up with something new.
- If the I/O tracking patches make it into the kernel we could move on
and try to get the Cgroup extensions to CFQ and AS mentioned before (see
(1), (2), and (3) above for details) merged.
- Delegate the task of controlling the rate at which a task can
generate dirty pages to the memory controller.
This RFC is somewhat vague but my feeling is that we build some
consensus on the goals and basic design aspects before delving into
I would appreciate your comments and feedback.
Very nice RFC.
[Date Prev][Date Next] [Thread Prev][Thread Next]