[Date Prev][Date Next] [Thread Prev][Thread Next]
[dm-devel] [PATCH 01/18] io-controller: Documentation
- From: Vivek Goyal <vgoyal redhat com>
- To: nauman google com, dpshah google com, lizf cn fujitsu com, mikew google com, fchecconi gmail com, paolo valente unimore it, jens axboe oracle com, ryov valinux co jp, fernando oss ntt co jp, s-uchida ap jp nec com, taka valinux co jp, guijianfeng cn fujitsu com, jmoyer redhat com, dhaval linux vnet ibm com, balbir linux vnet ibm com, linux-kernel vger kernel org, containers lists linux-foundation org, righi andrea gmail com, agk redhat com, dm-devel redhat com, snitzer redhat com, m-ikeda ds jp nec com
- Cc: akpm linux-foundation org, vgoyal redhat com
- Subject: [dm-devel] [PATCH 01/18] io-controller: Documentation
- Date: Tue, 05 May 2009 19:59:01 -0000
o Documentation for io-controller.
Signed-off-by: Vivek Goyal <vgoyal redhat com>
Documentation/block/00-INDEX | 2 +
Documentation/block/io-controller.txt | 264 +++++++++++++++++++++++++++++++++
2 files changed, 266 insertions(+), 0 deletions(-)
create mode 100644 Documentation/block/io-controller.txt
diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index 961a051..dc8bf95 100644
@@ -10,6 +10,8 @@ capability.txt
- Generic Block Device Capability (/sys/block/<disk>/capability)
- Deadline IO scheduler tunables
+ - IO controller for provding hierarchical IO scheduling
- Block io priorities (in CFQ scheduler)
diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
new file mode 100644
@@ -0,0 +1,264 @@
+ IO Controller
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+The assumption/thought behind modifying IO scheduler is that resource control
+is needed only on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+ lv0 lv1
+ / \ / \
+ sda sdb sdc
+Also consider following cgroup hierarchy
+ / \
+ A B
+ / \ / \
+ T1 T2 T3 T4
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical to achive a good
+hierarchical control on IO.
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads. They maintain only one queue where all
+the IO gets queued (internally this queue is split in read and write queue
+for deadline and AS). With this patchset, now we maintain one queue per
+cgropu per device and then try to do fair queuing among those queues.
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers. Following
+diagram depicts the concept.
+ | Elevator Layer + Fair Queuing |
+ | | | |
+ NOOP DEADLINE AS CFQ
+This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
+fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
+B-WF2Q+ algorithm for fair queuing.
+- Not sure if weighted round robin logic of CFQ can be easily extended for
+ hierarchical mode. One of the things is that we can not keep dividing
+ the time slice of parent group among childrens. Deeper we go in hierarchy
+ time slice will get smaller.
+ One of the ways to implement hierarchical support could be to keep track
+ of virtual time and service provided to queue/group and select a queue/group
+ for service based on any of the various available algoriths.
+ BFQ already had support for hierarchical scheduling, taking those patches
+ was easier.
+- BFQ was designed to provide tighter bounds/delay w.r.t service provided
+ to a queue. Delay/Jitter with BFQ is O(1).
+ Note: BFQ originally used amount of IO done (number of sectors) as notion
+ of service provided. IOW, it tried to provide fairness in terms of
+ actual IO done and not in terms of actual time disk access was
+ given to a queue.
+ This patcheset modified BFQ to provide fairness in time domain because
+ that's what CFQ does. So idea was try not to deviate too much from
+ the CFQ behavior initially.
+ Providing fairness in time domain makes accounting trciky because
+ due to command queueing, at one time there might be multiple requests
+ from different queues and there is no easy way to find out how much
+ disk time actually was consumed by the requests of a particular
+ queue. More about this in comments in source code.
+We have taken BFQ code as starting point for providing fairness among groups
+because it already contained lots of features which we required to implement
+hierarhical IO scheduling. With this patch set, I am not trying to ensure O(1)
+delay here as my goal is to provide fairness among groups. Most likely that
+will mean that latencies are not worse than what cfq currently provides (if
+not improved ones). Once fairness is ensured, one can look into more in
+ensuring O(1) latencies.
+From data structure point of view, one can think of a tree per device, where
+io groups and io queues are hanging and are being scheduled using B-WF2Q+
+algorithm. io_queue, is end queue where requests are actually stored and
+dispatched from (like cfqq).
+These io queues are primarily created by and managed by end io schedulers
+depending on its semantics. For example, noop, deadline and AS ioschedulers
+keep one io queues per cgroup and cfqq keeps one io queue per io_context in
+a cgroup (apart from async queues).
+A request is mapped to an io group by elevator layer and which io queue it
+is mapped to with in group depends on ioscheduler. Currently "current" task
+is used to determine the cgroup (hence io group) of the request. Down the
+line we need to make use of bio-cgroup patches to map delayed writes to
+Going back to old behavior
+In new scheme of things essentially we are creating hierarchical fair
+queuing logic in elevator layer and chaning IO schedulers to make use of
+that logic so that end IO schedulers start supporting hierarchical scheduling.
+Elevator layer continues to support the old interfaces. So even if fair queuing
+is enabled at elevator layer, one can have both new hierchical scheduler as
+well as old non-hierarchical scheduler operating.
+Also noop, deadline and AS have option of enabling hierarchical scheduling.
+If it is selected, fair queuing is done in hierarchical manner. If hierarchical
+scheduling is disabled, noop, deadline and AS should retain their existing
+CFQ is the only exception where one can not disable fair queuing as it is
+needed for provding fairness among various threads even in non-hierarchical
+Various user visible config options
+ - Enables hierchical fair queuing in noop. Not selecting this option
+ leads to old behavior of noop.
+ - Enables hierchical fair queuing in deadline. Not selecting this
+ option leads to old behavior of deadline.
+ - Enables hierchical fair queuing in AS. Not selecting this option
+ leads to old behavior of AS.
+ - Enables hierarchical fair queuing in CFQ. Not selecting this option
+ still does fair queuing among various queus but it is flat and not
+ - This option enables blkio-cgroup controller for IO tracking
+ purposes. That means, by this controller one can attribute a write
+ to the original cgroup and not assume that it belongs to submitting
+ - Currently CFQ attributes the writes to the submitting thread and
+ caches the async queue pointer in the io context of the process.
+ If this option is set, it tells cfq and elevator fair queuing logic
+ that for async writes make use of IO tracking patches and attribute
+ writes to original cgroup and not to write submitting thread.
+ - Throws extra debug messages in blktrace output helpful in doing
+ doing debugging in hierarchical setup.
+Config options selected automatically
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configurations.
+ - Enables/Disables the fair queuing logic at elevator layer.
+ - Enables/Disables hierarchical queuing and associated cgroup bits.
+- Lots of code cleanups, testing, bug fixing, optimizations,
+ benchmarking etc...
+- Debug and fix some of the areas where higher weight cgroup async writes
+ are stuck behind lower weight cgroup async writes.
+- Anticipatory code will need more work. It is not working properly currently
+ and needs more thought.
+- Once things start working, planning to look into core algorithm. It looks
+ complicated and maintains lots of data structures. Need to spend some time
+ to see if can be simplified.
+- Currently a cgroup setting is global, that is it is applicable to all
+ the block devices in the system. Probably it will make more sense to
+ make it per cgroup per device setting so that a cgroup can have different
+ weights on different device etc.
+So far I have done very simple testing of running two dd threads in two
+different cgroups. Here is what you can do.
+- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
+- Enable IO tracking for async writes.
+ (This will automatically select CGROUP_BLKIO)
+- Compile and boot into kernel and mount IO controller and blkio io tracking
+ mount -t cgroup -o io,blkio none /cgroup
+- Create two cgroups
+ mkdir -p /cgroup/test1/ /cgroup/test2
+- Set weights of group test1 and test2
+ echo 1000 > /cgroup/test1/io.ioprio
+ echo 500 > /cgroup/test2/io.ioprio
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+ launch two dd threads in different cgroup to read those files. Make sure
+ right io scheduler is being used for the block device where files are
+ present (the one you compiled in hierarchical mode).
+ echo 1 > /proc/sys/vm/drop_caches
+ dd if=/mnt/lv0/zerofile1 of=/dev/null &
+ echo $! > /cgroup/test1/tasks
+ cat /cgroup/test1/tasks
+ dd if=/mnt/lv0/zerofile2 of=/dev/null &
+ echo $! > /cgroup/test2/tasks
+ cat /cgroup/test2/tasks
+- At macro level, first dd should finish first. To get more precise data, keep
+ on looking at (with the help of script), at io.disk_time and io.disk_sectors
+ files of both test1 and test2 groups. This will tell how much disk time
+ (in milli seconds), each group got and how many secotors each group
+ dispatched to the disk. We provide fairness in terms of disk time, so
+ ideally io.disk_time of cgroups should be in proportion to the weight.
+ (It is hard to achieve though :-)).
[Date Prev][Date Next] [Thread Prev][Thread Next]