[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
[dm-devel] [PATCH 01/28] io-controller: Documentation
- From: Vivek Goyal <vgoyal redhat com>
- To: linux-kernel vger kernel org, jens axboe oracle com
- Cc: dhaval linux vnet ibm com, peterz infradead org, dm-devel redhat com, dpshah google com, agk redhat com, balbir linux vnet ibm com, paolo valente unimore it, jmarchan redhat com, guijianfeng cn fujitsu com, fernando oss ntt co jp, mikew google com, jmoyer redhat com, nauman google com, mingo elte hu, vgoyal redhat com, m-ikeda ds jp nec com, riel redhat com, lizf cn fujitsu com, fchecconi gmail com, s-uchida ap jp nec com, containers lists linux-foundation org, akpm linux-foundation org, righi andrea gmail com, torvalds linux-foundation org
- Subject: [dm-devel] [PATCH 01/28] io-controller: Documentation
- Date: Thu, 24 Sep 2009 15:25:05 -0400
o Documentation for io-controller.
Signed-off-by: Vivek Goyal <vgoyal redhat com>
Acked-by: Rik van Riel <riel redhat com>
---
Documentation/block/00-INDEX | 2 +
Documentation/block/io-controller.txt | 464 +++++++++++++++++++++++++++++++++
2 files changed, 466 insertions(+), 0 deletions(-)
create mode 100644 Documentation/block/io-controller.txt
diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index 961a051..dc8bf95 100644
--- a/Documentation/block/00-INDEX
+++ b/Documentation/block/00-INDEX
@@ -10,6 +10,8 @@ capability.txt
- Generic Block Device Capability (/sys/block/<disk>/capability)
deadline-iosched.txt
- Deadline IO scheduler tunables
+io-controller.txt
+ - IO controller for provding hierarchical IO scheduling
ioprio.txt
- Block io priorities (in CFQ scheduler)
request.txt
diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
new file mode 100644
index 0000000..f2bfce6
--- /dev/null
+++ b/Documentation/block/io-controller.txt
@@ -0,0 +1,464 @@
+ IO Controller
+ =============
+
+Overview
+========
+
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+device.
+
+The assumption/thought behind modifying IO scheduler is that resource control
+is primarily needed on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+
+ lv0 lv1
+ / \ / \
+ sda sdb sdc
+
+Also consider following cgroup hierarchy
+
+ root
+ / \
+ A B
+ / \ / \
+ T1 T2 T3 T4
+
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical to achive a good
+hierarchical control on IO.
+
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads. They maintain only one queue where all
+the IO gets queued (internally this queue is split in read and write queue
+for deadline and AS). With this patchset, now we maintain one queue per
+cgropu per device and then try to do fair queuing among those queues.
+
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers. Following
+diagram depicts the concept.
+
+ --------------------------------
+ | Elevator Layer + Fair Queuing |
+ --------------------------------
+ | | | |
+ NOOP DEADLINE AS CFQ
+
+Design
+======
+This patchset takes the inspiration from CFS cpu scheduler, CFQ and BFQ to
+come up with core of hierarchical scheduling. Like CFQ we give time slices to
+every queue based on their priority. Like CFS, this disktime given to a
+queue is converted to virtual disk time based on queue's weight (vdisktime)
+and based on this vdisktime we decide which is the queue next to be
+dispatched. And like BFQ we maintain a cache of recently served queues and
+derive new vdisktime of the queue from the cache if queue was recently served.
+
+From data structure point of view, one can think of a tree per device, where
+io groups and io queues are hanging and are being scheduled. io_queue, is end
+queue where requests are actually stored and dispatched from (like cfqq).
+
+These io queues are primarily created by and managed by end io schedulers
+depending on its semantics. For example, noop, deadline and AS ioschedulers
+keep one io queues per cgroup and cfqq keeps one io queue per io_context in
+a cgroup (apart from async queues).
+
+A request is mapped to an io group by elevator layer and which io queue it
+is mapped to with in group depends on ioscheduler. Noop, deadline and AS don't
+maintain separate queues per task, hence ther is only one io_queue per group.
+So once we can find right group, we also found right queue. CFQ maintains
+multiple io queues with-in group based on task context and maps the request
+to right queue in the group.
+
+sync requests are mapped to right group and queue based on the "current" task.
+Async requests can be mapped using either "current" task or based on owner of
+the page. (blkio cgroup subsystem provides this bio/page tracking mechanism).
+This option is controlled by config option "CONFIG_TRACK_ASYNC_CONTEXT"
+
+Going back to old behavior
+==========================
+In new scheme of things essentially we are creating hierarchical fair
+queuing logic in elevator layer and changing IO schedulers to make use of
+that logic so that end IO schedulers start supporting hierarchical scheduling.
+
+Elevator layer continues to support the old interfaces. So even if fair queuing
+is enabled at elevator layer, one can have both new hierchical scheduler as
+well as old non-hierarchical scheduler operating.
+
+Also noop, deadline and AS have option of enabling hierarchical scheduling.
+If it is selected, fair queuing is done in hierarchical manner. If hierarchical
+scheduling is disabled, noop, deadline and AS should retain their existing
+behavior.
+
+CFQ is the only exception where one can not disable fair queuing as it is
+needed for provding fairness among various threads even in non-hierarchical
+mode. So CFQ has to use fair queuing logic from common layer but it can choose
+to enable only flat support and not enable hierarchical (group scheduling)
+support.
+
+Various user visible config options
+===================================
+CONFIG_IOSCHED_NOOP_HIER
+ - Enables hierchical fair queuing in noop. Not selecting this option
+ leads to old behavior of noop.
+
+CONFIG_IOSCHED_DEADLINE_HIER
+ - Enables hierchical fair queuing in deadline. Not selecting this
+ option leads to old behavior of deadline.
+
+CONFIG_IOSCHED_AS_HIER
+ - Enables hierchical fair queuing in AS. Not selecting this option
+ leads to old behavior of AS.
+
+CONFIG_IOSCHED_CFQ_HIER
+ - Enables hierarchical fair queuing in CFQ. Not selecting this option
+ still does fair queuing among various queus but it is flat and not
+ hierarchical.
+
+CGROUP_BLKIO
+ - This option enables blkio-cgroup controller for IO tracking
+ purposes. That means, by this controller one can attribute a write
+ to the original cgroup and not assume that it belongs to submitting
+ thread.
+
+CONFIG_TRACK_ASYNC_CONTEXT
+ - Currently CFQ attributes the writes to the submitting thread and
+ caches the async queue pointer in the io context of the process.
+ If this option is set, it tells cfq and elevator fair queuing logic
+ that for async writes make use of IO tracking patches and attribute
+ writes to original cgroup and not to write submitting thread.
+
+ This should be primarily useful when lots of asynchronous writes
+ are being submitted by pdflush threads and we need to assign the
+ writes to right group.
+
+CONFIG_DEBUG_GROUP_IOSCHED
+ - Throws extra debug messages in blktrace output helpful in doing
+ doing debugging in hierarchical setup.
+
+ - Also allows for export of extra debug statistics like group queue
+ and dequeue statistics on device through cgroup interface.
+
+CONFIG_DEBUG_ELV_FAIR_QUEUING
+ - Enables some vdisktime related debugging messages.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configurations.
+
+CONFIG_ELV_FAIR_QUEUING
+ - Enables/Disables the fair queuing logic at elevator layer.
+
+CONFIG_GROUP_IOSCHED
+ - Enables/Disables hierarchical queuing and associated cgroup bits.
+
+HOWTO
+=====
+You can do a very simple testing of running two dd threads in two different
+cgroups. Here is what you can do.
+
+- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
+ CONFIG_IOSCHED_CFQ_HIER=y
+
+- Enable IO tracking for async writes.
+ CONFIG_TRACK_ASYNC_CONTEXT=y
+
+ (This will automatically select CGROUP_BLKIO)
+
+- Compile and boot into kernel and mount IO controller and blkio io tracking
+ controller.
+
+ mount -t cgroup -o io,blkio none /cgroup
+
+- Create two cgroups
+ mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+ echo 1000 > /cgroup/test1/io.weight
+ echo 500 > /cgroup/test2/io.weight
+
+- Set "fairness" parameter to 1 at the disk you are testing.
+
+ echo 1 > /sys/block/<disk>/queue/iosched/fairness
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+ launch two dd threads in different cgroup to read those files. Make sure
+ right io scheduler is being used for the block device where files are
+ present (the one you compiled in hierarchical mode).
+
+ sync
+ echo 3 > /proc/sys/vm/drop_caches
+
+ dd if=/mnt/sdb/zerofile1 of=/dev/null &
+ echo $! > /cgroup/test1/tasks
+ cat /cgroup/test1/tasks
+
+ dd if=/mnt/sdb/zerofile2 of=/dev/null &
+ echo $! > /cgroup/test2/tasks
+ cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+ on looking at (with the help of script), at io.disk_time and io.disk_sectors
+ files of both test1 and test2 groups. This will tell how much disk time
+ (in milli seconds), each group got and how many secotors each group
+ dispatched to the disk. We provide fairness in terms of disk time, so
+ ideally io.disk_time of cgroups should be in proportion to the weight.
+
+What Works and What Does not
+============================
+Service differentiation at application level can be noticed only if completely
+parallel IO paths are created from application to IO scheduler and there
+are no serializations introduced by any intermediate layer. For example,
+in some cases file system and page cache layer introduce serialization and
+we don't see service difference between higher weight and lower weight
+process groups.
+
+For example, when I start an O_SYNC write out on an ext3 file system (file
+is being created newly), I see lots of activity from kjournald. I have not
+gone into details yet, but my understanding is that there are lot more
+journal commits and kjournald kind of introduces serialization between two
+processes. So even if you put these two processes in two different cgroups
+with different weights, higher weight process will not see more IO done.
+
+It does work very well when we bypass filesystem layer and IO is raw. For
+example in above virtual machine case, host sees raw synchronous writes
+coming from two guest machines and filesystem layer at host is not introducing
+any kind of serialization hence we can see the service difference.
+
+It also works very well for reads even on the same file system as for reads
+file system journalling activity does not kick in and we can create parallel
+IO paths from application to all the way down to IO scheduler and get more
+IO done on the IO path with higher weight.
+
+Details of new ioscheduler tunables
+===================================
+
+group_idle
+-----------
+
+"group_idle" specifies the duration one should wait for new request before
+group is expired. This is very similiar to "slice_idle" parameter of cfq. The
+difference is that slice_idle specifies queue idling period and group_idle
+specifies group idling period. Another difference is that cfq idling is
+dynamically updated based on traffic pattern. group idling is currently
+static.
+
+group idling takes place when a group is empty when it is being expired. If
+an empty group is expired and later it gets a request (say 1 ms), it looses
+its fair share as upon expiry it will be deleted from the service tree and
+a new queue will be selected to run and min_vdisktime will be udpated on
+service tree.
+
+There are both advantages and disadvantates of enabling group_idle. If
+enabled, it ensures that a group gets its fair share of disk time (as long
+as a group gets a new request with-in group_idle period). So even if a
+single sequential reader is running in a group, it will get the disk time
+depending on the group weight. IOW, enabling it provides very strong isolation
+between groups.
+
+The flip side is that it makes the group a heavier entity with slow switching
+between groups. There are many cases where CFQ disables the idling on the
+queue and hence queue gets expired as soon as requests are over in the queue
+and CFQ moves to new queue. This way it achieves faster switching and in many
+cases better throughput (most of the time seeky processes will not have idling
+enabled and will get very limited access to disk).
+
+If group idling is disabled, a group will get fairness only if it is
+continuously backlogged. So this weakens the fairness gurantees and isolation
+between the groups but can help achieve faster switching between queues/groups
+and better throughput.
+
+So one should set "group_idle" depending on one's use case and based on need.
+
+For the time being it is enabled by default.
+
+"fairness"
+----------
+IO controller has introduced a "fairness" tunable for every io scheduler.
+Currently this tunable can assume values 0, 1.
+
+If fairness is set to 1, then IO controller waits for requests to finish from
+previous queue before requests from new queue are dispatched. This helps in
+doing better accouting of disk time consumed by a queue. If this is not done
+then on a queuing hardware, there can be requests from multiple queues and
+we will not have any idea which queue consumed how much of disk time.
+
+So if "fairness" is set, it can help achive better time accounting. But the
+flip side is that it can slow down switching between queues and also lower the
+throughput.
+
+Again, this parameter should be set/reset based on the need. For the time
+being it is disabled by default.
+
+Details of cgroup files
+=======================
+- io.ioprio_class
+ - Specifies class of the cgroup (RT, BE, IDLE). This is default io
+ class of the group on all the devices until and unless overridden by
+ per device rule. (See io.policy).
+
+ 1 = RT; 2 = BE, 3 = IDLE
+
+- io.weight
+ - Specifies per cgroup weight. This is default weight of the group
+ on all the devices until and unless overridden by per device rule.
+ (See io.policy).
+
+ Currently allowed range of weights is from 100 to 1000.
+
+- io.disk_time
+ - disk time allocated to cgroup per device in milliseconds. First
+ two fields specify the major and minor number of the device and
+ third field specifies the disk time allocated to group in
+ milliseconds.
+
+- io.disk_sectors
+ - number of sectors transferred to/from disk by the group. First
+ two fields specify the major and minor number of the device and
+ third field specifies the number of sectors transferred by the
+ group to/from the device.
+
+- io.disk_queue
+ - Debugging aid only enabled if CONFIG_DEBUG_GROUP_IOSCHED=y. This
+ gives the statistics about how many a times a group was queued
+ on service tree of the device. First two fields specify the major
+ and minor number of the device and third field specifies the number
+ of times a group was queued on a particular device.
+
+- io.disk_queue
+ - Debugging aid only enabled if CONFIG_DEBUG_GROUP_IOSCHED=y. This
+ gives the statistics about how many a times a group was de-queued
+ or removed from the service tree of the device. This basically gives
+ and idea if we can generate enough IO to create continuously
+ backlogged groups. First two fields specify the major and minor
+ number of the device and third field specifies the number
+ of times a group was de-queued on a particular device.
+
+- io.policy
+ - One can specify per cgroup per device rules using this interface.
+ These rules override the default value of group weight and class as
+ specified by io.weight and io.ioprio_class.
+
+ Following is the format.
+
+ #echo dev_maj:dev_minor weight ioprio_class > /patch/to/cgroup/io.policy
+
+ weight=0 means removing a policy.
+
+ Examples:
+
+ Configure weight=300 ioprio_class=2 on /dev/hdb (8:16) in this cgroup
+ # echo 8:16 300 2 > io.policy
+ # cat io.policy
+ dev weight class
+ 8:16 300 2
+
+ Configure weight=500 ioprio_class=1 on /dev/hda (8:0) in this cgroup
+ # echo 8:0 500 1 > io.policy
+ # cat io.policy
+ dev weight class
+ 8:0 500 1
+ 8:16 300 2
+
+ Remove the policy for /dev/hda in this cgroup
+ # echo 8:0 0 1 > io.policy
+ # cat io.policy
+ dev weight class
+ 8:16 300 2
+
+About configuring request desriptors
+====================================
+Traditionally there are 128 request desriptors allocated per request queue
+where io scheduler is operating (/sys/block/<disk>/queue/nr_requests). If these
+request descriptors are exhausted, processes will put to sleep and woken
+up once request descriptors are available.
+
+With io controller and cgroup stuff, one can not afford to allocate requests
+from single pool as one group might allocate lots of requests and then tasks
+from other groups might be put to sleep and this other group might be a
+higher weight group. Hence to make sure that a group always can get the
+request descriptors it is entitled to, one needs to make request descriptor
+limit per group on every queue.
+
+A new parameter /sys/block/<disk>/queue/nr_group_requests has been introduced
+and this parameter controlls the maximum number of requests per group.
+nr_requests still continues to control total number of request descriptors
+on the queue.
+
+Ideally one should set nr_requests to be following.
+
+nr_requests = number_of_cgroups * nr_group_requests
+
+This will make sure that at any point of time nr_group_requests number of
+request descriptors will be available for any of the cgroups.
+
+Currently default nr_requests=512 and nr_group_requests=128. This will make
+sure that apart from root group one can create 3 more group without running
+into any issues. If one decides to create more cgorus, nr_requests and
+nr_group_requests should be adjusted accordingly.
+
+Some High Level Test setups
+===========================
+One of the use cases of IO controller is to provide some kind of IO isolation
+between multiple virtual machines on the same host. Following is one
+example setup which worked for me.
+
+
+ KVM KVM
+ Guest1 Guest2
+ --------- ----------
+ | ----- | | ------ |
+ | | vdb | | | | vdb | |
+ | ----- | | ------ |
+ --------- ----------
+
+ ---------------------------
+ | Host |
+ | ------------- |
+ | | sdb1 | sdb2 | |
+ | ------------- |
+ ---------------------------
+
+On host machine, I had a spare SATA disk. I created two partitions sdb1
+and sdb2 and gave this partitions as additional storage to kvm guests. sdb1
+to KVM guest1 and sdb2 KVM guest2. These storage appeared as /dev/vdb in
+both the guests. Formatted the /dev/vdb and created ext3 file system and
+started a 1G file writeout in both the guests. Before writeout I had created
+two cgroups of weight 1000 and 500 and put virtual machines in two different
+groups.
+
+Following is write I started in both the guests.
+
+dd if=/dev/zero of=/mnt/vdc/zerofile1 bs=4K count=262144 conv=fdatasync
+
+Following are the results on host with "deadline" scheduler.
+
+group1 time=8:16 17254 group1 sectors=8:16 2104288
+group2 time=8:16 8498 group2 sectors=8:16 1007040
+
+Virtual machine with cgroup weight 1000 got almost double the time of virtual
+machine with weight 500.
--
1.6.0.6
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]