[dm-devel] Performance numbers with IO throttling patches (Was: Re: IO scheduler based IO controller V10)

Sat Oct 10 19:53:16 UTC 2009

On Thu, Sep 24, 2009 at 02:33:15PM -0700, Andrew Morton wrote:

[..]
> > Environment
> > ==========
> > A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.
> 
> That's a bit of a toy.
> 
> Do we have testing results for more enterprisey hardware?  Big storage
> arrays?  SSD?  Infiniband?  iscsi?  nfs? (lol, gotcha)
> 
> 

Hi All,

Couple of days back I posted some performance number of "IO scheduler
controller" and "dm-ioband" here.

http://lkml.org/lkml/2009/10/8/9

Now I have run similar tests with Andrea Righi's IO throttling approach
of max bandwidth control. This is the exercise to understand pros/cons
of each approach and see how can we take things forward.

Environment
===========
Software
--------
- 2.6.31 kenrel
- IO scheduler controller V10 on top of 2.6.31
- IO throttling patch on top of 2.6.31. Patch is available here.

http://www.develer.com/~arighi/linux/patches/io-throttle/old/cgroup-io-throttle-2.6.31.patch

Hardware
--------
A storage array of 5 striped disks of 500GB each.

Used fio jobs for 30 seconds in various configurations. Most of the IO is
direct IO to eliminate the effects of caches.

I have run three sets for each test. Blindly reporting results of set2
from each test, otherwise it is too much of data to report.

Had lun of 2500GB capacity. Used 200G partition with ext3 file system for
my testing. For IO scheduler controller testing, created two cgroups of 
weight 100 each so that effectively disk can be divided half/half between
two groups.

For IO throttling patches also created two cgroups. Now tricky part is
that it is a max bw controller and not a proportional weight controller.
So dividing the disk capacity half/half between two cgroups is tricky. The
reason being I just don't know what's the BW capacity of underlying
storage. Throughput varies so much with type of workload. For example, on
my arrary, this is how throughput looks like with different workloads.

8 sequential buffered readers 			115 MB/s
8 direct sequential readers bs=64K		64 MB/s
8 direct sequential readers bs=4K		14 MB/s

8 buffered random readers bs=64K		3 MB/s
8 direct random readers bs=64K			15 MB/s
8 direct random readers bs=4K			1.5 MB/s

So throughput seems to be varying from 1.5 MB/s to 115 MB/s depending
on workload. What should be the BW limits per cgroup to divide disk BW
in half/half between two groups?

So I took a conservative estimate and divide max bandwidth divide by 2,
and thought of array capacity as 60MB/s and assign each cgroup 30MB/s. In
some cases I have assigened even 10MB/s or 5MB/s to each cgropu to see the
effects of throttling. I am using "Leaky bucket" policy for all the tests.

As theme of two controllers is different, at some places it might sound
like apples vs oranges comparison. But still it does help...

Multiple Random Reader vs Sequential Reader
===============================================
Generally random readers bring the throughput down of others in the
system. Ran a test to see the impact of increasing number of random readers on
single sequential reader in different groups.

Vanilla CFQ
-----------------------------------
[Multiple Random Reader]                      [Sequential Reader]       
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   23KB/s    23KB/s    22KB/s    691 msec    1   13519KB/s 468K usec   
2   152KB/s   152KB/s   297KB/s   244K usec   1   12380KB/s 31675 usec  
4   174KB/s   156KB/s   638KB/s   249K usec   1   10860KB/s 36715 usec  
8   49KB/s    11KB/s    310KB/s   1856 msec   1   1292KB/s  990K usec   
16  63KB/s    48KB/s    877KB/s   762K usec   1   3905KB/s  506K usec   
32  35KB/s    27KB/s    951KB/s   2655 msec   1   1109KB/s  1910K usec  

IO scheduler controller + CFQ
-----------------------------------
[Multiple Random Reader]                      [Sequential Reader]       
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   228KB/s   228KB/s   223KB/s   132K usec   1   5551KB/s  129K usec   
2   97KB/s    97KB/s    190KB/s   154K usec   1   5718KB/s  122K usec   
4   115KB/s   110KB/s   445KB/s   208K usec   1   5909KB/s  116K usec   
8   23KB/s    12KB/s    158KB/s   2820 msec   1   5445KB/s  168K usec   
16  11KB/s    3KB/s     145KB/s   5963 msec   1   5418KB/s  164K usec   
32  6KB/s     2KB/s     139KB/s   12762 msec  1   5398KB/s  175K usec   

Notes:
- Sequential reader in group2 seems to be well isolated from random readers
  in group1. Throughput and latency of sequential reader are stable and
  don't drop as number of random readers inrease in system.

io-throttle + CFQ
------------------
BW limit group1=10 MB/s                       BW limit group2=10 MB/s   
[Multiple Random Reader]                      [Sequential Reader]       
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   37KB/s    37KB/s    36KB/s    218K usec   1   8006KB/s  20529 usec  
2   185KB/s   183KB/s   360KB/s   228K usec   1   7475KB/s  33665 usec  
4   188KB/s   171KB/s   699KB/s   262K usec   1   6800KB/s  46224 usec  
8   84KB/s    51KB/s    573KB/s   1800K usec  1   2835KB/s  885K usec   
16  21KB/s    9KB/s     294KB/s   3590 msec   1   437KB/s   1855K usec  
32  34KB/s    27KB/s    980KB/s   2861K usec  1   1145KB/s  1952K usec  

Notes:
- I have setup limits of 10MB/s in both the cgroups. Now random reader
  group will never achieve that kind of speed, so it will not be throttled
  and then it goes onto impact the throughput and latency of other groups
  in the system.

- Now the key question is how conservative one should in be setting up 
  max BW limit. On this box if a customer has bought 10MB/s cgroup and if
  he is running some random readers it will kill throughput of other
  groups in the system and their latencies will shoot up. No isolation in
  this case.

- So in general, max BW provides isolation from high speed groups but it
  does not provide isolaton from random reader groups which are moving
  slow.

Multiple Sequential Reader vs Random Reader
===============================================
Now running a reverse test where in one group I am running increasing
number of sequential readers and in other group I am running one random
reader and see the impact of sequential readers on random reader.

Vanilla CFQ
-----------------------------------
[Multiple Sequential Reader]                  [Random Reader]           
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   13978KB/s 13978KB/s 13650KB/s 27614 usec  1   22KB/s    227 msec    
2   6225KB/s  6166KB/s  12101KB/s 568K usec   1   10KB/s    457 msec    
4   4052KB/s  2462KB/s  13107KB/s 322K usec   1   6KB/s     841 msec    
8   1899KB/s  557KB/s   12960KB/s 829K usec   1   13KB/s    1628 msec   
16  1007KB/s  279KB/s   13833KB/s 1629K usec  1   10KB/s    3236 msec   
32  506KB/s   98KB/s    13704KB/s 3389K usec  1   6KB/s     3238 msec   

IO scheduler controller + CFQ
-----------------------------------
[Multiple Sequential Reader]                  [Random Reader]           
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   5721KB/s  5721KB/s  5587KB/s  126K usec   1   223KB/s   126K usec   
2   3216KB/s  1442KB/s  4549KB/s  349K usec   1   224KB/s   176K usec   
4   1895KB/s  640KB/s   5121KB/s  775K usec   1   222KB/s   189K usec   
8   957KB/s   285KB/s   6368KB/s  1680K usec  1   223KB/s   142K usec   
16  458KB/s   132KB/s   6455KB/s  3343K usec  1   219KB/s   165K usec   
32  248KB/s   55KB/s    6001KB/s  6957K usec  1   220KB/s   504K usec   

Notes:
- Random reader is well isolated from increasing number of sequential
  readers in other group. BW and latencies are stable.

io-throttle + CFQ
-----------------------------------
BW limit group1=10 MB/s                       BW limit group2=10 MB/s   
[Multiple Sequential Reader]                  [Random Reader]           
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   8200KB/s  8200KB/s  8007KB/s  20275 usec  1   37KB/s    217K usec   
2   3926KB/s  3919KB/s  7661KB/s  122K usec   1   16KB/s    441 msec    
4   2271KB/s  1497KB/s  7672KB/s  611K usec   1   9KB/s     927 msec    
8   1113KB/s  513KB/s   7507KB/s  849K usec   1   21KB/s    1020 msec   
16  661KB/s   236KB/s   7959KB/s  1679K usec  1   13KB/s    2926 msec   
32  292KB/s   109KB/s   7864KB/s  3446K usec  1   8KB/s     3439 msec   

BW limit group1=5 MB/s                        BW limit group2=5 MB/s    
[Multiple Sequential Reader]                  [Random Reader]           
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   4686KB/s  4686KB/s  4576KB/s  21095 usec  1   57KB/s    219K usec   
2   2298KB/s  2179KB/s  4372KB/s  132K usec   1   37KB/s    431K usec   
4   1245KB/s  1019KB/s  4449KB/s  324K usec   1   26KB/s    835 msec    
8   584KB/s   403KB/s   4109KB/s  833K usec   1   30KB/s    1625K usec  
16  346KB/s   252KB/s   4605KB/s  1641K usec  1   129KB/s   3236K usec  
32  175KB/s   56KB/s    4269KB/s  3236K usec  1   8KB/s     3235 msec   

Notes:

- Above result is surprising to me. I have run it twice. In first run, I
  setup per cgroup limit as 10MB/s and in second run I set it up 5MB/s. In
  both the cases as number of sequential readers increase in other groups, 
  random reader's throughput decreases and latencies increase. This is
  happening despite the fact that sequential readers are being throttled
  to make sure it does not impact workload in other group. Wondering why
  random readers are not seeing consistent throughput and latencies.

- Andrea, can you please also run similar tests to see if you see same
  results or not. This is to rule out any testing methodology errors or
  scripting bugs. :-). I also have collected the snapshot of some cgroup
  files like bandwidth-max, throttlecnt, and stats. Let me know if you want
  those to see what is happenig here. 

Multiple Sequential Reader vs Sequential Reader
===============================================
- This time running random readers are out of the picture and trying to
  see the effect of increasing number of sequential readers on another
  sequential reader running in a different group.

Vanilla CFQ
-----------------------------------
[Multiple Sequential Reader]                  [Sequential Reader]       
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   6325KB/s  6325KB/s  6176KB/s  114K usec   1   6902KB/s  120K usec   
2   4588KB/s  3102KB/s  7510KB/s  571K usec   1   4564KB/s  680K usec   
4   3242KB/s  1158KB/s  9469KB/s  495K usec   1   3198KB/s  410K usec   
8   1775KB/s  459KB/s   12011KB/s 1178K usec  1   1366KB/s  818K usec   
16  943KB/s   296KB/s   13285KB/s 1923K usec  1   728KB/s   1816K usec  
32  511KB/s   148KB/s   13555KB/s 3286K usec  1   391KB/s   3212K usec  

IO scheduler controller + CFQ
-----------------------------------
[Multiple Sequential Reader]                  [Sequential Reader]       
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   6781KB/s  6781KB/s  6622KB/s  109K usec   1   6691KB/s  115K usec   
2   3758KB/s  1876KB/s  5502KB/s  693K usec   1   6373KB/s  419K usec   
4   2100KB/s  671KB/s   5751KB/s  987K usec   1   6330KB/s  569K usec   
8   1023KB/s  355KB/s   6969KB/s  1569K usec  1   6086KB/s  120K usec   
16  520KB/s   130KB/s   7094KB/s  3140K usec  1   5984KB/s  119K usec   
32  245KB/s   86KB/s    6621KB/s  6571K usec  1   5850KB/s  113K usec   

Notes:
- BW and latencies of sequential reader in group 2 are fairly stable as
  number of readers increase in first group.

io-throttle + CFQ
-----------------------------------
BW limit group1=30 MB/s                       BW limit group2=30 MB/s   
[Multiple Sequential Reader]                  [Sequential Reader]       
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   6343KB/s  6343KB/s  6195KB/s  116K usec   1   6993KB/s  109K usec   
2   4583KB/s  3046KB/s  7451KB/s  583K usec   1   4516KB/s  433K usec   
4   2945KB/s  1324KB/s  9552KB/s  602K usec   1   3001KB/s  583K usec   
8   1804KB/s  473KB/s   12257KB/s 861K usec   1   1386KB/s  815K usec   
16  942KB/s   265KB/s   13560KB/s 1659K usec  1   718KB/s   1658K usec  
32  462KB/s   143KB/s   13757KB/s 3482K usec  1   409KB/s   3480K usec  

Notes:
- BW decreases and latencies increase in group2 as number of readers
  increase in first group. This should be due to fact that no throttling
  will happen as none of the groups is hitting the limit of 30MB/s. To
  me this is the tricky part. How a service provider is supposed to 
  set the limit of groups. If groups are not hitting max limits, it will
  still impact the BW and latencies in other group.

BW limit group1=10 MB/s                       BW limit group2=10 MB/s   
[Multiple Sequential Reader]                  [Sequential Reader]       
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   4128KB/s  4128KB/s  4032KB/s  215K usec   1   4076KB/s  170K usec   
2   2880KB/s  1886KB/s  4655KB/s  291K usec   1   2891KB/s  212K usec   
4   1912KB/s  888KB/s   5872KB/s  417K usec   1   1881KB/s  411K usec   
8   1032KB/s  432KB/s   7312KB/s  841K usec   1   853KB/s   816K usec   
16  540KB/s   259KB/s   7844KB/s  1728K usec  1   503KB/s   1609K usec  
32  291KB/s   111KB/s   7920KB/s  3417K usec  1   249KB/s   3205K usec  

Notes:
- Same test with 10MB/s as group limit. This is again a surprising result.
  Max BW in first group is being throttled but still throughput is
  dropping significantly in second group and latencies are on the rise.

- Limit of first group is 10MB/s but it is achieving max BW of around
  8MB/s only. What happened to rest of the 2MB/s?

- Andrea, again, please do run this test. The throughput drop in second
  group stumps me and forces me to think if I am doing something wrong.  

BW limit group1=5 MB/s                        BW limit group2=5 MB/s    
[Multiple Sequential Reader]                  [Sequential Reader]       
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   2434KB/s  2434KB/s  2377KB/s  110K usec   1   2415KB/s  120K usec   
2   1639KB/s  1186KB/s  2759KB/s  222K usec   1   1709KB/s  220K usec   
4   1114KB/s  648KB/s   3314KB/s  420K usec   1   1163KB/s  414K usec   
8   567KB/s   366KB/s   4060KB/s  901K usec   1   527KB/s   816K usec   
16  329KB/s   179KB/s   4324KB/s  1613K usec  1   311KB/s   1613K usec  
32  178KB/s   70KB/s    4320KB/s  3235K usec  1   163KB/s   3209K usec  

- Setting the limit to 5MB/s per group also does not seem to be helping
  the second group.

Multiple Random Writer vs Random Reader
===============================================
This time running multiple random writers in first group and see the
impact on throughput and latency of random reader in different group.

Vanilla CFQ
-----------------------------------
[Multiple Random Writer]                      [Random Reader]           
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   64018KB/s 64018KB/s 62517KB/s 353K usec   1   190KB/s   96 msec     
2   35298KB/s 35257KB/s 68899KB/s 208K usec   1   76KB/s    2416 msec   
4   16387KB/s 14662KB/s 60630KB/s 3746K usec  1   106KB/s   2308K usec  
8   5106KB/s  3492KB/s  33335KB/s 2995K usec  1   193KB/s   2292K usec  
16  3676KB/s  3002KB/s  51807KB/s 2283K usec  1   72KB/s    2298K usec  
32  2169KB/s  1480KB/s  56882KB/s 1990K usec  1   35KB/s    1093 msec   

IO scheduler controller + CFQ
-----------------------------------
[Multiple Random Writer]                      [Random Reader]           
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   20369KB/s 20369KB/s 19892KB/s 877K usec   1   255KB/s   137K usec   
2   14347KB/s 14288KB/s 27964KB/s 1010K usec  1   228KB/s   117K usec   
4   6996KB/s  6701KB/s  26775KB/s 1362K usec  1   221KB/s   180K usec   
8   2849KB/s  2770KB/s  22007KB/s 2660K usec  1   250KB/s   485K usec   
16  1463KB/s  1365KB/s  22384KB/s 2606K usec  1   254KB/s   115K usec   
32  799KB/s   681KB/s   22404KB/s 2879K usec  1   266KB/s   107K usec   

Notes
- BW and latencies of random reader in second group are fairly stable.

io-throttle + CFQ
-----------------------------------
BW limit group1=30 MB/s                       BW limit group2=30 MB/s   
[Multiple Random Writer]                      [Random Reader]           
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   21920KB/s 21920KB/s 21406KB/s 1017K usec  1   353KB/s   432K usec   
2   14291KB/s 9626KB/s  23357KB/s 1832K usec  1   362KB/s   177K usec   
4   7130KB/s  5135KB/s  24736KB/s 1336K usec  1   348KB/s   425K usec   
8   3165KB/s  2949KB/s  23792KB/s 2133K usec  1   336KB/s   146K usec   
16  1653KB/s  1406KB/s  23694KB/s 2198K usec  1   337KB/s   115K usec   
32  793KB/s   717KB/s   23198KB/s 2195K usec  1   330KB/s   192K usec   

BW limit group1=10 MB/s                       BW limit group2=10 MB/s   
[Multiple Random Writer]                      [Random Reader]           
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   7903KB/s  7903KB/s  7718KB/s  1037K usec  1   474KB/s   103K usec   
2   4496KB/s  4428KB/s  8715KB/s  1091K usec  1   450KB/s   553K usec   
4   2153KB/s  1827KB/s  7914KB/s  2042K usec  1   458KB/s   108K usec   
8   1129KB/s  1087KB/s  8688KB/s  1280K usec  1   432KB/s   98215 usec  
16  606KB/s   527KB/s   8668KB/s  2303K usec  1   426KB/s   90609 usec  
32  312KB/s   259KB/s   8599KB/s  2557K usec  1   441KB/s   95283 usec  

Notes:
- IO throttling seems to be working really well here. Random writers are
  contained in the first group and this gives stable BW and latencies
  to random reader in second group.

Multiple Buffered Writer vs Buffered Writer
===========================================
This time run multiple buffered writers in group1 and see run a single
buffered writer in other group and see if we can provide fairness and
isolation.

Vanilla CFQ
------------
[Multiple Buffered Writer]                    [Buffered Writer]         
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   68997KB/s 68997KB/s 67380KB/s 645K usec   1   67122KB/s 567K usec   
2   47509KB/s 46218KB/s 91510KB/s 865K usec   1   45118KB/s 865K usec   
4   28002KB/s 26906KB/s 105MB/s   1649K usec  1   26879KB/s 1643K usec  
8   15985KB/s 14849KB/s 117MB/s   943K usec   1   15653KB/s 766K usec   
16  11567KB/s 6881KB/s  128MB/s   1174K usec  1   7333KB/s  947K usec   
32  5877KB/s  3649KB/s  130MB/s   1205K usec  1   5142KB/s  988K usec   

IO scheduler controller + CFQ
-----------------------------------
[Multiple Buffered Writer]                    [Buffered Writer]         
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   68580KB/s 68580KB/s 66972KB/s 2901K usec  1   67194KB/s 2901K usec  
2   47419KB/s 45700KB/s 90936KB/s 3149K usec  1   44628KB/s 2377K usec  
4   27825KB/s 27274KB/s 105MB/s   1177K usec  1   27584KB/s 1177K usec  
8   15382KB/s 14288KB/s 114MB/s   1539K usec  1   14794KB/s 783K usec   
16  9161KB/s  7592KB/s  124MB/s   3177K usec  1   7713KB/s  886K usec   
32  4928KB/s  3961KB/s  126MB/s   1152K usec  1   6465KB/s  4510K usec  

Notes:
- It does not work. Buffered writer in second group are being overwhelmed
  by writers in group1.

- This is a limitation of IO scheduler based controller currently as page
  cache at higher layer evens out the traffic and does not throw more
  traffic from higher weight group.

- This is something needs more work at higher layers like dirty limts
  per cgroup in memory contoller and the method to writeout buffered 
  pages belonging to a particular memory cgroup. This is still being
  brainstormed.

io-throttle + CFQ
-----------------------------------
BW limit group1=30 MB/s                       BW limit group2=30 MB/s   
[Multiple Buffered Writer]                    [Buffered Writer]         
nr  Max-bandw Min-bandw Agg-bandw Max-latency nr  Agg-bandw Max-latency 
1   33863KB/s 33863KB/s 33070KB/s 3046K usec  1   25165KB/s 13248K usec 
2   13457KB/s 12906KB/s 25745KB/s 9286K usec  1   29958KB/s 3736K usec  
4   7414KB/s  6543KB/s  27145KB/s 10557K usec 1   30968KB/s 8356K usec  
8   3562KB/s  2640KB/s  24430KB/s 12012K usec 1   30801KB/s 7037K usec  
16  3962KB/s  881KB/s   26632KB/s 12650K usec 1   31150KB/s 7173K usec  
32  3275KB/s  406KB/s   27295KB/s 14609K usec 1   26328KB/s 8069K usec  

Notes:
- This seems to work well here. io-throttle is throttling the writers
  before they write too much of data in page cache. One side effect of
  this seems to be that now a process will not be allowed to write at
  memory speed in page cahce and will be limited to disk IO speed limits
  set for the cgroup.

  Andrea is thinking of removing throttling in balance_dirty_pages() to allow
  writting at disk speed till we hit dirty_limits. But removing it leads
  to a different issue where too many dirty pages from a single group can
  be present from a cgroup in page cache and if that cgroup is slow moving
  one, then pages are flushed to disk at slower speed delyaing other
  higher rate cgroups. (all discussed in private mails with Andrea).

ioprio class and iopriority with-in cgroups issues with IO-throttle
===================================================================

Currently throttling logic is designed in such a way that it makes the
throttling uniform for every process in the group. So we will loose the
differentiation between different class of processes or differnetitation
between different priority of processes with-in group.

I have run the tests of these in the past and reported it here in the
past.

https://lists.linux-foundation.org/pipermail/containers/2009-May/017588.html

Thanks
Vivek