[dm-devel] [Multipath] Round-robin performance limit

Mon May 2 22:27:18 UTC 2011

On Mon, 2011-05-02 at 09:36 -0400, Adam Chasen wrote:
> Lowering rr_min_io provides marginal improvement. I see 6MB/s
> improvement at an rr_min_io of 3 vs 100. I played around with it
> before all the way down to 1. People seems to settle on 3. Still, I am
> not seeing the bandwidth I assume it should be (4 aggregated links).
> 
> Some additional information. If I attempt to pull from my two
> multipath devices simultaneously (different LUNs, but same iSCSI
> connections) then I can pull additional data (50MB/s vs 27-30MB/s
> from each link).
> 
> Adam
> 
> This is a response to a direct email I sent to someone who had a
> similar issue on this list a while back:
> Date: Sat, 30 Apr 2011 00:13:20 +0200
> From: Bart Coninckx <bart.coninckx at telenet.be>
> Hi Adam,
> 
> I believe setting rr_min_io to 3 in stead of 100 improved things
> significantly.
> What is still an unexplainable issue though is dd-ing to the multipath
> device (very slow) while reading from it is very fast. Doing the same
> piped over SSH to the original devices on the iSCSI server was OK, so it
> seems like either an iSCSI or still a multipath issue.
> 
> But I definitely remember that lowering rr_min_io helped quite a bit.
> I think the paths are switched faster in this way resulting into more speed.
> 
> Good luck,
> 
> b.
> 
> 
> On Mon, May 2, 2011 at 3:25 AM, Pasi Kärkkäinen <pasik at iki.fi> wrote:
> > On Thu, Apr 28, 2011 at 11:55:55AM -0400, Adam Chasen wrote:
> >>
> >> [root at zed ~]# multipath -ll
> >> 3600c0ff000111346d473554d01000000 dm-3 DotHill,DH3000
> >> size=1.1T features='0' hwhandler='0' wp=rw
> >> `-+- policy='round-robin 0' prio=1 status=active
> >>   |- 88:0:0:0 sdd 8:48  active ready  running
> >>   |- 86:0:0:0 sdc 8:32  active ready  running
> >>   |- 89:0:0:0 sdg 8:96  active ready  running
> >>   `- 87:0:0:0 sdf 8:80  active ready  running
> >> 3600c0ff00011148af973554d01000000 dm-2 DotHill,DH3000
> >> size=1.1T features='0' hwhandler='0' wp=rw
> >> `-+- policy='round-robin 0' prio=1 status=active
> >>   |- 89:0:0:1 sdk 8:160 active ready  running
> >>   |- 88:0:0:1 sdi 8:128 active ready  running
> >>   |- 86:0:0:1 sdh 8:112 active ready  running
> >>   `- 87:0:0:1 sdl 8:176 active ready  running
> >>
> >> /etc/multipath.conf
> >> defaults {
> >>         path_grouping_policy    multibus
> >>         rr_min_io 100
> >> }
> >
> > Did you try a lower value for rr_min_io ?
> >
> > -- Pasi
> >
> >>
> >> multipath-tools v0.4.9 (05/33, 2016)
> >> 2.6.35.11-2-fl.smp.gcc4.4.x86_64
<snip>
I'm quite curious to see what you ultimately find on this as we have a
similar setup (four paths to an iSCSI SAN) and have struggled quite a
bit.  We had settled on using multipath for failover but load balancing
using software RAID0 across the four devices. That seemed to provide
more even scaling under various IO patterns until we realized we could
not take a transactionally consistent snapshot of the SAN because we
would not know which RAID transaction had been committed at the timeof
the snapshot. Thus, we are planning to implement multibus.

What scheduler are you using? We found that the default cfq scheduler in
our kernel versions (2.6.28 and 29) did not scale at all to the number
of parallel iSCSI sessions.  Deadline or noop scaled almost linearly.
We then realized that our SAN (Nexenta running ZFS) was doing its own
optimization of writing to the physical media (we assumed that's what
the scheduler is for) so we had no need for the overhead of any
scheduler and set ours to noop except for local disks.

I'm also very curious about your findings on rr_min_io.  I cannot find
my benchmarks but we tested various settings heavily.  I do not recall
if we saw more even scaling with 10 or 100.  I remember being surprised
that performance with it set to 1 was poor.  I would have thought that,
in a bonded environment, changing paths per iSCSI command would give
optimal performance.  Can anyone explain why it does not?

We speculated that it either added too much overhead to manage the
constant switching or it was the nature of iSCSI.  Does each iSCSI
command need to be acknowledged before the next one can be sent? If so,
does multibus not increase throughput any individual iSCSI stream but
only as we multiplex iSCSI streams?

If that is the case, it would exacerbate the already significant problem
of Linux, iSCSI, and latency.  We have found that in any Linux disk IO
that touches the Linux file system, iSCSI performance is quite poor
because it is latency bound due to the maximum 4KB page size.  I'm only
parroting what others have told me so correct me if I am wrong.  Since
iSCSI can only commit 4KB at a time in Linux (unless bypassing the file
system with raw devices, dd, or direct writes in something like Oracle),
and since each write needs to be acknowledged before the next is sent,
and because sending 4KB down a high speed pipe like 10Gbps or even 1Gbps
comes nowhere near to saturating the link, iSCSI Linux IO is latency
bound and no amount of increase in bandwidth or number of bound channels
will increase the throughput of an individual iSCSI stream.  Only
minimizing latency will.

I hope some of that might have helped and look forward to hearing about
your optimization of multibus.  Thanks - John