[dm-devel] [Multipath] Round-robin performance limit

John A. Sullivan III jsullivan at opensourcedevel.com
Tue Dec 27 19:18:45 UTC 2011


On Tue, 2011-12-27 at 13:36 +0200, Pasi Kärkkäinen wrote:
> On Thu, Dec 22, 2011 at 07:54:46PM -0500, John A. Sullivan III wrote:
> > On Wed, 2011-10-05 at 15:54 -0400, Adam Chasen wrote:
> > > John,
> > > I am limited in a similar fashion. I would much prefer to use multibus
> > > multipath, but was unable to achieve bandwidth which would exceed a
> > > single link even though it was spread over the 4 available links. Were
> > > you able to gain even a similar performance of the RAID0 setup with
> > > the multibus multipath?
> > > 
> > > Thanks,
> > > Adam
> > <snip>
> > We just ran a quick benchmark before optimizing.  Using multibus rather
> > than RAID0 with four GbE NICs, and testing with a simple cat /dev/zero >
> > zeros, we hit 3.664 Gbps!
> > 
> > This is still on CentOS 5.4 so we are not able to play with
> > rr_min_io_rq.  We have not yet activated jumbo frames. We are also
> > thinking of using SFQ as a qdisc instead of the default pfifo_fast.  So,
> > we think we can make it go even faster.
> > 
> > We are delighted to be achieving this with multibus rather than RAID0 as
> > it means we can take transactionally consistent snapshots on the SAN.
> > 
> > Many thanks to whomever pointed out that tag queueing should solve the
> > 4KB block size latency problem.  The problem turned out to not be
> > latency as we were told but simply an under resources SAN.  We brought
> > in new Nexenta SANs with much more RAM and they are flying - John
> > 
> 
> Hey,
> 
> Can you please post your multipath configuration ? 
> Just for reference for future people googling for this :)
> 
> -- Pasi
> 
<snip>
Sure although I would be a bit careful.  I think there are a few things
we need to tweak in it and the lead engineer on the product and I just
haven't had the time to go over it.  It is also based upon CentOS 5.4 so
we do not have rr_min_io_rq.  We are a moderately secure environment so
I might need to scrub a bit of data:

multipath.conf
blacklist {
#        devnode "*"
        # sdb
        wwid SATA_ST3250310NS_9XX0LYYY
        #sda
        wwid SATA_ST3250310NS_9XX0LZZZ
        # The above does not seem to be working thus we will do
        devnode "^sd[ab]$"
        # This is usually a bad idea as the device names can change
        # However, since we add our iSCSI devices long after boot, I think we are safe
}

defaults {
        udev_dir                /dev
        polling_interval        5
        selector                "round-robin 0"
        path_grouping_policy    multibus
        getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
        prio_callout            "/bin/bash /sbin/mpath_prio_ssi %n" # This needs to be cleaned up
        prio_callout            /bin/true
        path_checker            directio
        rr_min_io               100
        max_fds                 8192
        rr_weight               uniform
        failback                immediate
        no_path_retry           fail
#       user_friendly_names     yes
}

multipaths {

        multipath {
                wwid                    aaaaaaaaaa53f0d0000004e81f27d0001
               alias                    isda
        }

        multipath {
                wwid                    aaaaaaaaaa53f0d0000004e81f2910002
               alias                    isdb
        }

        multipath {
                wwid                    aaaaaaaaaa53f0d0000004e81f2ab0003
               alias                    isdc
        }

        multipath {
                wwid                    aaaaaaaaaa53f0d0000004e81f2c10004
               alias                    isdd
        }

}

devices {
       device {
               vendor                  "NEXENTA"
               product                 "COMSTAR"
               getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
               features                "0"
               hardware_handler        "0"
       }
}

Other miscellaneous settings:
# Some optimizations for the SAN network
ip link set eth0 txqlen 2000
ip link set eth1 txqlen 2000
ip link set eth2 txqlen 2000
ip link set eth3 txqlen 2000

The more we read about and test bufferbloat
(http://www.bufferbloat.net/projects/bloat), the more we are thinking of
actually dramatically reducing these buffers as it is quite possible for
one new iSCSI conversation to become backlogged behind another and, I
suspect, that could also wreak havoc on command reordering if we are
doing round robin around the interfaces.

We are also thinking of changing the queuing discipline from the default
pfifo_fast.  Since it is all the same traffic, there is no need to band
it like pfifo_fast does by examining the TOS bits.  A regular fifo qdisc
might be a hair faster.  On the other hand, we might want to go with SFQ
so that one heavy iSCSI conversation cannot starve others or cause them
to not quickly accelerate TCP slow start.

multipath -F #flush
multipath
sleep 2
service multipathd start
sleep 2

blockdev --setra 1024 /dev/mapper/isda
blockdev --setra 1024 /dev/mapper/isdb
blockdev --setra 1024 /dev/mapper/isdc
blockdev --setra 1024 /dev/mapper/isdd

mount -o defaults,noatime /dev/mapper/id02sdd /backups # Note the
noatime

>From sysctl.conf:
# Controls tcp maximum receive window size
#net.core.rmem_max = 409600
#net.core.rmem_max = 8738000
net.core.rmem_max = 16777216
net.ipv4.tcp_rmem = 8192 873800 16777216

# Controls tcp maximum send window size
#net.core.wmem_max = 409600
#net.core.wmem_max = 6553600
net.core.wmem_max = 16777216
net.ipv4.tcp_wmem = 4096 655360 16777216

# Controls disabling Nagle algorithm and delayed acks
net.ipv4.tcp_low_latency=1

net.core.netdev_max_backlog = 2000
# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 1

# Controls the maximum size of a message, in bytes
kernel.msgmnb = 65536

# Controls the default maxmimum size of a mesage queue
kernel.msgmax = 65536

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296

# Controls when we call for more entropy
# Since these systems have no mouse or keyboard and Linux no longer uses
network I/O,
# we are regularly running low on entropy
kernel.random.write_wakeup_threshold = 1024
# Not really needed for iSCSI - just an interesting setting we use in
conjunction with haveged to address the problem of lack of entropy on
headless systems

We have not yet re-enabled jumbo packets as that actually reduced
throughput in the past but that may have been related to the lack of
resources in the original unit.

Hope this helps.  We are not experts so, if someone sees something we
can tweak, please point it out - John





More information about the dm-devel mailing list