[dm-devel] Shell Scripts or Arbitrary Priority Callouts?

Tue Mar 24 17:30:10 UTC 2009

Thanks very much, again, and, again, I'll reply in the text - John

On Tue, 2009-03-24 at 18:36 +0200, Pasi Kärkkäinen wrote:
> On Tue, Mar 24, 2009 at 11:43:20AM -0400, John A. Sullivan III wrote:
> > I greatly appreciate the help.  I'll answer in the thread below as well
> > as consolidating answers to the questions posed in your other email.
> > 
> > On Tue, 2009-03-24 at 17:01 +0200, Pasi Kärkkäinen wrote:
> > > On Tue, Mar 24, 2009 at 08:21:45AM -0400, John A. Sullivan III wrote:
> > > > > 
> > > > > Core-iscsi developer seems to be active developing at least the 
> > > > > new iSCSI target (LIO target).. I think he has been testing it with
> > > > > core-iscsi, so maybe there's newer version somewhere? 
> > > > > 
> > > > > > We did play with the multipath rr_min_io settings and smaller always
> > > > > > seemed to be better until we got into very large numbers of session.  We
> > > > > > were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
> > > > > > RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
> > > > > > ports with disktest using 4K blocks to mimic the file system using
> > > > > > sequential reads (and some sequential writes).
> > > > > > 
> > > > > 
> > > > > Nice hardware. Btw are you using jumbo frames or flow control for iSCSI
> > > > > traffic? 
> > > > > 
> > > 
> > > Dunno if you noticed this.. :) 
> > We are actually quite enthusiastic about the environment and the
> > project.  We hope to have many of these hosting about 400 VServer guests
> > running virtual desktops from the X2Go project.  It's not my project but
> > I don't mind plugging them as I think it is a great technology.
> > 
> > We are using jumbo frames.  The ProCurve 2810 switches explicitly state
> > to NOT use flow control and jumbo frames simultaneously.  We tried it
> > anyway but with poor results.
> 
> Ok. 
> 
> iirc 2810 does not have very big buffers per port, so you might be better
> using flow control instead of jumbos.. then again I'm not sure how good flow
> control implementation HP has? 
> 
> The whole point of flow control is to prevent packet loss/drop.. this happens
> with sending pause frames before the port buffers get full. If port buffers
> get full then the switch doesn't have any other option than to drop the
> packets.. and this causes tcp-retransmits -> causes delay and tcp slows down
> to prevent further packet drops.
> 
> flow control "pause frames" cause less delay than tcp-retransmits. 
> 
> Do you see tcp retransmits with "netstat -s" ? Check both the target and the initiators.
Thankfully this is an area of some expertise for me (unlike disk I/O -
obviously ;)  ).  We have been pretty thorough about checking the
network path.  We've not seen any upper layer retransmission or buffer
overflows.
> 
> > > 
> > > 
> > > > > > 
> > > > > 
> > > > > When you used dm RAID0 you didn't have any multipath configuration, right? 
> > > > Correct although we also did test successfully with multipath in
> > > > failover mode and RAID0.
> > > > > 
> > > 
> > > OK.
> > > 
> > > > > What kind of stripe size and other settings you had for RAID0?
> > > > Chunk size was 8KB with four disks.  
> > > > > 
> > > 
> > > Did you try with much bigger sizes.. 128 kB ?
> > We tried slightly larger sizes - 16KB and 32KB I believe and observed
> > performance degradation.  In fact, in some scenarios 4KB chunk sizes
> > gave us better performance than 8KB.
> 
> Ok. 
> 
> > > 
> > > > > What kind of performance do you get using just a single iscsi session (and
> > > > > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> > > > > directly on top of the iscsi /dev/sd? device.
> > > > Miserable - same roughly 12 MB/s.
> > > 
> > > OK, Here's your problem. Was this btw reads or writes? Did you tune
> > > readahead-settings? 
> > 12MBps is sequential reading but sequential writing is not much
> > different.  We did tweak readahead to 1024. We did not want to go much
> > larger in order to maintain balance with the various data patterns -
> > some of which are random and some of which may not read linearly.
> 
> I did some benchmarking earlier between two servers; other one running ietd
> target with 'nullio' and other running open-iscsi initiator. Both using a single gigabit NIC. 
> 
> I remember getting very close to full gigabit speed at least with bigger
> block sizes. I can't remember how much I got with 4 kB blocks. 
> 
> Those tests were made with dd.
Yes, if we use 64KB blocks, we can saturate a Gig link.  With larger
sizes, we can push over 3 Gpbs over the four gig links in the test
environment.
> 
> nullio target is a good way to benchmark your network and initiator and
> verify everything is correct. 
> 
> Also it's good to first test for example with FTP and Iperf to verify
> network is working properly between target and the initiator and all the
> other basic settings are correct.
We did flood ping the network and had all interfaces operating at near
capacity.  The network itself looks very healthy.
> 
> Btw have you configured tcp stacks of the servers? Bigger default tcp window
> size, bigger maximun tcp window size etc.. 
Yep, tweaked transmit queue length, receive and transmit windows, net
device backlogs, buffer space, disabled nagle, and even played with the
dirty page watermarks.
> 
> > > 
> > > Can paste your iSCSI session settings negotiated with the target? 
> > Pardon my ignorance :( but, other than packet traces, how do I show the
> > final negotiated settings?
> 
> Try:
> 
> iscsiadm -i -m session
> iscsiadm -m session -P3
> 
Here's what it says.  Pretty much as expected.  We are using COMSTAR on
the target and took some traces to see what COMSTAR was expecting. We
set the open-iscsi parameters to match:

Current Portal: 172.x.x.174:3260,2
        Persistent Portal: 172.x.x.174:3260,2
                **********
                Interface:
                **********
                Iface Name: default
                Iface Transport: tcp
                Iface Initiatorname: iqn.2008-05.biz.ssi:vd-gen
                Iface IPaddress: 172.x.x.162
                Iface HWaddress: default
                Iface Netdev: default
                SID: 32
                iSCSI Connection State: LOGGED IN
                iSCSI Session State: LOGGED_IN
                Internal iscsid Session State: NO CHANGE
                ************************
                Negotiated iSCSI params:
                ************************
                HeaderDigest: None
                DataDigest: None
                MaxRecvDataSegmentLength: 131072
                MaxXmitDataSegmentLength: 8192
                FirstBurstLength: 65536
                MaxBurstLength: 524288
                ImmediateData: Yes
                InitialR2T: Yes
                MaxOutstandingR2T: 1
                ************************
                Attached SCSI devices:
                ************************
                Host Number: 39 State: running
                scsi39 Channel 00 Id 0 Lun: 0
                        Attached scsi disk sdah         State: running

> 
> > > 
> > > > > 
> > > > > Sounds like there's some other problem if invidual throughput is bad? Or did
> > > > > you mean performance with a single disktest IO thread is bad, but using multiple
> > > > > disktest threads it's good.. that would make more sense :) 
> > > > Yes, the latter.  Single thread (I assume mimicking a single disk
> > > > operation, e.g., copying a large file) is miserable - much slower than
> > > > local disk despite the availability of huge bandwidth.  We start
> > > > utilizing the bandwidth when multiplying concurrent disk activity into
> > > > the hundreds.
> > > > 
> > > > I am guessing the single thread performance problem is an open-iscsi
> > > > issue but I was hoping multipath would help us work around it by
> > > > utilizing multiple sessions per disk operation.  I suppose that is where
> > > > we run into the command ordering problem unless there is something else
> > > > afoot.  Thanks - John
> > > 
> > > You should be able to get many times the throughput you get now.. just with
> > > a single path/session.
> > > 
> > > What kind of latency do you have from the initiator to the target/storage? 
> > > 
> > > Try with for example 4 kB ping:
> > > ping -s 4096 <ip_of_the_iscsi_target>
> > We have about 400 micro seconds - that seems a bit high :(
> > rtt min/avg/max/mdev = 0.275/0.337/0.398/0.047 ms
> > 
> 
> Yeah.. that's a bit high. 
Actually, with more testing, we're seeing it stretch up to over 700
micro-seconds.  I'll attach a raft of data I collected at the end of
this email.
> 
> > > 
> > > 1000ms divided by the roundtrip you get from ping should give you maximum
> > > possible IOPS using a single path.. 
> > > 
> > 1000 / 0.4 = 2500
> > > 4 kB * IOPS == max bandwidth you can achieve.
> > 2500 * 4KB = 10 MBps
> > Hmm . . . seems like what we are getting.  Is that an abnormally high
> > latency? We have tried playing with interrupt coalescing on the
> > initiator side but without significant effect.  Thanks for putting
> > together the formula for me.  Not only does it help me understand but it
> > means I can work on addressing the latency issue without setting up and
> > running disk tests.
> > 
> 
> I think Ross suggested in some other thread the following settings for e1000
> NICs:
> 
> "Set the e1000s InterruptThrottleRate=1 and their TxRingBufferSize=4096
> and RxRingBufferSize=4096 (verify those option names with a modinfo)
> and add those to modprobe.conf."
We did try playing with the ring buffer but to no avail.  Modinfo does
not seem to display the current settings.  We did try playing with
setting the InterruptThrottleRate to 1 but again to no avail.  As I'll
mention later, I suspect the issue might be the opensolaris based
target.
> 
> > I would love to use larger block sizes as you suggest in your other
> > email but, on AMD64, I believe we are stuck with 4KB.  I've not seen any
> > way to change it and would gladly do so if someone knows how.
> > 
> 
> Are we talking about filesystem block sizes? That shouldn't be a problem if
> your application uses larger blocksizes for read/write operations.. 
> 
Yes, file system block size.  When we try rough, end user style tests,
e.g., large file copies, we seem to get the performance indicated by 4KB
blocks, i.e., lousy!
> Try for example with:
> dd if=/dev/zero of=/iscsilun/file.bin bs=1024k count=1024
Large block sizes can make the system truly fly so we suspect you are
absolutely correct about latency being the issue.  We did do our testing
with raw interfaces by the way.
> 
> and optionally add "oflag=direct" (or iflag=direct) if you want to make sure 
> caches do not mess up the results. 
> 
> > CFQ was indeed a problem.  It would not scale with increasing the number
> > of threads.  noop, deadline, and anticipatory all fared much better.  We
> > are currently using noop for the iSCSI targets.  Thanks again - John
> 
> Yep. And no problems.. hopefully I'm able to help and guide to right
> direction :)  
<snip>
I did a little digging and calculating and here is what I came up with
and sent to Nexenta.  Please tell me if I am on the right track.

I am using jumbo frames and should be able to get 2 4KB blocks
per frame.  Total size should be 8192 + 78 (TCP + IP + Ethernet + CRC
-oops we need to add iSCSI -what size is the iSCSI header?) + 12
(interframe gap) = 8282 bytes.  Transmission latency should be 8282 *
8 / 1,000,000,000 = 66.3 micro-seconds.  Switch latency is 5.7
microseconds so let's say network latency is 72 - well let's say 75
micro-seconds.  The only additional latency should be added by the
network stacks on the target and initiator.

Current round trip latency between the initiator (Linux) and target
(Nexenta) is around 400 micro-seconds and fluctuates significantly:

Hmm . .  this is worse than the last test:
PING 172.30.13.158 (172.30.13.158) 8192(8220) bytes of data.
8200 bytes from 172.30.13.158: icmp_seq=1 ttl=255 time=1.36 ms
8200 bytes from 172.30.13.158: icmp_seq=2 ttl=255 time=0.638 ms
8200 bytes from 172.30.13.158: icmp_seq=3 ttl=255 time=0.622 ms
8200 bytes from 172.30.13.158: icmp_seq=4 ttl=255 time=0.603 ms
8200 bytes from 172.30.13.158: icmp_seq=5 ttl=255 time=0.586 ms
8200 bytes from 172.30.13.158: icmp_seq=6 ttl=255 time=0.564 ms
8200 bytes from 172.30.13.158: icmp_seq=7 ttl=255 time=0.553 ms
8200 bytes from 172.30.13.158: icmp_seq=8 ttl=255 time=0.525 ms
8200 bytes from 172.30.13.158: icmp_seq=9 ttl=255 time=0.508 ms
8200 bytes from 172.30.13.158: icmp_seq=10 ttl=255 time=0.490 ms
8200 bytes from 172.30.13.158: icmp_seq=11 ttl=255 time=0.472 ms
8200 bytes from 172.30.13.158: icmp_seq=12 ttl=255 time=0.454 ms
8200 bytes from 172.30.13.158: icmp_seq=13 ttl=255 time=0.436 ms
8200 bytes from 172.30.13.158: icmp_seq=14 ttl=255 time=0.674 ms
8200 bytes from 172.30.13.158: icmp_seq=15 ttl=255 time=0.399 ms
8200 bytes from 172.30.13.158: icmp_seq=16 ttl=255 time=0.638 ms
8200 bytes from 172.30.13.158: icmp_seq=17 ttl=255 time=0.620 ms
8200 bytes from 172.30.13.158: icmp_seq=18 ttl=255 time=0.601 ms
8200 bytes from 172.30.13.158: icmp_seq=19 ttl=255 time=0.583 ms
8200 bytes from 172.30.13.158: icmp_seq=20 ttl=255 time=0.563 ms
8200 bytes from 172.30.13.158: icmp_seq=21 ttl=255 time=0.546 ms
8200 bytes from 172.30.13.158: icmp_seq=22 ttl=255 time=0.518 ms
8200 bytes from 172.30.13.158: icmp_seq=23 ttl=255 time=0.501 ms
8200 bytes from 172.30.13.158: icmp_seq=24 ttl=255 time=0.481 ms
8200 bytes from 172.30.13.158: icmp_seq=25 ttl=255 time=0.463 ms
8200 bytes from 172.30.13.158: icmp_seq=26 ttl=255 time=0.443 ms
8200 bytes from 172.30.13.158: icmp_seq=27 ttl=255 time=0.682 ms
8200 bytes from 172.30.13.158: icmp_seq=28 ttl=255 time=0.404 ms
8200 bytes from 172.30.13.158: icmp_seq=29 ttl=255 time=0.644 ms
8200 bytes from 172.30.13.158: icmp_seq=30 ttl=255 time=0.624 ms
8200 bytes from 172.30.13.158: icmp_seq=31 ttl=255 time=0.605 ms
8200 bytes from 172.30.13.158: icmp_seq=32 ttl=255 time=0.586 ms
8200 bytes from 172.30.13.158: icmp_seq=33 ttl=255 time=0.566 ms
^C
--- 172.30.13.158 ping statistics ---
33 packets transmitted, 33 received, 0% packet loss, time 32000ms
rtt min/avg/max/mdev = 0.399/0.574/1.366/0.161 ms

There is nothing going on in the network.  So we are seeing 574
micro-seconds total with only 150 micro-seconds attributed to
transmission.  And we see a wide variation in latency.

I then tested the latency between interfaces on the initiator and the
target.  Here is what I get for internal latency on the Linux initiator:
PING 172.30.13.18 (172.30.13.18) from 172.30.13.146 : 8192(8220) bytes
of data.
8200 bytes from 172.30.13.18: icmp_seq=1 ttl=64 time=0.033 ms
8200 bytes from 172.30.13.18: icmp_seq=2 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=3 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=4 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=5 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=6 ttl=64 time=0.017 ms
8200 bytes from 172.30.13.18: icmp_seq=7 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=8 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=9 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=10 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=11 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=12 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=13 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=14 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=15 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=16 ttl=64 time=0.017 ms
8200 bytes from 172.30.13.18: icmp_seq=17 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=18 ttl=64 time=0.017 ms
8200 bytes from 172.30.13.18: icmp_seq=19 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=20 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=21 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=22 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=23 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=24 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=25 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=26 ttl=64 time=0.017 ms
8200 bytes from 172.30.13.18: icmp_seq=27 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=28 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=29 ttl=64 time=0.018 ms
^C
--- 172.30.13.18 ping statistics ---
29 packets transmitted, 29 received, 0% packet loss, time 27999ms
rtt min/avg/max/mdev = 0.017/0.018/0.033/0.005 ms

A very consistent 18 micro-seconds.

Here is what I get from the Z200:
root at disk01:/etc# ping -s -i e1000g6 172.30.13.190 4096
PING 172.30.13.190: 4096 data bytes
4104 bytes from 172.30.13.190: icmp_seq=0. time=0.104 ms
4104 bytes from 172.30.13.190: icmp_seq=1. time=0.081 ms
4104 bytes from 172.30.13.190: icmp_seq=2. time=0.067 ms
4104 bytes from 172.30.13.190: icmp_seq=3. time=0.083 ms
4104 bytes from 172.30.13.190: icmp_seq=4. time=0.097 ms
4104 bytes from 172.30.13.190: icmp_seq=5. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=6. time=0.048 ms
4104 bytes from 172.30.13.190: icmp_seq=7. time=0.050 ms
4104 bytes from 172.30.13.190: icmp_seq=8. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=9. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=10. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=11. time=0.042 ms
4104 bytes from 172.30.13.190: icmp_seq=12. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=13. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=14. time=0.042 ms
4104 bytes from 172.30.13.190: icmp_seq=15. time=0.047 ms
4104 bytes from 172.30.13.190: icmp_seq=16. time=0.072 ms
4104 bytes from 172.30.13.190: icmp_seq=17. time=0.080 ms
4104 bytes from 172.30.13.190: icmp_seq=18. time=0.070 ms
4104 bytes from 172.30.13.190: icmp_seq=19. time=0.066 ms
4104 bytes from 172.30.13.190: icmp_seq=20. time=0.086 ms
4104 bytes from 172.30.13.190: icmp_seq=21. time=0.068 ms
4104 bytes from 172.30.13.190: icmp_seq=22. time=0.079 ms
4104 bytes from 172.30.13.190: icmp_seq=23. time=0.068 ms
4104 bytes from 172.30.13.190: icmp_seq=24. time=0.069 ms
4104 bytes from 172.30.13.190: icmp_seq=25. time=0.070 ms
4104 bytes from 172.30.13.190: icmp_seq=26. time=0.095 ms
4104 bytes from 172.30.13.190: icmp_seq=27. time=0.095 ms
4104 bytes from 172.30.13.190: icmp_seq=28. time=0.073 ms
4104 bytes from 172.30.13.190: icmp_seq=29. time=0.071 ms
4104 bytes from 172.30.13.190: icmp_seq=30. time=0.071 ms
^C
----172.30.13.190 PING Statistics----
31 packets transmitted, 31 packets received, 0% packet loss
round-trip (ms)  min/avg/max/stddev = 0.042/0.066/0.104/0.019

Notice it is several times longer latency with much wider variation.
How to we tune the opensolaris network stack to reduce it's latency? I'd
really like to improve the individual user experience.  I can tell them
it's like commuting to work on the train instead of the car during rush
hour - faster when there's lots of traffic but slower when there is not,
but they will judge the product by their individual experiences more
than their collective experiences.  Thus, I really want to improve the
individual disk operation throughput.

Latency seems to be our key.  If I can add only 20 micro-seconds of
latency from initiator and target each, that would be roughly 200 micro
seconds.  That would almost triple the throughput from what we are
currently seeing.

Unfortunately, I'm a bit ignorant of tweaking networks on opensolaris.
I can certainly learn but am I headed in the right direction or is this
direction of investigation misguided? Thanks - John

-- 
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan at opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society