idnits 2.17.1 draft-ietf-ippm-tcp-throughput-tm-10.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (January 2, 2011) is 4857 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 1323 (Obsoleted by RFC 7323) Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group B. Constantine 2 Internet-Draft JDSU 3 Intended status: Informational G. Forget 4 Expires: July 2, 2011 Bell Canada (Ext. Consultant) 5 Rudiger Geib 6 Deutsche Telekom 7 Reinhard Schrage 8 Schrage Consulting 10 January 2, 2011 12 Framework for TCP Throughput Testing 13 draft-ietf-ippm-tcp-throughput-tm-10.txt 15 Abstract 17 This framework describes a methodology for measuring end-to-end TCP 18 throughput performance in a managed IP network. The intention is to 19 provide a practical methodology to validate TCP layer performance. 20 The goal is to provide a better indication of the user experience. 21 In this framework, various TCP and IP parameters are identified and 22 should be tested as part of a managed IP network. 24 Requirements Language 26 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 27 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 28 document are to be interpreted as described in RFC 2119 [RFC2119]. 30 Status of this Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF). Note that other groups may also distribute 37 working documents as Internet-Drafts. The list of current Internet- 38 Drafts is at http://datatracker.ietf.org/drafts/current/. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 This Internet-Draft will expire on July 2, 2011. 47 Copyright Notice 49 Copyright (c) 2011 IETF Trust and the persons identified as the 50 document authors. All rights reserved. 52 This document is subject to BCP 78 and the IETF Trust's Legal 53 Provisions Relating to IETF Documents 54 (http://trustee.ietf.org/license-info) in effect on the date of 55 publication of this document. Please review these documents 56 carefully, as they describe your rights and restrictions with respect 57 to this document. Code Components extracted from this document must 58 include Simplified BSD License text as described in Section 4.e of 59 the Trust Legal Provisions and are provided without warranty as 60 described in the Simplified BSD License. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 65 1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 66 1.2 Test Set-up . . . . . . . . . . . . . . . . . . . . . . . 5 67 2. Scope and Goals of this methodology. . . . . . . . . . . . . . 5 68 2.1 TCP Equilibrium. . . . . . . . . . . . . . . . . . . . . . 6 69 3. TCP Throughput Testing Methodology . . . . . . . . . . . . . . 7 70 3.1 Determine Network Path MTU . . . . . . . . . . . . . . . . 9 71 3.2. Baseline Round Trip Time and Bandwidth . . . . . . . . . . 10 72 3.2.1 Techniques to Measure Round Trip Time . . . . . . . . 11 73 3.2.2 Techniques to Measure end-to-end Bandwidth. . . . . . 12 74 3.3. TCP Throughput Tests . . . . . . . . . . . . . . . . . . . 12 75 3.3.1 Calculate Ideal maximum TCP RWIN Size. . . . . . . . . 12 76 3.3.2 Metrics for TCP Throughput Tests . . . . . . . . . . . 15 77 3.3.3 Conducting the TCP Throughput Tests. . . . . . . . . . 19 78 3.3.4 Single vs. Multiple TCP Connection Testing . . . . . . 19 79 3.3.5 Interpretation of the TCP Throughput Results . . . . . 20 80 3.3.6 High Performance Network Options . . . . . . . . . . . 20 81 3.4. Traffic Management Tests . . . . . . . . . . . . . . . . . 22 82 3.4.1 Traffic Shaping Tests. . . . . . . . . . . . . . . . . 23 83 3.4.1.1 Interpretation of Traffic Shaping Test Results. . . 23 84 3.4.2 RED Tests. . . . . . . . . . . . . . . . . . . . . . . 24 85 3.4.2.1 Interpretation of RED Results . . . . . . . . . . . 25 86 4. Security Considerations . . . . . . . . . . . . . . . . . . . 25 87 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 88 6. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 26 89 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 26 90 7.1 Normative References . . . . . . . . . . . . . . . . . . . 26 91 7.2 Informative References . . . . . . . . . . . . . . . . . . 26 93 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 27 95 1. Introduction 97 Network providers are coming to the realization that Layer 2/3 98 testing is not enough to adequately ensure end-user's satisfaction. 99 An SLA (Service Level Agreement) is provided to business customers 100 and is generally based upon Layer 2/3 criteria such as access rate, 101 latency, packet loss and delay variations. On the other hand, 102 measuring TCP throughput provides meaningful results with respect to 103 user experience. Thus, the network provider community desires to 104 measure IP network throughput performance at the TCP layer. 106 Additionally, business enterprise customers seek to conduct 107 repeatable TCP throughput tests between locations. Since these 108 enterprises rely on the networks of the providers, a common test 109 methodology with predefined metrics will benefit both parties. 111 Note that the primary focus of this methodology is managed business 112 class IP networks; i.e. those Ethernet terminated services for which 113 businesses are provided an SLA from the network provider. End-users 114 with "best effort" access between locations can use this methodology, 115 but this framework and its metrics are intended to be used in a 116 predictable managed IP service environment. 118 So the intent behind this document is to define a methodology for 119 testing sustained TCP layer performance. In this document, the 120 maximum achievable TCP Throughput is that amount of data per unit 121 time that TCP transports when trying to reach Equilibrium, i.e. 122 after the initial slow start and congestion avoidance phases. 124 TCP is connection oriented and at the transmitting side of the 125 connection it uses a congestion window, (TCP CWND), to determine how 126 many packets it can send at one time. The network path bandwidth 127 delay product (BDP) determines the ideal TCP CWND. With the help of 128 slow start and congestion avoidance mechanisms, TCP probes the IP 129 network path. So up to the bandwidth limit, a larger TCP CWND permits 130 a higher throughput. And up to local host limits, TCP "Slow Start" 131 and "Congestion Avoidance" algorithms together will determine the TCP 132 CWND size. This TCP CWND will vary during the session, but the 133 Maximum TCP CWND size is tributary to the buffer space allocated by 134 the kernel for each socket. 136 At the receiving end of the connection, TCP uses a receive window, 137 (TCP RWIN), to inform the transmitting end on how many Bytes it is 138 capabable to receive between acknowledgements (TCP ACK). This TCP 139 RWIN will also vary during the session and the Maximum TCP RWIN Size 140 is also tributary to the buffer space allocated by the kernel for 141 each socket. 143 At both end of the TCP connection and for each socket, there are 144 default buffer sizes that can be changed by programs using system 145 libraries called just before opening the socket. There are also 146 kernel enforced maximum buffer sizes. These buffer sizes can be 147 adjusted at both ends (transmitting and receiving). In order to 148 obtain the maximum throughput, it is critical to use optimal TCP 149 Send and Receive Socket Buffer sizes. 151 Note that some TCP/IP stack implementations are using Receive Window 152 Auto-Tuning and cannot be adjusted until this feature is disabled. 154 There are many variables to consider when conducting a TCP throughput 155 test, but this methodology focuses on: 156 - RTT and Bottleneck BW 157 - Ideal Send Socket Buffer (Ideal maximum TCP CWND) 158 - Ideal Receive Socket Buffer (Ideal maximum TCP RWIN) 159 - Path MTU and Maximum Segment Size (MSS) 160 - Single Connection and Multiple Connections testing 162 This methodology proposes TCP testing that should be performed in 163 addition to traditional Layer 2/3 type tests. Layer 2/3 tests are 164 required to verify the integrity of the network before conducting TCP 165 tests. Examples include iperf (UDP mode) or manual packet layer test 166 techniques where packet throughput, loss, and delay measurements are 167 conducted. When available, standardized testing similar to RFC 2544 168 [RFC2544] but adapted for use in operational networks may be used. 169 Note: RFC 2544 was never meant to be used outside a lab environment. 171 The following 2 sections provide a general overview of the test 172 methodology. 174 1.1 Terminology 176 Common terminologies used in the test methodology are: 178 - TCP Throughput Test Device (TCP TTD), refers to compliant TCP 179 host that generates traffic and measures metrics as defined in 180 this methodology. i.e. a dedicated communications test instrument. 181 - Customer Provided Equipment (CPE), refers to customer owned 182 equipment (routers, switches, computers, etc.) 183 - Customer Edge (CE), refers to provider owned demarcation device. 184 - Provider Edge (PE), refers to provider's distribution equipment. 185 - Bottleneck Bandwidth (BB), lowest bandwidth along the complete 186 path. Bottleneck Bandwidth and Bandwidth are used synonymously 187 in this document. Most of the time the Bottleneck Bandwidth is 188 in the access portion of the wide area network (CE - PE). 189 - Provider (P), refers to provider core network equipment. 190 - Network Under Test (NUT), refers to the tested IP network path. 191 - Round-Trip Time (RTT), refers to Layer 4 back and forth delay. 193 Figure 1.1 Devices, Links and Paths 195 +----+ +----+ +----+ +----+ +---+ +---+ +----+ +----+ +----+ +----+ 196 | TCP|-| CPE|-| CE |--| PE |-| P |--| P |-| PE |--| CE |-| CPE|-| TCP| 197 | TTD| | | | |BB| | | | | | | |BB| | | | | TTD| 198 +----+ +----+ +----+ +----+ +---+ +---+ +----+ +----+ +----+ +----+ 199 <------------------------ NUT -------------------------> 200 R >-----------------------------------------------------------| 201 T | 202 T <-----------------------------------------------------------| 204 Note that the NUT may consist of a variety of devices including but 205 not limited to, load balancers, proxy servers or WAN acceleration 206 devices. The detailed topology of the NUT should be well understood 207 when conducting the TCP throughput tests, although this methodology 208 makes no attempt to characterize specific network architectures. 210 1.2 Test Set-up 212 This methodology is intended for operational and managed IP networks. 213 A multitude of network architectures and topologies can be tested. 214 The above set-up diagram is very general and it only illustrates the 215 segmentation within end-user and network provider domains. 217 2. Scope and Goals of this Methodology 219 Before defining the goals, it is important to clearly define the 220 areas that are out-of-scope. 222 - This methodology is not intended to predict the TCP throughput 223 during the transient stages of a TCP connection, such as the initial 224 slow start. 226 - This methodology is not intended to definitively benchmark TCP 227 implementations of one OS to another, although some users may find 228 some value in conducting qualitative experiments. 230 - This methodology is not intended to provide detailed diagnosis 231 of problems within end-points or within the network itself as 232 related to non-optimal TCP performance, although a results 233 interpretation section for each test step may provide insight in 234 regards with potential issues. 236 - This methodology does not propose to operate permanently with high 237 measurement loads. TCP performance and optimization within 238 operational networks may be captured and evaluated by using data 239 from the "TCP Extended Statistics MIB" [RFC4898]. 241 - This methodology is not intended to measure TCP throughput as part 242 of an SLA, or to compare the TCP performance between service 243 providers or to compare between implementations of this methodology 244 in dedicated communications test instruments. 246 In contrast to the above exclusions, a primary goal is to define a 247 method to conduct a practical, end-to-end assessment of sustained 248 TCP performance within a managed business class IP network. Another 249 key goal is to establish a set of "best practices" that a non-TCP 250 expert should apply when validating the ability of a managed network 251 to carry end-user TCP applications. 253 Specific goals are to : 255 - Provide a practical test approach that specifies tunable parameters 256 such as MSS (Maximum Segment Size) and Socket Buffer sizes and how 257 these affect the outcome of TCP performances over an IP network. 258 See section 3.3.3. 260 - Provide specific test conditions like link speed, RTT, MSS, Socket 261 Buffer sizes and maximum achievable TCP throughput when trying to 262 reach TCP Equilibrium. For guideline purposes, provide examples of 263 test conditions and their maximum achievable TCP throughput. 264 Section 2.1 provides specific details concerning the definition of 265 TCP Equilibrium within this methodology while section 3 provides 266 specific test conditions with examples. 268 - Define three (3) basic metrics to compare the performance of TCP 269 connections under various network conditions. See section 3.3.2. 271 - In test situations where the recommended procedure does not yield 272 the maximum achievable TCP throughput results, this methodology 273 provides some possible areas within the end host or the network that 274 should be considered for investigation. Although again, this 275 methodology is not intended to provide a detailed diagnosis on these 276 issues. See section 3.3.5. 278 2.1 TCP Equilibrium 280 TCP connections have three (3) fundamental congestion window phases: 282 1 - The Slow Start phase, which occurs at the beginning of a TCP 283 transmission or after a retransmission time out. 285 2 - The Congestion Avoidance phase, during which TCP ramps up to 286 establish the maximum attainable throughput on an end-to-end network 287 path. Retransmissions are a natural by-product of the TCP congestion 288 avoidance algorithm as it seeks to achieve maximum throughput. 290 3 - The Loss Recovery phase, which could include Fast Retransmit 291 (Tahoe) or Fast Recovery (Reno & New Reno). When packet loss occurs, 292 Congestion Avoidance phase transitions either to Fast Retransmission 293 or Fast Recovery depending upon TCP implementations. If a Time-Out 294 occurs, TCP transitions back to the Slow Start phase. 296 The following diagram depicts these 3 phases. 298 Figure 2.1 TCP CWND Phases 300 /\ | Trying to reach TCP Equilibrium > > > > > > > > > 301 /\ | 302 /\ |High ssthresh TCP CWND 303 /\ |Loss Event * halving 3-Loss Recovery 304 /\ | * \ upon loss Adjusted 305 /\ | * \ / \ Time-Out ssthresh 306 /\ | * \ / \ +--------+ * 307 TCP | * \/ \ / Multiple| * 308 Through- | * 2-Congestion\ / Loss | * 309 put | * Avoidance \/ Event | * 310 | * Half | * 311 | * TCP CWND | * 1-Slow Start 312 | * 1-Slow Start Min TCP CWND after T-O 313 +----------------------------------------------------------- 314 Time > > > > > > > > > > > > > > > 316 Note : ssthresh = Slow Start threshold. 318 A well tuned and managed IP network with appropriate TCP adjustments 319 in it's IP hosts and applications should perform very close to TCP 320 Equilibrium and to the BB (Bottleneck Bandwidth). 322 This TCP methodology provides guidelines to measure the maximum 323 achievable TCP throughput or maximum TCP sustained rate obtained 324 after TCP CWND has stabilized to an optimal value. All maximum 325 achievable TCP throughputs specified in section 3 are with respect to 326 this condition. 328 It is important to clarify the interaction between the sender's Send 329 Socket Buffer and the receiver's advertised TCP RWIN Size. TCP test 330 programs such as iperf, ttcp, etc. allow the sender to control the 331 quantity of TCP Bytes transmitted and unacknowledged (in-flight), 332 commonly referred to as the Send Socket Buffer. This is done 333 independently of the TCP RWIN Size advertised by the 334 receiver. Implications to the capabilities of the Throughput Test 335 Device (TTD) are covered at the end of section 3. 337 3. TCP Throughput Testing Methodology 339 As stated earlier in section 1, it is considered best practice to 340 verify the integrity of the network by conducting Layer2/3 tests such 341 as [RFC2544] or other methods of network stress tests. Although, it 342 is important to mention here that RFC 2544 was never meant to be used 343 outside a lab environment. 345 If the network is not performing properly in terms of packet loss, 346 jitter, etc. then the TCP layer testing will not be meaningful. A 347 dysfunctional network will not acheive optimal TCP throughputs in 348 regards with the available bandwidth. 350 TCP Throughput testing may require cooperation between the end-user 351 customer and the network provider. In a Layer 2/3 VPN architecture, 352 the testing should be conducted either on the CPE or on the CE device 353 and not on the PE (Provider Edge) router. 355 The following represents the sequential order of steps for this 356 testing methodology: 358 1. Identify the Path MTU. Packetization Layer Path MTU Discovery 359 or PLPMTUD, [RFC4821], MUST be conducted to verify the network path 360 MTU. Conducting PLPMTUD establishes the upper limit for the MSS to 361 be used in subsequent steps. 363 2. Baseline Round Trip Time and Bandwidth. This step establishes the 364 inherent, non-congested Round Trip Time (RTT) and the bottleneck 365 bandwidth of the end-to-end network path. These measurements are 366 used to provide estimates of the ideal maximum TCP RWIN and Send 367 Socket Buffer Sizes that SHOULD be used in subsequent test steps. 368 These measurements reference [RFC2681] and [RFC4898] to measure RTD 369 and the associated RTT. 371 3. TCP Connection Throughput Tests. With baseline measurements 372 of Round Trip Time and bottleneck bandwidth, single and multiple TCP 373 connection throughput tests SHOULD be conducted to baseline network 374 performance expectations. 376 4. Traffic Management Tests. Various traffic management and queuing 377 techniques can be tested in this step, using multiple TCP 378 connections. Multiple connections testing should verify that the 379 network is configured properly for traffic shaping versus policing, 380 various queuing implementations and Random Early Discards (RED). 382 Important to note are some of the key characteristics and 383 considerations for the TCP test instrument. The test host may be a 384 standard computer or a dedicated communications test instrument. 385 In both cases, they must be capable of emulating both client and 386 server. 388 The following criteria should be considered when selecting whether 389 the TCP test host can be a standard computer or has to be a dedicated 390 communications test instrument: 392 - TCP implementation used by the test host, OS version, i.e. Linux OS 393 kernel using TCP New Reno, TCP options supported, etc. These will 394 obviously be more important when using dedicated communications test 395 instruments where the TCP implementation may be customized or tuned 396 to run in higher performance hardware. When a compliant TCP TTD is 397 used, the TCP implementation MUST be identified in the test results. 398 The compliant TCP TTD should be usable for complete end-to-end 399 testing through network security elements and should also be usable 400 for testing network sections. 402 - More important, the TCP test host MUST be capable to generate 403 and receive stateful TCP test traffic at the full link speed of the 404 network under test. Stateful TCP test traffic means that the test 405 host MUST fully implement a TCP/IP stack; this is generally a comment 406 aimed at dedicated communications test equipments which sometimes 407 "blast" packets with TCP headers. As a general rule of thumb, testing 408 TCP throughput at rates greater than 100 Mbit/sec MAY require high 409 performance server hardware or dedicated hardware based test tools. 411 - A compliant TCP Throughput Test Device MUST allow adjusting both 412 Send and Receive Socket Buffer sizes. The Send Socket Buffer MUST be 413 large enough to accommodate the maximum TCP CWND Size. The Receive 414 Socket Buffer MUST be large enough to accommodate the maximum TCP 415 RWIN Size. 417 - Measuring RTT and retransmissions per connection will generally 418 require a dedicated communications test instrument. In the absence of 419 dedicated hardware based test tools, these measurements may need to 420 be conducted with packet capture tools, i.e. conduct TCP throughput 421 tests and analyze RTT and retransmission results in packet captures. 422 Another option may be to use "TCP Extended Statistics MIB" per 423 [RFC4898]. 425 - The RFC4821 PLPMTUD test SHOULD be conducted with a dedicated 426 tester which exposes the ability to run the PLPMTUD algorithm 427 independent from the OS stack. 429 3.1. Determine Network Path MTU 431 TCP implementations should use Path MTU Discovery techniques (PMTUD). 432 PMTUD relies on ICMP 'need to frag' messages to learn the path MTU. 433 When a device has a packet to send which has the Don't Fragment (DF) 434 bit in the IP header set and the packet is larger than the Maximum 435 Transmission Unit (MTU) of the next hop, the packet is dropped and 436 the device sends an ICMP 'need to frag' message back to the host that 437 originated the packet. The ICMP 'need to frag' message includes 438 the next hop MTU which PMTUD uses to tune the TCP Maximum Segment 439 Size (MSS). Unfortunately, because many network managers completely 440 disable ICMP, this technique does not always prove reliable. 442 Packetization Layer Path MTU Discovery or PLPMTUD [RFC4821] MUST then 443 be conducted to verify the network path MTU. PLPMTUD can be used 444 with or without ICMP. The following sections provide a summary of the 445 PLPMTUD approach and an example using TCP. [RFC4821] specifies a 446 search_high and a search_low parameter for the MTU. As specified in 447 [RFC4821], 1024 Bytes is a safe value for search_low in modern 448 networks. 450 It is important to determine the links overhead along the IP path, 451 and then to select a TCP MSS size corresponding to the Layer 3 MTU. 452 For example, if the MTU is 1024 Bytes and the TCP/IP headers are 40 453 Bytes, then the MSS would be set to 984 Bytes. 455 An example scenario is a network where the actual path MTU is 1240 456 Bytes. The TCP client probe MUST be capable of setting the MSS for 457 the probe packets and could start at MSS = 984 (which corresponds 458 to an MTU size of 1024 Bytes). 460 The TCP client probe would open a TCP connection and advertise the 461 MSS as 984. Note that the client probe MUST generate these packets 462 with the DF bit set. The TCP client probe then sends test traffic 463 per a small default Send Socket Buffer size of ~8KBytes. It should 464 be kept small to minimize the possibility of congesting the network, 465 which may induce packet loss. The duration of the test should also 466 be short (10-30 seconds), again to minimize congestive effects 467 during the test. 469 In the example of a 1240 Bytes path MTU, probing with an MSS equal to 470 984 would yield a successful probe and the test client packets would 471 be successfully transferred to the test server. 473 Also note that the test client MUST verify that the MSS advertised 474 is indeed negotiated. Network devices with built-in Layer 4 475 capabilities can intercede during the connection establishment and 476 reduce the advertised MSS to avoid fragmentation. This is certainly 477 a desirable feature from a network perspective, but it can yield 478 erroneous test results if the client test probe does not confirm the 479 negotiated MSS. 481 The next test probe would use the search_high value and this would 482 be set to MSS = 1460 to correspond to a 1500 Bytes MTU. In this 483 example, the test client will retransmit based upon time-outs, since 484 no ACKs will be received from the test server. This test probe is 485 marked as a conclusive failure if none of the test packets are 486 ACK'ed. If any of the test packets are ACK'ed, congestive network 487 may be the cause and the test probe is not conclusive. Re-testing 488 at other times of the day is recommended to further isolate. 490 The test is repeated until the desired granularity of the MTU is 491 discovered. The method can yield precise results at the expense of 492 probing time. One approach may be to reduce the probe size to 493 half between the unsuccessful search_high and successful search_low 494 value and raise it by half also when seeking the upper limit. 496 3.2. Baseline Round Trip Time and Bandwidth 498 Before stateful TCP testing can begin, it is important to determine 499 the baseline Round Trip Time (non-congested inherent delay) and 500 bottleneck bandwidth of the end-to-end network to be tested. These 501 measurements are used to provide estimates of the ideal maximum TCP 502 RWIN and Send Socket Buffer Sizes that SHOULD be used in 503 subsequent test steps. 505 3.2.1 Techniques to Measure Round Trip Time 507 Following the definitions used in section 1.1, Round Trip Time (RTT) 508 is the elapsed time between the clocking in of the first bit of a 509 payload sent packet to the receipt of the last bit of the 510 corresponding Acknowledgment. Round Trip Delay (RTD) is used 511 synonymously to twice the Link Latency. RTT measurements SHOULD use 512 techniques defined in [RFC2681] or statistics available from MIBs 513 defined in [RFC4898]. 515 The RTT SHOULD be baselined during "off-peak" hours to obtain a 516 reliable figure for inherent network latency versus additional delay 517 caused by network buffering. When sampling values of RTT over a test 518 interval, the minimum value measured SHOULD be used as the baseline 519 RTT since this will most closely estimate the inherent network 520 latency. This inherent RTT is also used to determine the Buffer 521 Delay Percentage metric which is defined in Section 3.3.2 523 The following list is not meant to be exhaustive, although it 524 summarizes some of the most common ways to determine round trip time. 525 The desired resolution of the measurement (i.e. msec versus usec) may 526 dictate whether the RTT measurement can be achieved with ICMP pings 527 or by a dedicated communications test instrument with precision 528 timers. 530 The objective in this section is to list several techniques 531 in order of decreasing accuracy. 533 - Use test equipment on each end of the network, "looping" the 534 far-end tester so that a packet stream can be measured back and forth 535 from end-to-end. This RTT measurement may be compatible with delay 536 measurement protocols specified in [RFC5357]. 538 - Conduct packet captures of TCP test sessions using "iperf" or FTP, 539 or other TCP test applications. By running multiple experiments, 540 packet captures can then be analyzed to estimate RTT. It is 541 important to note that results based upon the SYN -> SYN-ACK at the 542 beginning of TCP sessions should be avoided since Firewalls might 543 slow down 3 way handshakes. 545 - ICMP pings may also be adequate to provide round trip time 546 estimates, provided that the packet size is factored into the 547 estimates (i.e. pings with different packet sizes might be required). 548 Some limitations with ICMP Ping may include msec resolution and 549 whether the network elements are responding to pings or not. Also, 550 ICMP is often rate-limited and segregated into different buffer 551 queues and is not as reliable and accurate as in-band measurements. 553 3.2.2 Techniques to Measure end-to-end Bandwidth 555 Before any TCP Throughput test can be done, bandwidth measurement 556 tests MUST be run with stateless IP streams (i.e. not stateful TCP) 557 in order to determine the available bandwidths. These measurements 558 SHOULD be conducted in both directions of the network, especially for 559 access networks, which may be asymmetrical. These tests should 560 obviously be performed at various intervals throughout a business day 561 or even across a week. Ideally, the bandwidth tests should produce 562 logged outputs of the achieved bandwidths across the tests durations. 564 There are many well established techniques available to provide 565 estimated measures of bandwidth over a network. It is a common 566 practice for network providers to conduct Layer2/3 bandwidth capacity 567 tests using [RFC2544], although it is understood that RFC 2544 was 568 never meant to be used outside a lab environment. Ideally, these 569 bandwidth measurements SHOULD use network capacity techniques as 570 defined in [RFC5136]. 572 The bandwidth results should be at least 90% of the business customer 573 SLA or to the IP-type-P Available Path Capacity defined in RFC5136. 575 3.3. TCP Throughput Tests 577 This methodology specifically defines TCP throughput techniques to 578 verify sustained TCP performance in a managed business IP network, as 579 defined in section 2.1. This section and others will define the 580 method to conduct these sustained TCP throughput tests and guidelines 581 for the predicted results. 583 With baseline measurements of round trip time and bandwidth 584 from section 3.2, a series of single and multiple TCP connection 585 throughput tests SHOULD be conducted to baseline network performance 586 against expectations. The number of trials and the type of testing 587 (single versus multiple connections) will vary according to the 588 intention of the test. One example would be a single connection test 589 in which the throughput achieved by large Send and Receive Socket 590 Buffers sizes (i.e. 256KB) is to be measured. It would be advisable 591 to test performance at various times of the business day. 593 It is RECOMMENDED to run the tests in each direction independently 594 first, then run both directions simultaneously. In each case, 595 TCP Transfer Time, TCP Efficiency, and Buffer Delay Percentage MUST 596 be measured in each direction. These metrics are defined in 3.3.2. 598 3.3.1 Calculate Ideal maximum TCP RWIN Size 600 The ideal maximum TCP RWIN Size can be calculated from the 601 bandwidth delay product (BDP), which is: 603 BDP (bits) = RTT (sec) x Bandwidth (bps) 605 Note that the RTT is being used as the "Delay" variable in the 606 BDP calculations. 608 Then, by dividing the BDP by 8, we obtain the "ideal" maximum TCP 609 RWIN Size in Bytes. For optimal results, the Send Socket 610 Buffer size must be adjusted to the same value at the opposite end 611 of the network path. 613 Ideal maximum TCP RWIN = BDP / 8 615 An example would be a T3 link with 25 msec RTT. The BDP would equal 616 ~1,105,000 bits and the ideal maximum TCP RWIN would be ~138 KBytes. 618 Note that separate calculations are required on asymetrical paths. 619 An asymetrical path example would be a 90 msec RTT ADSL line with 620 5Mbps downstream and 640Kbps upstream. The downstream BDP would equal 621 ~450,000 bits while the upstream one would be only ~57,600 bits. 623 The following table provides some representative network Link Speeds, 624 RTT, BDP, and associated Ideal maximum TCP RWIN Sizes. 626 Table 3.3.1: Link Speed, RTT, calculated BDP & max TCP RWIN 628 Link Ideal max 629 Speed* RTT BDP TCP RWIN 630 (Mbps) (ms) (bits) (KBytes) 631 --------------------------------------------------------------------- 632 1.536 20 30,720 3.84 633 1.536 50 76,800 9.60 634 1.536 100 153,600 19.20 635 44.210 10 442,100 55.26 636 44.210 15 663,150 82.89 637 44.210 25 1,105,250 138.16 638 100 1 100,000 12.50 639 100 2 200,000 25.00 640 100 5 500,000 62.50 641 1,000 0.1 100,000 12.50 642 1,000 0.5 500,000 62.50 643 1,000 1 1,000,000 125.00 644 10,000 0.05 500,000 62.50 645 10,000 0.3 3,000,000 375.00 647 * Note that link speed is the bottleneck bandwidth (BB) for the NUT 649 The following serial link speeds are used: 650 - T1 = 1.536 Mbits/sec (for a B8ZS line encoding facility) 651 - T3 = 44.21 Mbits/sec (for a C-Bit Framing facility) 653 The above table illustrates the ideal maximum TCP RWIN. 654 If a smaller TCP RWIN Size is used, then the TCP Throughput 655 is not optimal. To calculate the TCP Throughput, the following 656 formula is used: TCP Throughput = max TCP RWIN X 8 / RTT 657 An example could be a 100 Mbps IP path with 5 ms RTT and a maximum 658 TCP RWIN Size of 16KB, then: 660 TCP Throughput = 16 KBytes X 8 bits / 5 ms. 661 TCP Throughput = 128,000 bits / 0.005 sec. 662 TCP Throughput = 25.6 Mbps. 664 Another example for a T3 using the same calculation formula is 665 illustrated on the next page: 666 TCP Throughput = max TCP RWIN X 8 / RTT. 667 TCP Throughput = 16 KBytes X 8 bits / 10 ms. 668 TCP Throughput = 128,000 bits / 0.01 sec. 669 TCP Throughput = 12.8 Mbps. 671 When the maximum TCP RWIN Size exceeds the BDP (T3 link, 672 64 KBytes max TCP RWIN on a 10 ms RTT path), the maximum 673 frames per second limit of 3664 is reached and the formula is: 675 TCP Throughput = Max FPS X MSS X 8. 676 TCP Throughput = 3664 FPS X 1460 Bytes X 8 bits. 677 TCP Throughput = 42.8 Mbps 679 The following diagram compares achievable TCP throughputs on a T3 680 with Send Socket Buffer & max TCP RWIN Sizes of 16KB vs. 64KB. 682 Figure 3.3.1a TCP Throughputs on a T3 at different RTTs 684 45| 685 | _______42.8M 686 40| |64KB | 687 TCP | | | 688 Throughput 35| | | 689 in Mbps | | | +-----+34.1M 690 30| | | |64KB | 691 | | | | | 692 25| | | | | 693 | | | | | 694 20| | | | | _______20.5M 695 | | | | | |64KB | 696 15| | | | | | | 697 |12.8M+-----| | | | | | 698 10| |16KB | | | | | | 699 | | | |8.5M+-----| | | | 700 5| | | | |16KB | |5.1M+-----| | 701 |_____|_____|_____|____|_____|_____|____|16KB |_____|_____ 702 10 15 25 703 RTT in milliseconds 705 The following diagram shows the achievable TCP throughput on a 25ms 706 T3 when Send Socket Buffer & maximum TCP RWIN Sizes are increased. 708 Figure 3.3.1b TCP Throughputs on a T3 with different TCP RWIN 710 45| 711 | 712 40| +-----+40.9M 713 TCP | | | 714 Throughput 35| | | 715 in Mbps | | | 716 30| | | 717 | | | 718 25| | | 719 | | | 720 20| +-----+20.5M | | 721 | | | | | 722 15| | | | | 723 | | | | | 724 10| +-----+10.2M | | | | 725 | | | | | | | 726 5| +-----+5.1M | | | | | | 727 |_____|_____|______|_____|______|_____|_______|_____|_____ 728 16 32 64 128* 729 maximum TCP RWIN Size in KBytes 731 * Note that 128KB requires [RFC1323] TCP Window scaling option. 733 3.3.2 Metrics for TCP Throughput Tests 735 This framework focuses on a TCP throughput methodology and also 736 provides several basic metrics to compare results of various 737 throughput tests. It is recognized that the complexity and 738 unpredictability of TCP makes it impossible to develop a complete 739 set of metrics that accounts for the myriad of variables (i.e. RTT 740 variation, loss conditions, TCP implementation, etc.). However, 741 these basic metrics will facilitate TCP throughput comparisons 742 under varying network conditions and between network traffic 743 management techniques. 745 The first metric is the TCP Transfer Time, which is simply the 746 measured time it takes to transfer a block of data across 747 simultaneous TCP connections. This concept is useful when 748 benchmarking traffic management techniques and where multiple 749 TCP connections are required. 751 TCP Transfer time may also be used to provide a normalized ratio of 752 the actual TCP Transfer Time versus the Ideal Transfer Time. This 753 ratio is called the TCP Transfer Index and is defined as: 755 Actual TCP Transfer Time 756 ------------------------- 757 Ideal TCP Transfer Time 759 The Ideal TCP Transfer time is derived from the network path 760 bottleneck bandwidth and various Layer 1/2/3/4 overheads associated 761 with the network path. Additionally, both the maximum TCP RWIN and 762 the Send Socket Buffer Sizes must be tuned to equal the bandwidth 763 delay product (BDP) as described in section 3.3.1. 765 The following table illustrates the Ideal TCP Transfer time of a 766 single TCP connection when its maximum TCP RWIN and Send Socket 767 Buffer Sizes are equal to the BDP. 769 Table 3.3.2: Link Speed, RTT, BDP, TCP Throughput, and 770 Ideal TCP Transfer time for a 100 MB File 772 Link Maximum Ideal TCP 773 Speed BDP Achievable TCP Transfer time 774 (Mbps) RTT (ms) (KBytes) Throughput(Mbps) (seconds) 775 -------------------------------------------------------------------- 776 1.536 50 9.6 1.4 571 777 44.21 25 138.2 42.8 18 778 100 2 25.0 94.9 9 779 1,000 1 125.0 949.2 1 780 10,000 0.05 62.5 9,492 0.1 782 Transfer times are rounded for simplicity. 784 For a 100MB file(100 x 8 = 800 Mbits), the Ideal TCP Transfer Time 785 is derived as follows: 787 800 Mbits 788 Ideal TCP Transfer Time = ----------------------------------- 789 Maximum Achievable TCP Throughput 791 The maximum achievable layer 2 throughput on T1 and T3 Interfaces 792 is based on the maximum frames per second (FPS) permitted by the 793 actual layer 1 speed when the MTU is 1500 Bytes. 795 The maximum FPS for a T1 is 127 and the calculation formula is: 796 FPS = T1 Link Speed / ((MTU + PPP + Flags + CRC16) X 8) 797 FPS = (1.536M /((1500 Bytes + 4 Bytes + 2 Bytes + 2 Bytes) X 8 ))) 798 FPS = (1.536M / (1508 Bytes X 8)) 799 FPS = 1.536 Mbps / 12064 bits 800 FPS = 127 802 The maximum FPS for a T3 is 3664 and the calculation formula is: 803 FPS = T3 Link Speed / ((MTU + PPP + Flags + CRC16) X 8) 804 FPS = (44.21M /((1500 Bytes + 4 Bytes + 2 Bytes + 2 Bytes) X 8 ))) 805 FPS = (44.21M / (1508 Bytes X 8)) 806 FPS = 44.21 Mbps / 12064 bits 807 FPS = 3664 808 The 1508 equates to: 810 MTU + PPP + Flags + CRC16 812 Where MTU is 1500 Bytes, PPP is 4 Bytes, Flags are 2 Bytes and CRC16 813 is 2 Bytes. 815 Then, to obtain the Maximum Achievable TCP Throughput (layer 4), we 816 simply use: MSS in Bytes X 8 bits X max FPS. 817 For a T3, the maximum TCP Throughput = 1460 Bytes X 8 bits X 3664 FPS 818 Maximum TCP Throughput = 11680 bits X 3664 FPS 819 Maximum TCP Throughput = 42.8 Mbps. 821 The maximum achievable layer 2 throughput on Ethernet Interfaces is 822 based on the maximum frames per second permitted by the IEEE802.3 823 standard when the MTU is 1500 Bytes. 825 The maximum FPS for 100M Ethernet is 8127 and the calculation is: 826 FPS = (100Mbps /(1538 Bytes X 8 bits)) 828 The maximum FPS for GigE is 81274 and the calculation formula is: 829 FPS = (1Gbps /(1538 Bytes X 8 bits)) 831 The maximum FPS for 10GigE is 812743 and the calculation formula is: 832 FPS = (10Gbps /(1538 Bytes X 8 bits)) 834 The 1538 equates to: 836 MTU + Eth + CRC32 + IFG + Preamble + SFD 838 Where MTU is 1500 Bytes, Ethernet is 14 Bytes, CRC32 is 4 Bytes, 839 IFG is 12 Bytes, Preamble is 7 Bytes and SFD is 1 Byte. 841 Note that better results could be obtained with jumbo frames on 842 GigE and 10 GigE. 844 Then, to obtain the Maximum Achievable TCP Throughput (layer 4), we 845 simply use: MSS in Bytes X 8 bits X max FPS. 846 For a 100M, the maximum TCP Throughput = 1460 B X 8 bits X 8127 FPS 847 Maximum TCP Throughput = 11680 bits X 8127 FPS 848 Maximum TCP Throughput = 94.9 Mbps. 850 To illustrate the TCP Transfer Time Index, an example would be the 851 bulk transfer of 100 MB over 5 simultaneous TCP connections (each 852 connection uploading 100 MB). In this example, the Ethernet service 853 provides a Committed Access Rate (CAR) of 500 Mbit/s. Each 854 connection may achieve different throughputs during a test and the 855 overall throughput rate is not always easy to determine (especially 856 as the number of connections increases). 858 The ideal TCP Transfer Time would be ~8 seconds, but in this example, 859 the actual TCP Transfer Time was 12 seconds. The TCP Transfer Index 860 would then be 12/8 = 1.5, which indicates that the transfer across 861 all connections took 1.5 times longer than the ideal. 863 The second metric is TCP Efficiency, which is the percentage of Bytes 864 that were not retransmitted and is defined as: 866 Transmitted Bytes - Retransmitted Bytes 867 --------------------------------------- x 100 868 Transmitted Bytes 870 Transmitted Bytes are the total number of TCP payload Bytes to be 871 transmitted which includes the original and retransmitted Bytes. This 872 metric provides a comparative measure between various QoS mechanisms 873 like traffic management or congestion avoidance. Various TCP 874 implementations like Reno, Vegas, etc. could also be compared. 876 As an example, if 100,000 Bytes were sent and 2,000 had to be 877 retransmitted, the TCP Efficiency should be calculated as: 879 102,000 - 2,000 880 ---------------- x 100 = 98.03% 881 102,000 883 Note that the retransmitted Bytes may have occurred more than once, 884 and these multiple retransmissions are added to the Retransmitted 885 Bytes count (and the Transmitted Bytes count). 887 The third metric is the Buffer Delay Percentage, which represents the 888 increase in RTT during a TCP throughput test with respect to 889 inherent or baseline network RTT. The baseline RTT is the round-trip 890 time inherent to the network path under non-congested conditions. 891 (See 3.2.1 for details concerning the baseline RTT measurements). 893 The Buffer Delay Percentage is defined as: 895 Average RTT during Transfer - Baseline RTT 896 ------------------------------------------ x 100 897 Baseline RTT 899 As an example, the baseline RTT for the network path is 25 msec. 900 During the course of a TCP transfer, the average RTT across the 901 entire transfer increased to 32 msec. In this example, the Buffer 902 Delay Percentage would be calculated as: 904 32 - 25 905 ------- x 100 = 28% 906 25 908 Note that the TCP Transfer Time, TCP Efficiency, and Buffer Delay 909 Percentage MUST be measured during each throughput test. Poor TCP 910 Transfer Time Indexes (TCP Transfer Time greater than Ideal TCP 911 Transfer Times) may be diagnosed by correlating with sub-optimal TCP 912 Efficiency and/or Buffer Delay Percentage metrics. 914 3.3.3 Conducting the TCP Throughput Tests 916 Several TCP tools are currently used in the network world and one of 917 the most common is "iperf". With this tool, hosts are installed at 918 each end of the network path; one acts as client and the other as 919 a server. The Send Socket Buffer and the maximum TCP RWIN Sizes 920 of both client and server can be manually set. The achieved 921 throughput can then be measured, either uni-directionally or 922 bi-directionally. For higher BDP situations in lossy networks 923 (long fat networks or satellite links, etc.), TCP options such as 924 Selective Acknowledgment SHOULD be considered and become part of 925 the window size / throughput characterization. 927 Host hardware performance must be well understood before conducting 928 the tests described in the following sections. A dedicated 929 communications test instrument will generally be required, especially 930 for line rates of GigE and 10 GigE. A compliant TCP TTD SHOULD 931 provide a warning message when the expected test throughput will 932 exceed 10% of the network bandwidth capacity. If the throughput test 933 is expected to exceed 10% of the provider bandwidth, then the test 934 should be coordinated with the network provider. This does not 935 include the customer premise bandwidth, the 10% refers directly to 936 the provider's bandwidth (Provider Edge to Provider router). 938 The TCP throughput test should be run over a long enough duration 939 to properly exercise network buffers (greater than 30 seconds) and 940 also characterize performance at different time periods of the day. 942 3.3.4 Single vs. Multiple TCP Connection Testing 944 The decision whether to conduct single or multiple TCP connection 945 tests depends upon the size of the BDP in relation to the maximum 946 TCP RWIN configured in the end-user environment. For example, if 947 the BDP for a long fat network turns out to be 2MB, then it is 948 probably more realistic to test this network path with multiple 949 connections. Assuming typical host computer maximum TCP RWIN Sizes 950 of 64 KB, using 32 TCP connections would realistically test this 951 path. 953 The following table is provided to illustrate the relationship 954 between the maximum TCP RWIN and the number of TCP connections 955 required to utilize the available capacity of a given BDP. For this 956 example, the network bandwidth is 500 Mbps and the RTT is 5 ms, then 957 the BDP equates to 312.5 KBytes. 959 Table 3.3.4 Number of TCP connections versus maximum TCP RWIN 961 Maximum Number of TCP Connections 962 TCP RWIN to fill available bandwidth 963 ------------------------------------- 964 16KB 20 965 32KB 10 966 64KB 5 967 128KB 3 969 The TCP Transfer Time metric is useful for conducting multiple 970 connection tests. Each connection should be configured to transfer 971 payloads of the same size (i.e. 100 MB), and the TCP Transfer time 972 should provide a simple metric to verify the actual versus expected 973 results. 975 Note that the TCP transfer time is the time for all connections to 976 complete the transfer of the configured payload size. From the 977 previous table, the 64KB window is considered. Each of the 5 978 TCP connections would be configured to transfer 100MB, and each one 979 should obtain a maximum of 100 Mb/sec. So for this example, the 980 100MB payload should be transferred across the connections in 981 approximately 8 seconds (which would be the ideal TCP transfer time 982 under these conditions). 984 Additionally, the TCP Efficiency metric MUST be computed for each 985 connection tested as defined in section 3.3.2. 987 3.3.5 Interpretation of the TCP Throughput Results 989 At the end of this step, the user will document the theoretical BDP 990 and a set of Window size experiments with measured TCP throughput for 991 each TCP window size. For cases where the sustained TCP throughput 992 does not equal the ideal value, some possible causes are: 994 - Network congestion causing packet loss which MAY be inferred from 995 a poor TCP Efficiency % (higher TCP Efficiency % = less packet 996 loss) 997 - Network congestion causing an increase in RTT which MAY be inferred 998 from the Buffer Delay Percentage (i.e., 0% = no increase in RTT 999 over baseline) 1000 - Intermediate network devices which actively regenerate the TCP 1001 connection and can alter TCP RWIN Size, MSS, etc. 1002 - Rate limiting (policing). More details on traffic management 1003 tests follows in section 3.4 1005 3.3.6 High Performance Network Options 1007 For cases where the network outperforms the client/server IP hosts 1008 some possible causes are: 1010 - Maximum TCP Buffer space. All operating systems have a global 1011 mechanism to limit the quantity of system memory to be used by TCP 1012 connections. On some systems, each connection is subject to a memory 1013 limit that is applied to the total memory used for input data, output 1014 data and controls. On other systems, there are separate limits for 1015 input and output buffer spaces per connection. Client/server IP 1016 hosts might be configured with Maximum Buffer Space limits that are 1017 far too small for high performance networks. 1019 - Socket Buffer Sizes. Most operating systems support separate per 1020 connection send and receive buffer limits that can be adjusted as 1021 long as they stay within the maximum memory limits. These socket 1022 buffers must be large enough to hold a full BDP of TCP segments plus 1023 some overhead. There are several methods that can be used to adjust 1024 socket buffer sizes, but TCP Auto-Tuning automatically adjusts these 1025 as needed to optimally balance TCP performance and memory usage. 1026 It is important to note that Auto-Tuning is enabled by default in 1027 LINUX since the kernel release 2.6.6 and in UNIX since FreeBSD 7.0. 1028 It is also enabled by default in Windows since Vista and in MAC since 1029 OS X version 10.5 (leopard). Over buffering can cause some 1030 applications to behave poorly, typically causing sluggish interactive 1031 response and risk running the system out of memory. Large default 1032 socket buffers have to be considered carefully on multi-user systems. 1034 - TCP Window Scale Option, RFC1323. This option enables TCP to 1035 support large BDP paths. It provides a scale factor which is 1036 required for TCP to support window sizes larger than 64KB. Most 1037 systems automatically request WSCALE under some conditions, such as 1038 when the receive socket buffer is larger than 64KB or when the other 1039 end of the TCP connection requests it first. WSCALE can only be 1040 negotiated during the 3 way handhsake. If either end fails to 1041 request WSCALE or requests an insufficient value, it cannot be 1042 renegotiated. Different systems use different algorithms to select 1043 WSCALE, but they are all tributary to the maximum permitted buffer 1044 size, the current receiver buffer size for this connection, or a 1045 global system setting. Note that under these constraints, a client 1046 application wishing to send data at high rates may need to set its 1047 own receive buffer to something larger than 64K Bytes before it 1048 opens the connection to ensure that the server properly negotiates 1049 WSCALE. A system administrator might have to explicitly enable 1050 RFC1323 extensions. Otherwise, the client/server IP host would not 1051 support TCP window sizes (BDP) larger than 64KB. Most of the time, 1052 performance gains will be obtained by enabling this option in Long 1053 Fat Networks. (i.e.Networks with large BDP, see Figure 3.3.1b). 1055 - TCP Timestamps Option, RFC1323. This feature provides better 1056 measurements of the Round Trip Time and protects TCP from data 1057 corruption that might occur if packets are delivered so late that the 1058 sequence numbers wrap before they are delivered. Wrapped sequence 1059 numbers do not pose a serious risk below 100 Mbps, but the risk 1060 increases at higher data rates. Most of the time, performance gains 1061 will be obtained by enabling this option in Gigabit bandwidth 1062 networks. 1064 - TCP Selective Acknowledgments Option (SACK), RFC2018. This allows 1065 a TCP receiver to inform the sender about exactly which data segment 1066 is missing and needs to be retransmitted. Without SACK, TCP has to 1067 estimate which data segment is missing, which works just fine if all 1068 losses are isolated (i.e. only one loss in any given round trip). 1069 Without SACK, TCP takes a very long time to recover after multiple 1070 and consecutive losses. SACK is now supported by most operating 1071 systems, but it may have to be explicitly enabled by the system 1072 administrator. In most situations, enabling TCP SACK will improve 1073 throughput performances, but it is important to note that it might 1074 need to be disabled in network architectures where TCP randomization 1075 is done by network security appliances. 1077 - Path MTU. The client/server IP host system must use the largest 1078 possible MTU for the path. This may require enabling Path MTU 1079 Discovery (RFC1191 & RFC4821). Since RFC1191 is flawed it is 1080 sometimes not enabled by default and may need to be explicitly 1081 enabled by the system administrator. RFC4821 describes a new, more 1082 robust algorithm for MTU discovery and ICMP black hole recovery. 1084 - TOE (TCP Offload Engine). Some recent Network Interface Cards (NIC) 1085 are equipped with drivers that can do part or all of the TCP/IP 1086 protocol processing. TOE implementations require additional work 1087 (i.e. hardware-specific socket manipulation) to set up and tear down 1088 connections. For connection intensive protocols such as HTTP, TOE 1089 might need to be disabled to increase performances. Because TOE NICs 1090 configuration parameters are vendor specific and not necessarily 1091 RFC-compliant, they are poorly integrated with UNIX & LINUX. 1092 Occasionally, TOE might need to be disabled in a server because its 1093 NIC does not have enough memory resources to buffer thousands of 1094 connections. 1096 Note that both ends of a TCP connection must be properly tuned. 1098 3.4. Traffic Management Tests 1100 In most cases, the network connection between two geographic 1101 locations (branch offices, etc.) is lower than the network connection 1102 to host computers. An example would be LAN connectivity of GigE 1103 and WAN connectivity of 100 Mbps. The WAN connectivity may be 1104 physically 100 Mbps or logically 100 Mbps (over a GigE WAN 1105 connection). In the later case, rate limiting is used to provide the 1106 WAN bandwidth per the SLA. 1108 Traffic management techniques are employed to provide various forms 1109 of QoS, the more common include: 1111 - Traffic Shaping 1112 - Priority queuing 1113 - Random Early Discard (RED) 1114 Configuring the end-to-end network with these various traffic 1115 management mechanisms is a complex under-taking. For traffic shaping 1116 and RED techniques, the end goal is to provide better performance to 1117 bursty traffic such as TCP,(RED is specifically intended for TCP). 1119 This section of the methodology provides guidelines to test traffic 1120 shaping and RED implementations. As in section 3.3, host hardware 1121 performance must be well understood before conducting the traffic 1122 shaping and RED tests. Dedicated communications test instrument will 1123 generally be REQUIRED for line rates of GigE and 10 GigE. If the 1124 throughput test is expected to exceed 10% of the provider bandwidth, 1125 then the test should be coordinated with the network provider. This 1126 does not include the customer premises bandwidth, the 10% refers to 1127 the provider's bandwidth (Provider Edge to Provider router). Note 1128 that GigE and 10 GigE interfaces might benefit from hold-queue 1129 adjustments in order to prevent the saw-tooth TCP traffic pattern. 1131 3.4.1 Traffic Shaping Tests 1133 For services where the available bandwidth is rate limited, two (2) 1134 techniques can be used: traffic policing or traffic shaping. 1136 Simply stated, traffic policing marks and/or drops packets which 1137 exceed the SLA bandwidth (in most cases, excess traffic is dropped). 1138 Traffic shaping employs the use of queues to smooth the bursty 1139 traffic and then send out within the SLA bandwidth limit (without 1140 dropping packets unless the traffic shaping queue is exhausted). 1142 Traffic shaping is generally configured for TCP data services and 1143 can provide improved TCP performance since the retransmissions are 1144 reduced, which in turn optimizes TCP throughput for the available 1145 bandwidth. Through this section, the rate-limited bandwidth shall 1146 be referred to as the "bottleneck bandwidth". 1148 The ability to detect proper traffic shaping is more easily diagnosed 1149 when conducting a multiple TCP connections test. Proper shaping will 1150 provide a fair distribution of the available bottleneck bandwidth, 1151 while traffic policing will not. 1153 The traffic shaping tests are built upon the concepts of multiple 1154 connections testing as defined in section 3.3.3. Calculating the BDP 1155 for the bottleneck bandwidth is first required before selecting the 1156 number of connections, the Send Socket Buffer and maximum TCP RWIN 1157 Sizes per connection. 1159 Similar to the example in section 3.3, a typical test scenario might 1160 be: GigE LAN with a 100Mbps bottleneck bandwidth (rate limited 1161 logical interface), and 5 msec RTT. This would require five (5) TCP 1162 connections of 64 KB Send Socket Buffer and maximum TCP RWIN Sizes 1163 to evenly fill the bottleneck bandwidth (~100 Mbps per connection). 1165 The traffic shaping test should be run over a long enough duration to 1166 properly exercise network buffers (greater than 30 seconds) and also 1167 characterize performance during different time periods of the day. 1168 The throughput of each connection MUST be logged during the entire 1169 test, along with the TCP Transfer Time, TCP Efficiency, and 1170 Buffer Delay Percentage. 1172 3.4.1.1 Interpretation of Traffic Shaping Test Results 1174 By plotting the throughput achieved by each TCP connection, we should 1175 see fair sharing of the bandwidth when traffic shaping is properly 1176 configured for the bottleneck interface. For the previous example of 1177 5 connections sharing 500 Mbps, each connection would consume 1178 ~100 Mbps with smooth variations. 1180 When traffic shaping is not configured properly or if traffic 1181 policing is present on the bottleneck interface, the bandwidth 1182 sharing may not be fair. The resulting throughput plot may reveal 1183 "spikey" throughput consumption of the competing TCP connections (due 1184 to the high rate of TCP retransmissions). 1186 3.4.2 RED Tests 1188 Random Early Discard techniques are specifically targeted to provide 1189 congestion avoidance for TCP traffic. Before the network element 1190 queue "fills" and enters the tail drop state, RED drops packets at 1191 configurable queue depth thresholds. This action causes TCP 1192 connections to back-off which helps to prevent tail drop, which in 1193 turn helps to prevent global TCP synchronization. 1195 Again, rate limited interfaces may benefit greatly from RED based 1196 techniques. Without RED, TCP may not be able to achieve the full 1197 bottleneck bandwidth. With RED enabled, TCP congestion avoidance 1198 throttles the connections on the higher speed interface (i.e. LAN) 1199 and can help achieve the full bottleneck bandwidth. The burstiness 1200 of TCP traffic is a key factor in the overall effectiveness of RED 1201 techniques; steady state bulk transfer flows will generally not 1202 benefit from RED. With bulk transfer flows, network device queues 1203 gracefully throttle the effective throughput rates due to increased 1204 delays. 1206 The ability to detect proper RED configuration is more easily 1207 diagnosed when conducting a multiple TCP connections test. Multiple 1208 TCP connections provide the bursty sources that emulate the 1209 real-world conditions for which RED was intended. 1211 The RED tests also builds upon the concepts of multiple connections 1212 testing as defined in section 3.3.3. Calculating the BDP for the 1213 bottleneck bandwidth is first required before selecting the number 1214 of connections, the Send Socket Buffer size and the maximum TCP RWIN 1215 Size per connection. 1217 For RED testing, the desired effect is to cause the TCP connections 1218 to burst beyond the bottleneck bandwidth so that queue drops will 1219 occur. Using the same example from section 3.4.1 (traffic shaping), 1220 the 500 Mbps bottleneck bandwidth requires 5 TCP connections (with 1221 window size of 64KB) to fill the capacity. Some experimentation is 1222 required, but it is recommended to start with double the number of 1223 connections in order to stress the network element buffers / queues 1224 (10 connections for this example). 1226 The TCP TTD must be configured to generate these connections as 1227 shorter (bursty) flows versus bulk transfer type flows. These TCP 1228 bursts should stress queue sizes in the 512KB range. Again 1229 experimentation will be required; the proper number of TCP 1230 connections, the Send Socket Buffer and maximum TCP RWIN Sizes will 1231 be dictated by the size of the network element queue. 1233 3.4.2.1 Interpretation of RED Results 1235 The default queuing technique for most network devices is FIFO based. 1236 Without RED, the FIFO based queue may cause excessive loss to all of 1237 the TCP connections and in the worst case global TCP synchronization. 1239 By plotting the aggregate throughput achieved on the bottleneck 1240 interface, proper RED operation may be determined if the bottleneck 1241 bandwidth is fully utilized. For the previous example of 10 1242 connections (window = 64 KB) sharing 500 Mbps, each connection should 1243 consume ~50 Mbps. If RED was not properly enabled on the interface, 1244 then the TCP connections will retransmit at a higher rate and the 1245 net effect is that the bottleneck bandwidth is not fully utilized. 1247 Another means to study non-RED versus RED implementations is to use 1248 the TCP Transfer Time metric for all of the connections. In this 1249 example, a 100 MB payload transfer should take ideally 16 seconds 1250 across all 10 connections (with RED enabled). With RED not enabled, 1251 the throughput across the bottleneck bandwidth may be greatly 1252 reduced (generally 10-20%) and the actual TCP Transfer time may be 1253 proportionally longer then the Ideal TCP Transfer time. 1255 Additionally, non-RED implementations may exhibit a lower TCP 1256 Transfer Efficiency. 1258 4. Security Considerations 1260 The security considerations that apply to any active measurement of 1261 live networks are relevant here as well. See [RFC4656] and 1262 [RFC5357]. 1264 5. IANA Considerations 1266 This document does not REQUIRE an IANA registration for ports 1267 dedicated to the TCP testing described in this document. 1269 6. Acknowledgments 1271 Thanks to Lars Eggert, Al Morton, Matt Mathis, Matt Zekauskas, 1272 Yaakov Stein, and Loki Jorgenson for many good comments and for 1273 pointing us to great sources of information pertaining to past works 1274 in the TCP capacity area. 1276 7. References 1278 7.1 Normative References 1280 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1281 Requirement Levels", BCP 14, RFC 2119, March 1997. 1283 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 1284 Zekauskas, "A One-way Active Measurement Protocol 1285 (OWAMP)", RFC 4656, September 2006. 1287 [RFC2544] Bradner, S., McQuaid, J., "Benchmarking Methodology for 1288 Network Interconnect Devices", RFC 2544, June 1999 1290 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., Babiarz, 1291 J., "A Two-Way Active Measurement Protocol (TWAMP)", 1292 RFC 5357, October 2008 1294 [RFC4821] Mathis, M., Heffner, J., "Packetization Layer Path MTU 1295 Discovery", RFC 4821, June 2007 1297 draft-ietf-ippm-btc-cap-00.txt Allman, M., "A Bulk 1298 Transfer Capacity Methodology for Cooperating Hosts", 1299 August 2001 1301 [RFC2681] Almes G., Kalidindi S., Zekauskas, M., "A Round-trip Delay 1302 Metric for IPPM", RFC 2681, September, 1999 1304 [RFC4898] Mathis, M., Heffner, J., Raghunarayan, R., "TCP Extended 1305 Statistics MIB", May 2007 1307 [RFC5136] Chimento P., Ishac, J., "Defining Network Capacity", 1308 February 2008 1310 [RFC1323] Jacobson, V., Braden, R., Borman D., "TCP Extensions for 1311 High Performance", May 1992 1313 7.2. Informative References 1314 Authors' Addresses 1316 Barry Constantine 1317 JDSU, Test and Measurement Division 1318 One Milesone Center Court 1319 Germantown, MD 20876-7100 1320 USA 1322 Phone: +1 240 404 2227 1323 barry.constantine@jdsu.com 1325 Gilles Forget 1326 Independent Consultant to Bell Canada. 1327 308, rue de Monaco, St-Eustache 1328 Qc. CANADA, Postal Code : J7P-4T5 1330 Phone: (514) 895-8212 1331 gilles.forget@sympatico.ca 1333 Rudiger Geib 1334 Heinrich-Hertz-Strasse (Number: 3-7) 1335 Darmstadt, Germany, 64295 1337 Phone: +49 6151 6282747 1338 Ruediger.Geib@telekom.de 1340 Reinhard Schrage 1341 Schrage Consulting 1343 Phone: +49 (0) 5137 909540 1344 reinhard@schrageconsult.com