idnits 2.17.1 draft-ietf-ippm-tcp-throughput-tm-11.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (January 31, 2011) is 4824 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 1323 (Obsoleted by RFC 7323) Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group B. Constantine 2 Internet-Draft JDSU 3 Intended status: Informational G. Forget 4 Expires: July 31, 2011 Bell Canada (Ext. Consultant) 5 Rudiger Geib 6 Deutsche Telekom 7 Reinhard Schrage 8 Schrage Consulting 10 January 31, 2011 12 Framework for TCP Throughput Testing 13 draft-ietf-ippm-tcp-throughput-tm-11.txt 15 Abstract 17 This framework describes a practical methodology for measuring end- 18 to-end TCP throughput in a managed IP network. The goal is to provide 19 a better indication in regards to user experience. In this framework, 20 TCP and IP parameters are specified and should be configured as 21 recommended. 23 Requirements Language 25 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 26 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 27 document are to be interpreted as described in RFC 2119 [RFC2119]. 29 Status of this Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at http://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on July 31, 2011. 46 Copyright Notice 48 Copyright (c) 2011 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (http://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 64 1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 65 1.2 Test Set-up . . . . . . . . . . . . . . . . . . . . . . . 5 66 2. Scope and Goals of this methodology. . . . . . . . . . . . . . 5 67 2.1 TCP Equilibrium. . . . . . . . . . . . . . . . . . . . . . 6 68 3. TCP Throughput Testing Methodology . . . . . . . . . . . . . . 7 69 3.1 Determine Network Path MTU . . . . . . . . . . . . . . . . 9 70 3.2. Baseline Round Trip Time and Bandwidth . . . . . . . . . . 10 71 3.2.1 Techniques to Measure Round Trip Time . . . . . . . . 11 72 3.2.2 Techniques to Measure end-to-end Bandwidth. . . . . . 12 73 3.3. TCP Throughput Tests . . . . . . . . . . . . . . . . . . . 12 74 3.3.1 Calculate minimum required TCP RWND Size. . . . . . . 12 75 3.3.2 Metrics for TCP Throughput Tests . . . . . . . . . . . 15 76 3.3.3 Conducting the TCP Throughput Tests. . . . . . . . . . 19 77 3.3.4 Single vs. Multiple TCP Connection Testing . . . . . . 19 78 3.3.5 Interpretation of the TCP Throughput Results . . . . . 20 79 3.3.6 High Performance Network Options . . . . . . . . . . . 20 80 3.4. Traffic Management Tests . . . . . . . . . . . . . . . . . 22 81 3.4.1 Traffic Shaping Tests. . . . . . . . . . . . . . . . . 23 82 3.4.1.1 Interpretation of Traffic Shaping Test Results. . . 23 83 3.4.2 AQM Tests. . . . . . . . . . . . . . . . . . . . . . . 24 84 3.4.2.1 Interpretation of AQM Results . . . . . . . . . . . 25 85 4. Security Considerations . . . . . . . . . . . . . . . . . . . 26 86 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 26 87 6. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 26 88 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 26 89 7.1 Normative References . . . . . . . . . . . . . . . . . . . 26 90 7.2 Informative References . . . . . . . . . . . . . . . . . . 27 92 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 27 94 1. Introduction 96 The SLA (Service Level Agreement) provided to business class 97 customers is generally based upon Layer 2/3 criteria such as : 98 Guaranteed bandwidth, maximum network latency, maximum packet loss 99 percentage and maximum delay variation (i.e. maximum jitter). 100 Network providers are coming to the realization that Layer 2/3 101 testing is not enough to adequately ensure end-user's satisfaction. 102 In addition to Layer 2/3 performance, measuring TCP throughput 103 provides more meaningful results with respect to user experience. 105 Additionally, business class customers seek to conduct repeatable TCP 106 throughput tests between locations. Since these organizations rely on 107 the networks of the providers, a common test methodology with 108 predefined metrics would benefit both parties. 110 Note that the primary focus of this methodology is managed business 111 class IP networks; i.e. those Ethernet terminated services for which 112 organizations are provided an SLA from the network provider. Because 113 of the SLA, the expectation is that the TCP Throughput should achieve 114 the guaranteed bandwidth. End-users with "best effort" access could 115 use this methodology, but this framework and its metrics are intended 116 to be used in a predictable managed IP network. No end-to-end 117 performance can be guaranteed when only the access portion is being 118 provisioned to a specific bandwidth capacity. 120 The intent behind this document is to define a methodology for 121 testing sustained TCP layer performance. In this document, the 122 achievable TCP Throughput is that amount of data per unit time that 123 TCP transports when in the TCP Equilibrium state. (See section 2.1 124 for TCP Equilibrium definition). Throughout this document, maximum 125 achievable throughput refers to the theoretical achievable throughput 126 when TCP is in the Equilibrium state. 128 TCP is connection oriented and at the transmitting side it uses a 129 congestion window, (TCP CWND). At the receiving end, TCP uses a 130 receive window, (TCP RWND) to inform the transmitting end on how 131 many Bytes it is capable to accept at a given time. 133 Derived from Round Trip Time (RTT) and network path bandwidth, the 134 bandwidth delay product (BDP) determines the Send and Received Socket 135 buffers sizes required to achieve the maximum TCP throughput. Then, 136 with the help of slow start and congestion avoidance algorithms, a 137 TCP CWND is calculated based on the IP network path loss rate. 138 Finally, the minimum value between the calculated TCP CWND and the 139 TCP RWND advertised by the opposite end will determine how many Bytes 140 can actually be sent by the transmitting side at a given time. 142 Both TCP Window sizes (RWND and CWND) may vary during any given TCP 143 session, although up to bandwidth limits, larger RWND and larger CWND 144 will achieve higher throughputs by permitting more in-flight Bytes. 146 At both ends of the TCP connection and for each socket, there are 147 default buffer sizes. There are also kernel enforced maximum buffer 148 sizes. These buffer sizes can be adjusted at both ends (transmitting 149 and receiving). Some TCP/IP stack implementations use Receive Window 150 Auto-Tuning, although in order to obtain the maximum throughput it is 151 critical to use large enough TCP Send and Receive Socket Buffer 152 sizes. In fact, they should be equal to or greater than BDP. 154 Many variables are involved in TCP throughput performance, but this 155 methodology focuses on: 156 - BB (Bottleneck Bandwidth) 157 - RTT (Round Trip Time) 158 - Send and Receive Socket Buffers 159 - Minimum TCP RWND 160 - Path MTU (Maximum Transmission Unit) 161 - Path MSS (Maximum Segment Size) 163 This methodology proposes TCP testing that should be performed in 164 addition to traditional Layer 2/3 type tests. In fact, Layer 2/3 165 tests are required to verify the integrity of the network before 166 conducting TCP tests. Examples include iperf (UDP mode) and manual 167 packet layer test techniques where packet throughput, loss, and delay 168 measurements are conducted. When available, standardized testing 169 similar to [RFC2544] but adapted for use in operational networks may 170 be used. 172 Note: RFC 2544 was never meant to be used outside a lab environment. 174 Sections 2 and 3 of this document provide a general overview of the 175 proposed methodology. 177 1.1 Terminology 179 The common definitions used in this methodology are: 181 - TCP Throughput Test Device (TCP TTD), refers to compliant TCP 182 host that generates traffic and measures metrics as defined in 183 this methodology. i.e. a dedicated communications test instrument. 184 - Customer Provided Equipment (CPE), refers to customer owned 185 equipment (routers, switches, computers, etc.) 186 - Customer Edge (CE), refers to provider owned demarcation device. 187 - Provider Edge (PE), refers to provider's distribution equipment. 188 - Bottleneck Bandwidth (BB), lowest bandwidth along the complete 189 path. Bottleneck Bandwidth and Bandwidth are used synonymously 190 in this document. Most of the time the Bottleneck Bandwidth is 191 in the access portion of the wide area network (CE - PE). 192 - Provider (P), refers to provider core network equipment. 193 - Network Under Test (NUT), refers to the tested IP network path. 194 - Round Trip Time (RTT), refers to Layer 4 back and forth delay. 196 Figure 1.1 Devices, Links and Paths 198 +----+ +----+ +----+ +----+ +---+ +---+ +----+ +----+ +----+ +----+ 199 | TCP|-| CPE|-| CE |--| PE |-| P |--| P |-| PE |--| CE |-| CPE|-| TCP| 200 | TTD| | | | |BB| | | | | | | |BB| | | | | TTD| 201 +----+ +----+ +----+ +----+ +---+ +---+ +----+ +----+ +----+ +----+ 202 <------------------------ NUT -------------------------> 203 R >-----------------------------------------------------------| 204 T | 205 T <-----------------------------------------------------------| 207 Note that the NUT may be built with of a variety of devices including 208 but not limited to, load balancers, proxy servers or WAN acceleration 209 appliances. The detailed topology of the NUT should be well known 210 when conducting the TCP throughput tests, although this methodology 211 makes no attempt to characterize specific network architectures. 213 1.2 Test Set-up 215 This methodology is intended for operational and managed IP networks. 216 A multitude of network architectures and topologies can be tested. 217 The above diagram is very general and is only there to illustrate 218 typical segmentation within end-user and network provider domains. 220 2. Scope and Goals of this Methodology 222 Before defining the goals, it is important to clearly define the 223 areas that are out-of-scope. 225 - This methodology is not intended to predict the TCP throughput 226 during the transient stages of a TCP connection, such as during the 227 initial slow start phase. 229 - This methodology is not intended to definitively benchmark TCP 230 implementations of one OS to another, although some users may find 231 value in conducting qualitative experiments. 233 - This methodology is not intended to provide detailed diagnosis 234 of problems within end-points or within the network itself as 235 related to non-optimal TCP performance, although a results 236 interpretation section for each test step may provide insights to 237 potential issues. 239 - This methodology does not propose to operate permanently with high 240 measurement loads. TCP performance and optimization within 241 operational networks may be captured and evaluated by using data 242 from the "TCP Extended Statistics MIB" [RFC4898]. 244 - This methodology is not intended to measure TCP throughput as part 245 of an SLA, or to compare the TCP performance between service 246 providers or to compare between implementations of this methodology 247 in dedicated communications test instruments. 249 In contrast to the above exclusions, the primary goal is to define a 250 method to conduct a practical end-to-end assessment of sustained 251 TCP performance within a managed business class IP network. Another 252 key goal is to establish a set of "best practices" that a non-TCP 253 expert should apply when validating the ability of a managed IP 254 network to carry end-user TCP applications. 256 Specific goals are to : 258 - Provide a practical test approach that specifies tunable parameters 259 (such as MSS (Maximum Segment Size) and Socket Buffer sizes) and how 260 these affect the outcome of TCP performances over an IP network. 261 See section 3.3.3. 263 - Provide specific test conditions like link speed, RTT, MSS, Socket 264 Buffer sizes and achievable TCP throughput when TCP is in the 265 Equilibrium state. For guideline purposes, provide examples of 266 test conditions and their maximum achievable TCP throughput. 267 Section 2.1 provides specific details concerning the definition of 268 TCP Equilibrium within this methodology while section 3 provides 269 specific test conditions with examples. 271 - Define three (3) basic metrics to compare the performance of TCP 272 connections under various network conditions. See section 3.3.2. 274 - In test situations where the recommended procedure does not yield 275 the maximum achievable TCP throughput, this methodology provides 276 some possible areas within the end host or the network that should 277 be considered for investigation. Although again, this methodology 278 is not intended to provide detailed diagnosis on these issues. 279 See section 3.3.5. 281 2.1 TCP Equilibrium 283 TCP connections have three (3) fundamental congestion window phases: 285 1 - The Slow Start phase, which occurs at the beginning of a TCP 286 transmission or after a retransmission time out. 288 2 - The Congestion Avoidance phase, during which TCP ramps up to 289 establish the maximum achievable throughput. It is important to note 290 that retransmissions are a natural by-product of the TCP congestion 291 avoidance algorithm as it seeks to achieve maximum throughput. 293 3 - The Loss Recovery phase, which could include Fast Retransmit 294 (Tahoe) or Fast Recovery (Reno & New Reno). When packet loss occurs, 295 Congestion Avoidance phase transitions either to Fast Retransmission 296 or Fast Recovery depending upon the TCP implementation. If a Time-Out 297 occurs, TCP transitions back to the Slow Start phase. 299 The following diagram depicts these 3 phases. 301 Figure 2.1 TCP CWND Phases 303 /\ | TCP 304 /\ | Equilibrium 305 /\ |High ssthresh TCP CWND 306 /\ |Loss Event * halving 3-Loss Recovery 307 /\ | * \ upon loss Adjusted 308 /\ | * \ / \ Time-Out ssthresh 309 /\ | * \ / \ +--------+ * 310 /\ | * \/ \ / Multiple| * 311 /\ | * 2-Congestion\ / Loss | * 312 /\ | * Avoidance \/ Event | * 313 TCP | * Half | * 314 Through- | * TCP CWND | * 1-Slow Start 315 put | * 1-Slow Start Min TCP CWND after T-O 316 +----------------------------------------------------------- 317 Time > > > > > > > > > > > > > > > > > > > > > > > > > > > 319 Note : ssthresh = Slow Start threshold. 321 A well tuned and managed IP network with appropriate TCP adjustments 322 in the IP hosts and applications should perform very close to the 323 BB (Bottleneck Bandwidth) when TCP is in the Equilibrium state. 325 This TCP methodology provides guidelines to measure the maximum 326 achievable TCP throughput when TCP is in the Equilibrium state. 327 All maximum achievable TCP throughputs specified in section 3 are 328 with respect to this condition. 330 It is important to clarify the interaction between the sender's Send 331 Socket Buffer and the receiver's advertised TCP RWND Size. TCP test 332 programs such as iperf, ttcp, etc. allows the sender to control the 333 quantity of TCP Bytes transmitted and unacknowledged (in-flight), 334 commonly referred to as the Send Socket Buffer. This is done 335 independently of the TCP RWND Size advertised by the receiver. 336 Implications to the capabilities of the Throughput Test Device (TTD) 337 are covered at the end of section 3. 339 3. TCP Throughput Testing Methodology 341 As stated earlier in section 1, it is considered best practice to 342 verify the integrity of the network by conducting Layer 2/3 tests 343 such as [RFC2544] or other methods of network stress tests. 344 Although, it is important to mention here that RFC 2544 was never 345 meant to be used outside a lab environment. 347 If the network is not performing properly in terms of packet loss, 348 jitter, etc. then the TCP layer testing will not be meaningful. A 349 dysfunctional network will not achieve optimal TCP throughputs in 350 regards with the available bandwidth. 352 TCP Throughput testing may require cooperation between the end-user 353 customer and the network provider. As an example, in an MPLS (Multi- 354 Protocol Label Switching) network architecture, the testing should be 355 conducted either on the CPE or on the CE device and not on the PE 356 (Provider Edge) router. 358 The following represents the sequential order of steps for this 359 testing methodology: 361 1. Identify the Path MTU. Packetization Layer Path MTU Discovery 362 or PLPMTUD, [RFC4821], MUST be conducted to verify the network path 363 MTU. Conducting PLPMTUD establishes the upper limit for the MSS to 364 be used in subsequent steps. 366 2. Baseline Round Trip Time and Bandwidth. This step establishes the 367 inherent, non-congested Round Trip Time (RTT) and the Bottleneck 368 Bandwidth of the end-to-end network path. These measurements are 369 used to provide estimates of the TCP RWND and Send Socket Buffer 370 Sizes that SHOULD be used during subsequent test steps. These 371 measurements refers to [RFC2681] and [RFC4898] in order to measure 372 RTD and associated RTT. 374 3. TCP Connection Throughput Tests. With baseline measurements 375 of Round Trip Time and Bottleneck Bandwidth, single and multiple TCP 376 connection throughput tests SHOULD be conducted to baseline network 377 performances. 379 4. Traffic Management Tests. Various traffic management and queuing 380 techniques can be tested in this step, using multiple TCP 381 connections. Multiple connections testing should verify that the 382 network is configured properly for traffic shaping versus policing 383 and that Active Queue Management implementations are used. 385 Important to note are some of the key characteristics and 386 considerations for the TCP test instrument. The test host may be a 387 standard computer or a dedicated communications test instrument. 388 In both cases, it must be capable of emulating both a client and a 389 server. 391 The following criteria should be considered when selecting whether 392 the TCP test host can be a standard computer or has to be a dedicated 393 communications test instrument: 395 - TCP implementation used by the test host, OS version, i.e. LINUX OS 396 kernel using TCP New Reno, TCP options supported, etc. These will 397 obviously be more important when using dedicated communications test 398 instruments where the TCP implementation may be customized or tuned 399 to run in higher performance hardware. When a compliant TCP TTD is 400 used, the TCP implementation MUST be identified in the test results. 401 The compliant TCP TTD should be usable for complete end-to-end 402 testing through network security elements and should also be usable 403 for testing network sections. 405 - More important, the TCP test host MUST be capable to generate 406 and receive stateful TCP test traffic at the full link speed of the 407 network under test. Stateful TCP test traffic means that the test 408 host MUST fully implement a TCP/IP stack; this is generally a comment 409 aimed at dedicated communications test equipments which sometimes 410 "blast" packets with TCP headers. As a general rule of thumb, testing 411 TCP throughput at rates greater than 100 Mbit/sec MAY require high 412 performance server hardware or dedicated hardware based test tools. 414 - A compliant TCP Throughput Test Device MUST allow adjusting both 415 Send and Receive Socket Buffer sizes. The Socket Buffers MUST be 416 large enough to fill the BDP. 418 - Measuring RTT and retransmissions per connection will generally 419 require a dedicated communications test instrument. In the absence of 420 dedicated hardware based test tools, these measurements may need to 421 be conducted with packet capture tools, i.e. conduct TCP throughput 422 tests and analyze RTT and retransmissions in packet captures. 423 Another option may be to use "TCP Extended Statistics MIB" per 424 [RFC4898]. 426 - The RFC4821 PLPMTUD test SHOULD be conducted with a dedicated 427 tester which exposes the ability to run the PLPMTUD algorithm 428 independently from the OS stack. 430 3.1. Determine Network Path MTU 432 TCP implementations should use Path MTU Discovery techniques (PMTUD). 433 PMTUD relies on ICMP 'need to frag' messages to learn the path MTU. 434 When a device has a packet to send which has the Don't Fragment (DF) 435 bit in the IP header set and the packet is larger than the Maximum 436 Transmission Unit (MTU) of the next hop, the packet is dropped and 437 the device sends an ICMP 'need to frag' message back to the host that 438 originated the packet. The ICMP 'need to frag' message includes 439 the next hop MTU which PMTUD uses to tune the TCP Maximum Segment 440 Size (MSS). Unfortunately, because many network managers completely 441 disable ICMP, this technique does not always prove reliable. 443 Packetization Layer Path MTU Discovery or PLPMTUD [RFC4821] MUST then 444 be conducted to verify the network path MTU. PLPMTUD can be used 445 with or without ICMP. The following sections provide a summary of the 446 PLPMTUD approach and an example using TCP. [RFC4821] specifies a 447 search_high and a search_low parameter for the MTU. As specified in 448 [RFC4821], 1024 Bytes is a safe value for search_low in modern 449 networks. 451 It is important to determine the links overhead along the IP path, 452 and then to select a TCP MSS size corresponding to the Layer 3 MTU. 453 For example, if the MTU is 1024 Bytes and the TCP/IP headers are 40 454 Bytes, (20 for IP + 20 for TCP) then the MSS would be 984 Bytes. 456 An example scenario is a network where the actual path MTU is 1240 457 Bytes. The TCP client probe MUST be capable of setting the MSS for 458 the probe packets and could start at MSS = 984 (which corresponds 459 to an MTU size of 1024 Bytes). 461 The TCP client probe would open a TCP connection and advertise the 462 MSS as 984. Note that the client probe MUST generate these packets 463 with the DF bit set. The TCP client probe then sends test traffic 464 per a small default Send Socket Buffer size of ~8KBytes. It should 465 be kept small to minimize the possibility of congesting the network, 466 which may induce packet loss. The duration of the test should also 467 be short (10-30 seconds), again to minimize congestive effects 468 during the test. 470 In the example of a 1240 Bytes path MTU, probing with an MSS equal to 471 984 would yield a successful probe and the test client packets would 472 be successfully transferred to the test server. 474 Also note that the test client MUST verify that the advertised MSS 475 is indeed negotiated. Network devices with built-in Layer 4 476 capabilities can intercede during the connection establishment and 477 reduce the advertised MSS to avoid fragmentation. This is certainly 478 a desirable feature from a network perspective, but it can yield 479 erroneous test results if the client test probe does not confirm the 480 negotiated MSS. 482 The next test probe would use the search_high value and it would be 483 set to a MSS of 1460 in order to produce a 1500 Bytes MTU. In this 484 example, the test client will retransmit based upon time-outs, since 485 no ACKs will be received from the test server. This test probe is 486 marked as a conclusive failure if none of the test packets are 487 ACK'ed. If none of the test packets are ACK'ed, congestive network 488 may be the cause and the test probe is not conclusive. Re-testing 489 at another time is recommended to further isolate. 491 The test is repeated until the desired granularity of the MTU is 492 discovered. The method can yield precise results at the expense of 493 probing time. One approach may be to reduce the probe size to 494 half between the unsuccessful search_high and successful search_low 495 value and raise it by half when seeking the upper limit. 497 3.2. Baseline Round Trip Time and Bandwidth 499 Before stateful TCP testing can begin, it is important to determine 500 the baseline Round Trip Time (i.e. non-congested inherent delay) and 501 Bottleneck Bandwidth of the end-to-end network to be tested. These 502 measurements are used to calculate the BDP and to provide estimates 503 of the TCP RWND and Send Socket Buffer Sizes that SHOULD be used in 504 subsequent test steps. 506 3.2.1 Techniques to Measure Round Trip Time 508 Following the definitions used in section 1.1, Round Trip Time (RTT) 509 is the elapsed time between the clocking in of the first bit of a 510 payload sent packet and the receipt of the last bit of the 511 corresponding Acknowledgment. Round Trip Delay (RTD) is used 512 synonymously to twice the Link Latency. RTT measurements SHOULD use 513 techniques defined in [RFC2681] or statistics available from MIBs 514 defined in [RFC4898]. 516 The RTT SHOULD be baselined during off-peak hours in order to obtain 517 a reliable figure of the inherent network latency. Otherwise, 518 additional delay caused by network buffering can occur. Also, when 519 sampling RTT values over a given test interval, the minimum 520 measured value SHOULD be used as the baseline RTT. This will most 521 closely estimate the real inherent RTT. This value is also used to 522 determine the Buffer Delay Percentage metric defined in Section 3.3.2 524 The following list is not meant to be exhaustive, although it 525 summarizes some of the most common ways to determine Round Trip Time. 526 The desired measurement precision (i.e. msec versus usec) may dictate 527 whether the RTT measurement can be achieved with ICMP pings or by a 528 dedicated communications test instrument with precision timers. 530 The objective in this section is to list several techniques 531 in order of decreasing accuracy. 533 - Use test equipment on each end of the network, "looping" the 534 far-end tester so that a packet stream can be measured back and forth 535 from end-to-end. This RTT measurement may be compatible with delay 536 measurement protocols specified in [RFC5357]. 538 - Conduct packet captures of TCP test sessions using "iperf" or FTP, 539 or other TCP test applications. By running multiple experiments, 540 packet captures can then be analyzed to estimate RTT. It is 541 important to note that results based upon the SYN -> SYN-ACK at the 542 beginning of TCP sessions should be avoided since Firewalls might 543 slow down 3 way handshakes. Also, at the senders side, Ostermann's 544 LINUX TCPTRACE utility with -l -r arguments can be used to extract 545 the RTT results directly from the packet captures. 547 - ICMP pings may also be adequate to provide Round Trip Time 548 estimates, provided that the packet size is factored into the 549 estimates (i.e. pings with different packet sizes might be required). 550 Some limitations with ICMP Ping may include msec resolution and 551 whether the network elements are responding to pings or not. Also, 552 ICMP is often rate-limited or segregated into different buffer 553 queues. ICMP might not work if QoS (Quality of Service) 554 reclassification is done at any hop. ICMP is not as reliable and 555 accurate as in-band measurements. 557 3.2.2 Techniques to Measure end-to-end Bandwidth 559 Before any TCP Throughput test can be conducted, bandwidth 560 measurement tests MUST be run with stateless IP streams (i.e. not 561 stateful TCP) in order to determine the available path bandwidth. 562 These measurements SHOULD be conducted in both directions, 563 especially in asymmetrical access networks (e.g. ADSL access). 564 These tests should obviously be performed at various intervals 565 throughout a business day or even across a week. Ideally, the 566 bandwidth tests should produce logged outputs of the achieved 567 bandwidths across the complete test duration. 569 There are many well established techniques available to provide 570 estimated measures of bandwidth over a network. It is a common 571 practice for network providers to conduct Layer 2/3 bandwidth 572 capacity tests using [RFC2544], although it is understood that 573 [RFC2544] was never meant to be used outside a lab environment. 574 Ideally, these bandwidth measurements SHOULD use network capacity 575 techniques as defined in [RFC5136]. 577 3.3. TCP Throughput Tests 579 This methodology specifically defines TCP throughput techniques to 580 verify maximum achievable TCP performance in a managed business 581 class IP network, as defined in section 2.1. This document defines 582 a method to conduct these maximum achievable TCP throughput tests 583 as well as guidelines on the predicted results. 585 With baseline measurements of Round Trip Time and bandwidth from 586 section 3.2, a series of single and multiple TCP connection 587 throughput tests SHOULD be conducted in order to measure network 588 performance against expectations. The number of trials and the type 589 of testing (i.e. single versus multiple connections) will vary 590 according to the intention of the test. One example would be a 591 single connection test in which the throughput achieved by large 592 Send and Receive Socket Buffer sizes (i.e. 256KB) is to be measured. 593 It would be advisable to test at various times of the business day. 595 It is RECOMMENDED to run the tests in each direction independently 596 first, then run both directions simultaneously. In each case, the 597 TCP Transfer Time, TCP Efficiency, and Buffer Delay Percentage 598 metrics MUST be measured in each direction. These metrics are 599 defined in 3.3.2. 601 3.3.1 Calculate minimum required TCP RWND Size 603 The minimum required TCP RWND Size can be calculated from the 604 bandwidth delay product (BDP), which is: 606 BDP (bits) = RTT (sec) x Bandwidth (bps) 608 Note that the RTT is being used as the "Delay" variable in the 609 BDP calculations. 611 Then, by dividing the BDP by 8, we obtain the minimum required TCP 612 RWND Size in Bytes. For optimal results, the Send Socket Buffer size 613 must be adjusted to the same value at the opposite end of the network 614 path. 616 Minimum required TCP RWND = BDP / 8 618 An example would be a T3 link with 25 msec RTT. The BDP would equal 619 ~1,105,000 bits and the minimum required TCP RWND would be ~138 620 KBytes. 622 Note that separate calculations are required on asymmetrical paths. 623 An asymmetrical path example would be a 90 msec RTT ADSL line with 624 5Mbps downstream and 640Kbps upstream. The downstream BDP would equal 625 ~450,000 bits while the upstream one would be only ~57,600 bits. 627 The following table provides some representative network Link Speeds, 628 RTT, BDP, and their associated minimum required TCP RWND Sizes. 630 Table 3.3.1: Link Speed, RTT, calculated BDP & minimum TCP RWND 632 Link Minimum required 633 Speed* RTT BDP TCP RWND 634 (Mbps) (ms) (bits) (KBytes) 635 --------------------------------------------------------------------- 636 1.536 20 30,720 3.84 637 1.536 50 76,800 9.60 638 1.536 100 153,600 19.20 639 44.210 10 442,100 55.26 640 44.210 15 663,150 82.89 641 44.210 25 1,105,250 138.16 642 100 1 100,000 12.50 643 100 2 200,000 25.00 644 100 5 500,000 62.50 645 1,000 0.1 100,000 12.50 646 1,000 0.5 500,000 62.50 647 1,000 1 1,000,000 125.00 648 10,000 0.05 500,000 62.50 649 10,000 0.3 3,000,000 375.00 651 * Note that link speed is the Bottleneck Bandwidth (BB) for the NUT 653 The following serial link speeds are used: 654 - T1 = 1.536 Mbits/sec (for a B8ZS line encoding facility) 655 - T3 = 44.21 Mbits/sec (for a C-Bit Framing facility) 657 The above table illustrates the minimum required TCP RWND. 658 If a smaller TCP RWND Size is used, then the TCP Throughput 659 can not be optimal. To calculate the TCP Throughput, the following 660 formula is used: TCP Throughput = TCP RWND X 8 / RTT 661 An example could be a 100 Mbps IP path with 5 ms RTT and a TCP RWND 662 of 16KB, then: 664 TCP Throughput = 16 KBytes X 8 bits / 5 ms. 665 TCP Throughput = 128,000 bits / 0.005 sec. 666 TCP Throughput = 25.6 Mbps. 668 Another example for a T3 using the same calculation formula is 669 illustrated on the next page: 671 TCP Throughput = 16 KBytes X 8 bits / 10 ms. 672 TCP Throughput = 128,000 bits / 0.01 sec. 673 TCP Throughput = 12.8 Mbps. 675 When the TCP RWND Size exceeds the BDP (T3 link and 64 KBytes TCP 676 RWND on a 10 ms RTT path), the maximum frames per second limit of 677 3664 is reached and then the formula is: 679 TCP Throughput = Max FPS X MSS X 8. 680 TCP Throughput = 3664 FPS X 1460 Bytes X 8 bits. 681 TCP Throughput = 42.8 Mbps 683 The following diagram compares achievable TCP throughputs on a T3 684 with Send Socket Buffer & TCP RWND Sizes of 16KB vs. 64KB. 686 Figure 3.3.1a TCP Throughputs on a T3 at different RTTs 688 45| 689 | _______42.8M 690 40| |64KB | 691 TCP | | | 692 Throughput 35| | | 693 in Mbps | | | +-----+34.1M 694 30| | | |64KB | 695 | | | | | 696 25| | | | | 697 | | | | | 698 20| | | | | _______20.5M 699 | | | | | |64KB | 700 15| | | | | | | 701 |12.8M+-----| | | | | | 702 10| |16KB | | | | | | 703 | | | |8.5M+-----| | | | 704 5| | | | |16KB | |5.1M+-----| | 705 |_____|_____|_____|____|_____|_____|____|16KB |_____|_____ 706 10 15 25 707 RTT in milliseconds 709 The following diagram shows the achievable TCP throughput on a 25ms 710 T3 when Send Socket Buffer & TCP RWND Sizes are increased. 712 Figure 3.3.1b TCP Throughputs on a T3 with different TCP RWND 714 45| 715 | 716 40| +-----+40.9M 717 TCP | | | 718 Throughput 35| | | 719 in Mbps | | | 720 30| | | 721 | | | 722 25| | | 723 | | | 724 20| +-----+20.5M | | 725 | | | | | 726 15| | | | | 727 | | | | | 728 10| +-----+10.2M | | | | 729 | | | | | | | 730 5| +-----+5.1M | | | | | | 731 |_____|_____|______|_____|______|_____|_______|_____|_____ 732 16 32 64 128* 733 TCP RWND Size in KBytes 735 * Note that 128KB requires [RFC1323] TCP Window scaling option. 737 3.3.2 Metrics for TCP Throughput Tests 739 This framework focuses on a TCP throughput methodology and also 740 provides several basic metrics to compare results between various 741 throughput tests. It is recognized that the complexity and 742 unpredictability of TCP makes it impossible to develop a complete 743 set of metrics that accounts for the myriad of variables (i.e. RTT 744 variation, loss conditions, TCP implementation, etc.). However, 745 these basic metrics will facilitate TCP throughput comparisons 746 under varying network conditions and between network traffic 747 management techniques. 749 The first metric is the TCP Transfer Time, which is simply the 750 measured time required to transfer a block of data across 751 simultaneous TCP connections. This concept is useful when 752 benchmarking traffic management techniques and when multiple 753 TCP connections are required. 755 TCP Transfer time may also be used to provide a normalized ratio of 756 the actual TCP Transfer Time versus the Ideal Transfer Time. This 757 ratio is called the TCP Transfer Index and is defined as: 759 Actual TCP Transfer Time 760 ------------------------- 761 Ideal TCP Transfer Time 763 The Ideal TCP Transfer time is derived from the network path 764 Bottleneck Bandwidth and Layer 1/2/3/4 overheads associated with the 765 network path. Additionally, both the TCP RWND and the Send Socket 766 Buffer Sizes must be tuned to equal or exceed the bandwidth delay 767 product (BDP) as described in section 3.3.1. 769 The following table illustrates the Ideal TCP Transfer time of a 770 single TCP connection when its TCP RWND and Send Socket Buffer Sizes 771 equals or exceeds the BDP. 773 Table 3.3.2: Link Speed, RTT, BDP, TCP Throughput, and 774 Ideal TCP Transfer time for a 100 MB File 776 Link Maximum Ideal TCP 777 Speed BDP Achievable TCP Transfer time 778 (Mbps) RTT (ms) (KBytes) Throughput(Mbps) (seconds) 779 -------------------------------------------------------------------- 780 1.536 50 9.6 1.4 571 781 44.21 25 138.2 42.8 18 782 100 2 25.0 94.9 9 783 1,000 1 125.0 949.2 1 784 10,000 0.05 62.5 9,492 0.1 786 Transfer times are rounded for simplicity. 788 For a 100MB file(100 x 8 = 800 Mbits), the Ideal TCP Transfer Time 789 is derived as follows: 791 800 Mbits 792 Ideal TCP Transfer Time = ----------------------------------- 793 Maximum Achievable TCP Throughput 795 The maximum achievable layer 2 throughput on T1 and T3 Interfaces 796 is based on the maximum frames per second (FPS) permitted by the 797 actual layer 1 speed with an MTU of 1500 Bytes. 799 The maximum FPS for a T1 is 127 and the calculation formula is: 800 FPS = T1 Link Speed / ((MTU + PPP + Flags + CRC16) X 8) 801 FPS = (1.536M /((1500 Bytes + 4 Bytes + 2 Bytes + 2 Bytes) X 8 ))) 802 FPS = (1.536M / (1508 Bytes X 8)) 803 FPS = 1.536 Mbps / 12064 bits 804 FPS = 127 806 The maximum FPS for a T3 is 3664 and the calculation formula is: 807 FPS = T3 Link Speed / ((MTU + PPP + Flags + CRC16) X 8) 808 FPS = (44.21M /((1500 Bytes + 4 Bytes + 2 Bytes + 2 Bytes) X 8 ))) 809 FPS = (44.21M / (1508 Bytes X 8)) 810 FPS = 44.21 Mbps / 12064 bits 811 FPS = 3664 812 The 1508 equates to: 814 MTU + PPP + Flags + CRC16 816 Where the MTU is 1500 Bytes, PPP is 4 Bytes, the 2 Flags are 1 Byte 817 each and the CRC16 is 2 Bytes. 819 Then, to obtain the Maximum Achievable TCP Throughput (layer 4), we 820 simply use: MSS in Bytes X 8 bits X max FPS. 821 For a T3, the maximum TCP Throughput = 1460 Bytes X 8 bits X 3664 FPS 822 Maximum TCP Throughput = 11680 bits X 3664 FPS 823 Maximum TCP Throughput = 42.8 Mbps. 825 The maximum achievable layer 2 throughput on Ethernet Interfaces is 826 based on the maximum frames per second permitted by the IEEE802.3 827 standard when the MTU is 1500 Bytes. 829 The maximum FPS for 100M Ethernet is 8127 and the calculation is: 830 FPS = (100Mbps /(1538 Bytes X 8 bits)) 832 The maximum FPS for GigE is 81274 and the calculation formula is: 833 FPS = (1Gbps /(1538 Bytes X 8 bits)) 835 The maximum FPS for 10GigE is 812743 and the calculation formula is: 836 FPS = (10Gbps /(1538 Bytes X 8 bits)) 838 The 1538 equates to: 840 MTU + Eth + CRC32 + IFG + Preamble + SFD 841 (IFG = Inter-Frame Gap and SFD = Start of Frame Delimiter) 842 Where MTU is 1500 Bytes, Ethernet is 14 Bytes, CRC32 is 4 Bytes, 843 IFG is 12 Bytes, Preamble is 7 Bytes and SFD is 1 Byte. 845 Note that better results could be obtained with jumbo frames on 846 GigE and 10 GigE. 848 Then, to obtain the Maximum Achievable TCP Throughput (layer 4), we 849 simply use: MSS in Bytes X 8 bits X max FPS. 850 For a 100M, the maximum TCP Throughput = 1460 B X 8 bits X 8127 FPS 851 Maximum TCP Throughput = 11680 bits X 8127 FPS 852 Maximum TCP Throughput = 94.9 Mbps. 854 To illustrate the TCP Transfer Time Index, an example would be the 855 bulk transfer of 100 MB over 5 simultaneous TCP connections (each 856 connection transferring 100 MB). In this example, the Ethernet 857 service provides a Committed Access Rate (CAR) of 500 Mbit/s. Each 858 connection may achieve different throughputs during a test and the 859 overall throughput rate is not always easy to determine (especially 860 as the number of connections increases). 862 The ideal TCP Transfer Time would be ~8 seconds, but in this example, 863 the actual TCP Transfer Time was 12 seconds. The TCP Transfer Index 864 would then be 12/8 = 1.5, which indicates that the transfer across 865 all connections took 1.5 times longer than the ideal. 867 The second metric is TCP Efficiency, which is the percentage of Bytes 868 that were not retransmitted and is defined as: 870 Transmitted Bytes - Retransmitted Bytes 871 --------------------------------------- x 100 872 Transmitted Bytes 874 Transmitted Bytes are the total number of TCP Bytes to be transmitted 875 including the original and the retransmitted Bytes. This metric 876 provides comparative results between various traffic management and 877 congestion avoidance mechanisms. Performance between different TCP 878 implementations could also be compared. (e.g. Reno, Vegas, etc). 880 As an example, if 100,000 Bytes were sent and 2,000 had to be 881 retransmitted, the TCP Efficiency should be calculated as: 883 102,000 - 2,000 884 ---------------- x 100 = 98.03% 885 102,000 887 Note that the Retransmitted Bytes may have occurred more than once, 888 if so, then these multiple retransmissions are added to the 889 Retransmitted Bytes and to the Transmitted Bytes counts. 891 The third metric is the Buffer Delay Percentage, which represents the 892 increase in RTT during a TCP throughput test versus the inherent or 893 baseline RTT. The baseline RTT is the Round Trip Time inherent to 894 the network path under non-congested conditions. 895 (See 3.2.1 for details concerning the baseline RTT measurements). 897 The Buffer Delay Percentage is defined as: 899 Average RTT during Transfer - Baseline RTT 900 ------------------------------------------ x 100 901 Baseline RTT 903 As an example, consider a network path with a baseline RTT of 25 904 msec. During the course of a TCP transfer, the average RTT across 905 the entire transfer increases to 32 msec. Then, the Buffer Delay 906 Percentage would be calculated as: 908 32 - 25 909 ------- x 100 = 28% 910 25 912 Note that the TCP Transfer Time, TCP Efficiency, and Buffer Delay 913 Percentage MUST be measured during each throughput test. Poor TCP 914 Transfer Time Indexes (TCP Transfer Time greater than Ideal TCP 915 Transfer Times) may be diagnosed by correlating with sub-optimal TCP 916 Efficiency and/or Buffer Delay Percentage metrics. 918 3.3.3 Conducting the TCP Throughput Tests 920 Several TCP tools are currently used in the network world and one of 921 the most common is "iperf". With this tool, hosts are installed at 922 each end of the network path; one acts as client and the other as 923 a server. The Send Socket Buffer and the TCP RWND Sizes of both 924 client and server can be manually set. The achieved throughput can 925 then be measured, either uni-directionally or bi-directionally. For 926 higher BDP situations in lossy networks (long fat networks or 927 satellite links, etc.), TCP options such as Selective Acknowledgment 928 SHOULD be considered and become part of the window size / throughput 929 characterization. 931 Host hardware performance must be well understood before conducting 932 the tests described in the following sections. A dedicated 933 communications test instrument will generally be required, especially 934 for line rates of GigE and 10 GigE. A compliant TCP TTD SHOULD 935 provide a warning message when the expected test throughput will 936 exceed 10% of the network bandwidth capacity. If the throughput test 937 is expected to exceed 10% of the provider bandwidth, then the test 938 should be coordinated with the network provider. This does not 939 include the customer premise bandwidth, the 10% refers directly to 940 the provider's bandwidth (Provider Edge to Provider router). 942 The TCP throughput test should be run over a long enough duration 943 to properly exercise network buffers (i.e. greater than 30 seconds) 944 and should also characterize performance at different times of day. 946 3.3.4 Single vs. Multiple TCP Connection Testing 948 The decision whether to conduct single or multiple TCP connection 949 tests depends upon the size of the BDP in relation to the TCP RWND 950 configured in the end-user environment. For example, if the BDP for 951 a long fat network turns out to be 2MB, then it is probably more 952 realistic to test this network path with multiple connections. 953 Assuming typical host computer TCP RWND Sizes of 64 KB (i.e. Windows 954 XP), using 32 TCP connections would emulate a typical small office 955 scenario. 957 The following table is provided to illustrate the relationship 958 between the TCP RWND and the number of TCP connections required to 959 fill the available capacity of a given BDP. For this example, the 960 network bandwidth is 500 Mbps and the RTT is 5 ms, then the BDP 961 equates to 312.5 KBytes. 963 Table 3.3.4 Number of TCP connections versus TCP RWND 965 Number of TCP Connections 966 TCP RWND to fill available bandwidth 967 ------------------------------------- 968 16KB 20 969 32KB 10 970 64KB 5 971 128KB 3 973 The TCP Transfer Time metric is useful for conducting multiple 974 connection tests. Each connection should be configured to transfer 975 payloads of the same size (i.e. 100 MB), and the TCP Transfer time 976 provides a simple metric to verify the actual versus expected 977 results. 979 Note that the TCP transfer time is the time for all connections to 980 complete the transfer of the configured payload size. From the 981 previous table, the 64KB window is considered. Each of the 5 982 TCP connections would be configured to transfer 100MB, and each one 983 should obtain a maximum of 100 Mb/sec. So for this example, the 984 100MB payload should be transferred across the connections in 985 approximately 8 seconds (which would be the ideal TCP transfer time 986 under these conditions). 988 Additionally, the TCP Efficiency metric MUST be computed for each 989 connection as defined in section 3.3.2. 991 3.3.5 Interpretation of the TCP Throughput Results 993 At the end of this step, the user will document the theoretical BDP 994 and a set of Window size experiments with measured TCP throughput for 995 each TCP window size. For cases where the sustained TCP throughput 996 does not equal the ideal value, some possible causes are: 998 - Network congestion causing packet loss which MAY be inferred from 999 a poor TCP Efficiency % (higher TCP Efficiency % = less packet 1000 loss) 1001 - Network congestion causing an increase in RTT which MAY be inferred 1002 from the Buffer Delay Percentage (i.e., 0% = no increase in RTT 1003 over baseline) 1004 - Intermediate network devices which actively regenerate the TCP 1005 connection and can alter TCP RWND Size, MSS, etc. 1006 - Rate limiting (policing). More details on traffic management 1007 tests follows in section 3.4 1009 3.3.6 High Performance Network Options 1011 For cases where the network outperforms the client/server IP hosts 1012 some possible causes are: 1014 - Maximum TCP Buffer space. All operating systems have a global 1015 mechanism to limit the quantity of system memory to be used by TCP 1016 connections. On some systems, each connection is subject to a memory 1017 limit that is applied to the total memory used for input data, output 1018 data and controls. On other systems, there are separate limits for 1019 input and output buffer spaces per connection. Client/server IP 1020 hosts might be configured with Maximum Buffer Space limits that are 1021 far too small for high performance networks. 1023 - Socket Buffer Sizes. Most operating systems support separate per 1024 connection send and receive buffer limits that can be adjusted as 1025 long as they stay within the maximum memory limits. These socket 1026 buffers must be large enough to hold a full BDP of TCP Bytes plus 1027 some overhead. There are several methods that can be used to adjust 1028 socket buffer sizes, but TCP Auto-Tuning automatically adjusts these 1029 as needed to optimally balance TCP performance and memory usage. 1030 It is important to note that Auto-Tuning is enabled by default in 1031 LINUX since the kernel release 2.6.6 and in UNIX since FreeBSD 7.0. 1032 It is also enabled by default in Windows since Vista and in MAC since 1033 OS X version 10.5 (leopard). Over buffering can cause some 1034 applications to behave poorly, typically causing sluggish interactive 1035 response and risk running the system out of memory. Large default 1036 socket buffers have to be considered carefully on multi-user systems. 1038 - TCP Window Scale Option, RFC1323. This option enables TCP to 1039 support large BDP paths. It provides a scale factor which is 1040 required for TCP to support window sizes larger than 64KB. Most 1041 systems automatically request WSCALE under some conditions, such as 1042 when the receive socket buffer is larger than 64KB or when the other 1043 end of the TCP connection requests it first. WSCALE can only be 1044 negotiated during the 3 way handshake. If either end fails to 1045 request WSCALE or requests an insufficient value, it cannot be 1046 renegotiated. Different systems use different algorithms to select 1047 WSCALE, but it is very important to have large enough buffer 1048 sizes. Note that under these constraints, a client application 1049 wishing to send data at high rates may need to set its own receive 1050 buffer to something larger than 64K Bytes before it opens the 1051 connection to ensure that the server properly negotiates WSCALE. 1052 A system administrator might have to explicitly enable RFC1323 1053 extensions. Otherwise, the client/server IP host would not support 1054 TCP window sizes (BDP) larger than 64KB. Most of the time, 1055 performance gains will be obtained by enabling this option in Long 1056 Fat Networks. (i.e., networks with large BDP, see Figure 3.3.1b). 1058 - TCP Timestamps Option, RFC1323. This feature provides better 1059 measurements of the Round Trip Time and protects TCP from data 1060 corruption that might occur if packets are delivered so late that the 1061 sequence numbers wrap before they are delivered. Wrapped sequence 1062 numbers do not pose a serious risk below 100 Mbps, but the risk 1063 increases at higher data rates. Most of the time, performance gains 1064 will be obtained by enabling this option in Gigabit bandwidth 1065 networks. 1067 - TCP Selective Acknowledgments Option (SACK), RFC2018. This allows 1068 a TCP receiver to inform the sender about exactly which data segment 1069 is missing and needs to be retransmitted. Without SACK, TCP has to 1070 estimate which data segment is missing, which works just fine if all 1071 losses are isolated (i.e. only one loss in any given round trip). 1072 Without SACK, TCP takes a very long time to recover after multiple 1073 and consecutive losses. SACK is now supported by most operating 1074 systems, but it may have to be explicitly enabled by the system 1075 administrator. In networks with unknown load and error patterns, TCP 1076 SACK will improve throughput performances. On the other hand, 1077 security appliances vendors might have implemented TCP randomization 1078 without considering TCP SACK and under such circumstances, SACK might 1079 need to be disabled in the client/server IP hosts until the vendor 1080 corrects the issue. Also, poorly implemented SACK algorithms might 1081 cause extreme CPU loads and might need to be disabled. 1083 - Path MTU. The client/server IP host system must use the largest 1084 possible MTU for the path. This may require enabling Path MTU 1085 Discovery (RFC1191 & RFC4821). Since RFC1191 is flawed it is 1086 sometimes not enabled by default and may need to be explicitly 1087 enabled by the system administrator. RFC4821 describes a new, more 1088 robust algorithm for MTU discovery and ICMP black hole recovery. 1090 - TOE (TCP Offload Engine). Some recent Network Interface Cards (NIC) 1091 are equipped with drivers that can do part or all of the TCP/IP 1092 protocol processing. TOE implementations require additional work 1093 (i.e. hardware-specific socket manipulation) to set up and tear down 1094 connections. Because TOE NICs configuration parameters are vendor 1095 specific and not necessarily RFC-compliant, they are poorly 1096 integrated with UNIX & LINUX. Occasionally, TOE might need to be 1097 disabled in a server because its NIC does not have enough memory 1098 resources to buffer thousands of connections. 1100 Note that both ends of a TCP connection must be properly tuned. 1102 3.4. Traffic Management Tests 1104 In most cases, the network connection between two geographic 1105 locations (branch offices, etc.) is lower than the network connection 1106 to host computers. An example would be LAN connectivity of GigE 1107 and WAN connectivity of 100 Mbps. The WAN connectivity may be 1108 physically 100 Mbps or logically 100 Mbps (over a GigE WAN 1109 connection). In the later case, rate limiting is used to provide the 1110 WAN bandwidth per the SLA. 1112 Traffic management techniques might be employed and the most common 1113 are: 1115 - Traffic Policing and/or Shaping 1116 - Priority queuing 1117 - Active Queue Management (AQM) 1118 Configuring the end-to-end network with these various traffic 1119 management mechanisms is a complex under-taking. For traffic shaping 1120 and AQM techniques, the end goal is to provide better performance to 1121 bursty traffic. 1123 This section of the methodology provides guidelines to test traffic 1124 shaping and AQM implementations. As in section 3.3, host hardware 1125 performance must be well understood before conducting the traffic 1126 shaping and AQM tests. Dedicated communications test instrument will 1127 generally be REQUIRED for line rates of GigE and 10 GigE. If the 1128 throughput test is expected to exceed 10% of the provider bandwidth, 1129 then the test should be coordinated with the network provider. This 1130 does not include the customer premises bandwidth, the 10% refers to 1131 the provider's bandwidth (Provider Edge to Provider router). Note 1132 that GigE and 10 GigE interfaces might benefit from hold-queue 1133 adjustments in order to prevent the saw-tooth TCP traffic pattern. 1135 3.4.1 Traffic Shaping Tests 1137 For services where the available bandwidth is rate limited, two (2) 1138 techniques can be used: traffic policing or traffic shaping. 1140 Simply stated, traffic policing marks and/or drops packets which 1141 exceed the SLA bandwidth (in most cases, excess traffic is dropped). 1142 Traffic shaping employs the use of queues to smooth the bursty 1143 traffic and then send out within the SLA bandwidth limit (without 1144 dropping packets unless the traffic shaping queue is exhausted). 1146 Traffic shaping is generally configured for TCP data services and 1147 can provide improved TCP performance since the retransmissions are 1148 reduced, which in turn optimizes TCP throughput for the available 1149 bandwidth. Throughout this section, the rate-limited bandwidth shall 1150 be referred to as the "Bottleneck Bandwidth". 1152 The ability to detect proper traffic shaping is more easily diagnosed 1153 when conducting a multiple TCP connections test. Proper shaping will 1154 provide a fair distribution of the available Bottleneck Bandwidth, 1155 while traffic policing will not. 1157 The traffic shaping tests are built upon the concepts of multiple 1158 connections testing as defined in section 3.3.3. Calculating the BDP 1159 for the Bottleneck Bandwidth is first required before selecting the 1160 number of connections, the Send Socket Buffer and TCP RWND Sizes per 1161 connection. 1163 Similar to the example in section 3.3, a typical test scenario might 1164 be: GigE LAN with a 100Mbps Bottleneck Bandwidth (rate limited 1165 logical interface), and 5 msec RTT. This would require five (5) TCP 1166 connections of 64 KB Send Socket Buffer and TCP RWND Sizes to evenly 1167 fill the Bottleneck Bandwidth (~100 Mbps per connection). 1169 The traffic shaping test should be run over a long enough duration to 1170 properly exercise network buffers (i.e. greater than 30 seconds) and 1171 should also characterize performance at different times of day. The 1172 throughput of each connection MUST be logged during the entire test, 1173 along with the TCP Transfer Time, TCP Efficiency, and Buffer Delay 1174 Percentage. 1176 3.4.1.1 Interpretation of Traffic Shaping Test Results 1178 By plotting the throughput achieved by each TCP connection, we should 1179 see fair sharing of the bandwidth when traffic shaping is properly 1180 configured. For the previous example of 5 connections sharing 500 1181 Mbps, each connection would consume ~100 Mbps with smooth variations. 1183 If traffic shaping is not configured properly or if traffic policing 1184 is present on the bottleneck interface, the bandwidth sharing may 1185 not be fair. The resulting throughput plot may reveal "spikey" 1186 throughput consumption of the competing TCP connections (due to the 1187 high rate of TCP retransmissions). 1189 3.4.2 AQM Tests 1191 Active Queue Management techniques are specifically targeted to 1192 provide congestion avoidance to TCP traffic. As an example, before 1193 the network element queue "fills" and enters the tail drop state, an 1194 AQM implementation like RED (Random Early Discard) drops packets at 1195 pre-configurable queue depth thresholds. This action causes TCP 1196 connections to back-off which helps prevent tail drops and in 1197 turn helps avoid global TCP synchronization. 1199 RED is just an example and other AQM implementations like WRED 1200 (Weighted Random Early Discard) or REM (Random Exponential Marking) 1201 or AREM (Adaptive Random Exponential Marking), just to name a few, 1202 could be used. 1204 Again, rate limited interfaces may benefit greatly from AQM based 1205 techniques. With a default FIFO queue, bloated buffering is 1206 increasingly a common encounter and has dire effects on TCP 1207 connections. However, the main effect is the delayed congestion 1208 feedback (poor TCP control loop response) and enormous queuing 1209 delays on all other traffic flows. 1211 In a FIFO based queue, the TCP traffic may not be able to achieve 1212 the full throughput available on the Bottleneck Bandwidth link. 1213 While with an AQM implementation, TCP congestion avoidance would 1214 throttle the connections on the higher speed interface (i.e. LAN) 1215 and could help achieve the full throughput (up to the Bottleneck 1216 Bandwidth). The bursty nature of TCP traffic is a key factor in the 1217 overall effectiveness of AQM techniques; steady state bulk transfer 1218 flows will generally not benefit from AQM because with bulk transfer 1219 flows, network device queues gracefully throttle the effective 1220 throughput rates due to increased delays. 1222 The ability to detect proper AQM configuration is more easily 1223 diagnosed when conducting a multiple TCP connections test. Multiple 1224 TCP connections provide the bursty sources that emulate the 1225 real-world conditions for which AQM implementations are intended. 1227 AQM testing also builds upon the concepts of multiple connections 1228 testing as defined in section 3.3.3. Calculating the BDP for the 1229 Bottleneck Bandwidth is first required before selecting the number 1230 of connections, the Send Socket Buffer size and the TCP RWND Size 1231 per connection. 1233 For AQM testing, the desired effect is to cause the TCP connections 1234 to burst beyond the Bottleneck Bandwidth so that queue drops will 1235 occur. Using the same example from section 3.4.1 (traffic shaping), 1236 the 500 Mbps Bottleneck Bandwidth requires 5 TCP connections (with 1237 window size of 64KB) to fill the capacity. Some experimentation is 1238 required, but it is recommended to start with double the number of 1239 connections in order to stress the network element buffers / queues 1240 (10 connections for this example). 1242 The TCP TTD must be configured to generate these connections as 1243 shorter (bursty) flows versus bulk transfer type flows. These TCP 1244 bursts should stress queue sizes in the 512KB range. Again 1245 experimentation will be required; the proper number of TCP 1246 connections, the Send Socket Buffer and TCP RWND Sizes will be 1247 dictated by the size of the network element queue. 1249 3.4.2.1 Interpretation of AQM Results 1251 The default queuing technique for most network devices is FIFO based. 1252 Under heavy traffic conditions, FIFO based queue management may cause 1253 enormous queuing delays plus delayed congestion feedback to all TCP 1254 applications. This can cause excessive loss on all of the TCP 1255 connections and in the worst cases, global TCP synchronization. 1257 AQM implementation can be detected by plotting individual and 1258 aggregate throughput results achieved by multiple TCP connections on 1259 the bottleneck interface. Proper AQM operation may be determined if 1260 the TCP throughput is fully utilized (up to the Bottleneck Bandwidth) 1261 and fairly shared between TCP connections. For the previous example 1262 of 10 connections (window = 64 KB) sharing 500 Mbps, each connection 1263 should consume ~50 Mbps. If AQM was not properly enabled on the 1264 interface, then the TCP connections would retransmit at higher rates 1265 and the net effect is that the Bottleneck Bandwidth is not fully 1266 utilized. 1268 Another means to study non-AQM versus AQM implementations is to use 1269 the Buffer Delay Percent metric for all of the connections. The 1270 Buffer Delay Percentage should be significantly lower in AQM 1271 implementations versus default FIFO queuing. 1273 Additionally, non-AQM implementations may exhibit a lower TCP 1274 Transfer Efficiency. 1276 4. Security Considerations 1278 The security considerations that apply to any active measurement of 1279 live networks are relevant here as well. See [RFC4656] and 1280 [RFC5357]. 1282 5. IANA Considerations 1284 This document does not REQUIRE an IANA registration for ports 1285 dedicated to the TCP testing described in this document. 1287 6. Acknowledgments 1289 Thanks to Lars Eggert, Al Morton, Matt Mathis, Matt Zekauskas, 1290 Yaakov Stein, and Loki Jorgenson for many good comments and for 1291 pointing us to great sources of information pertaining to past works 1292 in the TCP capacity area. 1294 7. References 1296 7.1 Normative References 1298 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1299 Requirement Levels", BCP 14, RFC 2119, March 1997. 1301 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 1302 Zekauskas, "A One-way Active Measurement Protocol 1303 (OWAMP)", RFC 4656, September 2006. 1305 [RFC2544] Bradner, S., McQuaid, J., "Benchmarking Methodology for 1306 Network Interconnect Devices", RFC 2544, June 1999 1308 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., Babiarz, 1309 J., "A Two-Way Active Measurement Protocol (TWAMP)", 1310 RFC 5357, October 2008 1312 [RFC4821] Mathis, M., Heffner, J., "Packetization Layer Path MTU 1313 Discovery", RFC 4821, June 2007 1315 draft-ietf-ippm-btc-cap-00.txt Allman, M., "A Bulk 1316 Transfer Capacity Methodology for Cooperating Hosts", 1317 August 2001 1319 [RFC2681] Almes G., Kalidindi S., Zekauskas, M., "A Round-trip Delay 1320 Metric for IPPM", RFC 2681, September, 1999 1322 [RFC4898] Mathis, M., Heffner, J., Raghunarayan, R., "TCP Extended 1323 Statistics MIB", May 2007 1325 [RFC5136] Chimento P., Ishac, J., "Defining Network Capacity", 1326 February 2008 1328 [RFC1323] Jacobson, V., Braden, R., Borman D., "TCP Extensions for 1329 High Performance", May 1992 1331 7.2. Informative References 1333 Authors' Addresses 1335 Barry Constantine 1336 JDSU, Test and Measurement Division 1337 One Milesone Center Court 1338 Germantown, MD 20876-7100 1339 USA 1341 Phone: +1 240 404 2227 1342 barry.constantine@jdsu.com 1344 Gilles Forget 1345 Independent Consultant to Bell Canada. 1346 308, rue de Monaco, St-Eustache 1347 Qc. CANADA, Postal Code : J7P-4T5 1349 Phone: (514) 895-8212 1350 gilles.forget@sympatico.ca 1352 Rudiger Geib 1353 Heinrich-Hertz-Strasse (Number: 3-7) 1354 Darmstadt, Germany, 64295 1356 Phone: +49 6151 6282747 1357 Ruediger.Geib@telekom.de 1359 Reinhard Schrage 1360 Schrage Consulting 1362 Phone: +49 (0) 5137 909540 1363 reinhard@schrageconsult.com