idnits 2.17.1 draft-wang-tcpm-low-latency-opt-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. == There are 2 instances of lines with private range IPv4 addresses in the document. If these are generic example addresses, they should be changed to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x, 198.51.100.x or 203.0.113.x. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 64: '... 4.2, "an ACK SHOULD be generated fo...' RFC 2119 keyword, line 65: '... segment, and MUST be generated with...' RFC 2119 keyword, line 112: '...y handshake. It MUST be ignored in ot...' RFC 2119 keyword, line 141: '... implementations MUST (a) set to zero ...' RFC 2119 keyword, line 175: '...s the maximum ACK delay parameter MUST...' (14 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 8, 2017) is 2508 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'RFC1122' is mentioned on line 63, but not defined == Missing Reference: 'RFC2119' is mentioned on line 102, but not defined Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance Working Group W. Wang 3 Internet-Draft N. Cardwell 4 Intended status: Experimental Y. Cheng 5 Expires: December 10, 2017 E. Dumazet 6 Google, Inc 7 June 8, 2017 9 TCP Low Latency Option 10 draft-wang-tcpm-low-latency-opt-00 12 Abstract 14 This document specifies the TCP Low Latency option, which TCP 15 connections can use during the connection establishment handshake to 16 communicate extra parameters that can improve performance in low- 17 latency environments. With the first such parameter, a TCP data 18 receiver can advertise a hint about the Maximum ACK Delay (MAD) it 19 will schedule for its own delayed ACK mechanism. This enables the 20 TCP data sender to achieve lower latencies during loss recovery by 21 using the Maximum ACK Delay advertised by the remote receiver to help 22 compute retransmission timeouts that are potentially much lower than 23 would otherwise be feasible. The Low Latency option is extensible, 24 and later versions of this draft will introduce other mechanisms, 25 including TCP timestamps with a finer granularity than those 26 supported by RFC 7323. 28 Status of This Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at http://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on December 10, 2017. 45 Copyright Notice 47 Copyright (c) 2017 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 1. Introduction 62 TCP receivers typically implement a delayed ACK algorithm, as 63 specified in [RFC1122] Sec 4.2.3.2; as summarized in [RFC5681] sec 64 4.2, "an ACK SHOULD be generated for at least every second full-sized 65 segment, and MUST be generated within 500 ms of the arrival of the 66 first unacknowledged packet." In practice, many widely-deployed 67 implementations have tended to delay ACKs by up to roughly 200ms. 68 This is probably a historical artifact inherited from the 200ms "fast 69 timeout" mechanism in the BSD TCP implementation from the late 1980s 70 [WS95]. 72 As a result, to avoid spurious timeouts due to delayed ACKs, widely- 73 deployed TCP sender implementations have adapted to this delayed ACK 74 behavior by constraining retransmission timeout (RTO) values to be at 75 least 200ms. 77 Unfortunately, this 200ms value is 2000x the typical RTT of today's 78 commodity datacenter networks (which are typically below 100 79 microseconds). So senders constraining RTOs to be at least 200ms are 80 paying a latency penalty much higher than the RTT in such 81 environments. 83 The TCP Low Latency option enables a TCP data receiver to advertise a 84 hint about the Maximum ACK Delay (MAD) it will schedule for its own 85 delayed ACK mechanism. The receiver specifies the MAD value in the 86 Low Latency option because the value that is feasible can be quite 87 different for different receivers, based on the CPU's speed, CPU and 88 network workloads, and OS-specific constraints on minimum supported 89 timer granularity. 91 This Low Latency option enables the TCP data sender to achieve lower 92 latencies during loss recovery by using the Maximum ACK Delay 93 advertised by the remote receiver to help compute retransmission 94 timeouts that are potentially much lower than would otherwise be 95 feasible. 97 2. Terminology 99 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 100 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 101 document are to be interpreted as described in [RFC2119]. 103 In this document, "MAD" refers to the Maximum Ack Delay used by the 104 data receiver to delay TCP acknowledgments, and "minRTO" refers to 105 the Minimum Retransmit Timeout. 107 3. Detailed Protocol 109 3.1. TCP Low Latency Option 111 The Low Latency option is only valid in SYN or SYN/ACK packets during 112 the three way handshake. It MUST be ignored in other cases. 114 The format of the TCP Low Latency option is as follows: 116 0 1 2 3 117 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 118 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 119 | Kind | Length |M u| MAD | | 120 | | |A n| Value | Res | 121 | | |D i| (10 bits) | | 122 | | | t| | | 123 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 124 | | 125 ~ ... Reserved ... ~ 126 | | 127 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 128 Kind: 1 byte: value = IANA-assigned option number 129 Length: 1 byte: value = 4 (or longer in later versions) 130 MAD unit: 2 bits: indicates time unit for MAD value: 131 0: reserved 132 1: milliseconds 133 2: microseconds 134 3: nanoseconds 135 MAD value: 10 bits: indicates MAD value set on the host: 136 1 ... 1023: MAD value in the given units 137 0: no MAD value is specified 138 Reserved: N>=4 bits: value = 0 139 In order to support future extensions, the option is variable-length. 140 Bits beyond those defined so far in IETF standards should be 141 considered "reserved". TCP implementations MUST (a) set to zero any 142 reserved bits they add for padding, and (b) ignore any reserved bits 143 (whether they are set or not). 145 3.2. Overview 147 The communication, starting from the TCP connection handshake, looks 148 like the following: 150 TCP A (Active) TCP B (Passive) 151 ============== =============== 152 CLOSED LISTEN 153 #1 SYN-SENT ----- ------> SYN-RCVD 154 (Adjust RTO accordingly) 155 #2 ESTABLISHED <---- ----- SYN-RCVD 156 (Adjust RTO accordingly) 157 #3 ESTABLISHED -----------------------> ESTABLISHED 158 #4 Send() --------------------> - 159 | 160 | Delay Ack < 5ms 161 | 162 <-------------------- - 163 #5 Recv() 165 #6 Send() -----------------------> 166 | 167 RTO >= 5ms | 168 | 169 ------------> 170 <------------------------ 171 #7 Recv() 173 3.3. Configuring maximum ACK delay 175 An implementation that supports the maximum ACK delay parameter MUST 176 provide a user API to configure the maximum ACK delay for a specific 177 connection or all TCP connections. 179 o If the user does not specify a MAD value, then the implementation 180 SHOULD NOT specify a MAD value in the Low Latency option. 182 o If the user specifies a MAD value outside the range of ACK delay 183 values supported by the implementation, then the implementation 184 SHOULD allow the request to succeed, but SHOULD silently constrain 185 the MAD value to be within the valid range (between the minimum 186 and maximum ACK delay for the implementation). This is intended 187 to allow applications to portably request a MAD value without 188 needing special logic to search for a valid value. 190 o If the specified connections are not in CLOSED or LISTEN states, 191 the API SHOULD return an error and ignore the request to specify a 192 MAD value. 194 o Otherwise the implementation SHOULD use the user-specified value 195 as the maximum timeout for the delayed ACK and the MAD value in 196 the Low Latency option of the specified TCP connections. 198 The exact design and implementation of such an API is intentionally 199 left to the implementation. We discuss some examples in the 200 appendix. 202 3.4. Announcing the maximum ACK delay 204 o The maximum ACK delay is announced to the remote TCP endpoint by 205 including a Low Latency option with a non-zero MAD value in the 206 SYN or SYN/ACK packet. A "MAD value" field of 0 in the Low 207 Latency option indicates that the sender is not specifying a MAD 208 value. 210 o If specified, then the MAD value in the Low Latency option MUST be 211 set, as close as possible, to the implementation's actual delayed 212 ACK timeout for the connection. Note that the actual maximum 213 delayed ACK timeout of the connection may be larger than the 214 actual user specified value because of implementation constraints 215 (e.g. timer granularity limitations). 217 o If the user has specified a MAD value for an active connection, 218 then the active open side SHOULD include a Low Latency option with 219 a MAD value in the SYN packet. 221 o If the user has specified a MAD value for a passive connection, 222 and the passive side has received at least one SYN packet with a 223 Low Latency option with a valid MAD value, then the passive open 224 side SHOULD return its MAD value in the Low Latency option. 226 3.5. Adjusting TCP retransmission timeouts 228 If the MAD value advertised in a received Low Latency option is 0, or 229 greater than the default maximum ACK delay of 200ms, then the option 230 SHOULD be ignored and no further action is needed. 232 Otherwise the (data) sender MAY use the maximum delayed ACK 233 advertised by the receiver to adjust the sender's RTO calculation. 234 Specifically, if the sender implements an RTO calculation based on 236 [RFC6298], it MAY replace the 1 second lower-bound specified in step 237 2.4 in Section 2 with the value of the maximum ACK delay advertised 238 in the Low Latency option, so that the calculation becomes: 240 RTO <- SRTT + max(G, K*RTTVAR) + max(G, max_ACK_delay) 242 instead of 244 RTO <- max(SRTT + max(G, K*RTTVAR), 1 second) /* [RFC6298] */ 246 Here we use the notation of [RFC6298], including SRTT (smoothed 247 round-trip time), RTTVAR (round-trip time variation), and G (clock 248 granularity). 250 Also, if the sender also implements [draft-ietf-tcpm-rack] then it 251 SHOULD replace the maximum delayed ACK parameter (WCDelAckT) with the 252 max_ACK_delay specified in the Low Latency option. 254 Using the MAD value in the RTO calculation helps senders reduce the 255 RTO significantly while still avoiding spurious retransmissions due 256 to delayed acks. With this new algorithm, the RTO can be drastically 257 shortened in most environments where the receiver advertises a MAD. 258 In particular, in data center environments the RTO can often be 259 reduced from more than one second to single-digit milliseconds. 260 Using the MAD to reduce the RTO can improve performance and thus 261 mitigate TCP incast issues. More details are provided in the 262 following Related work section. 264 4. Related work 266 Several research papers have shown that reducing the minimum 267 retransmission timeout (minRTO) significantly improves the 268 performance of TCP in the datacenter, by mitigating the effect of TCP 269 timeouts. As a result, this can mitigate TCP incast issues. 271 o In "Attaining the Promise and Avoiding the Pitfalls of TCP in the 272 Datacenter" [JS15], the authors show that reducing minRTO from 273 200ms to 5ms greatly reduced the impact of TCP incast issues. 275 o In "Understanding TCP incast throughput collapse in datacenter 276 networks" [CG09], the authors show significant improvement in 277 goodput when reducing minRTO. 279 o In "Measurement and Analysis of TCP Throughput Collapse in 280 Cluster-based Storage Systems" [PK07], the authors show that 281 reducing minRTO from 200 milliseconds to 200 microseconds improved 282 goodput by an order of magnitude in some data center scenarios 283 they evaluated. 285 o In "Safe and Effective Fine-grained TCP Retransmissions for 286 Datacenter Communication" [VP09], the authors point out that the 287 imbalance between the TCP minRTO and datacenter latencies can 288 result in poor performance for applications sensitive to 289 millisecond-scale delays in query response times. In simulations 290 of datacenter scenarios they show that goodput drops when 291 increasing minRTO above 1ms. Moreover, in some data center 292 scenarios the default minRTO of 200ms results in nearly 2 orders 293 of magnitude lower throughput compared to a minRTO of 1ms. 295 o In Google data centers a TCP option mechanism equivalent to the 296 Low Latency option's MAD parameter has been used since 2005, and 297 the TCP minRTO has been set to 5ms by default since 2013 [CC16]. 299 5. Middlebox Considerations 301 The new Low Latency option might expose some middlebox issues: 303 o Middleboxes could drop SYNs with a Low Latency option in the case 304 where it treats the Low Latency option as an unknown option. 305 However, this happens fairly rarely according to "Is it still 306 possible to extend TCP?" [HN11], table 3. 308 o In case middleboxes alter the content in the Low Latency option, 309 the receiver SHOULD do a sanity check on the MAD value included in 310 the Low Latency option to verify it is less than or equal to the 311 default maximum ACK delay of 200ms. As explained earlier, it is 312 not practical for users to set MAD value greater than default. So 313 it is safe to consider a MAD value greater than default as a 314 result of a bad user configuration or a malfunctioning middlebox 315 and ignore the Low Latency option completely in such cases. 317 6. Security Considerations 319 TBD 321 7. IANA Considerations 323 As no official option number has been issued for the new Low Latency 324 option by IANA yet, experimental option 254 per [RFC6994] with magic 325 number 0xF990 (16 bits) is used for now. 327 The option format with experimental ID is as follows: 329 0 1 2 3 330 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 331 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 332 | Kind | Length | RFC 6994 Experiment ID | 333 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 334 |M u| MAD | | 335 |A n| Value | Res | ... 336 |D i| (10 bits) | | 337 | t| | | 338 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 339 Kind: 1 byte: value = 254 340 Length: 1 byte: value = 6 (or longer in later versions) 341 Experiment ID: 2 bytes: value = 0xF990 342 MAD unit: 2 bits: indicates time unit for MAD value: 343 0: reserved 344 1: milliseconds 345 2: microseconds 346 3: nanoseconds 347 MAD value: 10 bits: indicates MAD value set on the host: 348 1 ... 1023: MAD value in the given units 349 0: no MAD value is specified 350 Reserved: N>=4 bits: value = 0 352 We will migrate to using the official option number for the Low 353 Latency option after IANA has assigned one. 355 8. Appendix 357 8.1. Example user API in Linux to configure maximum ACK delay 359 8.1.1. Per-route MAD configuration API 361 A new configuration option called "mad" will be added to the "ip" 362 command line tool in the iproute2 package. Users can use this to 363 configure a per-route MAD value like the following: 365 ip route add 10.1.2.0/24 dev eth0 scope link src 10.1.2.123 mad 5ms 367 This configures all connections destined to 10.1.2.0/24 to have a MAD 368 value of 5ms. When implementing this new MAD option field, the "ip" 369 command line tool will verify that the provided MAD parameter is less 370 than or equal to the default MAD value of 200ms. If the MAD is 371 invalid then the ip route command will ignore the command and report 372 an error to user. 374 Newly-created TCP sockets have the default 200ms MAD value. When a 375 TCP connection is opened, it SHOULD consult the ip routing table to 376 check if there is any configured MAD value for the route. If so, the 377 implementation copies the route's MAD value to the connection's MAD 378 value. 380 This per-route configuration will mostly be used by network 381 administrators when configuring routes on the host. 383 8.1.2. MAD Socket option API 385 Socket options provide per-connection configuration parameters. To 386 allow per-connection configuration of the MAD value in the Low 387 Latency option, a new TCP socket option called TCP_MAD will be added 388 to the TCP implementation. This will allow applications to request a 389 MAD value on a finer granularity than the per-route configuration, 390 depending on the application's requirements. 392 The API will look like the following example: 394 int mad_val = 5 * 1000 * 1000; // in ns unit: 5ms 396 err = setsockopt(fd, SOL_TCP, TCP_MAD, &mad_val, sizeof(mad_val)); 398 The socket option implementation will sanitize the MAD value provided 399 by the user. Per the specification above, in the "Configuring 400 maximum ACK delay" section, if the user specifies a MAD value outside 401 the range of ACK delay values supported by the implementation, then 402 the implementation will allow the request to succeed, but will 403 silently constrain the MAD value to be within the valid range 404 (between the minimum and maximum ACK delay for the implementation). 405 This is intended to allow applications to portably request a MAD 406 value without needing special logic to search for a valid value. 408 Once the implementation has sanitized the provided MAD value, it will 409 record the value in the socket as the socket's own MAD value. 411 Note: the MAD value set by the socket option SHOULD always override 412 the per-route MAD value if there is one. 414 9. References 416 9.1. Normative References 418 [draft-ietf-tcpm-rack] 419 Cheng, Y., Cardwell, N., and N. Dukkipati, "RACK: a time- 420 based fast loss detection algorithm for TCP", draft-ietf- 421 tcpm-rack-02 (work in progress), March 2017. 423 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 424 Control", RFC 5681, September 2009. 426 [RFC6298] Paxson, V., "Computing TCP's Retransmission Timer", 427 RFC 6298, June 2011. 429 [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", 430 RFC 6994, August 2013. 432 9.2. Informative References 434 [CC16] Cardwell, N., Cheng, Y., and E. Dumazet, "TCP Options for 435 Low Latency: Maximum ACK Delay and Microsecond 436 Timestamps", IETF 97 , November 2016. 438 [CG09] Chen, Y., Griffith, R., Liu, J., and R. Katz, 439 "Understanding TCP incast throughput collapse in 440 datacenter networks", WREN 09 , August 2009. 442 [HN11] Honda, M., Nishida, Y., Raiciu, C., Greenhalgh, A., 443 Handley, M., and H. Tokuda, "Is it Still Possible to 444 Extend TCP?", IMC 11 , November 2011. 446 [JS15] Judd, G. and M. Stanley, "Attaining the Promise and 447 Avoiding the Pitfalls of TCP in the Datacenter", NSDI 15 , 448 May 2015. 450 [PK07] Phanishayee, A., Krevat, E., Vasudevan, V., Andersen, D., 451 Ganger, G., Gibson, G., and S. Seshan, "Measurement and 452 Analysis of TCP Throughput Collapse in Cluster-based 453 Storage Systems", September 2007. 455 [VP09] Vasudevan, V., Phanishayee, A., Shah, H., Krevat, E., 456 Andersen, D., Ganger, G., Gibson, G., and B. Mueller, 457 "Safe and Effective Fine-grained TCP Retransmissions for 458 Datacenter Communication", SIGCOMM 09 , August 2009. 460 [WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: 461 The Implementation", 1995. 463 Authors' Addresses 465 Wei Wang 466 Google, Inc 467 1600 Amphitheater Parkway 468 Mountain View, California 94043 469 USA 471 Email: weiwan@google.com 472 Neal Cardwell 473 Google, Inc 474 76 Ninth Avenue 475 New York, NY 10011 476 USA 478 Email: ncardwell@google.com 480 Yuchung Cheng 481 Google, Inc 482 1600 Amphitheater Parkway 483 Mountain View, California 94043 484 USA 486 Email: ycheng@google.com 488 Eric Dumazet 489 Google, Inc 490 1600 Amphitheater Parkway 491 Mountain View, California 94043 493 Email: edumazet@google.com