idnits 2.17.1 draft-ietf-tcpm-2140bis-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 313 has weird spacing: '...sthresh old_...' == Line 315 has weird spacing: '...endcwnd old_...' == Line 344 has weird spacing: '... MSSopt curr_...' == Line 354 has weird spacing: '...sthresh curr...' == Line 356 has weird spacing: '...endcwnd curr...' == (5 more instances...) == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 8, 2021) is 1173 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-15) exists of draft-ietf-tcpm-generalized-ecn-06 -- Obsolete informational reference (is this intentional?): RFC 1644 (Obsoleted by RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1379 (Obsoleted by RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 2001 (Obsoleted by RFC 2581) -- Obsolete informational reference (is this intentional?): RFC 2140 (Obsoleted by RFC 9040) -- Obsolete informational reference (is this intentional?): RFC 2414 (Obsoleted by RFC 3390) -- Obsolete informational reference (is this intentional?): RFC 4960 (Obsoleted by RFC 9260) -- Obsolete informational reference (is this intentional?): RFC 7231 (Obsoleted by RFC 9110) -- Obsolete informational reference (is this intentional?): RFC 7540 (Obsoleted by RFC 9113) Summary: 1 error (**), 0 flaws (~~), 9 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 TCPM WG J. Touch 2 Internet Draft Independent 3 Intended status: Informational M. Welzl 4 Obsoletes: 2140 S. Islam 5 Expires: August 2021 University of Oslo 6 February 8, 2021 8 TCP Control Block Interdependence 9 draft-ietf-tcpm-2140bis-09.txt 11 Status of this Memo 13 This Internet-Draft is submitted in full conformance with the 14 provisions of BCP 78 and BCP 79. 16 This document may contain material from IETF Documents or IETF 17 Contributions published or made publicly available before November 18 10, 2008. The person(s) controlling the copyright in some of this 19 material may not have granted the IETF Trust the right to allow 20 modifications of such material outside the IETF Standards Process. 21 Without obtaining an adequate license from the person(s) controlling 22 the copyright in such materials, this document may not be modified 23 outside the IETF Standards Process, and derivative works of it may 24 not be created outside the IETF Standards Process, except to format 25 it for publication as an RFC or to translate it into languages other 26 than English. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF), its areas, and its working groups. Note that 30 other groups may also distribute working documents as Internet- 31 Drafts. 33 Internet-Drafts are draft documents valid for a maximum of six 34 months and may be updated, replaced, or obsoleted by other documents 35 at any time. It is inappropriate to use Internet-Drafts as 36 reference material or to cite them other than as "work in progress." 38 The list of current Internet-Drafts can be accessed at 39 http://www.ietf.org/ietf/1id-abstracts.txt 41 The list of Internet-Draft Shadow Directories can be accessed at 42 http://www.ietf.org/shadow.html 44 This Internet-Draft will expire on August 8, 2021. 46 Copyright Notice 48 Copyright (c) 2021 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (https://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with 56 respect to this document. Code Components extracted from this 57 document must include Simplified BSD License text as described in 58 Section 4.e of the Trust Legal Provisions and are provided 59 without warranty as described in the Simplified BSD License. 61 Abstract 63 This memo provides guidance to TCP implementers that are intended to 64 help improve convergence to steady-state operation without affecting 65 interoperability. It updates and replaces RFC 2140's description of 66 interdependent TCP control blocks and the ways that part of TCP 67 state can be shared among similar concurrent or consecutive 68 connections. TCP state includes a combination of parameters, such as 69 connection state, current round-trip time estimates, congestion 70 control information, and process information. Most of this state is 71 maintained on a per-connection basis in the TCP Control Block (TCB), 72 but implementations can (and do) share certain TCB information 73 across connections to the same host. Such sharing is intended to 74 improve overall transient transport performance, while maintaining 75 backward-compatibility with existing implementations. The sharing 76 described herein is limited to only the TCB initialization and so 77 has no effect on the long-term behavior of TCP after a connection 78 has been established. 80 Table of Contents 82 1. Introduction...................................................3 83 2. Conventions Used in This Document..............................4 84 3. Terminology....................................................4 85 4. The TCP Control Block (TCB)....................................6 86 5. TCB Interdependence............................................7 87 6. Temporal Sharing...............................................7 88 6.1. Initialization of the new TCB................................7 89 6.2. Updates to the new TCB.......................................8 90 6.3. Discussion...................................................9 91 7. Ensemble Sharing..............................................10 92 7.1. Initialization of a new TCB.................................10 93 7.2. Updates to the new TCB......................................11 94 7.3. Discussion..................................................12 95 8. Compatibility Issues..........................................13 96 8.1. Traversing the same network path............................14 97 8.2. State dependence............................................14 98 8.3. Problems with IP sharing....................................15 99 9. Implications..................................................15 100 9.1. Layering....................................................15 101 9.2. Other possibilities.........................................16 102 10. Implementation Observations..................................16 103 11. Updates to RFC 2140..........................................17 104 12. Security Considerations......................................18 105 13. IANA Considerations..........................................19 106 14. References...................................................19 107 14.1. Normative References....................................19 108 14.2. Informative References..................................20 109 15. Acknowledgments..............................................22 110 16. Change log...................................................22 111 Appendix A : TCB Sharing History.................................25 112 Appendix B : TCP Option Sharing and Caching......................26 113 Appendix C : Automating the Initial Window in TCP over Long 114 Timescales.......................................................28 115 C.1. Introduction.............................................28 116 C.2. Design Considerations....................................28 117 C.3. Proposed IW Algorithm....................................29 118 C.4. Discussion...............................................33 119 C.5. Observations.............................................34 121 1. Introduction 123 TCP is a connection-oriented reliable transport protocol layered 124 over IP [RFC793]. Each TCP connection maintains state, usually in a 125 data structure called the TCP Control Block (TCB). The TCB contains 126 information about the connection state, its associated local 127 process, and feedback parameters about the connection's transmission 128 properties. As originally specified and usually implemented, most 129 TCB information is maintained on a per-connection basis. Some 130 implementations can (and now do) share certain TCB information 131 across connections to the same host [RFC2140]. Such sharing is 132 intended to lead to better overall transient performance, especially 133 for numerous short-lived and simultaneous connections, as often used 134 in the World-Wide Web [Be94][Br02]. This sharing of state is 135 intended to help TCP connections converge to steady-state behavior 136 more quickly without affecting TCP interoperability. 138 This document updates RFC 2140's discussion of TCB state sharing and 139 provides a complete replacement for that document. This state 140 sharing affects only TCB initialization [RFC2140] and thus has no 141 effect on the long-term behavior of TCP after a connection has been 142 established nor on interoperability. Path information shared across 143 SYN destination port numbers assumes that TCP segments having the 144 same host-pair experience the same path properties, irrespective of 145 TCP port numbers. The observations about TCB sharing in this 146 document apply similarly to any protocol with congestion state, 147 including SCTP [RFC4960] and DCCP [RFC4340], as well as for 148 individual subflows in Multipath TCP [RFC8684]. 150 2. Conventions Used in This Document 152 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 153 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 154 "OPTIONAL" in this document are to be interpreted as described in 155 BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all 156 capitals, as shown here. 158 The core of this document describes behavior that is already 159 permitted by TCP standards. As a result, it provides informative 160 guidance but does not use normative language, except when quoting 161 other documents. Normative language is used in Appendix C as 162 examples of requirements for future consideration. 164 3. Terminology 166 The following terminology is used frequently in this document. Items 167 preceded with a "+" may be part of the state maintained as TCP 168 connection state in the associated connections TCB and are the focus 169 of sharing as described in this document. 171 +cwnd - the TCP congestion window size [RFC5681] 173 Host - a source or sink of TCP segments associated with a single IP 174 address 176 Host-pair - a pair of hosts and their corresponding IP addresses 178 +MMS_R - the maximum message size that can be received, the largest 179 received transport payload of an IP datagram [RFC1122] 181 +MMS_S - the maximum message size that can be sent, the largest 182 transmitted transport payload of an IP datagram [RFC1122] 184 Path - an Internet path between the IP addresses of two hosts 185 PCB - protocol control block, the data associated with a protocol as 186 maintained by an endpoint; a TCP PCB is called a TCB 188 PLPMTUD - packetization-layer path MTU discovery, a mechanism that 189 uses transport packets to discovery the PMTU [RFC4821] 191 +PMTU - the largest IP datagram that can traverse a path 192 [RFC1191][RFC8201] 194 PMTUD - path-layer MTU discovery, a mechanism that relies on ICMP 195 error messages to discover the PMTU [RFC1191][RFC8201] 197 +RTT - the round-trip time of a TCP packet exchange [RFC793] 199 +RTTvar - the variance of the round-trip times of a TCP packet 200 exchange [RFC6298] 202 +RWIN - the TCP receive window size [RFC793] 204 +sendcwnd - the TCP send-side congestion window (cwnd) size 205 [RFC5681] 207 +sendMSS - the TCP maximum segment size, a value transmitted in a 208 TCP option that represents the largest TCP user data payload that 209 can be received [RFC793] 211 +ssthresh - the TCP slow-start threshold [RFC5681] 213 TCB - TCP Control Block, the data associated with a TCP connection 214 as maintained by an endpoint 216 TCP-AO - the TCP Authentication Option [RFC5925] 218 TFO - TCP Fast Open option [RFC7413] 220 +TFO_cookie - the TCP Fast Open cookie, state that is used as part 221 of the TFO mechanism, when TFO is supported [RFC7413] 223 +TFO_failure - an indication of when TFO option negotiation failed, 224 when TFO is supported 226 +TFOinfo - information cached when a TFO connection is established, 227 which includes the TFO_cookie [RFC7413] 229 4. The TCP Control Block (TCB) 231 A TCB describes the data associated with each connection, i.e., with 232 each association of a pair of applications across the network. The 233 TCB contains at least the following information [RFC793]: 235 Local process state 236 pointers to send and receive buffers 237 pointers to retransmission queue and current segment 238 pointers to Internet Protocol (IP) PCB 239 Per-connection shared state 240 macro-state 241 connection state 242 timers 243 flags 244 local and remote host numbers and ports 245 TCP option state 246 micro-state 247 send and receive window state (size*, current number) 248 cong. window size (snd_cwnd)* 249 cong. window size threshold (ssthresh)* 250 max window size seen* 251 sendMSS# 252 MMS_S# 253 MMS_R# 254 PMTU# 255 round-trip time and variance# 257 The per-connection information is shown as split into macro-state 258 and micro-state, terminology borrowed from [Co91]. Macro-state 259 describes the protocol for establishing the initial shared state 260 about the connection; we include the endpoint numbers and components 261 (timers, flags) required upon commencement that are later used to 262 help maintain that state. Micro-state describes the protocol after a 263 connection has been established, to maintain the reliability and 264 congestion control of the data transferred in the connection. 266 We further distinguish two other classes of shared micro-state that 267 are associated more with host-pairs than with application pairs. One 268 class is clearly host-pair dependent (#, e.g., MSS, MMS, PMTU, RTT), 269 and the other is host-pair dependent in its aggregate (*, e.g., 270 congestion window information, current window sizes, etc.). 272 Finally, we exclude RWIN from further discussion because its value 273 should depend on the send window size, so it is already addressed by 274 send window sharing and is not independently affected by sharing. 276 5. TCB Interdependence 278 There are two cases of TCB interdependence. Temporal sharing occurs 279 when the TCB of an earlier (now CLOSED) connection to a host is used 280 to initialize some parameters of a new connection to that same host, 281 i.e., in sequence. Ensemble sharing occurs when a currently active 282 connection to a host is used to initialize another (concurrent) 283 connection to that host. 285 6. Temporal Sharing 287 The TCB data cache is accessed in two ways: it is read to initialize 288 new TCBs and written when more current per-host state is available. 290 6.1. Initialization of the new TCB 292 TCBs for new connections can be initialized using context from past 293 connections as follows: 295 TEMPORAL SHARING - TCB Initialization 297 Cached TCB New TCB 298 -------------------------------------- 299 old_MMS_S old_MMS_S or not cached 301 old_MMS_R old_MMS_R or not cached 303 old_sendMSS old_sendMSS 305 old_PMTU old_PMTU+ 307 old_RTT old_RTT 309 old_RTTvar old_RTTvar 311 old_option (option specific) 313 old_ssthresh old_ssthresh 315 old_sendcwnd old_sendcwnd 317 +Note that PMTU feedback is cached at the IP layer [RFC1191]. 319 The table below gives an overview of option-specific information 320 that can be shared. Additional information on some specific TCP 321 options and sharing is provided in Appendix B. 323 TEMPORAL SHARING - Option Info Initialization 325 Cached New 326 ------------------------------------ 327 old_TFO_cookie old_TFO_cookie 329 old_TFO_failure old_TFO_failure 331 6.2. Updates to the new TCB 333 During the connection, the associated TCB can be updated based on 334 particular events, as shown below: 336 TEMPORAL SHARING - Cache Updates 338 Cached TCB Current TCB when? New Cached TCB 339 ---------------------------------------------------------- 340 old_MMS_S curr_MMS_S OPEN curr_MMS_S 342 old_MMS_R curr_MMS_R OPEN curr_MMS_R 344 old_sendMSS curr_sendMSS MSSopt curr_sendMSS 346 old_PMTU curr_PMTU PMTUD+ curr_PMTU 348 old_RTT curr_RTT CLOSE merge(curr,old) 350 old_RTTvar curr_RTTvar CLOSE merge(curr,old) 352 old_option curr_option ESTAB (depends on option) 354 old_ssthresh curr_ssthresh CLOSE merge(curr,old) 356 old_sendcwnd curr_sendcwnd CLOSE merge(curr,old) 358 +Note that PMTU feedback is cached at the IP layer [RFC1191]. 360 The table below gives an overview of option-specific information 361 that can be similarly shared. The TFO cookie is maintained until the 362 client explicitly requests it be updated as a separate event. 364 TEMPORAL SHARING - Option Info Updates 366 Cached Current when? New Cached 367 --------------------------------------------------------- 368 old_TFO_cookie old_TFO_cookie ESTAB old_TFO_cookie 370 old_TFO_failure old_TFO_failure ESTAB old_TFO_failure 372 6.3. Discussion 374 There is no particular benefit to caching MMS_S and MMS_R as these 375 are reported by the local IP stack. Caching sendMSS and PMTU is 376 trivial; reported values are cached (PMTU at the IP layer), and the 377 most recent values are used. The cache is updated when the MSS 378 option is received in a SYN or after PMTUD (i.e., when an ICMPv4 379 Fraqmentation Needed [RFC1191] or ICMPv6 Packet Too Big message is 380 received [RFC8201] or the equivalent is inferred, e.g., as from 381 PLPMTUD [RFC4821]), respectively, so the cache always has the most 382 recent values from any connection. For sendMSS, the cache is 383 consulted only at connection establishment and not otherwise 384 updated, which means that MSS options do not affect current 385 connections. The default sendMSS is never saved; only reported MSS 386 values update the cache, so an explicit override is required to 387 reduce the sendMSS. Cached sendMSS affects only data sent in the SYN 388 segment, i.e., during client connection initiation or during 389 simulataneous open; all other segment MSS are based on the value 390 updated as included in the SYN. 392 RTT values are updated by formulae that merge the old and new 393 values. Dynamic RTT estimation requires a sequence of RTT 394 measurements. As a result, the cached RTT (and its variance) is an 395 average of its previous value with the contents of the currently 396 active TCB for that host, when a TCB is closed. RTT values are 397 updated only when a connection is closed. The method for merging old 398 and current values needs to attempt to reduce the transient effects 399 of the new connections. 401 The updates for RTT, RTTvar and ssthresh rely on existing 402 information, i.e., old values. Should no such values exist, the 403 current values are cached instead. 405 TCP options are copied or merged depending on the details of each 406 option, where "merge" is some function that combines the values of 407 "curr" and "old". E.g., TFO state is updated when a connection is 408 established and read before establishing a new connection. 410 Sections 8 and 9 discuss compatibility issues and implications of 411 sharing the specific information listed above. Section 10 gives an 412 overview of known implementations. 414 Most cached TCB values are updated when a connection closes. The 415 exceptions are MMS_R and MMS_S, which are reported by IP [RFC1122], 416 PMTU which is updated after Path MTU Discovery and also reported by 417 IP [RFC1191][RFC4821][RFC8201], and sendMSS, which is updated if the 418 MSS option is received in the TCP SYN header. 420 Sharing sendMSS information affects only data in the SYN of the next 421 connection, because sendMSS information is typically included in 422 most TCP SYN segments. Caching PMTU can accelerate the efficiency of 423 PMTUD but can also result in black-holing until corrected if in 424 error. Caching MMS_R and MMS_S may be of little direct value as they 425 are reported by the local IP stack anyway. 427 The way in which other TCP option state can be shared depends on the 428 details of that option. E.g., TFO state includes the TCP Fast Open 429 Cookie [RFC7413] or, in case TFO fails, a negative TCP Fast Open 430 response. RFC 7413 states, "The client MUST cache negative responses 431 from the server in order to avoid potential connection failures. 432 Negative responses include the server not acknowledging the data in 433 the SYN, ICMP error messages, and (most importantly) no response 434 (SYN-ACK) from the server at all, i.e., connection timeout." [RFC 435 7413]. TFOinfo is cached when a connection is established. 437 Other TCP option state might not be as readily cached. E.g., TCP-AO 438 [RFC5925] success or failure between a host pair for a single SYN 439 destination port might be usefully cached. TCP-AO success or failure 440 to other SYN destination ports on that host pair is never useful to 441 cache because TCP-AO security parameters can vary per service. 443 7. Ensemble Sharing 445 Sharing cached TCB data across concurrent connections requires 446 attention to the aggregate nature of some of the shared state. For 447 example, although MSS and RTT values can be shared by copying, it 448 may not be appropriate to simply copy congestion window or ssthresh 449 information; instead, the new values can be a function (f) of the 450 cumulative values and the number of connections (N). 452 7.1. Initialization of a new TCB 454 TCBs for new connections can be initialized using context from 455 concurrent connections as follows: 457 ENSEMBLE SHARING - TCB Initialization 459 Cached TCB New TCB 460 ------------------------------------------ 461 old_MMS_S old_MMS_S 463 old_MMS_R old_MMS_R 465 old_sendMSS old_sendMSS 467 old_PMTU old_PMTU+ 469 old_RTT old_RTT 471 old_RTTvar old_RTTvar 473 sum(old_ssthresh) f(sum(old_ssthresh), N) 475 sum(old_sendcwnd) f(sum(old_sendcwnd), N) 476 _ 477 old_option (option specific) 479 +Note that PMTU feedback is cached at the IP layer [RFC1191]. 481 The table below gives an overview of option-specific information 482 that can be similarly shared. Again, The TFO_cookie is updated upon 483 explicit client request, which is a separate event. 485 ENSEMBLE SHARING - Option Info Initialization 487 Cached New 488 ------------------------------------ 489 old_TFO_cookie old_TFO_cookie 491 old_TFO_failure old_TFO_failure 493 7.2. Updates to the new TCB 495 During the connection, the associated TCB can be updated based on 496 changes to concurrent connections, as shown below: 498 ENSEMBLE SHARING - Cache Updates 500 Cached TCB Current TCB when? New Cached TCB 501 --------------------------------------------------------------- 502 old_MMS_S curr_MMS_S OPEN curr_MMS_S 504 old_MMS_R curr_MMS_R OPEN curr_MMS_R 506 old_sendMSS curr_sendMSS MSSopt curr_sendMSS 508 old_PMTU curr_PMTU PMTUD+ / curr_PMTU 509 PLPMTUD 511 old_RTT curr_RTT update rtt_update(old,curr) 513 old_RTTvar curr_RTTvar update rtt_update(old,curr) 515 old_ssthresh curr_ssthresh update adjust sum as appropriate 517 old_sendcwnd curr_sendcwnd update adjust sum as appropriate 519 old_option curr_option (depends) (option specific) 521 +Note that PMTU feedback is cached at the IP layer [RFC1191]. 523 The table below gives an overview of option-specific information 524 that can be similarly shared. 526 ENSEMBLE SHARING - Option Info Updates 528 Cached Current when? New Cached 529 ---------------------------------------------------------- 530 old_TFO_cookie old_TFO_cookie ESTAB old_TFO_cookie 532 old_TFO_failure old_TFO_failure ESTAB old_TFO_failure 534 7.3. Discussion 536 For ensemble sharing, TCB information should be cached as early as 537 possible, sometimes before a connection is closed. Otherwise, 538 opening multiple concurrent connections may not result in TCB data 539 sharing if no connection closes before others open. The amount of 540 work involved in updating the aggregate average should be minimized, 541 but the resulting value should be equivalent to having all values 542 measured within a single connection. The function "rtt_update" in 543 the ensemble sharing table indicates this operation, which occurs 544 whenever the RTT would have been updated in the individual TCP 545 connection. As a result, the cache contains the shared RTT 546 variables, which no longer need to reside in the TCB. 548 Congestion window size and ssthresh aggregation are more complicated 549 in the concurrent case. When there is an ensemble of connections, we 550 need to decide how that ensemble would have shared these variables, 551 in order to derive initial values for new TCBs. 553 Sections 8 and 9 discuss compatibility issues and implications of 554 sharing the specific information listed above. 556 Any assumption of TCB information sharing can be incorrect because 557 identical endpoint address pairs may not share network paths. In 558 current implementations, new congestion windows are set at an 559 initial value of 4-10 segments [RFC3390][RFC6928], so that the sum 560 of the current windows is increased for any new connection. This can 561 have detrimental consequences where several connections share a 562 highly congested link. 564 There are several ways to initialize the congestion window in a new 565 TCB among an ensemble of current connections to a host. Current TCP 566 implementations initialize it to four segments as standard [rfc3390] 567 and 10 segments experimentally [RFC6928]. These approaches assume 568 that new connections should behave as conservatively as possible. 569 The algorithm described in [Ba12] adjusts the initial cwnd depending 570 on the cwnd values of ongoing connections. It is also possible to 571 use sharing mechanisms over long timescales to adapt TCP's initial 572 window automatically, as described further in Appendix C. 574 8. Compatibility Issues 576 Here, we discuss various types of problems that may arise with TCB 577 information sharing. 579 For the congestion and current window information, the initial 580 values computed by TCB interdependence may not be consistent with 581 the long-term aggregate behavior of a set of concurrent connections 582 between the same endpoints. Under conventional TCP congestion 583 control, if a single existing connection has converged to a 584 congestion window of 40 segments, two newly joining concurrent 585 connections assume initial windows of 10 segments [RFC6928], and the 586 current connection's window doesn't decrease to accommodate this 587 additional load and connections can mutually interfere. One example 588 of this is seen on low-bandwidth, high-delay links, where concurrent 589 connections supporting Web traffic can collide because their initial 590 windows were too large, even when set at one segment. 592 The authors of [Hu12] recommend caching ssthresh for temporal 593 sharing only when flows are long. Some studies suggest that sharing 594 ssthresh between short flows can deteriorate the performance of 595 individual connections [Hu12, Du16], although this may benefit 596 aggregate network performance. 598 8.1. Traversing the same network path 600 TCP is sometimes used in situations where packets of the same host- 601 pair do not always take the same path. Multipath routing that relies 602 on examining transport headers, such as ECMP and LAG [RFC7424], may 603 not result in repeatable path selection when TCP segments are 604 encapsulated, encrypted, or altered - for example, in some Virtual 605 Private Network (VPN) tunnels that rely on proprietary 606 encapsulation. Similarly, such approaches cannot operate 607 deterministically when the TCP header is encrypted, e.g., when using 608 IPsec ESP (although TCB interdependence among the entire set sharing 609 the same endpoint IP addresses should work without problems when the 610 TCP header is encrypted). Measures to increase the probability that 611 connections use the same path could be applied: e.g., the 612 connections could be given the same IPv6 flow label. TCB 613 interdependence can also be extended to sets of host IP address 614 pairs that share the same network path conditions, such as when a 615 group of addresses is on the same LAN (see Section 9). 617 Traversing the same path is not important for host-specific 618 information such as RWIN and TCP option state, such as TFOinfo. When 619 TCB information is shared across different SYN destination ports, 620 path-related information can be incorrect; however, the impact of 621 this error is potentially diminished if (as discussed here) TCB 622 sharing affects only the transient event of a connection start or if 623 TCB information is shared only within connections to the same SYN 624 destination port. In case of Temporal Sharing, TCB information could 625 also become invalid over time. Because this is similar to the case 626 when a connection becomes idle, mechanisms that address idle TCP 627 connections (e.g., [RFC7661]) could also be applied to TCB cache 628 management, especially when TCP Fast Open is used [RFC7413]. 630 8.2. State dependence 632 There may be additional considerations to the way in which TCB 633 interdependence rebalances congestion feedback among the current 634 connections, e.g., it may be appropriate to consider the impact of a 635 connection being in Fast Recovery [RFC5681] or some other similar 636 unusual feedback state, e.g., as inhibiting or affecting the 637 calculations described herein. 639 8.3. Problems with IP sharing 641 It can be wrong to share TCB information between TCP connections on 642 the same host as identified by the IP address if an IP address is 643 assigned to a new host (e.g., IP address spinning, as is used by 644 ISPs to inhibit running servers). It can be wrong if Network Address 645 (and Port) Translation (NA(P)T) [RFC2663] or any other IP sharing 646 mechanism is used. Such mechanisms are less likely to be used with 647 IPv6. Other methods to identify a host could also be considered to 648 make correct TCB sharing more likely. Moreover, some TCB information 649 is about dominant path properties rather than the specific host. IP 650 addresses may differ, yet the relevant part of the path may be the 651 same. 653 9. Implications 655 There are several implications to incorporating TCB interdependence 656 in TCP implementations. First, it may reduce the need for 657 application-layer multiplexing for performance enhancement 658 [RFC7231]. Protocols like HTTP/2 [RFC7540] avoid connection 659 reestablishment costs by serializing or multiplexing a set of per- 660 host connections across a single TCP connection. This avoids TCP's 661 per-connection OPEN handshake and also avoids recomputing the MSS, 662 RTT, and congestion window values. By avoiding the so-called, "slow- 663 start restart," performance can be optimized [Hu01]. TCB 664 interdependence can provide the "slow-start restart avoidance" of 665 multiplexing, without requiring a multiplexing mechanism at the 666 application layer. 668 Like the initial version of this document [RFC2140], this update's 669 approach to TCB interdependence focuses on sharing a set of TCBs by 670 updating the TCB state to reduce the impact of transients when 671 connections begin or end. Other mechanisms have since been proposed 672 to continuously share information between all ongoing communication 673 (including connectionless protocols), updating the congestion state 674 during any congestion-related event (e.g., timeout, loss 675 confirmation, etc.) [RFC3124]. By dealing exclusively with 676 transients, TCB interdependence is more likely to exhibit the same 677 behavior as unmodified, independent TCP connections. 679 9.1. Layering 681 TCB interdependence pushes some of the TCP implementation from the 682 traditional transport layer (in the ISO model), to the network 683 layer. This acknowledges that some state is in fact per-host-pair or 684 can be per-path as indicated solely by that host-pair. Transport 685 protocols typically manage per-application-pair associations (per 686 stream), and network protocols manage per-host-pair and path 687 associations (routing). Round-trip time, MSS, and congestion 688 information could be more appropriately handled in a network-layer 689 fashion, aggregated among concurrent connections, and shared across 690 connection instances [RFC3124]. 692 An earlier version of RTT sharing suggested implementing RTT state 693 at the IP layer, rather than at the TCP layer. Our observations 694 describe sharing state among TCP connections, which avoids some of 695 the difficulties in an IP-layer solution. One such problem of an IP 696 layer solution is determining the correspondence between packet 697 exchanges using IP header information alone, where such 698 correspondence is needed to compute RTT. Because TCB sharing 699 computes RTTs inside the TCP layer using TCP header information, it 700 can be implemented more directly and simply than at the IP layer. 701 This is a case where information should be computed at the transport 702 layer but could be shared at the network layer. 704 9.2. Other possibilities 706 Per-host-pair associations are not the limit of these techniques. It 707 is possible that TCBs could be similarly shared between hosts on a 708 subnet or within a cluster, because the predominant path can be 709 subnet-subnet, rather than host-host. Additionally, TCB 710 interdependence can be applied to any protocol with congestion 711 state, including SCTP [RFC4960] and DCCP [RFC4340], as well as for 712 individual subflows in Multipath TCP [RFC8684]. 714 There may be other information that can be shared between concurrent 715 connections. For example, knowing that another connection has just 716 tried to expand its window size and failed, a connection may not 717 attempt to do the same for some period. The idea is that existing 718 TCP implementations infer the behavior of all competing connections, 719 including those within the same host or subnet. One possible 720 optimization is to make that implicit feedback explicit, via 721 extended information associated with the endpoint IP address and its 722 TCP implementation, rather than per-connection state in the TCB. 724 10. Implementation Observations 726 The observation that some TCB state is host-pair specific rather 727 than application-pair dependent is not new and is a common 728 engineering decision in layered protocol implementations. Although 729 now deprecated, T/TCP [RFC1644] was the first to propose using 730 caches in order to maintain TCB states (see Appendix A). 732 The table below describes the current implementation status for TCB 733 temporal sharing in Windows as of December 2020, Linux kernel 734 version 5.10.3, and FreeBSD 12. Ensemble sharing is not yet 735 implemented. 737 KNOWN IMPLEMENTATION STATUS 739 TCB data Status 740 ------------------------------------------------------------ 741 old_MMS_S Not shared 743 old_MMS_R Not shared 745 old_sendMSS Cached and shared in Apple, Linux (MSS) 747 old_PMTU Cached and shared in Apple, FreeBSD, Windows (PMTU) 749 old_RTT Cached and shared in Apple, FreeBSD, Linux, Windows 751 old_RTTvar Cached and shared in Apple, FreeBSD, Windows 753 old_TFOinfo Cached and shared in Apple, Linux, Windows 755 old_sendcwnd Not shared 757 old_ssthresh Cached and shared in Apple, FreeBSD*, Linux* 759 TFO_failure Cached and shared in Apple 761 In the table above, "Apple" refers to all Apple OSes, i.e., 762 desktop/laptop macOS, phone iOS, video player tvOS, pad ipadOS, and 763 watch watchOS, which all share the same Internet protocol stack. 765 *Note: In FreeBSD, new ssthresh is the mean of curr_ssthresh and 766 previous value if a previous value exists; in Linux, the calculation 767 depends on state and is max(curr_cwnd/2, old_ssthresh) in most 768 cases. 770 11. Updates to RFC 2140 772 This document updates the description of TCB sharing in RFC 2140 and 773 its associated impact on existing and new connection state, 774 providing a complete replacement for that document [RFC2140]. It 775 clarifies the previous description and terminology and extends the 776 mechanism to its impact on new protocols and mechanisms, including 777 multipath TCP, fast open, PLPMTUD, NAT, and the TCP Authentication 778 Option. 780 The detailed impact on TCB state addresses TCB parameters in greater 781 detail, addressing RSS in both the send and receive direction, MSS 782 and send-MSS separately, adds path MTU and ssthresh, and addresses 783 the impact on TCP option state. 785 New sections have been added to address compatibility issues and 786 implementation observations. The relation of this work to T/TCP has 787 been moved to Appendix A on history, partly to reflect the 788 deprecation of that protocol. 790 Appendix C has been added to discuss the potential to use temporal 791 sharing over long timescales to adapt TCP's initial window 792 automatically, avoiding the need to periodically revise a single 793 global constant value. 795 Finally, this document updates and significantly expands the 796 referenced literature. 798 12. Security Considerations 800 These presented implementation methods do not have additional 801 ramifications for explicit attacks. They may be susceptible to 802 denial-of-service attacks if not otherwise secured. 804 TCB sharing may be susceptible to denial-of-service attacks, 805 wherever the TCB is shared, between connections in a single host, or 806 between hosts if TCB sharing is implemented within a subnet (see 807 Implications section). Some shared TCB parameters are used only to 808 create new TCBs, others are shared among the TCBs of ongoing 809 connections. New connections can join the ongoing set, e.g., to 810 optimize send window size among a set of connections to the same 811 host. PMTU is defined as shared at the IP layer, and is already 812 susceptible in this way. 814 Options in client SYNs can be easier to forge than complete, two-way 815 connections. As a result, their values may not be safely 816 incorporated in shared values until after the three-way handshake 817 completes. 819 Attacks on parameters used only for initialization affect only the 820 transient performance of a TCP connection. For short connections, 821 the performance ramification can approach that of a denial-of- 822 service attack. E.g., if an application changes its TCB to have a 823 false and small window size, subsequent connections will experience 824 performance degradation until their window grew appropriately. 826 TCB sharing reuses and mixes information from past and current 827 connections. Although reusing information could create a potential 828 for fingerprinting to identify hosts, the mixing reduces that 829 potential. There has been no evidence of fingerprinting based on 830 this technique and it is currently considered safe in that regard. 832 13. IANA Considerations 834 There are no IANA implications or requests in this document. 836 This section should be removed upon final publication as an RFC. 838 14. References 840 14.1. Normative References 842 [RFC793] Postel, Jon, "Transmission Control Protocol," Network 843 Working Group RFC-793/STD-7, ISI, Sept. 1981. 845 [RFC1122] Braden, R. (ed), "Requirements for Internet Hosts -- 846 Communication Layers", RFC-1122, Oct. 1989. 848 [RFC1191] Mogul, J., Deering, S., "Path MTU Discovery," RFC 1191, 849 Nov. 1990. 851 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 852 Requirement Levels", BCP 14, RFC 2119, March 1997. 854 [RFC4821] Mathis, M., Heffner, J., "Packetization Layer Path MTU 855 Discovery," RFC 4821, Mar. 2007. 857 [RFC5681] Allman, M., Paxson, V., Blanton, E., "TCP Congestion 858 Control," RFC 5681 (Standards Track), Sep. 2009. 860 [RFC6298] Paxson, V., Allman, M., Chu, J., Sargent, M., "Computing 861 TCP's Retransmission Timer," RFC 6298, June 2011. 863 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., Jain, A., "TCP Fast 864 Open", RFC 7413, Dec. 2014. 866 [RFC8174] Leiba., B., "Ambiguity of Uppercase vs Lowercase in RFC 867 2119 Key Words", RFC 8174, May 2017. 869 [RFC8201] McCann, J., Deering. S., Mogul, J., Hinden, R. (Ed.), 870 "Path MTU Discovery for IP version 6," RFC 8201, Jul. 871 2017. 873 14.2. Informative References 875 [Al10] Allman, M., "Initial Congestion Window Specification", 876 (work in progress), draft-allman-tcpm-bump-initcwnd-00, 877 Nov. 2010. 879 [Ba12] Barik, R., Welzl, M., Ferlin, S., Alay, O., " LISA: A 880 Linked Slow-Start Algorithm for MPTCP", IEEE ICC, Kuala 881 Lumpur, Malaysia, May 23-27 2016. 883 [Ba20] Bagnulo, M., Briscoe, B., "ECN++: Adding Explicit 884 Congestion Notification (ECN) to TCP Control Packets", 885 draft-ietf-tcpm-generalized-ecn-06, Oct. 2020. 887 [Be94] Berners-Lee, T., et al., "The World-Wide Web," 888 Communications of the ACM, V37, Aug. 1994, pp. 76-82. 890 [Br94] Braden, B., "T/TCP -- Transaction TCP: Source Changes for 891 Sun OS 4.1.3,", Release 1.0, USC/ISI, September 14, 1994. 893 [Br02] Brownlee, N. and K. Claffy, "Understanding Internet 894 Traffic Streams: Dragonflies and Tortoises", IEEE 895 Communications Magazine p110-117, 2002. 897 [Co91] Comer, D., Stevens, D., Internetworking with TCP/IP, V2, 898 Prentice-Hall, NJ, 1991. 900 [Du16] Dukkipati, N., Yuchung C., and Amin V., "Research 901 Impacting the Practice of Congestion Control." ACM SIGCOMM 902 CCR (editorial), on-line post, July 2016. 904 [FreeBSD] FreeBSD source code, Release 2.10, http://www.freebsd.org/ 906 [Hu01] Hugues, A., Touch, J., Heidemann, J., "Issues in Slow- 907 Start Restart After Idle", draft-hughes-restart-00 908 (expired), Dec. 2001. 910 [Hu12] Hurtig, P., Brunstrom, A., "Enhanced metric caching for 911 short TCP flows," 2012 IEEE International Conference on 912 Communications (ICC), Ottawa, ON, 2012, pp. 1209-1213. 914 [Ja88] Jacobson, V., M. Karels, "Congestion Avoidance and 915 Control", Proc. Sigcomm 1988. 917 [RFC1644] Braden, R., "T/TCP -- TCP Extensions for Transactions 918 Functional Specification," RFC-1644, July 1994. 920 [RFC1379] Braden, R., "Transaction TCP -- Concepts," RFC-1379, 921 September 1992. 923 [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast 924 Retransmit, and Fast Recovery Algorithms", RFC2001 925 (Standards Track), Jan. 1997. 927 [RFC2140] Touch, J., "TCP Control Block Interdependence", RFC 2140, 928 April 1997. 930 [RFC2414] Allman, M., Floyd, S., Partridge, C., "Increasing TCP's 931 Initial Window", RFC 2414 (Experimental), Sept. 1998. 933 [RFC2663] Srisuresh, P., Holdrege, M., "IP Network Address 934 Translator (NAT) Terminology and Considerations", RFC- 935 2663, August 1999. 937 [RFC3390] Allman, M., Floyd, S., Partridge, C., "Increasing TCP's 938 Initial Window," RFC 3390, Oct. 2002. 940 [RFC3124] Balakrishnan, H., Seshan, S., "The Congestion Manager," 941 RFC 3124, June 2001. 943 [RFC4340] Kohler, E., Handley, M., Floyd, S., "Datagram Congestion 944 Control Protocol (DCCP)," RFC 4340, Mar. 2006. 946 [RFC4960] Stewart, R., (Ed.), "Stream Control Transmission 947 Protocol," RFC4960, Sept. 2007. 949 [RFC5925] Touch, J., Mankin, A., Bonica, R., "The TCP Authentication 950 Option," RFC 5925, June 2010. 952 [RFC6928] Chu, J., Dukkipati, N., Cheng, Y., Mathis, M., "Increasing 953 TCP's Initial Window," RFC 6928, Apr. 2013. 955 [RFC7231] Fielding, R., J. Reshke, Eds., "HTTP/1.1 Semantics and 956 Content," RFC-7231, June 2014. 958 [RFC7323] Borman, D., B. Braden, V. Jacobson, R. Scheffenegger 959 (Ed.), "TCP Extensions for High Performance," RFC 7323, 960 Sept. 2014. 962 [RFC7424] Krishnan, R., Yong, L., Ghanwani, A., So, N., Khasnabish, 963 B., "Mechanisms for Optimizing Link Aggregation Group 964 (LAG) and Equal-Cost Multipath (ECMP) Component Link 965 Utilization in Networks", RFC 7424, Jan. 2015 967 [RFC7540] Belshe, M., Peon, R., Thomson, M., "Hypertext Transfer 968 Protocol Version 2 (HTTP/2)", RFC 7540, May 2015. 970 [RFC7661] Fairhurst, G., Sathiaseelan, A., Secchi, R., "Updating TCP 971 to Support Rate-Limited Traffic", RFC 7661, Oct. 2015. 973 [RFC8684] Ford, A., Raiciu, C., Handley, M., Bonaventure, O., 974 Paasch, C., "TCP Extensions for Multipath Operation with 975 Multiple Addresses," RFC 8684, Mar. 2020. 977 15. Acknowledgments 979 The authors would like to thank for Praveen Balasubramanian for 980 information regarding TCB sharing in Windows, Christoph Paasch for 981 information regarding TCB sharing in Apple OSes, and Yuchung Cheng, 982 Lars Eggert, Ilpo Jarvinen and Michael Scharf for comments on 983 earlier versions of the draft, as well as members of the TCPM WG. 984 Earlier revisions of this work received funding from a collaborative 985 research project between the University of Oslo and Huawei 986 Technologies Co., Ltd. and were partly supported by USC/ISI's Postel 987 Center. 989 This document was prepared using 2-Word-v2.0.template.dot. 991 16. Change log 993 This section should be removed upon final publication as an RFC. 995 ietf-09: 997 - Correction of typographic errors 999 ietf-08: 1001 - Address TSV AD comments, add Apple OS implementation status 1003 ietf-07: 1005 - Update per id-nits and normative language for consistency 1007 ietf-06: 1009 - Address WGLC comments 1011 ietf-05: 1013 - Correction of typographic errors, expansion of terminology 1015 ietf-04: 1017 - Fix internal cross-reference errors that appeared in ietf-02 1018 - Updated tables to re-center; clarified text 1020 ietf-03: 1022 - Correction of typographic errors, minor rewording in appendices 1024 ietf-02: 1026 - Minor reorganization and correction of typographic errors 1027 - Added text to address fingerprinting in Security section 1028 - Now retains Appendix B and body option tables upon publication 1030 ietf-01: 1032 - Added Appendix C to address long-timescale temporal adaptation 1034 ietf-00: 1036 - Re-issued as draft-ietf-tcpm-2140bis due to WG adoption. 1037 - Cleaned orphan references to T/TCP, removed incomplete refs 1038 - Moved references to informative section and updated Sec 2 1039 - Updated to clarify no impact to interoperability 1040 - Updated appendix B to avoid 2119 language 1042 06: 1044 - Changed to update 2140, cite it normatively, and summarize the 1045 updates in a separate section 1047 05: 1049 - Fixed some TBDs. 1051 04: 1053 - Removed BCP-style recommendations and fixed some TBDs. 1055 03: 1057 - Updated Touch's affiliation and address information 1059 02: 1061 - Stated that our OS implementation overview table only covers 1062 temporal sharing. 1064 - Correctly reflected sharing of old_RTT in Linux in the 1065 implementation overview table. 1067 - Marked entries that are considered safe to share with an 1068 asterisk (suggestion was to split the table) 1070 - Discussed correct host identification: NATs may make IP 1071 addresses the wrong input, could e.g., use HTTP cookie. 1073 - Included MMS_S and MMS_R from RFC1122; fixed the use of MSS and 1074 MTU 1076 - Added information about option sharing, listed options in 1077 Appendix B 1079 Authors' Addresses 1081 Joe Touch 1082 Manhattan Beach, CA 90266 1083 USA 1085 Phone: +1 (310) 560-0334 1086 Email: touch@strayalpha.com 1088 Michael Welzl 1089 University of Oslo 1090 PO Box 1080 Blindern 1091 Oslo N-0316 1092 Norway 1094 Phone: +47 22 85 24 20 1095 Email: michawe@ifi.uio.no 1096 Safiqul Islam 1097 University of Oslo 1098 PO Box 1080 Blindern 1099 Oslo N-0316 1100 Norway 1102 Phone: +47 22 84 08 37 1103 Email: safiquli@ifi.uio.no 1105 Appendix A: TCB Sharing History 1107 T/TCP proposed using caches to maintain TCB information across 1108 instances (temporal sharing), e.g., smoothed RTT, RTT variance, 1109 congestion avoidance threshold, and MSS [RFC1644]. These values were 1110 in addition to connection counts used by T/TCP to accelerate data 1111 delivery prior to the full three-way handshake during an OPEN. The 1112 goal was to aggregate TCB components where they reflect one 1113 association - that of the host-pair, rather than artificially 1114 separating those components by connection. 1116 At least one T/TCP implementation saved the MSS and aggregated the 1117 RTT parameters across multiple connections but omitted caching the 1118 congestion window information [Br94], as originally specified in 1119 [RFC1379]. Some T/TCP implementations immediately updated MSS when 1120 the TCP MSS header option was received [Br94], although this was not 1121 addressed specifically in the concepts or functional specification 1122 [RFC1379][RFC1644]. In later T/TCP implementations, RTT values were 1123 updated only after a CLOSE, which does not benefit concurrent 1124 sessions. 1126 Temporal sharing of cached TCB data was originally implemented in 1127 the SunOS 4.1.3 T/TCP extensions [Br94] and the FreeBSD port of same 1128 [FreeBSD]. As mentioned before, only the MSS and RTT parameters were 1129 cached, as originally specified in [RFC1379]. Later discussion of 1130 T/TCP suggested including congestion control parameters in this 1131 cache; for example, [RFC1644] (Section 3.1) hints at initializing 1132 the congestion window to the old window size. 1134 Appendix B: TCP Option Sharing and Caching 1136 In addition to the options that can be cached and shared, this memo 1137 also lists known options for which state is unsafe to be kept. This 1138 list is not intended to be authoritative or exhaustive. 1140 Obsolete (unsafe to keep state): 1142 ECHO 1144 ECHO REPLY 1146 PO Conn permitted 1148 PO service profile 1150 CC 1152 CC.NEW 1154 CC.ECHO 1156 Alt CS req 1158 Alt CS data 1160 No state to keep: 1162 EOL 1164 NOP 1166 WS 1168 SACK 1170 TS 1172 MD5 1174 TCP-AO 1176 EXP1 1178 EXP2 1180 Unsafe to keep state: 1182 Skeeter (DH exchange, known to be vulnerable) 1184 Bubba (DH exchange, known to be vulnerable) 1186 Trailer CS 1188 SCPS capabilities 1190 S-NACK 1192 Records boundaries 1194 Corruption experienced 1196 SNAP 1198 TCP Compression 1200 Quickstart response 1202 UTO 1204 MPTCP negotiation success (see below for negotiation failure) 1206 TFO negotiation success (see below for negotiation failure) 1208 Safe but optional to keep state: 1210 MPTCP negotiation failure (to avoid negotiation retries) 1212 MSS 1214 TFO negotiation failure (to avoid negotiation retries) 1216 Safe and necessary to keep state: 1218 TFO cookie (if TFO succeeded in the past) 1220 Appendix C: Automating the Initial Window in TCP over Long Timescales 1222 C.1. Introduction 1224 Temporal sharing, as described earlier in this document, builds on 1225 the assumption that multiple consecutive connections between the 1226 same host pair are somewhat likely to be exposed to similar 1227 environment characteristics. The stored information can therefore 1228 become invalid over time, and suitable precautions should be taken 1229 (this is discussed further in section 8.1). However, there are also 1230 cases where it can make sense to use much longer-term measurements 1231 of TCP connections to gradually influence TCP parameters. This 1232 appendix describes an example of such a case. 1234 TCP's congestion control algorithm uses an initial window value 1235 (IW), both as a starting point for new connections and as an upper 1236 limit for restarting after an idle period [RFC5681][RFC7661]. This 1237 value has evolved over time, originally one maximum segment size 1238 (MSS), and increased to the lesser of four MSS or 4,380 bytes 1239 [RFC3390][RFC5681]. For a typical Internet connection with a maximum 1240 transmission unit (MTU) of 1500 bytes, this permits three segments 1241 of 1,460 bytes each. 1243 The IW value was originally implied in the original TCP congestion 1244 control description and documented as a standard in 1997 1245 [RFC2001][Ja88]. The value was updated in 1998 experimentally and 1246 moved to the standards track in 2002 [RFC2414][RFC3390]. In 2013, it 1247 was experimentally increased to 10 [RFC6928]. 1249 This appendix discusses how TCP can objectively measure when an IW 1250 is too large, and that such feedback should be used over long 1251 timescales to adjust the IW automatically. The result should be 1252 safer to deploy and might avoid the need to repeatedly revisit IW 1253 over time. 1255 Note that this mechanism attempts to make the IW more adaptive over 1256 time. It can increase the IW beyond that which is currently 1257 recommended for widescale deployment, and so its use should be 1258 carefully monitored. 1260 C.2. Design Considerations 1262 TCP's IW value has existed statically for over two decades, so any 1263 solution to adjusting the IW dynamically should have similarly 1264 stable, non-invasive effects on the performance and complexity of 1265 TCP. In order to be fair, the IW should be similar for most machines 1266 on the public Internet. Finally, a desirable goal is to develop a 1267 self-correcting algorithm, so that IW values that cause network 1268 problems can be avoided. To that end, we propose the following 1269 design goals: 1271 o Impart little to no impact to TCP in the absence of loss, i.e., 1272 it should not increase the complexity of default packet 1273 processing in the normal case. 1275 o Adapt to network feedback over long timescales, avoiding values 1276 that persistently cause network problems. 1278 o Decrease the IW in the presence of sustained loss of IW segments, 1279 as determined over a number of different connections. 1281 o Increase the IW in the absence of sustained loss of IW segments, 1282 as determined over a number of different connections. 1284 o Operate conservatively, i.e., tend towards leaving the IW the 1285 same in the absence of sufficient information, and give greater 1286 consideration to IW segment loss than IW segment success. 1288 We expect that, without other context, a good IW algorithm will 1289 converge to a single value, but this is not required. An endpoint 1290 with additional context or information, or deployed in a constrained 1291 environment, can always use a different value. In specific, 1292 information from previous connections, or sets of connections with a 1293 similar path, can already be used as context for such decisions (as 1294 noted in the core of this document). 1296 However, if a given IW value persistently causes packet loss during 1297 the initial burst of packets, it is clearly inappropriate and could 1298 be inducing unnecessary loss in other competing connections. This 1299 might happen for sites behind very slow boxes with small buffers, 1300 which may or may not be the first hop. 1302 C.3. Proposed IW Algorithm 1304 Below is a simple description of the proposed IW algorithm. It 1305 relies on the following parameters: 1307 o MinIW = 3 MSS or 4,380 bytes (as per [RFC3390]) 1309 o MaxIW = 10 MSS (as per [RFC6928]) 1311 o MulDecr = 0.5 1313 o AddIncr = 2 MSS 1314 o Threshold = 0.05 1316 We assume that the minimum IW (MinIW) should be as currently 1317 specified as standard [RFC3390]. The maximum IW can be set to a 1318 fixed value (we suggest using the experimental and now somewhat de- 1319 facto standard in [RFC6928]) or set based on a schedule if trusted 1320 time references are available [Al10]; here we prefer a fixed value. 1321 We also propose to use an AIMD algorithm, with increase and 1322 decreases as noted. 1324 Although these parameters are somewhat arbitrary, their initial 1325 values are not important except that the algorithm is AIMD and the 1326 MaxIW should not exceed that recommended for other systems on the 1327 Internet (here we selected the current de-facto standard rather than 1328 the actual standard). Current proposals, including default current 1329 operation, are degenerate cases of the algorithm below for given 1330 parameters - notably MulDec = 1.0 and AddIncr = 0 MSS, thus 1331 disabling the automatic part of the algorithm. 1333 The proposed algorithm is as follows: 1335 1. On boot: 1337 IW = MaxIW; # assume this is in bytes, and an even number of MSS 1339 2. Upon starting a new connection: 1341 CWND = IW; 1342 conncount++; 1343 IWnotchecked = 1; # true 1345 3. During a connection's SYN-ACK processing, if SYN-ACK includes ECN 1346 (as similarly addressed in Sec 5 of ECN++ for TCP [Ba20]), treat 1347 as if the IW is too large: 1349 if (IWnotchecked && (synackecn == 1)) { 1350 losscount++; 1351 IWnotchecked = 0; # never check again 1352 } 1354 4. During a connection, if retransmission occurs, check the seqno of 1355 the outgoing packet (in bytes) to see if the resent segment fixes 1356 an IW loss: 1358 if (Retransmitting && IWnotchecked && ((seqno - ISN) < IW))) { 1359 losscount++; 1360 IWnotchecked = 0; # never do this entire "if" again 1361 } else { 1362 IWnotchecked = 0; # you're beyond the IW so stop checking 1363 } 1365 5. Once every 1000 connections, as a separate process (i.e., not as 1366 part of processing a given connection): 1368 if (conncount > 1000) { 1369 if (losscount/conncount > threshold) { 1370 # the number of connections with errors is too high 1371 IW = IW * MulDecr; 1372 } else { 1373 IW = IW + AddIncr; 1374 } 1375 } 1377 As presented, this algorithm can yield a false positive when the 1378 sequence number wraps around, e.g., the code might increment 1379 losscount in step 4 when no loss occurred or fail to increment 1380 losscount when a loss did occur. This can be avoided using either 1381 PAWS [RFC7323] context or internal extended sequence number 1382 representations (as in TCP-AO [RFC5925]). Alternately, false 1383 positives can be tolerated because they are expected to be 1384 infrequent and thus will not significantly impact the algorithm. 1386 A number of additional constraints need to be imposed if this 1387 mechanism is implemented to ensure that it defaults values that 1388 comply with current Internet standards, is conservative in how it 1389 extends those values, and returns to those values in the absence of 1390 positive feedback (i.e., success). To that end, we recommend the 1391 following list of example constraints: 1393 >> The automatic IW algorithm MUST initialize MaxIW a value no 1394 larger than the currently recommended Internet default, in the 1395 absence of other context information. 1397 Thus, if there are too few connections to make a decision or if 1398 there is otherwise insufficient information to increase the IW, then 1399 the MaxIW defaults to the current recommended value. 1401 >> An implementation MAY allow the MaxIW to grow beyond the 1402 currently recommended Internet default, but not more than 2 segments 1403 per calendar year. 1405 Thus, if an endpoint has a persistent history of successfully 1406 transmitting IW segments without loss, then it is allowed to probe 1407 the Internet to determine if larger IW values have similar success. 1408 This probing is limited and requires a trusted time source, 1409 otherwise the MaxIW remains constant. 1411 >> An implementation MUST adjust the IW based on loss statistics at 1412 least once every 1000 connections. 1414 An endpoint needs to be sufficiently reactive to IW loss. 1416 >> An implementation MUST decrease the IW by at least one MSS when 1417 indicated during an evaluation interval. 1419 An endpoint that detects loss needs to decrease its IW by at least 1420 one MSS, otherwise it is not participating in an automatic reactive 1421 algorithm. 1423 >> An implementation MUST increase by no more than 2 MSS per 1424 evaluation interval. 1426 An endpoint that does not experience IW loss needs to probe the 1427 network incrementally. 1429 >> An implementation SHOULD use an IW that is an integer multiple of 1430 2 MSS. 1432 The IW should remain a multiple of 2 MSS segments, to enable 1433 efficient ACK compression without incurring unnecessary timeouts. 1435 >> An implementation MUST decrease the IW if more than 95% of 1436 connections have IW losses. 1438 Again, this is to ensure an implementation is sufficiently reactive. 1440 >> An implementation MAY group IW values and statistics within 1441 subsets of connections. Such grouping MAY use any information about 1442 connections to form groups except loss statistics. 1444 There are some TCP connections which might not be counted at all, 1445 such as those to/from loopback addresses, or those within the same 1446 subnet as that of a local interface (for which congestion control is 1447 sometimes disabled anyway). This may also include connections that 1448 terminate before the IW is full, i.e., as a separate check at the 1449 time of the connection closing. 1451 The period over which the IW is updated is intended to be a long 1452 timescale, e.g., a month or so, or 1,000 connections, whichever is 1453 longer. An implementation might check the IW once a month, and 1454 simply not update the IW or clear the connection counts in months 1455 where the number of connections is too small. 1457 C.4. Discussion 1459 There are numerous parameters to the above algorithm that are 1460 compliant with the given requirements; this is intended to allow 1461 variation in configuration and implementation while ensuring that 1462 all such algorithms are reactive and safe. 1464 This algorithm continues to assume segments because that is the 1465 basis of most TCP implementations. It might be useful to consider 1466 revising the specifications to allow byte-based congestion given 1467 sufficient experience. 1469 The algorithm checks for IW losses only during the first IW after a 1470 connection start; it does not check for IW losses elsewhere the IW 1471 is used, e.g., during slow-start restarts. 1473 >> An implementation MAY detect IW losses during slow-start restarts 1474 in addition to losses during the first IW of a connection. In this 1475 case, the implementation MUST count each restart as a "connection" 1476 for the purposes of connection counts and periodic rechecking of the 1477 IW value. 1479 False positives can occur during some kinds of segment reordering, 1480 e.g., that might trigger spurious retransmissions even without a 1481 true segment loss. These are not expected to be sufficiently common 1482 to dominate the algorithm and its conclusions. 1484 This mechanism does require additional per-connection state, which 1485 is currently common in some implementations, and is useful for other 1486 reasons (e.g., the ISN is used in TCP-AO [RFC5925]). The mechanism 1487 also benefits from persistent state kept across reboots, as would be 1488 other state sharing mechanisms (e.g., TCP Control Block Sharing 1489 [RFC2140]). The mechanism is inspired by RFC 2140's use of 1490 information across connections. 1492 The receive window (RWIN) is not involved in this calculation. The 1493 size of RWIN is determined by receiver resources and provides space 1494 to accommodate segment reordering. It is not involved with 1495 congestion control, which is the focus of this document and its 1496 management of the IW. 1498 C.5. Observations 1500 The IW may not converge to a single, global value. It also may not 1501 converge at all, but rather may oscillate by a few MSS as it 1502 repeatedly probes the Internet for larger IWs and fails. Both 1503 properties are consistent with TCP behavior during each individual 1504 connection. 1506 This mechanism assumes that losses during the IW are due to IW size. 1507 Persistent errors that drop packets for other reasons - e.g., OS 1508 bugs, can cause false positives. Again, this is consistent with 1509 TCP's basic assumption that loss is caused by congestion and 1510 requires backoff. This algorithm treats the IW of new connections as 1511 a long-timescale backoff system.