idnits 2.17.1 draft-ietf-tcpm-2140bis-11.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 319 has weird spacing: '...sthresh old_...' == Line 321 has weird spacing: '...endcwnd old_...' == Line 353 has weird spacing: '... MSSopt curr...' == Line 364 has weird spacing: '...sthresh curr...' == Line 366 has weird spacing: '...endcwnd curr...' == (4 more instances...) == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (April 12, 2021) is 1104 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-15) exists of draft-ietf-tcpm-generalized-ecn-07 -- Obsolete informational reference (is this intentional?): RFC 1644 (Obsoleted by RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1379 (Obsoleted by RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 2001 (Obsoleted by RFC 2581) -- Obsolete informational reference (is this intentional?): RFC 2140 (Obsoleted by RFC 9040) -- Obsolete informational reference (is this intentional?): RFC 2414 (Obsoleted by RFC 3390) -- Obsolete informational reference (is this intentional?): RFC 4960 (Obsoleted by RFC 9260) -- Obsolete informational reference (is this intentional?): RFC 6691 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 7231 (Obsoleted by RFC 9110) -- Obsolete informational reference (is this intentional?): RFC 7540 (Obsoleted by RFC 9113) Summary: 1 error (**), 0 flaws (~~), 9 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 TCPM WG J. Touch 2 Internet Draft Independent 3 Intended status: Informational M. Welzl 4 Obsoletes: 2140 S. Islam 5 Expires: October 2021 University of Oslo 6 April 12, 2021 8 TCP Control Block Interdependence 9 draft-ietf-tcpm-2140bis-11.txt 11 Status of this Memo 13 This Internet-Draft is submitted in full conformance with the 14 provisions of BCP 78 and BCP 79. 16 This document may contain material from IETF Documents or IETF 17 Contributions published or made publicly available before November 18 10, 2008. The person(s) controlling the copyright in some of this 19 material may not have granted the IETF Trust the right to allow 20 modifications of such material outside the IETF Standards Process. 21 Without obtaining an adequate license from the person(s) controlling 22 the copyright in such materials, this document may not be modified 23 outside the IETF Standards Process, and derivative works of it may 24 not be created outside the IETF Standards Process, except to format 25 it for publication as an RFC or to translate it into languages other 26 than English. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF), its areas, and its working groups. Note that 30 other groups may also distribute working documents as Internet- 31 Drafts. 33 Internet-Drafts are draft documents valid for a maximum of six 34 months and may be updated, replaced, or obsoleted by other documents 35 at any time. It is inappropriate to use Internet-Drafts as 36 reference material or to cite them other than as "work in progress." 38 The list of current Internet-Drafts can be accessed at 39 http://www.ietf.org/ietf/1id-abstracts.txt 41 The list of Internet-Draft Shadow Directories can be accessed at 42 http://www.ietf.org/shadow.html 44 This Internet-Draft will expire on October 12, 2021. 46 Copyright Notice 48 Copyright (c) 2021 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (https://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with 56 respect to this document. Code Components extracted from this 57 document must include Simplified BSD License text as described in 58 Section 4.e of the Trust Legal Provisions and are provided 59 without warranty as described in the Simplified BSD License. 61 Abstract 63 This memo provides guidance to TCP implementers that is intended to 64 help improve connection convergence to steady-state operation 65 without affecting interoperability. It updates and replaces RFC 66 2140's description of sharing TCP state, as typically represented in 67 TCP Control Blocks, among similar concurrent or consecutive 68 connections. 70 Table of Contents 72 1. Introduction...................................................3 73 2. Conventions Used in This Document..............................4 74 3. Terminology....................................................4 75 4. The TCP Control Block (TCB)....................................5 76 5. TCB Interdependence............................................7 77 6. Temporal Sharing...............................................7 78 6.1. Initialization of a new TCB..................................7 79 6.2. Updates to the TCB cache.....................................8 80 6.3. Discussion..................................................10 81 7. Ensemble Sharing..............................................11 82 7.1. Initialization of a new TCB.................................11 83 7.2. Updates to the TCB cache....................................12 84 7.3. Discussion..................................................13 85 8. Issues with TCB information sharing...........................14 86 8.1. Traversing the same network path............................15 87 8.2. State dependence............................................15 88 8.3. Problems with sharing based on IP address...................16 89 9. Implications..................................................16 90 9.1. Layering....................................................17 91 9.2. Other possibilities.........................................17 92 10. Implementation Observations..................................18 93 11. Changes Compared to RFC 2140.................................19 94 12. Security Considerations......................................19 95 13. IANA Considerations..........................................20 96 14. References...................................................20 97 14.1. Normative References....................................20 98 14.2. Informative References..................................21 99 15. Acknowledgments..............................................24 100 16. Change log...................................................24 101 Appendix A : TCB Sharing History.................................28 102 Appendix B : TCP Option Sharing and Caching......................29 103 Appendix C : Automating the Initial Window in TCP over Long 104 Timescales.......................................................31 105 C.1. Introduction.............................................31 106 C.2. Design Considerations....................................31 107 C.3. Proposed IW Algorithm....................................32 108 C.4. Discussion...............................................36 109 C.5. Observations.............................................37 111 1. Introduction 113 TCP is a connection-oriented reliable transport protocol layered 114 over IP [RFC793]. Each TCP connection maintains state, usually in a 115 data structure called the TCP Control Block (TCB). The TCB contains 116 information about the connection state, its associated local 117 process, and feedback parameters about the connection's transmission 118 properties. As originally specified and usually implemented, most 119 TCB information is maintained on a per-connection basis. Some 120 implementations share certain TCB information across connections to 121 the same host [RFC2140]. Such sharing is intended to lead to better 122 overall transient performance, especially for numerous short-lived 123 and simultaneous connections, as can be used in the World-Wide Web 124 and other applications [Be94][Br02]. This sharing of state is 125 intended to help TCP connections converge to long term behavior 126 (assuming stable application load, i.e., so-called "steady-state") 127 more quickly without affecting TCP interoperability. 129 This document updates RFC 2140's discussion of TCB state sharing and 130 provides a complete replacement for that document. This state 131 sharing affects only TCB initialization [RFC2140] and thus has no 132 effect on the long-term behavior of TCP after a connection has been 133 established nor on interoperability. Path information shared across 134 SYN destination port numbers assumes that TCP segments having the 135 same host-pair experience the same path properties, i.e., that 136 traffic is not routed differently based on port numbers or other 137 connection parameters (also addressed further in Section 8.1). The 138 observations about TCB sharing in this document apply similarly to 139 any protocol with congestion state, including SCTP [RFC4960] and 140 DCCP [RFC4340], as well as for individual subflows in Multipath TCP 141 [RFC8684]. 143 2. Conventions Used in This Document 145 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 146 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 147 "OPTIONAL" in this document are to be interpreted as described in 148 BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all 149 capitals, as shown here. 151 The core of this document describes behavior that is already 152 permitted by TCP standards. As a result, it provides informative 153 guidance but does not use normative language, except when quoting 154 other documents. Normative language is used in Appendix C as 155 examples of requirements for future consideration. 157 3. Terminology 159 The following terminology is used frequently in this document. Items 160 preceded with a "+" may be part of the state maintained as TCP 161 connection state in the associated connections TCB and are the focus 162 of sharing as described in this document. Note that terms are used 163 as originally introduced where possible; in some cases, direction is 164 indicated with a suffix (_S for send, _R for receive) and in other 165 cases spelled out (sendcwnd). 167 +cwnd - TCP congestion window size [RFC5681] 169 host - a source or sink of TCP segments associated with a single IP 170 address 172 host-pair - a pair of hosts and their corresponding IP addresses 174 +MMS_R - maximum message size that can be received, the largest 175 received transport payload of an IP datagram [RFC1122] 177 +MMS_S - maximum message size that can be sent, the largest 178 transmitted transport payload of an IP datagram [RFC1122] 180 path - an Internet path between the IP addresses of two hosts 182 PCB - protocol control block, the data associated with a protocol as 183 maintained by an endpoint; a TCP PCB is called a TCB 184 PLPMTUD - packetization-layer path MTU discovery, a mechanism that 185 uses transport packets to discover the PMTU [RFC4821] 187 +PMTU - largest IP datagram that can traverse a path 188 [RFC1191][RFC8201] 190 PMTUD - path-layer MTU discovery, a mechanism that relies on ICMP 191 error messages to discover the PMTU [RFC1191][RFC8201] 193 +RTT - round-trip time of a TCP packet exchange [RFC793] 195 +RTTVAR - variation of round-trip times of a TCP packet exchange 196 [RFC6298] 198 +rwnd - TCP receive window size [RFC5681] 200 +sendcwnd - TCP send-side congestion window (cwnd) size [RFC5681] 202 +sendMSS - TCP maximum segment size, a value transmitted in a TCP 203 option that represents the largest TCP user data payload that can be 204 received [RFC6691] 206 +ssthresh - TCP slow-start threshold [RFC5681] 208 TCB - TCP Control Block, the data associated with a TCP connection 209 as maintained by an endpoint 211 TCP-AO - TCP Authentication Option [RFC5925] 213 TFO - TCP Fast Open option [RFC7413] 215 +TFO_cookie - TCP Fast Open cookie, state that is used as part of 216 the TFO mechanism, when TFO is supported [RFC7413] 218 +TFO_failure - an indication of when TFO option negotiation failed, 219 when TFO is supported 221 +TFOinfo - information cached when a TFO connection is established, 222 which includes the TFO_cookie [RFC7413] 224 4. The TCP Control Block (TCB) 226 A TCB describes the data associated with each connection, i.e., with 227 each association of a pair of applications across the network. The 228 TCB contains at least the following information [RFC793]: 230 Local process state 231 pointers to send and receive buffers 232 pointers to retransmission queue and current segment 233 pointers to Internet Protocol (IP) PCB 234 Per-connection shared state 235 macro-state 236 connection state 237 timers 238 flags 239 local and remote host numbers and ports 240 TCP option state 241 micro-state 242 send and receive window state (size*, current number) 243 congestion window size (sendcwnd)* 244 congestion window size threshold (ssthresh)* 245 max window size seen* 246 sendMSS# 247 MMS_S# 248 MMS_R# 249 PMTU# 250 round-trip time and its variation# 252 The per-connection information is shown as split into macro-state 253 and micro-state, terminology borrowed from [Co91]. Macro-state 254 describes the protocol for establishing the initial shared state 255 about the connection; we include the endpoint numbers and components 256 (timers, flags) required upon commencement that are later used to 257 help maintain that state. Micro-state describes the protocol after a 258 connection has been established, to maintain the reliability and 259 congestion control of the data transferred in the connection. 261 We distinguish two other classes of shared micro-state that are 262 associated more with host-pairs than with application pairs. One 263 class is clearly host-pair dependent (shown above as "#", e.g., 264 sendMSS, MMS_R, MMS_S, PMTU, RTT), because these parameters are 265 defined by the endpoint or endpoint pair (sendMSS, MMS_R, MMS_S, 266 RTT) or are already cached and shared on that basis (PMTU 267 [RFC1191][RFC4821]). The other is host-pair dependent in its 268 aggregate (shown above as "*", e.g., congestion window information, 269 current window sizes, etc.) because they depend on the total 270 capacity between the two endpoints. 272 Not all of the TCB state is necessarily sharable. In particular, 273 some TCP options are negotiated only upon request by the application 274 layer, so their use may not be correlated across connections. Other 275 options negotiate connection-specific parameters, which are 276 similarly not shareable. These are discussed further in Appendix B. 278 Finally, we exclude rwnd from further discussion because its value 279 should depend on the send window size, so it is already addressed by 280 send window sharing and is not independently affected by sharing. 282 5. TCB Interdependence 284 There are two cases of TCB interdependence. Temporal sharing occurs 285 when the TCB of an earlier (now CLOSED) connection to a host is used 286 to initialize some parameters of a new connection to that same host, 287 i.e., in sequence. Ensemble sharing occurs when a currently active 288 connection to a host is used to initialize another (concurrent) 289 connection to that host. 291 6. Temporal Sharing 293 The TCB data cache is accessed in two ways: it is read to initialize 294 new TCBs and written when more current per-host state is available. 296 6.1. Initialization of a new TCB 298 TCBs for new connections can be initialized using cached context 299 from past connections as follows: 301 TEMPORAL SHARING - TCB Initialization 303 Cached TCB New TCB 304 -------------------------------------- 305 old_MMS_S old_MMS_S or not cached* 307 old_MMS_R old_MMS_R or not cached* 309 old_sendMSS old_sendMSS 311 old_PMTU old_PMTU+ 313 old_RTT old_RTT 315 old_RTTVAR old_RTTVAR 317 old_option (option specific) 319 old_ssthresh old_ssthresh 321 old_sendcwnd old_sendcwnd 323 +Note that PMTU is cached at the IP layer [RFC1191][RFC4821]. 324 *Note that some values are not cached when they are computed locally 325 (MMS_R) or indicated in the connection itself (MMS_S in the SYN). 327 The table below gives an overview of option-specific information 328 that can be shared. Additional information on some specific TCP 329 options and sharing is provided in Appendix B. 331 TEMPORAL SHARING - Option Info Initialization 333 Cached New 334 ------------------------------------ 335 old_TFO_cookie old_TFO_cookie 337 old_TFO_failure old_TFO_failure 339 6.2. Updates to the TCB cache 341 During a connection, the TCB cache can be updated based on events of 342 current connections and their TCBs as they progress over time, as 343 shown below: 345 TEMPORAL SHARING - Cache Updates 347 Cached TCB Current TCB when? New Cached TCB 348 ---------------------------------------------------------- 349 old_MMS_S curr_MMS_S OPEN curr_MMS_S 351 old_MMS_R curr_MMS_R OPEN curr_MMS_R 353 old_sendMSS curr_sendMSS MSSopt curr_sendMSS 355 old_PMTU curr_PMTU PMTUD+ / curr_PMTU 356 PLPMTUD+ 358 old_RTT curr_RTT CLOSE merge(curr,old) 360 old_RTTVAR curr_RTTVAR CLOSE merge(curr,old) 362 old_option curr_option ESTAB (depends on option) 364 old_ssthresh curr_ssthresh CLOSE merge(curr,old) 366 old_sendcwnd curr_sendcwnd CLOSE merge(curr,old) 368 +Note that PMTU is cached at the IP layer [RFC1191][RFC4821]. 370 Merge() is the function that combines the current and previous (old) 371 values and may vary for each parameter of the TCB cache. The 372 particular function is not specified in this document; examples 373 include windowed averages (mean of the past N values, for some N) 374 and exponential decay (new = (1-alpha)*old + alpha *new, where alpha 375 is in the range [0..1]). 377 The table below gives an overview of option-specific information 378 that can be similarly shared. The TFO cookie is maintained until the 379 client explicitly requests it be updated as a separate event. 381 TEMPORAL SHARING - Option Info Updates 383 Cached Current when? New Cached 384 --------------------------------------------------------- 385 old_TFO_cookie old_TFO_cookie ESTAB old_TFO_cookie 387 old_TFO_failure old_TFO_failure ESTAB old_TFO_failure 389 6.3. Discussion 391 As noted, there is no particular benefit to caching MMS_S and MMS_R 392 as these are reported by the local IP stack. Caching sendMSS and 393 PMTU is trivial; reported values are cached (PMTU at the IP layer), 394 and the most recent values are used. The cache is updated when the 395 MSS option is received in a SYN or after PMTUD (i.e., when an ICMPv4 396 Fragmentation Needed [RFC1191] or ICMPv6 Packet Too Big message is 397 received [RFC8201] or the equivalent is inferred, e.g., as from 398 PLPMTUD [RFC4821]), respectively, so the cache always has the most 399 recent values from any connection. For sendMSS, the cache is 400 consulted only at connection establishment and not otherwise 401 updated, which means that MSS options do not affect current 402 connections. The default sendMSS is never saved; only reported MSS 403 values update the cache, so an explicit override is required to 404 reduce the sendMSS. Cached sendMSS affects only data sent in the SYN 405 segment, i.e., during client connection initiation or during 406 simultaneous open; all other segment MSS are based on the value 407 updated as included in the SYN. 409 RTT values are updated by formulae that merges the old and new 410 values, as noted in Section 6.2. Dynamic RTT estimation requires a 411 sequence of RTT measurements. As a result, the cached RTT (and its 412 variation) is an average of its previous value with the contents of 413 the currently active TCB for that host, when a TCB is closed. RTT 414 values are updated only when a connection is closed. The method for 415 merging old and current values needs to attempt to reduce the 416 transient effects of the new connections. 418 The updates for RTT, RTTVAR and ssthresh rely on existing 419 information, i.e., old values. Should no such values exist, the 420 current values are cached instead. 422 TCP options are copied or merged depending on the details of each 423 option. E.g., TFO state is updated when a connection is established 424 and read before establishing a new connection. 426 Sections 8 and 9 discuss compatibility issues and implications of 427 sharing the specific information listed above. Section 10 gives an 428 overview of known implementations. 430 Most cached TCB values are updated when a connection closes. The 431 exceptions are MMS_R and MMS_S, which are reported by IP [RFC1122], 432 PMTU which is updated after Path MTU Discovery and also reported by 433 IP [RFC1191][RFC4821][RFC8201], and sendMSS, which is updated if the 434 MSS option is received in the TCP SYN header. 436 Sharing sendMSS information affects only data in the SYN of the next 437 connection, because sendMSS information is typically included in 438 most TCP SYN segments. Caching PMTU can accelerate the efficiency of 439 PMTUD but can also result in black-holing until corrected if in 440 error. Caching MMS_R and MMS_S may be of little direct value as they 441 are reported by the local IP stack anyway. 443 The way in which other TCP option state can be shared depends on the 444 details of that option. E.g., TFO state includes the TCP Fast Open 445 Cookie [RFC7413] or, in case TFO fails, a negative TCP Fast Open 446 response. RFC 7413 states, "The client MUST cache negative responses 447 from the server in order to avoid potential connection failures. 448 Negative responses include the server not acknowledging the data in 449 the SYN, ICMP error messages, and (most importantly) no response 450 (SYN-ACK) from the server at all, i.e., connection timeout." [RFC 451 7413]. TFOinfo is cached when a connection is established. 453 Other TCP option state might not be as readily cached. E.g., TCP-AO 454 [RFC5925] success or failure between a host pair for a single SYN 455 destination port might be usefully cached. TCP-AO success or failure 456 to other SYN destination ports on that host pair is never useful to 457 cache because TCP-AO security parameters can vary per service. 459 7. Ensemble Sharing 461 Sharing cached TCB data across concurrent connections requires 462 attention to the aggregate nature of some of the shared state. For 463 example, although MSS and RTT values can be shared by copying, it 464 may not be appropriate to simply copy congestion window or ssthresh 465 information; instead, the new values can be a function (f) of the 466 cumulative values and the number of connections (N). 468 7.1. Initialization of a new TCB 470 TCBs for new connections can be initialized using cached context 471 from concurrent connections as follows: 473 ENSEMBLE SHARING - TCB Initialization 475 Cached TCB New TCB 476 ------------------------------------------ 477 old_MMS_S old_MMS_S 479 old_MMS_R old_MMS_R 481 old_sendMSS old_sendMSS 483 old_PMTU old_PMTU+ 485 old_RTT old_RTT 487 old_RTTVAR old_RTTVAR 489 sum(old_ssthresh) f(sum(old_ssthresh), N) 491 sum(old_sendcwnd) f(sum(old_sendcwnd), N) 492 _ 493 old_option (option specific) 495 +Note that PMTU is cached at the IP layer [RFC1191][RFC4821]. 497 In the table, the cached sum() is a total across all active 498 connections because these parameters act in aggregate; similarly f() 499 is a function that updates that sum based on the new connection's 500 values, represented as "N". 502 The table below gives an overview of option-specific information 503 that can be similarly shared. Again, The TFO_cookie is updated upon 504 explicit client request, which is a separate event. 506 ENSEMBLE SHARING - Option Info Initialization 508 Cached New 509 ------------------------------------ 510 old_TFO_cookie old_TFO_cookie 512 old_TFO_failure old_TFO_failure 514 7.2. Updates to the TCB cache 516 During a connection, the TCB cache can be updated based on changes 517 to concurrent connections and their TCBs, as shown below: 519 ENSEMBLE SHARING - Cache Updates 521 Cached TCB Current TCB when? New Cached TCB 522 --------------------------------------------------------------- 523 old_MMS_S curr_MMS_S OPEN curr_MMS_S 525 old_MMS_R curr_MMS_R OPEN curr_MMS_R 527 old_sendMSS curr_sendMSS MSSopt curr_sendMSS 529 old_PMTU curr_PMTU PMTUD+ / curr_PMTU 530 PLPMTUD+ 532 old_RTT curr_RTT update rtt_update(old, curr) 534 old_RTTVAR curr_RTTVAR update rtt_update(old, curr) 536 old_ssthresh curr_ssthresh update adjust sum as appropriate 538 old_sendcwnd curr_sendcwnd update adjust sum as appropriate 540 old_option curr_option (depends) (option specific) 542 +Note that the PMTU is cached at the IP layer [RFC1191][RFC4821]. 544 In the table, rtt_update() is the function used to combine old and 545 current values, e.g., as a windowed average or exponentially decayed 546 average. 548 The table below gives an overview of option-specific information 549 that can be similarly shared. 551 ENSEMBLE SHARING - Option Info Updates 553 Cached Current when? New Cached 554 ---------------------------------------------------------- 555 old_TFO_cookie old_TFO_cookie ESTAB old_TFO_cookie 557 old_TFO_failure old_TFO_failure ESTAB old_TFO_failure 559 7.3. Discussion 561 For ensemble sharing, TCB information should be cached as early as 562 possible, sometimes before a connection is closed. Otherwise, 563 opening multiple concurrent connections may not result in TCB data 564 sharing if no connection closes before others open. The amount of 565 work involved in updating the aggregate average should be minimized, 566 but the resulting value should be equivalent to having all values 567 measured within a single connection. The function "rtt_update" in 568 the ensemble sharing table indicates this operation, which occurs 569 whenever the RTT would have been updated in the individual TCP 570 connection. As a result, the cache contains the shared RTT 571 variables, which no longer need to reside in the TCB. 573 Congestion window size and ssthresh aggregation are more complicated 574 in the concurrent case. When there is an ensemble of connections, we 575 need to decide how that ensemble would have shared these variables, 576 in order to derive initial values for new TCBs. 578 Sections 8 and 9 discuss compatibility issues and implications of 579 sharing the specific information listed above. 581 There are several ways to initialize the congestion window in a new 582 TCB among an ensemble of current connections to a host. Current TCP 583 implementations initialize it to four segments as standard [RFC3390] 584 and 10 segments experimentally [RFC6928]. These approaches assume 585 that new connections should behave as conservatively as possible. 586 The algorithm described in [Ba12] adjusts the initial cwnd depending 587 on the cwnd values of ongoing connections. It is also possible to 588 use sharing mechanisms over long timescales to adapt TCP's initial 589 window automatically, as described further in Appendix C. 591 8. Issues with TCB information sharing 593 Here, we discuss various types of problems that may arise with TCB 594 information sharing. 596 For the congestion and current window information, the initial 597 values computed by TCB interdependence may not be consistent with 598 the long-term aggregate behavior of a set of concurrent connections 599 between the same endpoints. Under conventional TCP congestion 600 control, if the congestion window of a single existing connection 601 has converged to 40 segments, two newly joining concurrent 602 connections assume initial windows of 10 segments [RFC6928], and the 603 current connection's window doesn't decrease to accommodate this 604 additional load and connections can mutually interfere. One example 605 of this is seen on low-bandwidth, high-delay links, where concurrent 606 connections supporting Web traffic can collide because their initial 607 windows were too large, even when set at one segment. 609 The authors of [Hu12] recommend caching ssthresh for temporal 610 sharing only when flows are long. Some studies suggest that sharing 611 ssthresh between short flows can deteriorate the performance of 612 individual connections [Hu12, Du16], although this may benefit 613 aggregate network performance. 615 8.1. Traversing the same network path 617 TCP is sometimes used in situations where packets of the same host- 618 pair do not always take the same path, such as when connection- 619 specific parameters are used for routing (e.g., for load balancing). 620 Multipath routing that relies on examining transport headers, such 621 as ECMP and LAG [RFC7424], may not result in repeatable path 622 selection when TCP segments are encapsulated, encrypted, or altered 623 - for example, in some Virtual Private Network (VPN) tunnels that 624 rely on proprietary encapsulation. Similarly, such approaches cannot 625 operate deterministically when the TCP header is encrypted, e.g., 626 when using IPsec ESP (although TCB interdependence among the entire 627 set sharing the same endpoint IP addresses should work without 628 problems when the TCP header is encrypted). Measures to increase the 629 probability that connections use the same path could be applied: 630 e.g., the connections could be given the same IPv6 flow label 631 [RFC6437]. TCB interdependence can also be extended to sets of host 632 IP address pairs that share the same network path conditions, such 633 as when a group of addresses is on the same LAN (see Section 9). 635 Traversing the same path is not important for host-specific 636 information such as rwnd and TCP option state, such as TFOinfo, or 637 for information that is already cached per-host, such as path MTU. 638 When TCB information is shared across different SYN destination 639 ports, path-related information can be incorrect; however, the 640 impact of this error is potentially diminished if (as discussed 641 here) TCB sharing affects only the transient event of a connection 642 start or if TCB information is shared only within connections to the 643 same SYN destination port. 645 In case of Temporal Sharing, TCB information could also become 646 invalid over time, i.e., indicating that although the path remains 647 the same, path properties have changed. Because this is similar to 648 the case when a connection becomes idle, mechanisms that address 649 idle TCP connections (e.g., [RFC7661]) could also be applied to TCB 650 cache management, especially when TCP Fast Open is used [RFC7413]. 652 8.2. State dependence 654 There may be additional considerations to the way in which TCB 655 interdependence rebalances congestion feedback among the current 656 connections, e.g., it may be appropriate to consider the impact of a 657 connection being in Fast Recovery [RFC5681] or some other similar 658 unusual feedback state, e.g., as inhibiting or affecting the 659 calculations described herein. 661 8.3. Problems with sharing based on IP address 663 It can be wrong to share TCB information between TCP connections on 664 the same host as identified by the IP address if an IP address is 665 assigned to a new host (e.g., IP address spinning, as is used by 666 ISPs to inhibit running servers). It can be wrong if Network Address 667 (and Port) Translation (NA(P)T) [RFC2663] or any other IP sharing 668 mechanism is used. Such mechanisms are less likely to be used with 669 IPv6. Other methods to identify a host could also be considered to 670 make correct TCB sharing more likely. Moreover, some TCB information 671 is about dominant path properties rather than the specific host. IP 672 addresses may differ, yet the relevant part of the path may be the 673 same. 675 9. Implications 677 There are several implications to incorporating TCB interdependence 678 in TCP implementations. First, it may reduce the need for 679 application-layer multiplexing for performance enhancement 680 [RFC7231]. Protocols like HTTP/2 [RFC7540] avoid connection 681 reestablishment costs by serializing or multiplexing a set of per- 682 host connections across a single TCP connection. This avoids TCP's 683 per-connection OPEN handshake and also avoids recomputing the MSS, 684 RTT, and congestion window values. By avoiding the so-called "slow- 685 start restart", performance can be optimized [Hu01]. TCB 686 interdependence can provide the "slow-start restart avoidance" of 687 multiplexing, without requiring a multiplexing mechanism at the 688 application layer. 690 Like the initial version of this document [RFC2140], this update's 691 approach to TCB interdependence focuses on sharing a set of TCBs by 692 updating the TCB state to reduce the impact of transients when 693 connections begin, end, or otherwise significantly change state. 694 Other mechanisms have since been proposed to continuously share 695 information between all ongoing communication (including 696 connectionless protocols), updating the congestion state during any 697 congestion-related event (e.g., timeout, loss confirmation, etc.) 698 [RFC3124]. By dealing exclusively with transients, the approach in 699 this document is more likely to exhibit the "steady-state" behavior 700 as unmodified, independent TCP connections. 702 9.1. Layering 704 TCB interdependence pushes some of the TCP implementation from the 705 traditional transport layer (in the ISO model), to the network 706 layer. This acknowledges that some state is in fact per-host-pair or 707 can be per-path as indicated solely by that host-pair. Transport 708 protocols typically manage per-application-pair associations (per 709 stream), and network protocols manage per-host-pair and path 710 associations (routing). Round-trip time, MSS, and congestion 711 information could be more appropriately handled at the network 712 layer, aggregated among concurrent connections, and shared across 713 connection instances [RFC3124]. 715 An earlier version of RTT sharing suggested implementing RTT state 716 at the IP layer, rather than at the TCP layer. Our observations 717 describe sharing state among TCP connections, which avoids some of 718 the difficulties in an IP-layer solution. One such problem of an IP 719 layer solution is determining the correspondence between packet 720 exchanges using IP header information alone, where such 721 correspondence is needed to compute RTT. Because TCB sharing 722 computes RTTs inside the TCP layer using TCP header information, it 723 can be implemented more directly and simply than at the IP layer. 724 This is a case where information should be computed at the transport 725 layer but could be shared at the network layer. 727 9.2. Other possibilities 729 Per-host-pair associations are not the limit of these techniques. It 730 is possible that TCBs could be similarly shared between hosts on a 731 subnet or within a cluster, because the predominant path can be 732 subnet-subnet, rather than host-host. Additionally, TCB 733 interdependence can be applied to any protocol with congestion 734 state, including SCTP [RFC4960] and DCCP [RFC4340], as well as for 735 individual subflows in Multipath TCP [RFC8684]. 737 There may be other information that can be shared between concurrent 738 connections. For example, knowing that another connection has just 739 tried to expand its window size and failed, a connection may not 740 attempt to do the same for some period. The idea is that existing 741 TCP implementations infer the behavior of all competing connections, 742 including those within the same host or subnet. One possible 743 optimization is to make that implicit feedback explicit, via 744 extended information associated with the endpoint IP address and its 745 TCP implementation, rather than per-connection state in the TCB. 747 This document focuses on sharing TCB information at connection 748 initialization. Subsequent to RFC 2140, there have been numerous 749 approaches that attempt to coordinate ongoing state across 750 concurrent connections, both within TCP and other congestion- 751 reactive protocols, which are summarized in [Is18]. These approaches 752 are more complex to implement and their comparison to steady-state 753 TCP equivalence can be more difficult to establish, sometimes 754 intentionally (i.e., they sometimes intend to provide a different 755 kind of "fairness" than emerges from TCP operation). 757 10. Implementation Observations 759 The observation that some TCB state is host-pair specific rather 760 than application-pair dependent is not new and is a common 761 engineering decision in layered protocol implementations. Although 762 now deprecated, T/TCP [RFC1644] was the first to propose using 763 caches in order to maintain TCB states (see Appendix A). 765 The table below describes the current implementation status for TCB 766 temporal sharing in Windows as of December 2020, Apple variants 767 (macOS, iOS, iPadOS, tvOS, watchOS) as of January 2021, Linux kernel 768 version 5.10.3, and FreeBSD 12. Ensemble sharing is not yet 769 implemented. 771 KNOWN IMPLEMENTATION STATUS 773 TCB data Status 774 ------------------------------------------------------------ 775 old_MMS_S Not shared 777 old_MMS_R Not shared 779 old_sendMSS Cached and shared in Apple, Linux (MSS) 781 old_PMTU Cached and shared in Apple, FreeBSD, Windows (PMTU) 783 old_RTT Cached and shared in Apple, FreeBSD, Linux, Windows 785 old_RTTVAR Cached and shared in Apple, FreeBSD, Windows 787 old_TFOinfo Cached and shared in Apple, Linux, Windows 789 old_sendcwnd Not shared 791 old_ssthresh Cached and shared in Apple, FreeBSD*, Linux* 793 TFO failure Cached and shared in Apple 795 In the table above, "Apple" refers to all Apple OSes, i.e., 796 desktop/laptop macOS, phone iOS, pad iPadOS, video player tvOS, and 797 watch watchOS, which all share the same Internet protocol stack. 799 *Note: In FreeBSD, new ssthresh is the mean of curr_ssthresh and 800 previous value if a previous value exists; in Linux, the calculation 801 depends on state and is max(curr_cwnd/2, old_ssthresh) in most 802 cases. 804 11. Changes Compared to RFC 2140 806 This document updates the description of TCB sharing in RFC 2140 and 807 its associated impact on existing and new connection state, 808 providing a complete replacement for that document [RFC2140]. It 809 clarifies the previous description and terminology and extends the 810 mechanism to its impact on new protocols and mechanisms, including 811 multipath TCP, fast open, PLPMTUD, NAT, and the TCP Authentication 812 Option. 814 The detailed impact on TCB state addresses TCB parameters in greater 815 detail, addressing MSS in both the send and receive direction, MSS 816 and sendMSS separately, adds path MTU and ssthresh, and addresses 817 the impact on TCP option state. 819 New sections have been added to address compatibility issues and 820 implementation observations. The relation of this work to T/TCP has 821 been moved to 0 on history, partly to reflect the deprecation of 822 that protocol. 824 Appendix C has been added to discuss the potential to use temporal 825 sharing over long timescales to adapt TCP's initial window 826 automatically, avoiding the need to periodically revise a single 827 global constant value. 829 Finally, this document updates and significantly expands the 830 referenced literature. 832 12. Security Considerations 834 These presented implementation methods do not have additional 835 ramifications for direct (connection-aborting or information 836 injecting) attacks on individual connections. Individual 837 connections, whether using sharing or not, also may be susceptible 838 to denial-of-service attacks that reduce performance or completely 839 deny connections and transfers if not otherwise secured. 841 TCB sharing may create additional denial-of-service attacks that 842 affect the performance of other connections by polluting the cached 843 information. This can occur across whatever set of connections where 844 the TCB is shared, between connections in a single host, or between 845 hosts if TCB sharing is implemented within a subnet (see 846 Implications section). Some shared TCB parameters are used only to 847 create new TCBs, others are shared among the TCBs of ongoing 848 connections. New connections can join the ongoing set, e.g., to 849 optimize send window size among a set of connections to the same 850 host. PMTU is defined as shared at the IP layer, and is already 851 susceptible in this way. 853 Options in client SYNs can be easier to forge than complete, two-way 854 connections. As a result, their values may not be safely 855 incorporated in shared values until after the three-way handshake 856 completes. 858 Attacks on parameters used only for initialization affect only the 859 transient performance of a TCP connection. For short connections, 860 the performance ramification can approach that of a denial-of- 861 service attack. E.g., if an application changes its TCB to have a 862 false and small window size, subsequent connections will experience 863 performance degradation until their window grew appropriately. 865 TCB sharing reuses and mixes information from past and current 866 connections. Although reusing information could create a potential 867 for fingerprinting to identify hosts, the mixing reduces that 868 potential. There has been no evidence of fingerprinting based on 869 this technique and it is currently considered safe in that regard. 870 Further, information about the performance of a TCP connection has 871 not been considered as private. 873 13. IANA Considerations 875 There are no IANA implications or requests in this document. 877 This section should be removed upon final publication as an RFC. 879 14. References 881 14.1. Normative References 883 [RFC793] Postel, J., "Transmission Control Protocol," Network 884 Working Group RFC-793/STD-7, ISI, Sept. 1981. 886 [RFC1122] Braden, R. (ed), "Requirements for Internet Hosts -- 887 Communication Layers", RFC-1122, Oct. 1989. 889 [RFC1191] Mogul, J., Deering, S., "Path MTU Discovery," RFC 1191, 890 Nov. 1990. 892 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 893 Requirement Levels", BCP 14, RFC 2119, March 1997. 895 [RFC4821] Mathis, M., Heffner, J., "Packetization Layer Path MTU 896 Discovery," RFC 4821, Mar. 2007. 898 [RFC5681] Allman, M., Paxson, V., Blanton, E., "TCP Congestion 899 Control," RFC 5681 (Standards Track), Sep. 2009. 901 [RFC6298] Paxson, V., Allman, M., Chu, J., Sargent, M., "Computing 902 TCP's Retransmission Timer," RFC 6298, June 2011. 904 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., Jain, A., "TCP Fast 905 Open", RFC 7413, Dec. 2014. 907 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 908 2119 Key Words", RFC 8174, May 2017. 910 [RFC8201] McCann, J., Deering. S., Mogul, J., Hinden, R. (Ed.), 911 "Path MTU Discovery for IP version 6," RFC 8201, Jul. 912 2017. 914 14.2. Informative References 916 [Al10] Allman, M., "Initial Congestion Window Specification", 917 (work in progress), draft-allman-tcpm-bump-initcwnd-00, 918 Nov. 2010. 920 [Ba12] Barik, R., Welzl, M., Ferlin, S., Alay, O., " LISA: A 921 Linked Slow-Start Algorithm for MPTCP", IEEE ICC, Kuala 922 Lumpur, Malaysia, May 23-27 2016. 924 [Ba20] Bagnulo, M., Briscoe, B., "ECN++: Adding Explicit 925 Congestion Notification (ECN) to TCP Control Packets", 926 draft-ietf-tcpm-generalized-ecn-07, Feb. 2021. 928 [Be94] Berners-Lee, T., et al., "The World-Wide Web," 929 Communications of the ACM, V37, Aug. 1994, pp. 76-82. 931 [Br94] Braden, B., "T/TCP -- Transaction TCP: Source Changes for 932 Sun OS 4.1.3,", Release 1.0, USC/ISI, September 14, 1994. 934 [Br02] Brownlee, N., Claffy, K., "Understanding Internet Traffic 935 Streams: Dragonflies and Tortoises", IEEE Communications 936 Magazine p110-117, 2002. 938 [Co91] Comer, D., Stevens, D., Internetworking with TCP/IP, V2, 939 Prentice-Hall, NJ, 1991. 941 [Du16] Dukkipati, N., Yuchung C., Amin V., "Research Impacting 942 the Practice of Congestion Control." ACM SIGCOMM CCR 943 (editorial), on-line post, July 2016. 945 [FreeBSD] FreeBSD source code, Release 2.10, http://www.freebsd.org/ 947 [Hu01] Hughes, A., Touch, J., Heidemann, J., "Issues in Slow- 948 Start Restart After Idle", draft-hughes-restart-00 949 (expired), Dec. 2001. 951 [Hu12] Hurtig, P., Brunstrom, A., "Enhanced metric caching for 952 short TCP flows," 2012 IEEE International Conference on 953 Communications (ICC), Ottawa, ON, 2012, pp. 1209-1213. 955 [IANA] IANA TCP Parameters (options) registry, 956 https://www.iana.org/assignments/tcp-parameters 958 [Is18] Islam, S., Welzl, M., Hiorth, K., Hayes, D., Armitage, G., 959 Gjessing, S., "ctrlTCP: Reducing Latency through Coupled, 960 Heterogeneous Multi-Flow TCP Congestion Control," Proc. 961 IEEE INFOCOM Global Internet Symposium (GI) workshop (GI 962 2018), Honolulu, HI, April 2018. 964 [Ja88] Jacobson, V., Karels, M., "Congestion Avoidance and 965 Control", Proc. Sigcomm 1988. 967 [RFC1644] Braden, R., "T/TCP -- TCP Extensions for Transactions 968 Functional Specification," RFC-1644, July 1994. 970 [RFC1379] Braden, R., "Transaction TCP -- Concepts," RFC-1379, 971 September 1992. 973 [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast 974 Retransmit, and Fast Recovery Algorithms", RFC2001 975 (Standards Track), Jan. 1997. 977 [RFC2140] Touch, J., "TCP Control Block Interdependence", RFC 2140, 978 April 1997. 980 [RFC2414] Allman, M., Floyd, S., Partridge, C., "Increasing TCP's 981 Initial Window", RFC 2414 (Experimental), Sept. 1998. 983 [RFC2663] Srisuresh, P., Holdrege, M., "IP Network Address 984 Translator (NAT) Terminology and Considerations", RFC- 985 2663, August 1999. 987 [RFC3390] Allman, M., Floyd, S., Partridge, C., "Increasing TCP's 988 Initial Window," RFC 3390, Oct. 2002. 990 [RFC3124] Balakrishnan, H., Seshan, S., "The Congestion Manager," 991 RFC 3124, June 2001. 993 [RFC4340] Kohler, E., Handley, M., Floyd, S., "Datagram Congestion 994 Control Protocol (DCCP)," RFC 4340, Mar. 2006. 996 [RFC4960] Stewart, R., (Ed.), "Stream Control Transmission 997 Protocol," RFC4960, Sept. 2007. 999 [RFC5925] Touch, J., Mankin, A., Bonica, R., "The TCP Authentication 1000 Option," RFC 5925, June 2010. 1002 [RFC6437] Amante, S., Carpenter, B., Jiang, S., Rajajalme, J., "IPv6 1003 Flow Label Specification," RFC 6437, Nov. 2011. 1005 [RFC6691] Borman, D., "TCP Options and Maximum Segment Size (MSS)," 1006 RFC 6691, July 2012. 1008 [RFC6928] Chu, J., Dukkipati, N., Cheng, Y., Mathis, M., "Increasing 1009 TCP's Initial Window," RFC 6928, Apr. 2013. 1011 [RFC7231] Fielding, R., Reshke, J., Eds., "HTTP/1.1 Semantics and 1012 Content," RFC-7231, June 2014. 1014 [RFC7323] Borman, D., Braden, B., Jacobson, V., Scheffenegger, R., 1015 (Ed.), "TCP Extensions for High Performance," RFC 7323, 1016 Sept. 2014. 1018 [RFC7424] Krishnan, R., Yong, L., Ghanwani, A., So, N., Khasnabish, 1019 B., "Mechanisms for Optimizing Link Aggregation Group 1020 (LAG) and Equal-Cost Multipath (ECMP) Component Link 1021 Utilization in Networks", RFC 7424, Jan. 2015 1023 [RFC7540] Belshe, M., Peon, R., Thomson, M., "Hypertext Transfer 1024 Protocol Version 2 (HTTP/2)", RFC 7540, May 2015. 1026 [RFC7661] Fairhurst, G., Sathiaseelan, A., Secchi, R., "Updating TCP 1027 to Support Rate-Limited Traffic", RFC 7661, Oct. 2015. 1029 [RFC8684] Ford, A., Raiciu, C., Handley, M., Bonaventure, O., 1030 Paasch, C., "TCP Extensions for Multipath Operation with 1031 Multiple Addresses," RFC 8684, Mar. 2020. 1033 15. Acknowledgments 1035 The authors would like to thank for Praveen Balasubramanian for 1036 information regarding TCB sharing in Windows, Christoph Paasch for 1037 information regarding TCB sharing in Apple OSes, and Yuchung Cheng, 1038 Lars Eggert, Ilpo Jarvinen and Michael Scharf for comments on 1039 earlier versions of the draft, as well as members of the TCPM WG. 1040 Earlier revisions of this work received funding from a collaborative 1041 research project between the University of Oslo and Huawei 1042 Technologies Co., Ltd. and were partly supported by USC/ISI's Postel 1043 Center. 1045 This document was prepared using 2-Word-v2.0.template.dot. 1047 16. Change log 1049 This section should be removed upon final publication as an RFC. 1051 ietf-11: 1053 - Addressed gen-art review and IESG feedback 1055 ietf-10: 1057 - Addressed IETF last call feedback 1059 ietf-09: 1061 - Correction of typographic errors 1063 ietf-08: 1065 - Address TSV AD comments, add Apple OS implementation status 1067 ietf-07: 1069 - Update per id-nits and normative language for consistency 1071 ietf-06: 1073 - Address WGLC comments 1075 ietf-05: 1077 - Correction of typographic errors, expansion of terminology 1079 ietf-04: 1081 - Fix internal cross-reference errors that appeared in ietf-02 1082 - Updated tables to re-center; clarified text 1084 ietf-03: 1086 - Correction of typographic errors, minor rewording in appendices 1088 ietf-02: 1090 - Minor reorganization and correction of typographic errors 1091 - Added text to address fingerprinting in Security section 1092 - Now retains Appendix B and body option tables upon publication 1094 ietf-01: 1096 - Added Appendix C to address long-timescale temporal adaptation 1098 ietf-00: 1100 - Re-issued as draft-ietf-tcpm-2140bis due to WG adoption. 1101 - Cleaned orphan references to T/TCP, removed incomplete refs 1102 - Moved references to informative section and updated Sec 2 1103 - Updated to clarify no impact to interoperability 1104 - Updated appendix B to avoid 2119 language 1106 06: 1108 - Changed to update 2140, cite it normatively, and summarize the 1109 updates in a separate section 1111 05: 1113 - Fixed some TBDs 1115 04: 1117 - Removed BCP-style recommendations and fixed some TBDs 1119 03: 1121 - Updated Touch's affiliation and address information 1123 02: 1125 - Stated that our OS implementation overview table only covers 1126 temporal sharing. 1128 - Correctly reflected sharing of old_RTT in Linux in the 1129 implementation overview table. 1131 - Marked entries that are considered safe to share with an 1132 asterisk (suggestion was to split the table) 1134 - Discussed correct host identification: NATs may make IP 1135 addresses the wrong input, could e.g., use HTTP cookie. 1137 - Included MMS_S and MMS_R from RFC1122; fixed the use of MSS and 1138 MTU 1140 - Added information about option sharing, listed options in 1141 Appendix B 1143 Authors' Addresses 1145 Joe Touch 1146 Manhattan Beach, CA 90266 1147 USA 1149 Phone: +1 (310) 560-0334 1150 Email: touch@strayalpha.com 1152 Michael Welzl 1153 University of Oslo 1154 PO Box 1080 Blindern 1155 Oslo N-0316 1156 Norway 1158 Phone: +47 22 85 24 20 1159 Email: michawe@ifi.uio.no 1160 Safiqul Islam 1161 University of Oslo 1162 PO Box 1080 Blindern 1163 Oslo N-0316 1164 Norway 1166 Phone: +47 22 84 08 37 1167 Email: safiquli@ifi.uio.no 1169 Appendix A: TCB Sharing History 1171 T/TCP proposed using caches to maintain TCB information across 1172 instances (temporal sharing), e.g., smoothed RTT, RTT variation, 1173 congestion avoidance threshold, and MSS [RFC1644]. These values were 1174 in addition to connection counts used by T/TCP to accelerate data 1175 delivery prior to the full three-way handshake during an OPEN. The 1176 goal was to aggregate TCB components where they reflect one 1177 association - that of the host-pair, rather than artificially 1178 separating those components by connection. 1180 At least one T/TCP implementation saved the MSS and aggregated the 1181 RTT parameters across multiple connections but omitted caching the 1182 congestion window information [Br94], as originally specified in 1183 [RFC1379]. Some T/TCP implementations immediately updated MSS when 1184 the TCP MSS header option was received [Br94], although this was not 1185 addressed specifically in the concepts or functional specification 1186 [RFC1379][RFC1644]. In later T/TCP implementations, RTT values were 1187 updated only after a CLOSE, which does not benefit concurrent 1188 sessions. 1190 Temporal sharing of cached TCB data was originally implemented in 1191 the SunOS 4.1.3 T/TCP extensions [Br94] and the FreeBSD port of same 1192 [FreeBSD]. As mentioned before, only the MSS and RTT parameters were 1193 cached, as originally specified in [RFC1379]. Later discussion of 1194 T/TCP suggested including congestion control parameters in this 1195 cache; for example, [RFC1644] (Section 3.1) hints at initializing 1196 the congestion window to the old window size. 1198 Appendix B: TCP Option Sharing and Caching 1200 In addition to the options that can be cached and shared, this memo 1201 also lists known TCP options [IANA] for which state is unsafe to be 1202 kept. This list is not intended to be authoritative or exhaustive. 1204 Obsolete (unsafe to keep state): 1206 ECHO 1208 ECHO REPLY 1210 PO Conn permitted 1212 PO service profile 1214 CC 1216 CC.NEW 1218 CC.ECHO 1220 Alt CS req 1222 Alt CS data 1224 No state to keep: 1226 EOL 1228 NOP 1230 WS 1232 SACK 1234 TS 1236 MD5 1238 TCP-AO 1240 EXP1 1242 EXP2 1244 Unsafe to keep state: 1246 Skeeter (DH exchange, known to be vulnerable) 1248 Bubba (DH exchange, known to be vulnerable) 1250 Trailer CS 1252 SCPS capabilities 1254 S-NACK 1256 Records boundaries 1258 Corruption experienced 1260 SNAP 1262 TCP Compression 1264 Quickstart response 1266 UTO 1268 MPTCP negotiation success (see below for negotiation failure) 1270 TFO negotiation success (see below for negotiation failure) 1272 Safe but optional to keep state: 1274 MPTCP negotiation failure (to avoid negotiation retries) 1276 MSS 1278 TFO negotiation failure (to avoid negotiation retries) 1280 Safe and necessary to keep state: 1282 TFO cookie (if TFO succeeded in the past) 1284 Appendix C: Automating the Initial Window in TCP over Long Timescales 1286 C.1. Introduction 1288 Temporal sharing, as described earlier in this document, builds on 1289 the assumption that multiple consecutive connections between the 1290 same host pair are somewhat likely to be exposed to similar 1291 environment characteristics. The stored information can become less 1292 accurate over time and suitable precautions should take this ageing 1293 into consideration (this is discussed further in section 8.1). 1294 However, there are also cases where it can make sense to track these 1295 values over longer periods, observing properties of TCP connections 1296 to gradually influence evolving trends in TCP parameters. This 1297 appendix describes an example of such a case. 1299 TCP's congestion control algorithm uses an initial window value 1300 (IW), both as a starting point for new connections and as an upper 1301 limit for restarting after an idle period [RFC5681][RFC7661]. This 1302 value has evolved over time, originally one maximum segment size 1303 (MSS), and increased to the lesser of four MSS or 4,380 bytes 1304 [RFC3390][RFC5681]. For a typical Internet connection with a maximum 1305 transmission unit (MTU) of 1500 bytes, this permits three segments 1306 of 1,460 bytes each. 1308 The IW value was originally implied in the original TCP congestion 1309 control description and documented as a standard in 1997 1310 [RFC2001][Ja88]. The value was updated in 1998 experimentally and 1311 moved to the standards track in 2002 [RFC2414][RFC3390]. In 2013, it 1312 was experimentally increased to 10 [RFC6928]. 1314 This appendix discusses how TCP can objectively measure when an IW 1315 is too large, and that such feedback should be used over long 1316 timescales to adjust the IW automatically. The result should be 1317 safer to deploy and might avoid the need to repeatedly revisit IW 1318 over time. 1320 Note that this mechanism attempts to make the IW more adaptive over 1321 time. It can increase the IW beyond that which is currently 1322 recommended for widescale deployment, and so its use should be 1323 carefully monitored. 1325 C.2. Design Considerations 1327 TCP's IW value has existed statically for over two decades, so any 1328 solution to adjusting the IW dynamically should have similarly 1329 stable, non-invasive effects on the performance and complexity of 1330 TCP. In order to be fair, the IW should be similar for most machines 1331 on the public Internet. Finally, a desirable goal is to develop a 1332 self-correcting algorithm, so that IW values that cause network 1333 problems can be avoided. To that end, we propose the following 1334 design goals: 1336 o Impart little to no impact to TCP in the absence of loss, i.e., 1337 it should not increase the complexity of default packet 1338 processing in the normal case. 1340 o Adapt to network feedback over long timescales, avoiding values 1341 that persistently cause network problems. 1343 o Decrease the IW in the presence of sustained loss of IW segments, 1344 as determined over a number of different connections. 1346 o Increase the IW in the absence of sustained loss of IW segments, 1347 as determined over a number of different connections. 1349 o Operate conservatively, i.e., tend towards leaving the IW the 1350 same in the absence of sufficient information, and give greater 1351 consideration to IW segment loss than IW segment success. 1353 We expect that, without other context, a good IW algorithm will 1354 converge to a single value, but this is not required. An endpoint 1355 with additional context or information, or deployed in a constrained 1356 environment, can always use a different value. In particular, 1357 information from previous connections, or sets of connections with a 1358 similar path, can already be used as context for such decisions (as 1359 noted in the core of this document). 1361 However, if a given IW value persistently causes packet loss during 1362 the initial burst of packets, it is clearly inappropriate and could 1363 be inducing unnecessary loss in other competing connections. This 1364 might happen for sites behind very slow boxes with small buffers, 1365 which may or may not be the first hop. 1367 C.3. Proposed IW Algorithm 1369 Below is a simple description of the proposed IW algorithm. It 1370 relies on the following parameters: 1372 o MinIW = 3 MSS or 4,380 bytes (as per [RFC3390]) 1374 o MaxIW = 10 MSS (as per [RFC6928]) 1376 o MulDecr = 0.5 1377 o AddIncr = 2 MSS 1379 o Threshold = 0.05 1381 We assume that the minimum IW (MinIW) should be as currently 1382 specified as standard [RFC3390]. The maximum IW can be set to a 1383 fixed value (we suggest using the experimental and now somewhat de- 1384 facto standard in [RFC6928]) or set based on a schedule if trusted 1385 time references are available [Al10]; here we prefer a fixed value. 1386 We also propose to use an AIMD algorithm, with increase and 1387 decreases as noted. 1389 Although these parameters are somewhat arbitrary, their initial 1390 values are not important except that the algorithm is AIMD and the 1391 MaxIW should not exceed that recommended for other systems on the 1392 Internet (here we selected the current de-facto standard rather than 1393 the actual standard). Current proposals, including default current 1394 operation, are degenerate cases of the algorithm below for given 1395 parameters - notably MulDec = 1.0 and AddIncr = 0 MSS, thus 1396 disabling the automatic part of the algorithm. 1398 The proposed algorithm is as follows: 1400 1. On boot: 1402 IW = MaxIW; # assume this is in bytes, and indicates an integer 1403 multiple of 2 MSS (an even number to support ACK compression) 1405 2. Upon starting a new connection: 1407 CWND = IW; 1408 conncount++; 1409 IWnotchecked = 1; # true 1411 3. During a connection's SYN-ACK processing, if SYN-ACK includes ECN 1412 (as similarly addressed in Sec 5 of ECN++ for TCP [Ba20]), treat 1413 as if the IW is too large: 1415 if (IWnotchecked && (synackecn == 1)) { 1416 losscount++; 1417 IWnotchecked = 0; # never check again 1418 } 1420 4. During a connection, if retransmission occurs, check the seqno of 1421 the outgoing packet (in bytes) to see if the resent segment fixes 1422 an IW loss: 1424 if (Retransmitting && IWnotchecked && ((seqno - ISN) < IW))) { 1425 losscount++; 1426 IWnotchecked = 0; # never do this entire "if" again 1427 } else { 1428 IWnotchecked = 0; # you're beyond the IW so stop checking 1429 } 1431 5. Once every 1000 connections, as a separate process (i.e., not as 1432 part of processing a given connection): 1434 if (conncount > 1000) { 1435 if (losscount/conncount > threshold) { 1436 # the number of connections with errors is too high 1437 IW = IW * MulDecr; 1438 } else { 1439 IW = IW + AddIncr; 1440 } 1441 } 1443 As presented, this algorithm can yield a false positive when the 1444 sequence number wraps around, e.g., the code might increment 1445 losscount in step 4 when no loss occurred or fail to increment 1446 losscount when a loss did occur. This can be avoided using either 1447 PAWS [RFC7323] context or internal extended sequence number 1448 representations (as in TCP-AO [RFC5925]). Alternately, false 1449 positives can be tolerated because they are expected to be 1450 infrequent and thus will not significantly impact the algorithm. 1452 A number of additional constraints need to be imposed if this 1453 mechanism is implemented to ensure that it defaults to values that 1454 comply with current Internet standards, is conservative in how it 1455 extends those values, and returns to those values in the absence of 1456 positive feedback (i.e., success). To that end, we recommend the 1457 following list of example constraints: 1459 >> The automatic IW algorithm MUST initialize MaxIW a value no 1460 larger than the currently recommended Internet default, in the 1461 absence of other context information. 1463 Thus, if there are too few connections to make a decision or if 1464 there is otherwise insufficient information to increase the IW, then 1465 the MaxIW defaults to the current recommended value. 1467 >> An implementation MAY allow the MaxIW to grow beyond the 1468 currently recommended Internet default, but not more than 2 segments 1469 per calendar year. 1471 Thus, if an endpoint has a persistent history of successfully 1472 transmitting IW segments without loss, then it is allowed to probe 1473 the Internet to determine if larger IW values have similar success. 1474 This probing is limited and requires a trusted time source, 1475 otherwise the MaxIW remains constant. 1477 >> An implementation MUST adjust the IW based on loss statistics at 1478 least once every 1000 connections. 1480 An endpoint needs to be sufficiently reactive to IW loss. 1482 >> An implementation MUST decrease the IW by at least one MSS when 1483 indicated during an evaluation interval. 1485 An endpoint that detects loss needs to decrease its IW by at least 1486 one MSS, otherwise it is not participating in an automatic reactive 1487 algorithm. 1489 >> An implementation MUST increase by no more than 2 MSS per 1490 evaluation interval. 1492 An endpoint that does not experience IW loss needs to probe the 1493 network incrementally. 1495 >> An implementation SHOULD use an IW that is an integer multiple of 1496 2 MSS. 1498 The IW should remain a multiple of 2 MSS segments, to enable 1499 efficient ACK compression without incurring unnecessary timeouts. 1501 >> An implementation MUST decrease the IW if more than 95% of 1502 connections have IW losses. 1504 Again, this is to ensure an implementation is sufficiently reactive. 1506 >> An implementation MAY group IW values and statistics within 1507 subsets of connections. Such grouping MAY use any information about 1508 connections to form groups except loss statistics. 1510 There are some TCP connections which might not be counted at all, 1511 such as those to/from loopback addresses, or those within the same 1512 subnet as that of a local interface (for which congestion control is 1513 sometimes disabled anyway). This may also include connections that 1514 terminate before the IW is full, i.e., as a separate check at the 1515 time of the connection closing. 1517 The period over which the IW is updated is intended to be a long 1518 timescale, e.g., a month or so, or 1,000 connections, whichever is 1519 longer. An implementation might check the IW once a month, and 1520 simply not update the IW or clear the connection counts in months 1521 where the number of connections is too small. 1523 C.4. Discussion 1525 There are numerous parameters to the above algorithm that are 1526 compliant with the given requirements; this is intended to allow 1527 variation in configuration and implementation while ensuring that 1528 all such algorithms are reactive and safe. 1530 This algorithm continues to assume segments because that is the 1531 basis of most TCP implementations. It might be useful to consider 1532 revising the specifications to allow byte-based congestion given 1533 sufficient experience. 1535 The algorithm checks for IW losses only during the first IW after a 1536 connection start; it does not check for IW losses elsewhere the IW 1537 is used, e.g., during slow-start restarts. 1539 >> An implementation MAY detect IW losses during slow-start restarts 1540 in addition to losses during the first IW of a connection. In this 1541 case, the implementation MUST count each restart as a "connection" 1542 for the purposes of connection counts and periodic rechecking of the 1543 IW value. 1545 False positives can occur during some kinds of segment reordering, 1546 e.g., that might trigger spurious retransmissions even without a 1547 true segment loss. These are not expected to be sufficiently common 1548 to dominate the algorithm and its conclusions. 1550 This mechanism does require additional per-connection state, which 1551 is currently common in some implementations, and is useful for other 1552 reasons (e.g., the ISN is used in TCP-AO [RFC5925]). The mechanism 1553 also benefits from persistent state kept across reboots, as would be 1554 other state sharing mechanisms (e.g., TCP Control Block Sharing per 1555 the main body of this document). 1557 The receive window (rwnd) is not involved in this calculation. The 1558 size of rwnd is determined by receiver resources and provides space 1559 to accommodate segment reordering. It is not involved with 1560 congestion control, which is the focus of this document and its 1561 management of the IW. 1563 C.5. Observations 1565 The IW may not converge to a single, global value. It also may not 1566 converge at all, but rather may oscillate by a few MSS as it 1567 repeatedly probes the Internet for larger IWs and fails. Both 1568 properties are consistent with TCP behavior during each individual 1569 connection. 1571 This mechanism assumes that losses during the IW are due to IW size. 1572 Persistent errors that drop packets for other reasons - e.g., OS 1573 bugs, can cause false positives. Again, this is consistent with 1574 TCP's basic assumption that loss is caused by congestion and 1575 requires backoff. This algorithm treats the IW of new connections as 1576 a long-timescale backoff system.