idnits 2.17.1 draft-ietf-tcpm-2140bis-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 252 has weird spacing: '...sthresh old_...' == Line 254 has weird spacing: '...nd_cwnd old_...' == Line 281 has weird spacing: '... MSSopt curr_...' == Line 291 has weird spacing: '...sthresh curr...' == Line 293 has weird spacing: '...nd_cwnd curr...' == (5 more instances...) == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (April 29, 2020) is 1451 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 1644 (Obsoleted by RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1379 (Obsoleted by RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 2001 (Obsoleted by RFC 2581) -- Obsolete informational reference (is this intentional?): RFC 2140 (Obsoleted by RFC 9040) -- Obsolete informational reference (is this intentional?): RFC 2414 (Obsoleted by RFC 3390) -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 2861 (Obsoleted by RFC 7661) -- Obsolete informational reference (is this intentional?): RFC 4960 (Obsoleted by RFC 9260) -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) -- Obsolete informational reference (is this intentional?): RFC 7231 (Obsoleted by RFC 9110) -- Obsolete informational reference (is this intentional?): RFC 7540 (Obsoleted by RFC 9113) Summary: 1 error (**), 0 flaws (~~), 8 warnings (==), 13 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 TCPM WG J. Touch 2 Internet Draft Independent 3 Intended status: Informational M. Welzl 4 Obsoletes: 2140 S. Islam 5 Expires: October 2020 University of Oslo 6 April 29, 2020 8 TCP Control Block Interdependence 9 draft-ietf-tcpm-2140bis-04.txt 11 Status of this Memo 13 This Internet-Draft is submitted in full conformance with the 14 provisions of BCP 78 and BCP 79. 16 This document may contain material from IETF Documents or IETF 17 Contributions published or made publicly available before November 18 10, 2008. The person(s) controlling the copyright in some of this 19 material may not have granted the IETF Trust the right to allow 20 modifications of such material outside the IETF Standards Process. 21 Without obtaining an adequate license from the person(s) controlling 22 the copyright in such materials, this document may not be modified 23 outside the IETF Standards Process, and derivative works of it may 24 not be created outside the IETF Standards Process, except to format 25 it for publication as an RFC or to translate it into languages other 26 than English. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF), its areas, and its working groups. Note that 30 other groups may also distribute working documents as Internet- 31 Drafts. 33 Internet-Drafts are draft documents valid for a maximum of six 34 months and may be updated, replaced, or obsoleted by other documents 35 at any time. It is inappropriate to use Internet-Drafts as 36 reference material or to cite them other than as "work in progress." 38 The list of current Internet-Drafts can be accessed at 39 http://www.ietf.org/ietf/1id-abstracts.txt 41 The list of Internet-Draft Shadow Directories can be accessed at 42 http://www.ietf.org/shadow.html 44 This Internet-Draft will expire on October 29, 2020. 46 Copyright Notice 48 Copyright (c) 2020 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (https://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with 56 respect to this document. Code Components extracted from this 57 document must include Simplified BSD License text as described in 58 Section 4.e of the Trust Legal Provisions and are provided 59 without warranty as described in the Simplified BSD License. 61 Abstract 63 This memo provides guidance to TCP implementers that are intended to 64 help improve convergence to steady-state operation without affecting 65 interoperability. It updates and replaces RFC 2140's description of 66 interdependent TCP control blocks and the ways that part of TCP 67 state can be shared among similar concurrent or consecutive 68 connections. TCP state includes a combination of parameters, such as 69 connection state, current round-trip time estimates, congestion 70 control information, and process information. Most of this state is 71 maintained on a per-connection basis in the TCP Control Block (TCB), 72 but implementations can (and do) share certain TCB information 73 across connections to the same host. Such sharing is intended to 74 improve overall transient transport performance, while maintaining 75 backward-compatibility with existing implementations. The sharing 76 described herein is limited to only the TCB initialization and so 77 has no effect on the long-term behavior of TCP after a connection 78 has been established. 80 Table of Contents 82 1. Introduction...................................................3 83 2. Conventions Used in This Document..............................4 84 3. Terminology....................................................4 85 4. The TCP Control Block (TCB)....................................4 86 5. TCB Interdependence............................................5 87 6. Temporal Sharing...............................................6 88 6.1. Initialization of the new TCB................................6 89 6.2. Updates to the new TCB.......................................7 90 6.3. Discussion...................................................7 91 7. Ensemble Sharing...............................................9 92 7.1. Initialization of a new TCB..................................9 93 7.2. Updates to the new TCB......................................10 94 7.3. Discussion..................................................11 95 8. Compatibility Issues..........................................12 96 8.1. Traversing the same network path............................12 97 8.2. State dependence............................................13 98 8.3. Problems with IP sharing....................................13 99 9. Implications..................................................13 100 9.1. Layering....................................................14 101 9.2. Other possibilities.........................................14 102 10. Implementation Observations..................................15 103 11. Updates to RFC 2140..........................................16 104 12. Security Considerations......................................16 105 13. IANA Considerations..........................................17 106 14. References...................................................17 107 14.1. Normative References....................................17 108 14.2. Informative References..................................18 109 15. Acknowledgments..............................................20 110 16. Change log...................................................20 111 Appendix A : TCB Sharing History.................................23 112 Appendix B : TCP Option Sharing and Caching......................24 113 Appendix C : Automating the Initial Window in TCP over Long 114 Timescales.......................................................26 115 C.1. Introduction.............................................26 116 C.2. Design Considerations....................................26 117 C.3. Proposed IW Algorithm....................................27 118 C.4. Discussion...............................................30 119 C.5. Observations.............................................31 121 1. Introduction 123 TCP is a connection-oriented reliable transport protocol layered 124 over IP [RFC793]. Each TCP connection maintains state, usually in a 125 data structure called the TCP Control Block (TCB). The TCB contains 126 information about the connection state, its associated local 127 process, and feedback parameters about the connection's transmission 128 properties. As originally specified and usually implemented, most 129 TCB information is maintained on a per-connection basis. Some 130 implementations can (and now do) share certain TCB information 131 across connections to the same host [RFC2140]. Such sharing is 132 intended to lead to better overall transient performance, especially 133 for numerous short-lived and simultaneous connections, as often used 134 in the World-Wide Web [Be94][Br02]. This sharing of state is 135 intended to help TCP connections converge to steady-state behavior 136 more quickly without affecting TCP interoperability. 138 This document updates RFC 2140's discussion of TCB state sharing and 139 provides a complete replacement for that document. This state 140 sharing affects only TCB initialization [RFC2140] and thus has no 141 effect on the long-term behavior of TCP after a connection has been 142 established nor on interoperability. Path information shared across 143 SYN destination port numbers assumes that TCP segments having the 144 same host-pair experience the same path properties, irrespective of 145 TCP port numbers. The observations about TCB sharing in this 146 document apply similarly to any protocol with congestion state, 147 including SCTP [RFC4960] and DCCP [RFC4340], as well as for 148 individual subflows in Multipath TCP [RFC6824]. 150 2. Conventions Used in This Document 152 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 153 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 154 "OPTIONAL" in this document are to be interpreted as described in 155 BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all 156 capitals, as shown here. 158 However, this document is intended to describe behavior that is 159 already permitted by TCP standards. As a result, it provides 160 informative guidance but does not use such normative language, 161 except when quoting other documents. 163 3. Terminology 165 Host - a source or sink of TCP segments associated with a single IP 166 address 168 Host-pair - a pair of hosts and their corresponding IP addresses 170 Path - an Internet path between the IP addresses of two hosts 172 4. The TCP Control Block (TCB) 174 A TCB describes the data associated with each connection, i.e., with 175 each association of a pair of applications across the network. The 176 TCB contains at least the following information [RFC793]: 178 Local process state 179 pointers to send and receive buffers 180 pointers to retransmission queue and current segment 181 pointers to Internet Protocol (IP) PCB 182 Per-connection shared state 183 macro-state 184 connection state 185 timers 186 flags 187 local and remote host numbers and ports 188 TCP option state 189 micro-state 190 send and receive window state (size*, current number) 191 cong. window size (snd_cwnd)* 192 cong. window size threshold (ssthresh)* 193 max window size seen* 194 sendMSS# 195 MMS_S# 196 MMS_R# 197 PMTU# 198 round-trip time and variance# 200 The per-connection information is shown as split into macro-state 201 and micro-state, terminology borrowed from [Co91]. Macro-state 202 describes the protocol for establishing the initial shared state 203 about the connection; we include the endpoint numbers and components 204 (timers, flags) required upon commencement that are later used to 205 help maintain that state. Micro-state describes the protocol after a 206 connection has been established, to maintain the reliability and 207 congestion control of the data transferred in the connection. 209 We further distinguish two other classes of shared micro-state that 210 are associated more with host-pairs than with application pairs. One 211 class is clearly host-pair dependent (#, e.g., MSS, MMS, PMTU, RTT), 212 and the other is host-pair dependent in its aggregate (*, e.g., 213 congestion window information, current window sizes, etc.). 215 5. TCB Interdependence 217 There are two cases of TCB interdependence. Temporal sharing occurs 218 when the TCB of an earlier (now CLOSED) connection to a host is used 219 to initialize some parameters of a new connection to that same host, 220 i.e., in sequence. Ensemble sharing occurs when a currently active 221 connection to a host is used to initialize another (concurrent) 222 connection to that host. 224 6. Temporal Sharing 226 The TCB data cache is accessed in two ways: it is read to initialize 227 new TCBs and written when more current per-host state is available. 229 6.1. Initialization of the new TCB 231 TCBs for new connections can be initialized using context from past 232 connections as follows: 234 TEMPORAL SHARING - TCB Initialization 236 Cached TCB New TCB 237 -------------------------------------- 238 old_MMS_S old_MMS_S or not cached 240 old_MMS_R old_MMS_R or not cached 242 old_sendMSS old_sendMSS 244 old_PMTU old_PMTU 246 old_RTT old_RTT 248 old_RTTvar old_RTTvar 250 old_option (option specific) 252 old_ssthresh old_ssthresh 254 old_snd_cwnd old_snd_cwnd 256 The table below gives an overview of option-specific information 257 that can be shared. Additional information on some specific TCP 258 options and sharing is provided in Appendix B. 260 TEMPORAL SHARING - Option Info Initialization 262 Cached New 263 ------------------------------------ 264 old_TFO_Cookie old_TFO_Cookie 266 old_TFO_Failure old_TFO_Failure 268 6.2. Updates to the new TCB 270 During the connection, the associated TCB can be updated based on 271 particular events, as shown below: 273 TEMPORAL SHARING - Cache Updates 275 Cached TCB Current TCB when? New Cached TCB 276 ---------------------------------------------------------- 277 old_MMS_S curr_MMS_S OPEN curr_MMS_S 279 old_MMS_R curr_MMS_R OPEN curr_MMS_R 281 old_sendMSS curr_sendMSS MSSopt curr_sendMSS 283 old_PMTU curr_PMTU PMTUD curr_PMTU 285 old_RTT curr_RTT CLOSE merge(curr,old) 287 old_RTTvar curr_RTTvar CLOSE merge(curr,old) 289 old_option curr_option ESTAB (depends on option) 291 old_ssthresh curr_ssthresh CLOSE merge(curr,old) 293 old_snd_cwnd curr_snd_cwnd CLOSE merge(curr,old) 295 The table below gives an overview of option-specific information 296 that can be similarly shared. 298 TEMPORAL SHARING - Option Info Updates 300 Cached Current when? New Cached 301 --------------------------------------------------------- 302 old_TFO_Cookie old_TFO_Cookie ESTAB old_TFO_Cookie 304 old_TFO_Failure old_TFO_Failure ESTAB old_TFO_Failure 306 6.3. Discussion 308 There is no particular benefit to caching MMS_S and MMS_R as these 309 are reported by the local IP stack. Caching sendMSS and PMTU is 310 trivial; reported values are cached, and the most recent values are 311 used. The cache is updated when the MSS option is received in a SYN 312 or after PMTUD (i.e., when an ICMPv4 Fraqmentation Needed [RFC1191] 313 or ICMPv6 Packet Too Big message is received [RFC8201] or the 314 equivalent is inferred, e.g. as from PLPMTUD [RFC4821]), 315 respectively, so the cache always has the most recent values from 316 any connection. For sendMSS, the cache is consulted only at 317 connection establishment and not otherwise updated, which means that 318 MSS options do not affect current connections. The default sendMSS 319 is never saved; only reported MSS values update the cache, so an 320 explicit override is required to reduce the sendMSS. 322 RTT values are updated by formulae that merge the old and new 323 values. Dynamic RTT estimation requires a sequence of RTT 324 measurements. As a result, the cached RTT (and its variance) is an 325 average of its previous value with the contents of the currently 326 active TCB for that host, when a TCB is closed. RTT values are 327 updated only when a connection is closed. The method for merging old 328 and current values needs to attempt to reduce the transient effects 329 of the new connections. 331 The updates for RTT, RTTvar and ssthresh rely on existing 332 information, i.e., old values. Should no such values exist, the 333 current values are cached instead. 335 TCP options are copied or merged depending on the details of each 336 option, where "merge" is some function that combines the values of 337 "curr" and "old". E.g., TFO state is updated when a connection is 338 established and read before establishing a new connection. 340 Sections 8 and 9 discuss compatibility issues and implications of 341 sharing the specific information listed above. Section 10 gives an 342 overview of known implementations. 344 Most cached TCB values are updated when a connection closes. The 345 exceptions are MMS_R and MMS_S, which are reported by IP [RFC1122], 346 PMTU which is updated after Path MTU Discovery 347 [RFC1191][RFC4821][RFC8201], and sendMSS, which is updated if the 348 MSS option is received in the TCP SYN header. 350 Sharing sendMSS information affects only data in the SYN of the next 351 connection, because sendMSS information is typically included in 352 most TCP SYN segments. Caching PMTU can accelerate the efficiency of 353 PMTUD, but can also result in black-holing until corrected if in 354 error. Caching MMS_R and MMS_S may be of little direct value as they 355 are reported by the local IP stack anyway. 357 The way in which other TCP option state can be shared depends on the 358 details of that option. E.g., TFO state includes the TCP Fast Open 359 Cookie [RFC7413] or, in case TFO fails, a negative TCP Fast Open 360 response. RFC 7413 states, "The client MUST cache negative responses 361 from the server in order to avoid potential connection failures. 362 Negative responses include the server not acknowledging the data in 363 the SYN, ICMP error messages, and (most importantly) no response 364 (SYN-ACK) from the server at all, i.e., connection timeout." [RFC 365 7413]. TFOinfo is cached when a connection is established. 367 Other TCP option state might not be as readily cached. E.g., TCP-AO 368 [RFC5925] success or failure between a host pair for a single SYN 369 destination port might be usefully cached. TCP-AO success or failure 370 to other SYN destination ports on that host pair is never useful to 371 cache because TCP-AO security parameters can vary per service. 373 7. Ensemble Sharing 375 Sharing cached TCB data across concurrent connections requires 376 attention to the aggregate nature of some of the shared state. For 377 example, although MSS and RTT values can be shared by copying, it 378 may not be appropriate to simply copy congestion window or ssthresh 379 information; instead, the new values can be a function (f) of the 380 cumulative values and the number of connections (N). 382 7.1. Initialization of a new TCB 384 TCBs for new connections can be initialized using context from 385 concurrent connections as follows: 387 ENSEMBLE SHARING - TCB Initialization 389 Cached TCB New TCB 390 ------------------------------------------ 391 old_MMS_S old_MMS_S 393 old_MMS_R old_MMS_R 395 old_sendMSS old_sendMSS 397 old_PMTU old_PMTU 399 old_RTT old_RTT 401 old_RTTvar old_RTTvar 403 sum(old_ssthresh) f(sum(old_ssthresh), N) 405 sum(old_snd_cwnd) f(sum(old_snd_cwnd), N) 406 _ 407 old_option (option specific) 409 The table below gives an overview of option-specific information 410 that can be similarly shared. 412 ENSEMBLE SHARING - Option Info Initialization 414 Cached New 415 ------------------------------------ 416 old_TFO_Cookie old_TFO_Cookie 418 old_TFO_Failure old_TFO_Failure 420 7.2. Updates to the new TCB 422 During the connection, the associated TCB can be updated based on 423 changes to concurrent connections, as shown below: 425 ENSEMBLE SHARING - Cache Updates 427 Cached TCB Current TCB when? New Cached TCB 428 --------------------------------------------------------------- 429 old_MMS_S curr_MMS_S OPEN curr_MMS_S 431 old_MMS_R curr_MMS_R OPEN curr_MMS_R 433 old_sendMSS curr_sendMSS MSSopt curr_sendMSS 435 old_PMTU curr_PMTU PMTUD curr_PMTU / PLPMTUD 437 old_RTT curr_RTT update rtt_update(old,curr) 439 old_RTTvar curr_RTTvar update rtt_update(old,curr) 441 old_ssthresh curr_ssthresh update adjust sum as appropriate 443 old_snd_cwnd curr_snd_cwnd update adjust sum as appropriate 445 old_option curr_option (depends) (option specific) 447 The table below gives an overview of option-specific information 448 that can be similarly shared. 450 ENSEMBLE SHARING - Option Info Updates 452 Cached Current when? New Cached 453 ---------------------------------------------------------- 454 old_TFO_Cookie old_TFO_Cookie ESTAB old_TFO_Cookie 456 old_TFO_Failure old_TFO_Failure ESTAB old_TFO_Failure 458 7.3. Discussion 460 For ensemble sharing, TCB information should be cached as early as 461 possible, sometimes before a connection is closed. Otherwise, 462 opening multiple concurrent connections may not result in TCB data 463 sharing if no connection closes before others open. The amount of 464 work involved in updating the aggregate average should be minimized, 465 but the resulting value should be equivalent to having all values 466 measured within a single connection. The function "rtt_update" in 467 the ensemble sharing table indicates this operation, which occurs 468 whenever the RTT would have been updated in the individual TCP 469 connection. As a result, the cache contains the shared RTT 470 variables, which no longer need to reside in the TCB. 472 Congestion window size and ssthresh aggregation are more complicated 473 in the concurrent case. When there is an ensemble of connections, we 474 need to decide how that ensemble would have shared these variables, 475 in order to derive initial values for new TCBs. 477 Sections 8 and 9 discuss compatibility issues and implications of 478 sharing the specific information listed above. 480 Any assumption of TCB information sharing can be incorrect because 481 identical endpoint address pairs may not share network paths. In 482 current implementations, new congestion windows are set at an 483 initial value of 4-10 segments [RFC3390][RFC6928], so that the sum 484 of the current windows is increased for any new connection. This can 485 have detrimental consequences where several connections share a 486 highly congested link. 488 There are several ways to initialize the congestion window in a new 489 TCB among an ensemble of current connections to a host. Current TCP 490 implementations initialize it to four segments as standard [rfc3390] 491 and 10 segments experimentally [RFC6928]. These approaches assume 492 that new connections should behave as conservatively as possible. 493 The algorithm described in [Ba12] adjusts the initial cwnd depending 494 on the cwnd values of ongoing connections. There have also been 495 suggestions to use the kind of sharing mechanisms described in this 496 document over long timescales to adapt TCP's initial window 497 automatically, as described further in Appendix A [To12]. 499 8. Compatibility Issues 501 Here, we discuss various types of problems that may arise with TCB 502 information sharing. 504 For the congestion and current window information, the initial 505 values computed by TCB interdependence may not be consistent with 506 the long-term aggregate behavior of a set of concurrent connections 507 between the same endpoints. Under conventional TCP congestion 508 control, if a single existing connection has converged to a 509 congestion window of 40 segments, two newly joining concurrent 510 connections assume initial windows of 10 segments [RFC6928], and the 511 current connection's window doesn't decrease to accommodate this 512 additional load and connections can mutually interfere. One example 513 of this is seen on low-bandwidth, high-delay links, where concurrent 514 connections supporting Web traffic can collide because their initial 515 windows were too large, even when set at one segment. 517 The authors of [Hu12] recommend caching ssthresh for temporal 518 sharing only when flows are long. Some studies suggest that sharing 519 ssthresh between short flows can deteriorate the performance of 520 individual connections [Hu12, Du16], although this may benefit 521 aggregate network performance. 523 8.1. Traversing the same network path 525 TCP is sometimes used in situations where packets of the same host- 526 pair do not always take the same path. Multipath routing that relies 527 on examining transport headers, such as ECMP and LAG [RFC7424], may 528 not result in repeatable path selection when TCP segments are 529 encapsulated, encrypted, or altered - for example, in some Virtual 530 Private Network (VPN) tunnels that rely on proprietary 531 encapsulation. Similarly, such approaches cannot operate 532 deterministically when the TCP header is encrypted, e.g., when using 533 IPsec ESP (although TCB interdependence among the entire set sharing 534 the same endpoint IP addresses should work without problems when the 535 TCP header is encrypted). Measures to increase the probability that 536 connections use the same path could be applied: e.g., the 537 connections could be given the same IPv6 flow label. TCB 538 interdependence can also be extended to sets of host IP address 539 pairs that share the same network path conditions, such as when a 540 group of addresses is on the same LAN (see Section 9). 542 Traversing the same path is not important for host-specific 543 information such as RWIN and TCP option state, such as TFOinfo. When 544 TCB information is shared across different SYN destination ports, 545 path-related information can be incorrect; however, the impact of 546 this error is potentially diminished if (as discussed here) TCB 547 sharing affects only the transient event of a connection start or if 548 TCB information is shared only within connections to the same SYN 549 destination port. In case of Temporal Sharing, TCB information could 550 also become invalid over time. Because this is similar to the case 551 when a connection becomes idle, mechanisms that address idle TCP 552 connections (e.g., [RFC7661]) could also be applied to TCB cache 553 management, especially when TCP Fast Open is used [RFC7413]. 555 8.2. State dependence 557 There may be additional considerations to the way in which TCB 558 interdependence rebalances congestion feedback among the current 559 connections, e.g., it may be appropriate to consider the impact of a 560 connection being in Fast Recovery [RFC5681] or some other similar 561 unusual feedback state, e.g., as inhibiting or affecting the 562 calculations described herein. 564 8.3. Problems with IP sharing 566 It can be wrong to share TCB information between TCP connections on 567 the same host as identified by the IP address if an IP address is 568 assigned to a new host (e.g., IP address spinning, as is used by 569 ISPs to inhibit running servers). It can be wrong if Network Address 570 (and Port) Translation (NA(P)T) [RFC2663] or any other IP sharing 571 mechanism is used. Such mechanisms are less likely to be used with 572 IPv6. Other methods to identify a host could also be considered to 573 make correct TCB sharing more likely. Moreover, some TCB information 574 is about dominant path properties rather than the specific host. IP 575 addresses may differ, yet the relevant part of the path may be the 576 same. 578 9. Implications 580 There are several implications to incorporating TCB interdependence 581 in TCP implementations. First, it may reduce the need for 582 application-layer multiplexing for performance enhancement 583 [RFC7231]. Protocols like HTTP/2 [RFC7540] avoid connection 584 reestablishment costs by serializing or multiplexing a set of per- 585 host connections across a single TCP connection. This avoids TCP's 586 per-connection OPEN handshake and also avoids recomputing the MSS, 587 RTT, and congestion window values. By avoiding the so-called, "slow- 588 start restart," performance can be optimized [Hu01]. TCB 589 interdependence can provide the "slow-start restart avoidance" of 590 multiplexing, without requiring a multiplexing mechanism at the 591 application layer. 593 Like the initial version of this document [RFC2140], this update's 594 approach to TCB interdependence focuses on sharing a set of TCBs by 595 updating the TCB state to reduce the impact of transients when 596 connections begin or end. Other mechanisms have since been proposed 597 to continuously share information between all ongoing communication 598 (including connectionless protocols), updating the congestion state 599 during any congestion-related event (e.g., timeout, loss 600 confirmation, etc.) [RFC3124]. By dealing exclusively with 601 transients, TCB interdependence is more likely to exhibit the same 602 behavior as unmodified, independent TCP connections. 604 9.1. Layering 606 TCB interdependence pushes some of the TCP implementation from the 607 traditional transport layer (in the ISO model), to the network 608 layer. This acknowledges that some state is in fact per-host-pair or 609 can be per-path as indicated solely by that host-pair. Transport 610 protocols typically manage per-application-pair associations (per 611 stream), and network protocols manage per-host-pair and path 612 associations (routing). Round-trip time, MSS, and congestion 613 information could be more appropriately handled in a network-layer 614 fashion, aggregated among concurrent connections, and shared across 615 connection instances [RFC3124]. 617 An earlier version of RTT sharing suggested implementing RTT state 618 at the IP layer, rather than at the TCP layer. Our observations 619 describe sharing state among TCP connections, which avoids some of 620 the difficulties in an IP-layer solution. One such problem of an IP 621 layer solution is determining the correspondence between packet 622 exchanges using IP header information alone, where such 623 correspondence is needed to compute RTT. Because TCB sharing 624 computes RTTs inside the TCP layer using TCP header information, it 625 can be implemented more directly and simply than at the IP layer. 626 This is a case where information should be computed at the transport 627 layer but could be shared at the network layer. 629 9.2. Other possibilities 631 Per-host-pair associations are not the limit of these techniques. It 632 is possible that TCBs could be similarly shared between hosts on a 633 subnet or within a cluster, because the predominant path can be 634 subnet-subnet, rather than host-host. Additionally, TCB 635 interdependence can be applied to any protocol with congestion 636 state, including SCTP [RFC4960] and DCCP [RFC4340], as well as for 637 individual subflows in Multipath TCP [RFC6824]. 639 There may be other information that can be shared between concurrent 640 connections. For example, knowing that another connection has just 641 tried to expand its window size and failed, a connection may not 642 attempt to do the same for some period. The idea is that existing 643 TCP implementations infer the behavior of all competing connections, 644 including those within the same host or subnet. One possible 645 optimization is to make that implicit feedback explicit, via 646 extended information associated with the endpoint IP address and its 647 TCP implementation, rather than per-connection state in the TCB. 649 10. Implementation Observations 651 The observation that some TCB state is host-pair specific rather 652 than application-pair dependent is not new and is a common 653 engineering decision in layered protocol implementations. Although 654 now deprecated, T/TCP [RFC1644] was the first to propose using 655 caches in order to maintain TCB states (see Appendix A). 657 The table below describes the current implementation status for some 658 TCB temporal sharing in Linux kernel version 4.6, FreeBSD 10 and 659 Windows as of October 2016. Ensemble sharing is not yet implemented. 661 CURRENT IMPLEMENTATION STATUS (as of 2016) 663 TCB data Status 664 ------------------------------------------------------------ 665 old_MMS_S Not shared 667 old_MMS_R Not shared 669 old_sendMSS Cached and shared in Linux (MSS) 671 old_PMTU Cached and shared in FreeBSD and Windows (PMTU) 673 old_RTT Cached and shared in FreeBSD and Linux 675 old_RTTvar Cached and shared in FreeBSD 677 old_TFOinfo Cached and shared in Linux and Windows 679 old_snd_cwnd Not shared 681 old_ssthresh Cached and shared in FreeBSD and Linux* 683 *Note: In FreeBSD, new ssthresh is the mean of curr_ssthresh and 684 previous value if a previous value exists; in Linux, the calculation 685 depends on state and is max(curr_cwnd/2, old_ssthresh) in most 686 cases. 688 11. Updates to RFC 2140 690 This document updates the description of TCB sharing in RFC 2140 and 691 its associated impact on existing and new connection state, 692 providing a complete replacement for that document [RFC2140]. It 693 clarifies the previous description and terminology and extends the 694 mechanism to its impact on new protocols and mechanisms, including 695 multipath TCP, fast open, PLPMTUD, NAT, and the TCP Authentication 696 Option. 698 The detailed impact on TCB state addresses TCB parameters in greater 699 detail, addressing RSS in both the send and receive direction, MSS 700 and send-MSS separately, adds path MTU and ssthresh, and addresses 701 the impact on TCP option state. 703 New sections have been added to address compatibility issues and 704 implementation observations. The relation of this work to T/TCP has 705 been moved to Appendix A on history, partly to reflect the 706 deprecation of that protocol. 708 Appendix C has been added to discuss the potential to use temporal 709 sharing over long timescales to adapt TCP's initial window 710 automatically, largely imported from [To12]. 712 Finally, this document updates and significantly expands the 713 referenced literature. 715 12. Security Considerations 717 These presented implementation methods do not have additional 718 ramifications for explicit attacks. They may be susceptible to 719 denial-of-service attacks if not otherwise secured. 721 TCB sharing may be susceptible to denial-of-service attacks, 722 wherever the TCB is shared, between connections in a single host, or 723 between hosts if TCB sharing is implemented within a subnet (see 724 Implications section). Some shared TCB parameters are used only to 725 create new TCBs, others are shared among the TCBs of ongoing 726 connections. New connections can join the ongoing set, e.g., to 727 optimize send window size among a set of connections to the same 728 host. 730 Attacks on parameters used only for initialization affect only the 731 transient performance of a TCP connection. For short connections, 732 the performance ramification can approach that of a denial-of- 733 service attack. E.g., if an application changes its TCB to have a 734 false and small window size, subsequent connections would experience 735 performance degradation until their window grew appropriately. 737 TCB sharing reuses and mixes information from past and current 738 connections. Although reusing information could create a potential 739 for fingerprinting to identify hosts, the mixing reduces that 740 potential. There has been no evidence of fingerprinting based on 741 this technique and it is currently considered safe in that regard. 743 13. IANA Considerations 745 There are no IANA implications or requests in this document. 747 This section should be removed upon final publication as an RFC. 749 14. References 751 14.1. Normative References 753 [RFC793] Postel, Jon, "Transmission Control Protocol," Network 754 Working Group RFC-793/STD-7, ISI, Sept. 1981. 756 [RFC1122] Braden, R. (ed), "Requirements for Internet Hosts -- 757 Communication Layers", RFC-1122, Oct. 1989. 759 [RFC1191] Mogul, J., Deering, S., "Path MTU Discovery," RFC 1191, 760 Nov. 1990. 762 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 763 Requirement Levels", BCP 14, RFC 2119, March 1997. 765 [RFC4821] Mathis, M., Heffner, J., "Packetization Layer Path MTU 766 Discovery," RFC 4821, Mar. 2007. 768 [RFC5681] Allman, M., Paxson, V., Blanton, E., "TCP Congestion 769 Control," RFC 5681 (Standards Track), Sep. 2009. 771 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., Jain, A., "TCP Fast 772 Open", RFC 7413, Dec. 2014. 774 [RFC8174] Leiba., B., "Ambiguity of Uppercase vs Lowercase in RFC 775 2119 Key Words", RFC 8174, May 2017. 777 [RFC8201] McCann, J., Deering. S., Mogul, J., Hinden, R. (Ed.), 778 "Path MTU Discovery for IP version 6," RFC 8201, Jul. 779 2017. 781 14.2. Informative References 783 [Al10] Allman, M., "Initial Congestion Window Specification", 784 (work in progress), draft-allman-tcpm-bump-initcwnd-00, 785 Nov. 2010. 787 [Ba12] Barik, R., Welzl, M., Ferlin, S., Alay, O., " LISA: A 788 Linked Slow-Start Algorithm for MPTCP", IEEE ICC, Kuala 789 Lumpur, Malaysia, May 23-27 2016. 791 [Be94] Berners-Lee, T., et al., "The World-Wide Web," 792 Communications of the ACM, V37, Aug. 1994, pp. 76-82. 794 [Br94] Braden, B., "T/TCP -- Transaction TCP: Source Changes for 795 Sun OS 4.1.3,", Release 1.0, USC/ISI, September 14, 1994. 797 [Br02] Brownlee, N. and K. Claffy, "Understanding Internet 798 Traffic Streams: Dragonflies and Tortoises", IEEE 799 Communications Magazine p110-117, 2002. 801 [Co91] Comer, D., Stevens, D., Internetworking with TCP/IP, V2, 802 Prentice-Hall, NJ, 1991. 804 [Du16] Dukkipati, N., Yuchung C., and Amin V., "Research 805 Impacting the Practice of Congestion Control." ACM SIGCOMM 806 CCR (editorial), on-line post, July 2016. 808 [FreeBSD] FreeBSD source code, Release 2.10, http://www.freebsd.org/ 810 [Hu01] Hugues, A., Touch, J., Heidemann, J., "Issues in Slow- 811 Start Restart After Idle", draft-hughes-restart-00 812 (expired), Dec. 2001. 814 [Hu12] Hurtig, P., Brunstrom, A., "Enhanced metric caching for 815 short TCP flows," 2012 IEEE International Conference on 816 Communications (ICC), Ottawa, ON, 2012, pp. 1209-1213. 818 [Ja88] Jacobson, V., M. Karels, "Congestion Avoidance and 819 Control", Proc. Sigcomm 1988. 821 [RFC1644] Braden, R., "T/TCP -- TCP Extensions for Transactions 822 Functional Specification," RFC-1644, July 1994. 824 [RFC1379] Braden, R., "Transaction TCP -- Concepts," RFC-1379, 825 September 1992. 827 [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast 828 Retransmit, and Fast Recovery Algorithms", RFC2001 829 (Standards Track), Jan. 1997. 831 [RFC2140] Touch, J., "TCP Control Block Interdependence", RFC 2140, 832 April 1997. 834 [RFC2414] Allman, M., Floyd, S., Partridge, C., "Increasing TCP's 835 Initial Window", RFC 2414 (Experimental), Sept. 1998. 837 [RFC2581] Allman, M., Paxson, V., Stevens, W., "TCP Congestion 838 Control," RFC2581 (Standards Track), Apr. 1999. 840 [RFC2663] Srisuresh, P., Holdrege, M., "IP Network Address 841 Translator (NAT) Terminology and Considerations", RFC- 842 2663, August 1999. 844 [RFC2861] Handley, M., Padhye, J., Floyd, S., "TCP Congestion Window 845 Validation", RFC2861 (Experimental), June 2000. 847 [RFC3390] Allman, M., Floyd, S., Partridge, C., "Increasing TCP's 848 Initial Window," RFC 3390, Oct. 2002. 850 [RFC3124] Balakrishnan, H., Seshan, S., "The Congestion Manager," 851 RFC 3124, June 2001. 853 [RFC4340] Kohler, E., Handley, M., Floyd, S., "Datagram Congestion 854 Control Protocol (DCCP)," RFC 4340, Mar. 2006. 856 [RFC4960] Stewart, R., (Ed.), "Stream Control Transmission 857 Protocol," RFC4960, Sept. 2007. 859 [RFC5925] Touch, J., Mankin, A., Bonica, R., "The TCP Authentication 860 Option," RFC 5925, June 2010. 862 [RFC6824] Ford, A., Raiciu, C., Handley, M., Bonaventure, O., "TCP 863 Extensions for Multipath Operation with Multiple 864 Addresses," RFC 6824, Jan. 2013. 866 [RFC6928] Chu, J., Dukkipati, N., Cheng, Y., Mathis, M., "Increasing 867 TCP's Initial Window," RFC 6928, Apr. 2013. 869 [RFC7231] Fielding, R., J. Reshke, Eds., "HTTP/1.1 Semantics and 870 Content," RFC-7231, June 2014. 872 [RFC7323] Borman, D., B. Braden, V. Jacobson, R. Scheffenegger 873 (Ed.), "TCP Extensions for High Performance," RFC 7323, 874 Sept. 2014. 876 [RFC7424] Krishnan, R., Yong, L., Ghanwani, A., So, N., Khasnabish, 877 B., "Mechanisms for Optimizing Link Aggregation Group 878 (LAG) and Equal-Cost Multipath (ECMP) Component Link 879 Utilization in Networks", RFC 7424, Jan. 2015 881 [RFC7540] Belshe, M., Peon, R., Thomson, M., "Hypertext Transfer 882 Protocol Version 2 (HTTP/2)", RFC 7540, May 2015. 884 [RFC7661] Fairhurst, G., Sathiaseelan, A., Secchi, R., "Updating TCP 885 to Support Rate-Limited Traffic", RFC 7661, Oct. 2015. 887 [To12] Touch, J., "Automating the Initial Window in TCP," draft- 888 touch-tcpm-automatic-iw-03 (expired), July 2012. 890 15. Acknowledgments 892 The authors would like to thank for Praveen Balasubramanian for 893 information regarding TCB sharing in Windows, and Yuchung Cheng, 894 Lars Eggert, Ilpo Jarvinen and Michael Scharf for comments on 895 earlier versions of the draft. Earlier revisions of this work 896 received funding from a collaborative research project between the 897 University of Oslo and Huawei Technologies Co., Ltd. and were partly 898 supported by USC/ISI's Postel Center. 900 This document was prepared using 2-Word-v2.0.template.dot. 902 16. Change log 904 This section should be removed upon final publication as an RFC. 906 ietf-04: 908 - Fix internal cross-reference errors that appeared in ietf-02 909 - Updated tables to re-center; clarified text 911 ietf-03: 913 - Correction of typographic errors, minor rewording in appendices 915 ietf-02: 917 - Minor reorganization and correction of typographic errors 918 - Added text to address fingerprinting in Security section 919 - Now retains Appendix B and body option tables upon publication 921 ietf-01: 923 - Added Appendix C to address long-timescale temporal adaptation 925 ietf-00: 927 - Re-issued as draft-ietf-tcpm-2140bis due to WG adoption. 928 - Cleaned orphan references to T/TCP, removed incomplete refs 929 - Moved references to informative section and updated Sec 2 930 - Updated to clarify no impact to interoperability 931 - Updated appendix B to avoid 2119 language 933 06: 935 - Changed to update 2140, cite it normatively, and summarize the 936 updates in a separate section 938 05: 940 - Fixed some TBDs. 942 04: 944 - Removed BCP-style recommendations and fixed some TBDs. 946 03: 948 - Updated Touch's affiliation and address information 950 02: 952 - Stated that our OS implementation overview table only covers 953 temporal sharing. 955 - Correctly reflected sharing of old_RTT in Linux in the 956 implementation overview table. 958 - Marked entries that are considered safe to share with an 959 asterisk (suggestion was to split the table) 961 - Discussed correct host identification: NATs may make IP 962 addresses the wrong input, could e.g. use HTTP cookie. 964 - Included MMS_S and MMS_R from RFC1122; fixed the use of MSS and 965 MTU 966 - Added information about option sharing, listed options in 967 Appendix B 969 Authors' Addresses 971 Joe Touch 972 Manhattan Beach, CA 90266 973 USA 975 Phone: +1 (310) 560-0334 976 Email: touch@strayalpha.com 978 Michael Welzl 979 University of Oslo 980 PO Box 1080 Blindern 981 Oslo N-0316 982 Norway 984 Phone: +47 22 85 24 20 985 Email: michawe@ifi.uio.no 987 Safiqul Islam 988 University of Oslo 989 PO Box 1080 Blindern 990 Oslo N-0316 991 Norway 993 Phone: +47 22 84 08 37 994 Email: safiquli@ifi.uio.no 996 Appendix A: TCB Sharing History 998 T/TCP proposed using caches to maintain TCB information across 999 instances (temporal sharing), e.g., smoothed RTT, RTT variance, 1000 congestion avoidance threshold, and MSS [RFC1644]. These values were 1001 in addition to connection counts used by T/TCP to accelerate data 1002 delivery prior to the full three-way handshake during an OPEN. The 1003 goal was to aggregate TCB components where they reflect one 1004 association - that of the host-pair, rather than artificially 1005 separating those components by connection. 1007 At least one T/TCP implementation saved the MSS and aggregated the 1008 RTT parameters across multiple connections but omitted caching the 1009 congestion window information [Br94], as originally specified in 1010 [RFC1379]. Some T/TCP implementations immediately updated MSS when 1011 the TCP MSS header option was received [Br94], although this was not 1012 addressed specifically in the concepts or functional specification 1013 [RFC1379][RFC1644]. In later T/TCP implementations, RTT values were 1014 updated only after a CLOSE, which does not benefit concurrent 1015 sessions. 1017 Temporal sharing of cached TCB data was originally implemented in 1018 the SunOS 4.1.3 T/TCP extensions [Br94] and the FreeBSD port of same 1019 [FreeBSD]. As mentioned before, only the MSS and RTT parameters were 1020 cached, as originally specified in [RFC1379]. Later discussion of 1021 T/TCP suggested including congestion control parameters in this 1022 cache; for example, [RFC1644] (Section 3.1) hints at initializing 1023 the congestion window to the old window size. 1025 Appendix B: TCP Option Sharing and Caching 1027 In addition to the options that can be cached and shared, this memo 1028 also lists known options for which state is unsafe to be kept. This 1029 list is not intended to be authoritative or exhaustive. 1031 Obsolete (unsafe to keep state): 1033 ECHO 1035 ECHO REPLY 1037 PO Conn permitted 1039 PO service profile 1041 CC 1043 CC.NEW 1045 CC.ECHO 1047 Alt CS req 1049 Alt CS data 1051 No state to keep: 1053 EOL 1055 NOP 1057 WS 1059 SACK 1061 TS 1063 MD5 1065 TCP-AO 1067 EXP1 1069 EXP2 1071 Unsafe to keep state: 1073 Skeeter (DH exchange, known to be vulnerable) 1075 Bubba (DH exchange, known to be vulnerable) 1077 Trailer CS 1079 SCPS capabilities 1081 S-NACK 1083 Records boundaries 1085 Corruption experienced 1087 SNAP 1089 TCP Compression 1091 Quickstart response 1093 UTO 1095 MPTCP negotiation success (see below for negotiation failure) 1097 TFO negotiation success (see below for negotiation failure) 1099 Safe but optional to keep state: 1101 MPTCP negotiation failure (to avoid negotiation retries) 1103 MSS 1105 TFO negotiation failure (to avoid negotiation retries) 1107 Safe and necessary to keep state: 1109 TFP cookie (if TFO succeeded in the past) 1111 Appendix C: Automating the Initial Window in TCP over Long Timescales 1113 Note: this section is imported from [To12], updated only to refer to 1114 itself as an appendix. 1116 C.1. Introduction 1118 TCP's congestion control algorithm uses an initial window value 1119 (IW), both as a starting point for new connections and after one RTO 1120 or more [RFC2581][RFC2861]. This value has evolved over time, 1121 originally one maximum segment size (MSS), and increased to the 1122 lesser of four MSS or 4,380 bytes [RFC3390][RFC5681]. For typical 1123 Internet connections with an maximum transmission units (MTUs) of 1124 1500 bytes, this permits three segments of 1,460 bytes each. 1126 The IW value was originally implied in the original TCP congestion 1127 control description, and documented as a standard in 1997 1128 [RFC2001][Ja88]. The value was last updated in 1998 experimentally, 1129 and moved to the standards track in 2002 [RFC2414][RFC3390]. There 1130 have been recent proposals to update the IW based on further 1131 increases in host and router capabilities and network capacity, some 1132 focusing on specific values (e.g., IW=10), and others prescribing a 1133 schedule for increases over time (e.g., IW=6 for 2011, increasing by 1134 1-2 MSS per year). 1136 This appendix discusses how TCP can objectively measure when an IW 1137 is too large, and that such feedback should be used over long 1138 timescales to adjust the IW automatically. The result should be 1139 safer to deploy and might avoid the need to repeatedly revisit IW 1140 size over time. 1142 Note that this mechanism attempts to make the IW more adaptive over 1143 time. It can increase the IW beyond that which is currently 1144 recommended for widescale deployment, and so its use should be 1145 carefully monitored. 1147 C.2. Design Considerations 1149 TCP's IW value has existed statically for over two decades, so any 1150 solution to adjusting the IW dynamically should have similarly 1151 stable, non-invasive effects on the performance and complexity of 1152 TCP. In order to be fair, the IW should be similar for most machines 1153 on the public Internet. Finally, a desirable goal is to develop a 1154 self-correcting algorithm, so that IW values that cause network 1155 problems can be avoided. To that end, we propose the following list 1156 of design goals: 1158 o Impart little to no impact to TCP in the absence of loss, i.e., 1159 it should not increase the complexity of default packet 1160 processing in the normal case. 1162 o Adapt to network feedback over long timescales, avoiding values 1163 that persistently cause network problems. 1165 o Decrease the IW in the presence of sustained loss of IW segments, 1166 as determined over a number of different connections. 1168 o Increase the IW in the absence of sustained loss of IW segments, 1169 as determined over a number of different connections. 1171 o Operate conservatively, i.e., tend towards leaving the IW the 1172 same in the absence of sufficient information, and give greater 1173 consideration to IW segment loss than IW segment success. 1175 We expect that, without other context, a good IW algorithm will 1176 converge to a single value, but this is not required. An endpoint 1177 with additional context or information, or deployed in a constrained 1178 environment, can always use a different value. In specific, 1179 information from previous connections, or sets of connections with a 1180 similar path, can already be used as context for such decisions (as 1181 noted in the core of this document). 1183 However, if a given IW value persistently causes packet loss during 1184 the initial burst of packets, it is clearly inappropriate and could 1185 be inducing unnecessary loss in other competing connections. This 1186 might happen for sites behind very slow boxes with small buffers, 1187 which may or may not be the first hop. 1189 C.3. Proposed IW Algorithm 1191 Below is a simple description of the proposed IW algorithm. It 1192 relies on the following parameters: 1194 o MinIW = 3 MSS or 4,380 bytes (as per RFC3390] 1196 o MaxIW = 10 1198 o MulDecr = 0.5 1200 o AddIncr = 2 MSS 1202 o Threshold = 0.05 1203 We assume that the minimum IW (MinIW) should be as currently 1204 specified [RFC3390]. The maximum IW can be set to a fixed value 1205 [RFC6928], or set based on a schedule if trusted time references are 1206 available [Al10]; here we prefer a fixed value. We also propose to 1207 use an AIMD algorithm, with increase and decreases as noted. 1209 Although these parameters are somewhat arbitrary, their initial 1210 values are not important except that the algorithm is AIMD and the 1211 MaxIW should not exceed that recommended for other systems on the 1212 Internet. Current proposals, including default current operation, 1213 are degenerate cases of the algorithm below for given parameters - 1214 notably MulDec = 1.0 and AddIncr = 0 MSS, thus disabling the 1215 automatic part of the algorithm. 1217 The proposed algorithm is as follows: 1219 1. On boot: 1221 IW = MaxIW; # assume this is in bytes, and an even number of MSS 1223 2. Upon starting a new connection 1225 CWND = IW; 1226 conncount++; 1227 IWnotchecked = 1; # true 1229 3. During a connection's SYN-ACK processing, if SYN-ACK includes 1230 ECN, treat as if the IW is too large 1232 if (IWnotchecked && (synackecn == 1)) { 1233 losscount++; 1234 IWnotchecked = 0; # never check again 1235 } 1237 4. During a connection, if retransmission occurs, check the seqno of 1238 the outgoing packet (in bytes) to see if the resent segment fixes 1239 an IW loss: 1241 if (Retransmitting && IWnotchecked && ((ISN - seqno) < IW))) { 1242 losscount++; 1243 IWnotchecked = 0; # never do this entire "if" again 1244 } else { 1245 IWnotchecked = 0; # you're beyond the IW so stop checking 1246 } 1248 5. Once every 1000 conections, as a separate process (i.e., not as 1249 part of processing a given connection): 1251 if (conncount > 1000) { 1252 if (losscount/conncount > threshold) { 1253 # the number of connections with errors is too high 1254 IW = IW * MulDecr; 1255 } else { 1256 IW = IW + AddIncr; 1257 } 1258 } 1260 We recognize that this algorithm can yield a false positive when the 1261 sequence number wraps around. This can be avoided using either PAWS 1262 [RFC7323] context or 64-bit internal sequence numbers (as in TCP-AO 1263 [RFC5925]). Alternately, false positives can be allowed since they 1264 are expected to be infrequent and thus will not affect the overall 1265 statistics of the algorithm. 1267 The following additional constraints are imposed: 1269 >> The automatic IW algorithm MUST initialize to MaxIW, in the 1270 absence of other context information. 1272 If there are too few connections to make a decision or if there is 1273 otherwise insufficient information to increase the IW, then the 1274 MaxIW defaults to the current recommended value. 1276 >> An implementation may allow the MaxIW to grow beyond the 1277 currently recommended Internet default, but not more than 2 segments 1278 per calendar year. 1280 If an endpoint has a persistent history of successfully transmitting 1281 IW segments without loss, then it is allowed to probe the Internet 1282 to determine if larger IW values have similar success. This probing 1283 is limited and requires a trusted time source, otherwise the MaxIW 1284 remains constant. 1286 >> An implementation MUST adjust the IW based on loss statistics at 1287 least once every 1000 connections. 1289 An endpoint needs to be sufficiently reactive to IW loss. 1291 >> An implementation MUST decrease the IW by at least one MSS when 1292 indicated during an evaluation interval. 1294 An endpoint that detects loss needs to decrease its IW by at least 1295 one MSS, otherwise it is not participating in an automatic reactive 1296 algorithm. 1298 >> An implementation MUST increase by no more than 2 MSS per 1299 evaluation interval. 1301 An endpoint that does not experience IW loss needs to probe the 1302 network incrementally. 1304 >> An implementation SHOULD use an IW that is an integer multiple of 1305 2 MSS. 1307 The IW should remain a multiple of 2 MSS segments, to enable 1308 efficient ACK compression without incurring unnecessary timeouts. 1310 >> An implementation MUST decrease the IW if more than 95% of 1311 connections have IW losses. 1313 Again, this is to ensure an implementation is sufficiently reactive. 1315 >> An implementation MAY group IW values and statistics within 1316 subsets of connections. Such grouping MAY use any information about 1317 connections to form groups except loss statistics. 1319 There are some TCP connections which might not be counted at all, 1320 such as those to/from loopback addresses, or those within the same 1321 subnet as that of a local interface (for which congestion control is 1322 sometimes disabled anyway). This may also include connections that 1323 terminate before the IW is full, i.e., as a separate check at the 1324 time of the connection closing. 1326 The period over which the IW is updated is intended to be a long 1327 timescale, e.g., a month or so, or 1,000 connections, whichever is 1328 longer. An implementation might check the IW once a month, and 1329 simply not update the IW or clear the connection counts in months 1330 where the number of connections is too small. 1332 C.4. Discussion 1334 There are numerous parameters to the above algorithm that are 1335 compliant with the given requirements; this is intended to allow 1336 variation in configuration and implementation while ensuring that 1337 all such algorithms are reactive and safe. 1339 This algorithm continues to assume segments because that is the 1340 basis of most TCP implementations. It might be useful to consider 1341 revising the specifications to allow byte-based congestion given 1342 sufficient experience. 1344 The algorithm checks for IW losses only during the first IW after a 1345 connection start; it does not check for IW losses elsewhere the IW 1346 is used, e.g., during slow-start restarts. 1348 >> An implementation MAY detect IW losses during slow-start restarts 1349 in addition to losses during the first IW of a connection. In this 1350 case, the implementation MUST count each restart as a "connection" 1351 for the purposes of connection counts and periodic rechecking of the 1352 IW value. 1354 False positives can occur during some kinds of segment reordering, 1355 e.g., that might trigger spurious retransmissions even without a 1356 true segment loss. These are not expected to be sufficiently common 1357 to dominate the algorithm and its conclusions. 1359 This mechanism does require additional per-connection state which is 1360 currently common in some implementations, and is useful for other 1361 reasons (e.g., the ISN is used in TCP-AO [RFC5925]). The mechanism 1362 also benefits from persistent state kept across reboots, as would be 1363 other state sharing mechanisms (e.g., TCP Control Block Sharing 1364 [RFC2140]). The mechanism is inspired by RFC 2140's use of 1365 information across connections. 1367 The receive window (RWIN) is not involved in this calculation. The 1368 size of RWIN is determined by receiver resources, and provides space 1369 to accommodate segment reordering. It is not involved with 1370 congestion control, which is the focus of this document and its 1371 management of the IW. 1373 C.5. Observations 1375 The IW may not converge to a single, global value. It also may not 1376 converge at all, but rather may oscillate by a few MSS as it 1377 repeatedly probes the Internet for larger IWs and fails. Both 1378 properties are consistent with TCP behavior during each individual 1379 connection. 1381 This mechanism assumes that losses during the IW are due to IW size. 1382 Persistent errors that drop packets for other reasons - e.g., OS 1383 bugs, can cause false positives. Again, this is consistent with 1384 TCP's basic assumption that loss is caused by congestion and 1385 requires backoff. This algorithm treats the IW of new connections as 1386 a long-timescale backoff system.