idnits 2.17.1 draft-ietf-tcpm-rfc793bis-28.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** There are 3 instances of too long lines in the document, the longest one being 15 characters in excess of 72. == There are 3 instances of lines with non-RFC2606-compliant FQDNs in the document. -- The draft header indicates that this document obsoletes RFC6093, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC2873, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC6429, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC7, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC879, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC6528, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). (Using the creation date from RFC1011, updated by this document, for RFC5378 checks: 1987-05-01) -- The document seems to contain a disclaimer for pre-RFC5378 work, and may have content which was first submitted before 10 November 2008. The disclaimer is necessary when there are original authors that you have been unable to contact, or if some do not wish to grant the BCP78 rights to the IETF Trust. If you are able to get all authors (current and original) to grant those rights, you can and should remove the disclaimer; otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (7 March 2022) is 781 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: 'Options' on line 310 -- Looks like a reference, but probably isn't: 'RFC8311' on line 4470 -- Looks like a reference, but probably isn't: 'RFC3168' on line 4472 -- Looks like a reference, but probably isn't: 'RFC3540' on line 4481 -- Obsolete informational reference (is this intentional?): RFC 793 (ref. '16') (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 879 (ref. '17') (Obsoleted by RFC 7805, RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 896 (ref. '18') (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 1349 (ref. '21') (Obsoleted by RFC 2474) -- Obsolete informational reference (is this intentional?): RFC 1644 (ref. '22') (Obsoleted by RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 2873 (ref. '26') (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 6093 (ref. '40') (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 6429 (ref. '42') (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 6528 (ref. '43') (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 6691 (ref. '44') (Obsoleted by RFC 9293) == Outdated reference: A later version (-13) exists of draft-ietf-tcpm-tcp-edo-10 == Outdated reference: A later version (-13) exists of draft-mcquistin-augmented-ascii-diagrams-08 == Outdated reference: A later version (-04) exists of draft-iab-use-it-or-lose-it-02 Summary: 2 errors (**), 0 flaws (~~), 6 warnings (==), 22 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force W. Eddy, Ed. 3 Internet-Draft MTI Systems 4 Obsoletes: 793, 879, 2873, 6093, 6429, 6528, 7 March 2022 5 6691 (if approved) 6 Updates: 5961, 1011, 1122 (if approved) 7 Intended status: Standards Track 8 Expires: 8 September 2022 10 Transmission Control Protocol (TCP) Specification 11 draft-ietf-tcpm-rfc793bis-28 13 Abstract 15 This document specifies the Transmission Control Protocol (TCP). TCP 16 is an important transport layer protocol in the Internet protocol 17 stack, and has continuously evolved over decades of use and growth of 18 the Internet. Over this time, a number of changes have been made to 19 TCP as it was specified in RFC 793, though these have only been 20 documented in a piecemeal fashion. This document collects and brings 21 those changes together with the protocol specification from RFC 793. 22 This document obsoletes RFC 793, as well as RFCs 879, 2873, 6093, 23 6429, 6528, and 6691 that updated parts of RFC 793. It updates RFCs 24 1011 and 1122, and should be considered as a replacement for the 25 portions of those document dealing with TCP requirements. It also 26 updates RFC 5961 by adding a small clarification in reset handling 27 while in the SYN-RECEIVED state. The TCP header control bits from 28 RFC 793 have also been updated based on RFC 3168. 30 RFC EDITOR NOTE: If approved for publication as an RFC, this should 31 be marked additionally as "STD: 7" and replace RFC 793 in that role. 33 Status of This Memo 35 This Internet-Draft is submitted in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF). Note that other groups may also distribute 40 working documents as Internet-Drafts. The list of current Internet- 41 Drafts is at https://datatracker.ietf.org/drafts/current/. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on 8 September 2022. 50 Copyright Notice 52 Copyright (c) 2022 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 57 license-info) in effect on the date of publication of this document. 58 Please review these documents carefully, as they describe your rights 59 and restrictions with respect to this document. Code Components 60 extracted from this document must include Revised BSD License text as 61 described in Section 4.e of the Trust Legal Provisions and are 62 provided without warranty as described in the Revised BSD License. 64 This document may contain material from IETF Documents or IETF 65 Contributions published or made publicly available before November 66 10, 2008. The person(s) controlling the copyright in some of this 67 material may not have granted the IETF Trust the right to allow 68 modifications of such material outside the IETF Standards Process. 69 Without obtaining an adequate license from the person(s) controlling 70 the copyright in such materials, this document may not be modified 71 outside the IETF Standards Process, and derivative works of it may 72 not be created outside the IETF Standards Process, except to format 73 it for publication as an RFC or to translate it into languages other 74 than English. 76 Table of Contents 78 1. Purpose and Scope . . . . . . . . . . . . . . . . . . . . . . 4 79 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 5 80 2.1. Requirements Language . . . . . . . . . . . . . . . . . . 5 81 2.2. Key TCP Concepts . . . . . . . . . . . . . . . . . . . . 6 82 3. Functional Specification . . . . . . . . . . . . . . . . . . 6 83 3.1. Header Format . . . . . . . . . . . . . . . . . . . . . . 6 84 3.2. Specific Option Definitions . . . . . . . . . . . . . . . 12 85 3.2.1. Other Common Options . . . . . . . . . . . . . . . . 13 86 3.2.2. Experimental TCP Options . . . . . . . . . . . . . . 13 87 3.3. TCP Terminology Overview . . . . . . . . . . . . . . . . 13 88 3.3.1. Key Connection State Variables . . . . . . . . . . . 13 89 3.3.2. State Machine Overview . . . . . . . . . . . . . . . 15 90 3.4. Sequence Numbers . . . . . . . . . . . . . . . . . . . . 18 91 3.4.1. Initial Sequence Number Selection . . . . . . . . . . 21 92 3.4.2. Knowing When to Keep Quiet . . . . . . . . . . . . . 23 93 3.4.3. The TCP Quiet Time Concept . . . . . . . . . . . . . 23 94 3.5. Establishing a connection . . . . . . . . . . . . . . . . 25 95 3.5.1. Half-Open Connections and Other Anomalies . . . . . . 28 96 3.5.2. Reset Generation . . . . . . . . . . . . . . . . . . 31 97 3.5.3. Reset Processing . . . . . . . . . . . . . . . . . . 32 99 3.6. Closing a Connection . . . . . . . . . . . . . . . . . . 32 100 3.6.1. Half-Closed Connections . . . . . . . . . . . . . . . 35 101 3.7. Segmentation . . . . . . . . . . . . . . . . . . . . . . 35 102 3.7.1. Maximum Segment Size Option . . . . . . . . . . . . . 37 103 3.7.2. Path MTU Discovery . . . . . . . . . . . . . . . . . 38 104 3.7.3. Interfaces with Variable MTU Values . . . . . . . . . 39 105 3.7.4. Nagle Algorithm . . . . . . . . . . . . . . . . . . . 39 106 3.7.5. IPv6 Jumbograms . . . . . . . . . . . . . . . . . . . 40 107 3.8. Data Communication . . . . . . . . . . . . . . . . . . . 40 108 3.8.1. Retransmission Timeout . . . . . . . . . . . . . . . 41 109 3.8.2. TCP Congestion Control . . . . . . . . . . . . . . . 41 110 3.8.3. TCP Connection Failures . . . . . . . . . . . . . . . 42 111 3.8.4. TCP Keep-Alives . . . . . . . . . . . . . . . . . . . 43 112 3.8.5. The Communication of Urgent Information . . . . . . . 44 113 3.8.6. Managing the Window . . . . . . . . . . . . . . . . . 45 114 3.9. Interfaces . . . . . . . . . . . . . . . . . . . . . . . 50 115 3.9.1. User/TCP Interface . . . . . . . . . . . . . . . . . 50 116 3.9.2. TCP/Lower-Level Interface . . . . . . . . . . . . . . 59 117 3.10. Event Processing . . . . . . . . . . . . . . . . . . . . 61 118 3.10.1. OPEN Call . . . . . . . . . . . . . . . . . . . . . 63 119 3.10.2. SEND Call . . . . . . . . . . . . . . . . . . . . . 64 120 3.10.3. RECEIVE Call . . . . . . . . . . . . . . . . . . . . 65 121 3.10.4. CLOSE Call . . . . . . . . . . . . . . . . . . . . . 67 122 3.10.5. ABORT Call . . . . . . . . . . . . . . . . . . . . . 68 123 3.10.6. STATUS Call . . . . . . . . . . . . . . . . . . . . 69 124 3.10.7. SEGMENT ARRIVES . . . . . . . . . . . . . . . . . . 70 125 3.10.8. Timeouts . . . . . . . . . . . . . . . . . . . . . . 84 126 4. Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . 84 127 5. Changes from RFC 793 . . . . . . . . . . . . . . . . . . . . 89 128 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 96 129 7. Security and Privacy Considerations . . . . . . . . . . . . . 97 130 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 99 131 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 100 132 9.1. Normative References . . . . . . . . . . . . . . . . . . 100 133 9.2. Informative References . . . . . . . . . . . . . . . . . 102 134 Appendix A. Other Implementation Notes . . . . . . . . . . . . . 107 135 A.1. IP Security Compartment and Precedence . . . . . . . . . 108 136 A.1.1. Precedence . . . . . . . . . . . . . . . . . . . . . 108 137 A.1.2. MLS Systems . . . . . . . . . . . . . . . . . . . . . 109 138 A.2. Sequence Number Validation . . . . . . . . . . . . . . . 109 139 A.3. Nagle Modification . . . . . . . . . . . . . . . . . . . 109 140 A.4. Low Watermark Settings . . . . . . . . . . . . . . . . . 110 141 Appendix B. TCP Requirement Summary . . . . . . . . . . . . . . 110 142 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 114 144 1. Purpose and Scope 146 In 1981, RFC 793 [16] was released, documenting the Transmission 147 Control Protocol (TCP), and replacing earlier specifications for TCP 148 that had been published in the past. 150 Since then, TCP has been widely implemented, and has been used as a 151 transport protocol for numerous applications on the Internet. 153 For several decades, RFC 793 plus a number of other documents have 154 combined to serve as the core specification for TCP [50]. Over time, 155 a number of errata have been filed against RFC 793. There have also 156 been deficiencies found and resolved in security, performance, and 157 many other aspects. The number of enhancements has grown over time 158 across many separate documents. These were never accumulated 159 together into a comprehensive update to the base specification. 161 The purpose of this document is to bring together all of the IETF 162 Standards Track changes and other clarifications that have been made 163 to the base TCP functional specification and unify them into an 164 updated version of RFC 793. 166 Some companion documents are referenced for important algorithms that 167 are used by TCP (e.g. for congestion control), but have not been 168 completely included in this document. This is a conscious choice, as 169 this base specification can be used with multiple additional 170 algorithms that are developed and incorporated separately. This 171 document focuses on the common basis all TCP implementations must 172 support in order to interoperate. Since some additional TCP features 173 have become quite complicated themselves (e.g. advanced loss recovery 174 and congestion control), future companion documents may attempt to 175 similarly bring these together. 177 In addition to the protocol specification that describes the TCP 178 segment format, generation, and processing rules that are to be 179 implemented in code, RFC 793 and other updates also contain 180 informative and descriptive text for readers to understand aspects of 181 the protocol design and operation. This document does not attempt to 182 alter or update this informative text, and is focused only on 183 updating the normative protocol specification. This document 184 preserves references to the documentation containing the important 185 explanations and rationale, where appropriate. 187 This document is intended to be useful both in checking existing TCP 188 implementations for conformance purposes, as well as in writing new 189 implementations. 191 2. Introduction 193 RFC 793 contains a discussion of the TCP design goals and provides 194 examples of its operation, including examples of connection 195 establishment, connection termination, and packet retransmission to 196 repair losses. 198 This document describes the basic functionality expected in modern 199 TCP implementations, and replaces the protocol specification in RFC 200 793. It does not replicate or attempt to update the introduction and 201 philosophy content in Sections 1 and 2 of RFC 793. Other documents 202 are referenced to provide explanation of the theory of operation, 203 rationale, and detailed discussion of design decisions. This 204 document only focuses on the normative behavior of the protocol. 206 The "TCP Roadmap" [50] provides a more extensive guide to the RFCs 207 that define TCP and describe various important algorithms. The TCP 208 Roadmap contains sections on strongly encouraged enhancements that 209 improve performance and other aspects of TCP beyond the basic 210 operation specified in this document. As one example, implementing 211 congestion control (e.g. [8]) is a TCP requirement, but is a complex 212 topic on its own, and not described in detail in this document, as 213 there are many options and possibilities that do not impact basic 214 interoperability. Similarly, most TCP implementations today include 215 the high-performance extensions in [48], but these are not strictly 216 required or discussed in this document. Multipath considerations for 217 TCP are also specified separately in [59]. 219 A list of changes from RFC 793 is contained in Section 5. 221 2.1. Requirements Language 223 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 224 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 225 "OPTIONAL" in this document are to be interpreted as described in BCP 226 14 [3][12] when, and only when, they appear in all capitals, as shown 227 here. 229 Each use of RFC 2119 keywords in the document is individually labeled 230 and referenced in Appendix B that summarizes implementation 231 requirements. 233 Sentences using "MUST" are labeled as "MUST-X" with X being a numeric 234 identifier enabling the requirement to be located easily when 235 referenced from Appendix B. 237 Similarly, sentences using "SHOULD" are labeled with "SHLD-X", "MAY" 238 with "MAY-X", and "RECOMMENDED" with "REC-X". 240 For the purposes of this labeling, "SHOULD NOT" and "MUST NOT" are 241 labeled the same as "SHOULD" and "MUST" instances. 243 2.2. Key TCP Concepts 245 TCP provides a reliable, in-order, byte-stream service to 246 applications. 248 The application byte-stream is conveyed over the network via TCP 249 segments, with each TCP segment sent as an Internet Protocol (IP) 250 datagram. 252 TCP reliability consists of detecting packet losses (via sequence 253 numbers) and errors (via per-segment checksums), as well as 254 correction via retransmission. 256 TCP supports unicast delivery of data. Anycast applications exist 257 that successfully use TCP without modifications, though there is some 258 risk of instability due to changes of lower-layer forwarding behavior 259 [47]. 261 TCP is connection-oriented, though does not inherently include a 262 liveness detection capability. 264 Data flow is supported bidirectionally over TCP connections, though 265 applications are free to send data only unidirectionally, if they so 266 choose. 268 TCP uses port numbers to identify application services and to 269 multiplex distinct flows between hosts. 271 A more detailed description of TCP features compared to other 272 transport protocols can be found in Section 3.1 of [53]. Further 273 description of the motivations for developing TCP and its role in the 274 Internet protocol stack can be found in Section 2 of [16] and earlier 275 versions of the TCP specification. 277 3. Functional Specification 279 3.1. Header Format 281 TCP segments are sent as internet datagrams. The Internet Protocol 282 (IP) header carries several information fields, including the source 283 and destination host addresses [1] [13]. A TCP header follows the IP 284 headers, supplying information specific to the TCP protocol. This 285 division allows for the existence of host level protocols other than 286 TCP. In early development of the Internet suite of protocols, the IP 287 header fields had been a part of TCP. 289 This document describes the TCP protocol. The TCP protocol uses TCP 290 Headers. 292 A TCP Header, followed by any user data in the segment, is formatted 293 as follows, using the style from [67]: 295 0 1 2 3 296 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 297 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 298 | Source Port | Destination Port | 299 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 300 | Sequence Number | 301 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 302 | Acknowledgment Number | 303 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 304 | Data | |C|E|U|A|P|R|S|F| | 305 | Offset| Rsrvd |W|C|R|C|S|S|Y|I| Window | 306 | | |R|E|G|K|H|T|N|N| | 307 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 308 | Checksum | Urgent Pointer | 309 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 310 | [Options] | 311 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 312 | : 313 : Data : 314 : | 315 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 317 Note that one tick mark represents one bit position. 319 Figure 1: TCP Header Format 321 where: 323 Source Port: 16 bits. 324 The source port number. 326 Destination Port: 16 bits. 327 The destination port number. 329 Sequence Number: 32 bits. 330 The sequence number of the first data octet in this segment (except 331 when the SYN flag is set). If SYN is set the sequence number is 332 the initial sequence number (ISN) and the first data octet is 333 ISN+1. 335 Acknowledgment Number: 32 bits. 337 If the ACK control bit is set, this field contains the value of the 338 next sequence number the sender of the segment is expecting to 339 receive. Once a connection is established, this is always sent. 341 Data Offset (DOffset): 4 bits. 342 The number of 32 bit words in the TCP Header. This indicates where 343 the data begins. The TCP header (even one including options) is an 344 integer multiple of 32 bits long. 346 Reserved (Rsrvd): 4 bits. 347 A set of control bits reserved for future use. Must be zero in 348 generated segments and must be ignored in received segments, if 349 corresponding future features are unimplemented by the sending or 350 receiving host. 352 The control bits are also known as "flags". Assignment is managed 353 by IANA from the "TCP Header Flags" registry [63]. The currently 354 assigned control bits are CWR, ECE, URG, ACK, PSH, RST, SYN, and 355 FIN. 357 CWR: 1 bit. 358 Congestion Window Reduced (see [6]). 360 ECE: 1 bit. 361 ECN-Echo (see [6]). 363 URG: 1 bit. 364 Urgent Pointer field is significant. 366 ACK: 1 bit. 367 Acknowledgment field is significant. 369 PSH: 1 bit. 370 Push Function (see the Send Call description in Section 3.9.1). 372 RST: 1 bit. 373 Reset the connection. 375 SYN: 1 bit. 376 Synchronize sequence numbers. 378 FIN: 1 bit. 379 No more data from sender. 381 Window: 16 bits. 383 The number of data octets beginning with the one indicated in the 384 acknowledgment field that the sender of this segment is willing to 385 accept. The value is shifted when the Window Scaling extension is 386 used [48]. 388 The window size MUST be treated as an unsigned number, or else 389 large window sizes will appear like negative windows and TCP will 390 not work (MUST-1). It is RECOMMENDED that implementations will 391 reserve 32-bit fields for the send and receive window sizes in the 392 connection record and do all window computations with 32 bits (REC- 393 1). 395 Checksum: 16 bits. 396 The checksum field is the 16 bit ones' complement of the ones' 397 complement sum of all 16 bit words in the header and text. The 398 checksum computation needs to ensure the 16-bit alignment of the 399 data being summed. If a segment contains an odd number of header 400 and text octets, alignment can be achieved by padding the last 401 octet with zeros on its right to form a 16 bit word for checksum 402 purposes. The pad is not transmitted as part of the segment. 403 While computing the checksum, the checksum field itself is replaced 404 with zeros. 406 The checksum also covers a pseudo header (Figure 2) conceptually 407 prefixed to the TCP header. The pseudo header is 96 bits for IPv4 408 and 320 bits for IPv6. Including the pseudo header in the checksum 409 gives the TCP connection protection against misrouted segments. 410 This information is carried in IP headers and is transferred across 411 the TCP/Network interface in the arguments or results of calls by 412 the TCP implementation on the IP layer. 414 +--------+--------+--------+--------+ 415 | Source Address | 416 +--------+--------+--------+--------+ 417 | Destination Address | 418 +--------+--------+--------+--------+ 419 | zero | PTCL | TCP Length | 420 +--------+--------+--------+--------+ 422 Figure 2: IPv4 Pseudo Header 424 Pseudo header components for IPv4: 425 Source Address: the IPv4 source address in network byte order 427 Destination Address: the IPv4 destination address in network 428 byte order 429 zero: bits set to zero 431 PTCL: the protocol number from the IP header 433 TCP Length: the TCP header length plus the data length in 434 octets (this is not an explicitly transmitted quantity, but is 435 computed), and it does not count the 12 octets of the pseudo 436 header. 438 For IPv6, the pseudo header is defined in Section 8.1 of RFC 8200 439 [13], and contains the IPv6 Source Address and Destination 440 Address, an Upper Layer Packet Length (a 32-bit value otherwise 441 equivalent to TCP Length in the IPv4 pseudo header), three bytes 442 of zero-padding, and a Next Header value (differing from the IPv6 443 header value in the case of extension headers present in between 444 IPv6 and TCP). 446 The TCP checksum is never optional. The sender MUST generate it 447 (MUST-2) and the receiver MUST check it (MUST-3). 449 Urgent Pointer: 16 bits. 450 This field communicates the current value of the urgent pointer as 451 a positive offset from the sequence number in this segment. The 452 urgent pointer points to the sequence number of the octet following 453 the urgent data. This field is only to be interpreted in segments 454 with the URG control bit set. 456 Options: [TCP Option]; size(Options) == (DOffset-5)*32; present 457 only when DOffset > 5. Note that this size expression also 458 includes any padding trailing the actual options present. 459 Options may occupy space at the end of the TCP header and are a 460 multiple of 8 bits in length. All options are included in the 461 checksum. An option may begin on any octet boundary. There are 462 two cases for the format of an option: 464 Case 1: A single octet of option-kind. 466 Case 2: An octet of option-kind (Kind), an octet of option- 467 length, and the actual option-data octets. 469 The option-length counts the two octets of option-kind and option- 470 length as well as the option-data octets. 472 Note that the list of options may be shorter than the data offset 473 field might imply. The content of the header beyond the End-of- 474 Option option MUST be header padding of zeros (MUST-69). 476 The list of all currently defined options is managed by IANA [62], 477 and each option is defined in other RFCs, as indicated there. That 478 set includes experimental options that can be extended to support 479 multiple concurrent usages [46]. 481 A given TCP implementation can support any currently defined 482 options, but the following options MUST be supported (MUST-4 - note 483 Maximum Segment Size option support is also part of MUST-19 in 484 Section 3.7.2): 486 Kind Length Meaning 487 ---- ------ ------- 488 0 - End of option list. 489 1 - No-Operation. 490 2 4 Maximum Segment Size. 492 These options are specified in detail in Section 3.2. 494 A TCP implementation MUST be able to receive a TCP option in any 495 segment (MUST-5). 497 A TCP implementation MUST (MUST-6) ignore without error any TCP 498 option it does not implement, assuming that the option has a length 499 field. All TCP options except End of option list and No-Operation 500 MUST have length fields, including all future options (MUST-68). 501 TCP implementations MUST be prepared to handle an illegal option 502 length (e.g., zero); a suggested procedure is to reset the 503 connection and log the error cause (MUST-7). 505 Note: There is ongoing work to extend the space available for TCP 506 options, such as [66]. 508 Data: variable length. 509 User data carried by the TCP segment. 511 3.2. Specific Option Definitions 513 A TCP Option, in the mandatory option set, is one of: an End of 514 Option List Option, a No-Operation Option, or a Maximum Segment Size 515 Option. 517 An End of Option List Option is formatted as follows: 519 0 520 0 1 2 3 4 5 6 7 521 +-+-+-+-+-+-+-+-+ 522 | 0 | 523 +-+-+-+-+-+-+-+-+ 525 where: 527 Kind: 1 byte; Kind == 0. 528 This option code indicates the end of the option list. This might 529 not coincide with the end of the TCP header according to the Data 530 Offset field. This is used at the end of all options, not the end 531 of each option, and need only be used if the end of the options 532 would not otherwise coincide with the end of the TCP header. 534 A No-Operation Option is formatted as follows: 536 0 537 0 1 2 3 4 5 6 7 538 +-+-+-+-+-+-+-+-+ 539 | 1 | 540 +-+-+-+-+-+-+-+-+ 542 where: 544 Kind: 1 byte; Kind == 1. 545 This option code can be used between options, for example, to align 546 the beginning of a subsequent option on a word boundary. There is 547 no guarantee that senders will use this option, so receivers MUST 548 be prepared to process options even if they do not begin on a word 549 boundary (MUST-64). 551 A Maximum Segment Size Option is formatted as follows: 553 0 1 2 3 554 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 555 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 556 | 2 | Length | Maximum Segment Size (MSS) | 557 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 559 where: 561 Kind: 1 byte; Kind == 2. 562 If this option is present, then it communicates the maximum receive 563 segment size at the TCP endpoint that sends this segment. This 564 value is limited by the IP reassembly limit. This field may be 565 sent in the initial connection request (i.e., in segments with the 566 SYN control bit set) and MUST NOT be sent in other segments (MUST- 567 65). If this option is not used, any segment size is allowed. A 568 more complete description of this option is provided in 569 Section 3.7.1. 571 Length: 1 byte; Length == 4. 572 Length of the option in bytes. 574 Maximum Segment Size (MSS): 2 bytes. 575 The maximum receive segment size at the TCP endpoint that sends 576 this segment. 578 3.2.1. Other Common Options 580 Additional RFCs define some other commonly used options that are 581 recommended to implement for high performance, but not necessary for 582 basic TCP interoperability. These are the TCP Selective 583 Acknowledgement (SACK) option [23][27], TCP Timestamp (TS) option 584 [48], and TCP Window Scaling (WS) option [48]. 586 3.2.2. Experimental TCP Options 588 Experimental TCP option values are defined in [31], and [46] 589 describes the current recommended usage for these experimental 590 values. 592 3.3. TCP Terminology Overview 594 This section includes an overview of key terms needed to understand 595 the detailed protocol operation in the rest of the document. There 596 is a glossary of terms in Section 4. 598 3.3.1. Key Connection State Variables 600 Before we can discuss very much about the operation of the TCP 601 implementation we need to introduce some detailed terminology. The 602 maintenance of a TCP connection requires maintaining state for 603 several variables. We conceive of these variables being stored in a 604 connection record called a Transmission Control Block or TCB. Among 605 the variables stored in the TCB are the local and remote IP addresses 606 and port numbers, the IP security level and compartment of the 607 connection (see Appendix A.1), pointers to the user's send and 608 receive buffers, pointers to the retransmit queue and to the current 609 segment. In addition, several variables relating to the send and 610 receive sequence numbers are stored in the TCB. 612 Send Sequence Variables: 614 SND.UNA - send unacknowledged 615 SND.NXT - send next 616 SND.WND - send window 617 SND.UP - send urgent pointer 618 SND.WL1 - segment sequence number used for last window update 619 SND.WL2 - segment acknowledgment number used for last window 620 update 621 ISS - initial send sequence number 623 Receive Sequence Variables: 625 RCV.NXT - receive next 626 RCV.WND - receive window 627 RCV.UP - receive urgent pointer 628 IRS - initial receive sequence number 630 The following diagrams may help to relate some of these variables to 631 the sequence space. 633 1 2 3 4 634 ----------|----------|----------|---------- 635 SND.UNA SND.NXT SND.UNA 636 +SND.WND 638 1 - old sequence numbers that have been acknowledged 639 2 - sequence numbers of unacknowledged data 640 3 - sequence numbers allowed for new data transmission 641 4 - future sequence numbers that are not yet allowed 643 Figure 3: Send Sequence Space 645 The send window is the portion of the sequence space labeled 3 in 646 Figure 3. 648 1 2 3 649 ----------|----------|---------- 650 RCV.NXT RCV.NXT 651 +RCV.WND 653 1 - old sequence numbers that have been acknowledged 654 2 - sequence numbers allowed for new reception 655 3 - future sequence numbers that are not yet allowed 657 Figure 4: Receive Sequence Space 659 The receive window is the portion of the sequence space labeled 2 in 660 Figure 4. 662 There are also some variables used frequently in the discussion that 663 take their values from the fields of the current segment. 665 Current Segment Variables: 667 SEG.SEQ - segment sequence number 668 SEG.ACK - segment acknowledgment number 669 SEG.LEN - segment length 670 SEG.WND - segment window 671 SEG.UP - segment urgent pointer 673 3.3.2. State Machine Overview 675 A connection progresses through a series of states during its 676 lifetime. The states are: LISTEN, SYN-SENT, SYN-RECEIVED, 677 ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, 678 TIME-WAIT, and the fictional state CLOSED. CLOSED is fictional 679 because it represents the state when there is no TCB, and therefore, 680 no connection. Briefly the meanings of the states are: 682 LISTEN - represents waiting for a connection request from any 683 remote TCP peer and port. 685 SYN-SENT - represents waiting for a matching connection request 686 after having sent a connection request. 688 SYN-RECEIVED - represents waiting for a confirming connection 689 request acknowledgment after having both received and sent a 690 connection request. 692 ESTABLISHED - represents an open connection, data received can be 693 delivered to the user. The normal state for the data transfer 694 phase of the connection. 696 FIN-WAIT-1 - represents waiting for a connection termination 697 request from the remote TCP peer, or an acknowledgment of the 698 connection termination request previously sent. 700 FIN-WAIT-2 - represents waiting for a connection termination 701 request from the remote TCP peer. 703 CLOSE-WAIT - represents waiting for a connection termination 704 request from the local user. 706 CLOSING - represents waiting for a connection termination request 707 acknowledgment from the remote TCP peer. 709 LAST-ACK - represents waiting for an acknowledgment of the 710 connection termination request previously sent to the remote TCP 711 peer (this termination request sent to the remote TCP peer already 712 included an acknowledgment of the termination request sent from 713 the remote TCP peer). 715 TIME-WAIT - represents waiting for enough time to pass to be sure 716 the remote TCP peer received the acknowledgment of its connection 717 termination request, and to avoid new connections being impacted 718 by delayed segments from previous connections. 720 CLOSED - represents no connection state at all. 722 A TCP connection progresses from one state to another in response to 723 events. The events are the user calls, OPEN, SEND, RECEIVE, CLOSE, 724 ABORT, and STATUS; the incoming segments, particularly those 725 containing the SYN, ACK, RST and FIN flags; and timeouts. 727 The OPEN call specifies whether connection establishment is to be 728 actively pursued, or to be passively waited for. 730 A passive OPEN request means that the process wants to accept 731 incoming connection requests, in contrast to an active OPEN 732 attempting to initiate a connection. 734 The state diagram in Figure 5 illustrates only state changes, 735 together with the causing events and resulting actions, but addresses 736 neither error conditions nor actions that are not connected with 737 state changes. In a later section, more detail is offered with 738 respect to the reaction of the TCP implementation to events. Some 739 state names are abbreviated or hyphenated differently in the diagram 740 from how they appear elsewhere in the document. 742 NOTA BENE: This diagram is only a summary and must not be taken as 743 the total specification. Many details are not included. 745 +---------+ ---------\ active OPEN 746 | CLOSED | \ ----------- 747 +---------+<---------\ \ create TCB 748 | ^ \ \ snd SYN 749 passive OPEN | | CLOSE \ \ 750 ------------ | | ---------- \ \ 751 create TCB | | delete TCB \ \ 752 V | \ \ 753 rcv RST (note 1) +---------+ CLOSE | \ 754 -------------------->| LISTEN | ---------- | | 755 / +---------+ delete TCB | | 756 / rcv SYN | | SEND | | 757 / ----------- | | ------- | V 758 +--------+ snd SYN,ACK / \ snd SYN +--------+ 759 | |<----------------- ------------------>| | 760 | SYN | rcv SYN | SYN | 761 | RCVD |<-----------------------------------------------| SENT | 762 | | snd SYN,ACK | | 763 | |------------------ -------------------| | 764 +--------+ rcv ACK of SYN \ / rcv SYN,ACK +--------+ 765 | -------------- | | ----------- 766 | x | | snd ACK 767 | V V 768 | CLOSE +---------+ 769 | ------- | ESTAB | 770 | snd FIN +---------+ 771 | CLOSE | | rcv FIN 772 V ------- | | ------- 773 +---------+ snd FIN / \ snd ACK +---------+ 774 | FIN |<---------------- ------------------>| CLOSE | 775 | WAIT-1 |------------------ | WAIT | 776 +---------+ rcv FIN \ +---------+ 777 | rcv ACK of FIN ------- | CLOSE | 778 | -------------- snd ACK | ------- | 779 V x V snd FIN V 780 +---------+ +---------+ +---------+ 781 |FINWAIT-2| | CLOSING | | LAST-ACK| 782 +---------+ +---------+ +---------+ 783 | rcv ACK of FIN | rcv ACK of FIN | 784 | rcv FIN -------------- | Timeout=2MSL -------------- | 785 | ------- x V ------------ x V 786 \ snd ACK +---------+delete TCB +---------+ 787 -------------------->|TIME-WAIT|------------------->| CLOSED | 788 +---------+ +---------+ 790 Figure 5: TCP Connection State Diagram 792 The following notes apply to Figure 5: 794 Note 1: The transition from SYN-RECEIVED to LISTEN on receiving a 795 RST is conditional on having reached SYN-RECEIVED after a passive 796 open. 798 Note 2: The figure omits a transition from FIN-WAIT-1 to TIME-WAIT 799 if a FIN is received and the local FIN is also acknowledged. 801 Note 3: A RST can be sent from any state with a corresponding 802 transition to TIME-WAIT (see [71] for rationale). These 803 transitions are not explicitly shown, otherwise the diagram would 804 become very difficult to read. Similarly, receipt of a RST from 805 any state results in a transition to LISTEN or CLOSED, though this 806 is also omitted from the diagram for legibility. 808 3.4. Sequence Numbers 810 A fundamental notion in the design is that every octet of data sent 811 over a TCP connection has a sequence number. Since every octet is 812 sequenced, each of them can be acknowledged. The acknowledgment 813 mechanism employed is cumulative so that an acknowledgment of 814 sequence number X indicates that all octets up to but not including X 815 have been received. This mechanism allows for straight-forward 816 duplicate detection in the presence of retransmission. Numbering of 817 octets within a segment is that the first data octet immediately 818 following the header is the lowest numbered, and the following octets 819 are numbered consecutively. 821 It is essential to remember that the actual sequence number space is 822 finite, though large. This space ranges from 0 to 2**32 - 1. Since 823 the space is finite, all arithmetic dealing with sequence numbers 824 must be performed modulo 2**32. This unsigned arithmetic preserves 825 the relationship of sequence numbers as they cycle from 2**32 - 1 to 826 0 again. There are some subtleties to computer modulo arithmetic, so 827 great care should be taken in programming the comparison of such 828 values. The symbol "=<" means "less than or equal" (modulo 2**32). 830 The typical kinds of sequence number comparisons that the TCP 831 implementation must perform include: 833 (a) Determining that an acknowledgment refers to some sequence 834 number sent but not yet acknowledged. 836 (b) Determining that all sequence numbers occupied by a segment 837 have been acknowledged (e.g., to remove the segment from a 838 retransmission queue). 840 (c) Determining that an incoming segment contains sequence numbers 841 that are expected (i.e., that the segment "overlaps" the receive 842 window). 844 In response to sending data the TCP endpoint will receive 845 acknowledgments. The following comparisons are needed to process the 846 acknowledgments. 848 SND.UNA = oldest unacknowledged sequence number 850 SND.NXT = next sequence number to be sent 852 SEG.ACK = acknowledgment from the receiving TCP peer (next 853 sequence number expected by the receiving TCP peer) 855 SEG.SEQ = first sequence number of a segment 857 SEG.LEN = the number of octets occupied by the data in the segment 858 (counting SYN and FIN) 860 SEG.SEQ+SEG.LEN-1 = last sequence number of a segment 862 A new acknowledgment (called an "acceptable ack"), is one for which 863 the inequality below holds: 865 SND.UNA < SEG.ACK =< SND.NXT 867 A segment on the retransmission queue is fully acknowledged if the 868 sum of its sequence number and length is less or equal than the 869 acknowledgment value in the incoming segment. 871 When data is received the following comparisons are needed: 873 RCV.NXT = next sequence number expected on an incoming segment, 874 and is the left or lower edge of the receive window 876 RCV.NXT+RCV.WND-1 = last sequence number expected on an incoming 877 segment, and is the right or upper edge of the receive window 879 SEG.SEQ = first sequence number occupied by the incoming segment 881 SEG.SEQ+SEG.LEN-1 = last sequence number occupied by the incoming 882 segment 884 A segment is judged to occupy a portion of valid receive sequence 885 space if 887 RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND 889 or 891 RCV.NXT =< SEG.SEQ+SEG.LEN-1 < RCV.NXT+RCV.WND 893 The first part of this test checks to see if the beginning of the 894 segment falls in the window, the second part of the test checks to 895 see if the end of the segment falls in the window; if the segment 896 passes either part of the test it contains data in the window. 898 Actually, it is a little more complicated than this. Due to zero 899 windows and zero length segments, we have four cases for the 900 acceptability of an incoming segment: 902 Segment Receive Test 903 Length Window 904 ------- ------- ------------------------------------------- 906 0 0 SEG.SEQ = RCV.NXT 908 0 >0 RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND 910 >0 0 not acceptable 912 >0 >0 RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND 913 or RCV.NXT =< SEG.SEQ+SEG.LEN-1 < RCV.NXT+RCV.WND 915 Note that when the receive window is zero no segments should be 916 acceptable except ACK segments. Thus, it is possible for a TCP 917 implementation to maintain a zero receive window while transmitting 918 data and receiving ACKs. A TCP receiver MUST process the RST and URG 919 fields of all incoming segments, even when the receive window is zero 920 (MUST-66). 922 We have taken advantage of the numbering scheme to protect certain 923 control information as well. This is achieved by implicitly 924 including some control flags in the sequence space so they can be 925 retransmitted and acknowledged without confusion (i.e., one and only 926 one copy of the control will be acted upon). Control information is 927 not physically carried in the segment data space. Consequently, we 928 must adopt rules for implicitly assigning sequence numbers to 929 control. The SYN and FIN are the only controls requiring this 930 protection, and these controls are used only at connection opening 931 and closing. For sequence number purposes, the SYN is considered to 932 occur before the first actual data octet of the segment in which it 933 occurs, while the FIN is considered to occur after the last actual 934 data octet in a segment in which it occurs. The segment length 935 (SEG.LEN) includes both data and sequence space-occupying controls. 936 When a SYN is present then SEG.SEQ is the sequence number of the SYN. 938 3.4.1. Initial Sequence Number Selection 940 A connection is defined by a pair of sockets. Connections can be 941 reused. New instances of a connection will be referred to as 942 incarnations of the connection. The problem that arises from this is 943 -- "how does the TCP implementation identify duplicate segments from 944 previous incarnations of the connection?" This problem becomes 945 apparent if the connection is being opened and closed in quick 946 succession, or if the connection breaks with loss of memory and is 947 then reestablished. To support this, the TIME-WAIT state limits the 948 rate of connection reuse, while the initial sequence number selection 949 described below further protects against ambiguity about what 950 incarnation of a connection an incoming packet corresponds to. 952 To avoid confusion we must prevent segments from one incarnation of a 953 connection from being used while the same sequence numbers may still 954 be present in the network from an earlier incarnation. We want to 955 assure this, even if a TCP endpoint loses all knowledge of the 956 sequence numbers it has been using. When new connections are 957 created, an initial sequence number (ISN) generator is employed that 958 selects a new 32 bit ISN. There are security issues that result if 959 an off-path attacker is able to predict or guess ISN values [43]. 961 TCP Initial Sequence Numbers are generated from a number sequence 962 that monotonically increases until it wraps, known loosely as a 963 "clock". This clock is a 32-bit counter that typically increments at 964 least once every roughly 4 microseconds, although it is neither 965 assumed to be realtime nor precise, and need not persist across 966 reboots. The clock component is intended to ensure that with a 967 Maximum Segment Lifetime (MSL), generated ISNs will be unique, since 968 it cycles approximately every 4.55 hours, which is much longer than 969 the MSL. 971 A TCP implementation MUST use the above type of "clock" for clock- 972 driven selection of initial sequence numbers (MUST-8), and SHOULD 973 generate its Initial Sequence Numbers with the expression: 975 ISN = M + F(localip, localport, remoteip, remoteport, secretkey) 976 where M is the 4 microsecond timer, and F() is a pseudorandom 977 function (PRF) of the connection's identifying parameters ("localip, 978 localport, remoteip, remoteport") and a secret key ("secretkey") 979 (SHLD-1). F() MUST NOT be computable from the outside (MUST-9), or 980 an attacker could still guess at sequence numbers from the ISN used 981 for some other connection. The PRF could be implemented as a 982 cryptographic hash of the concatenation of the TCP connection 983 parameters and some secret data. For discussion of the selection of 984 a specific hash algorithm and management of the secret key data, 985 please see Section 3 of [43]. 987 For each connection there is a send sequence number and a receive 988 sequence number. The initial send sequence number (ISS) is chosen by 989 the data sending TCP peer, and the initial receive sequence number 990 (IRS) is learned during the connection establishing procedure. 992 For a connection to be established or initialized, the two TCP peers 993 must synchronize on each other's initial sequence numbers. This is 994 done in an exchange of connection establishing segments carrying a 995 control bit called "SYN" (for synchronize) and the initial sequence 996 numbers. As a shorthand, segments carrying the SYN bit are also 997 called "SYNs". Hence, the solution requires a suitable mechanism for 998 picking an initial sequence number and a slightly involved handshake 999 to exchange the ISNs. 1001 The synchronization requires each side to send its own initial 1002 sequence number and to receive a confirmation of it in acknowledgment 1003 from the remote TCP peer. Each side must also receive the remote 1004 peer's initial sequence number and send a confirming acknowledgment. 1006 1) A --> B SYN my sequence number is X 1007 2) A <-- B ACK your sequence number is X 1008 3) A <-- B SYN my sequence number is Y 1009 4) A --> B ACK your sequence number is Y 1011 Because steps 2 and 3 can be combined in a single message this is 1012 called the three-way (or three message) handshake (3WHS). 1014 A 3WHS is necessary because sequence numbers are not tied to a global 1015 clock in the network, and TCP implementations may have different 1016 mechanisms for picking the ISNs. The receiver of the first SYN has 1017 no way of knowing whether the segment was an old one or not, unless 1018 it remembers the last sequence number used on the connection (which 1019 is not always possible), and so it must ask the sender to verify this 1020 SYN. The three-way handshake and the advantages of a clock-driven 1021 scheme for ISN selection are discussed in [70]. 1023 3.4.2. Knowing When to Keep Quiet 1025 A theoretical problem exists where data could be corrupted due to 1026 confusion between old segments in the network and new ones after a 1027 host reboots, if the same port numbers and sequence space are reused. 1028 The "Quiet Time" concept discussed below addresses this and the 1029 discussion of it is included for situations where it might be 1030 relevant, although it is not felt to be necessary in most current 1031 implementations. The problem was more relevant earlier in the 1032 history of TCP. In practical use on the Internet today, the error- 1033 prone conditions are sufficiently unlikely that it is felt safe to 1034 ignore. Reasons why it is now negligible include: (a) ISS and 1035 ephemeral port randomization have reduced likelihood of reuse of port 1036 numbers and sequence numbers after reboots, (b) the effective MSL of 1037 the Internet has declined as links have become faster, and (c) 1038 reboots often taking longer than an MSL anyways. 1040 To be sure that a TCP implementation does not create a segment 1041 carrying a sequence number that may be duplicated by an old segment 1042 remaining in the network, the TCP endpoint must keep quiet for an MSL 1043 before assigning any sequence numbers upon starting up or recovering 1044 from a situation where memory of sequence numbers in use was lost. 1045 For this specification the MSL is taken to be 2 minutes. This is an 1046 engineering choice, and may be changed if experience indicates it is 1047 desirable to do so. Note that if a TCP endpoint is reinitialized in 1048 some sense, yet retains its memory of sequence numbers in use, then 1049 it need not wait at all; it must only be sure to use sequence numbers 1050 larger than those recently used. 1052 3.4.3. The TCP Quiet Time Concept 1054 Hosts that for any reason lose knowledge of the last sequence numbers 1055 transmitted on each active (i.e., not closed) connection shall delay 1056 emitting any TCP segments for at least the agreed MSL in the internet 1057 system that the host is a part of. In the paragraphs below, an 1058 explanation for this specification is given. TCP implementors may 1059 violate the "quiet time" restriction, but only at the risk of causing 1060 some old data to be accepted as new or new data rejected as old 1061 duplicated data by some receivers in the internet system. 1063 TCP endpoints consume sequence number space each time a segment is 1064 formed and entered into the network output queue at a source host. 1065 The duplicate detection and sequencing algorithm in the TCP protocol 1066 relies on the unique binding of segment data to sequence space to the 1067 extent that sequence numbers will not cycle through all 2**32 values 1068 before the segment data bound to those sequence numbers has been 1069 delivered and acknowledged by the receiver and all duplicate copies 1070 of the segments have "drained" from the internet. Without such an 1071 assumption, two distinct TCP segments could conceivably be assigned 1072 the same or overlapping sequence numbers, causing confusion at the 1073 receiver as to which data is new and which is old. Remember that 1074 each segment is bound to as many consecutive sequence numbers as 1075 there are octets of data and SYN or FIN flags in the segment. 1077 Under normal conditions, TCP implementations keep track of the next 1078 sequence number to emit and the oldest awaiting acknowledgment so as 1079 to avoid mistakenly using a sequence number over before its first use 1080 has been acknowledged. This alone does not guarantee that old 1081 duplicate data is drained from the net, so the sequence space has 1082 been made large to reduce the probability that a wandering duplicate 1083 will cause trouble upon arrival. At 2 megabits/sec. it takes 4.5 1084 hours to use up 2**32 octets of sequence space. Since the maximum 1085 segment lifetime in the net is not likely to exceed a few tens of 1086 seconds, this is deemed ample protection for foreseeable nets, even 1087 if data rates escalate to 10s of megabits/sec. At 100 megabits/sec, 1088 the cycle time is 5.4 minutes, which may be a little short, but still 1089 within reason. Much higher data rates are possible today, with 1090 implications described in the final paragraph of this subsection. 1092 The basic duplicate detection and sequencing algorithm in TCP can be 1093 defeated, however, if a source TCP endpoint does not have any memory 1094 of the sequence numbers it last used on a given connection. For 1095 example, if the TCP implementation were to start all connections with 1096 sequence number 0, then upon the host rebooting, a TCP peer might re- 1097 form an earlier connection (possibly after half-open connection 1098 resolution) and emit packets with sequence numbers identical to or 1099 overlapping with packets still in the network, which were emitted on 1100 an earlier incarnation of the same connection. In the absence of 1101 knowledge about the sequence numbers used on a particular connection, 1102 the TCP specification recommends that the source delay for MSL 1103 seconds before emitting segments on the connection, to allow time for 1104 segments from the earlier connection incarnation to drain from the 1105 system. 1107 Even hosts that can remember the time of day and used it to select 1108 initial sequence number values are not immune from this problem 1109 (i.e., even if time of day is used to select an initial sequence 1110 number for each new connection incarnation). 1112 Suppose, for example, that a connection is opened starting with 1113 sequence number S. Suppose that this connection is not used much and 1114 that eventually the initial sequence number function (ISN(t)) takes 1115 on a value equal to the sequence number, say S1, of the last segment 1116 sent by this TCP endpoint on a particular connection. Now suppose, 1117 at this instant, the host reboots and establishes a new incarnation 1118 of the connection. The initial sequence number chosen is S1 = ISN(t) 1119 -- last used sequence number on old incarnation of connection! If 1120 the recovery occurs quickly enough, any old duplicates in the net 1121 bearing sequence numbers in the neighborhood of S1 may arrive and be 1122 treated as new packets by the receiver of the new incarnation of the 1123 connection. 1125 The problem is that the recovering host may not know for how long it 1126 was down between rebooting nor does it know whether there are still 1127 old duplicates in the system from earlier connection incarnations. 1129 One way to deal with this problem is to deliberately delay emitting 1130 segments for one MSL after recovery from a reboot - this is the 1131 "quiet time" specification. Hosts that prefer to avoid waiting and 1132 are willing to risk possible confusion of old and new packets at a 1133 given destination may choose not to wait for the "quiet time". 1134 Implementors may provide TCP users with the ability to select on a 1135 connection by connection basis whether to wait after a reboot, or may 1136 informally implement the "quiet time" for all connections. 1137 Obviously, even where a user selects to "wait," this is not necessary 1138 after the host has been "up" for at least MSL seconds. 1140 To summarize: every segment emitted occupies one or more sequence 1141 numbers in the sequence space, the numbers occupied by a segment are 1142 "busy" or "in use" until MSL seconds have passed, upon rebooting a 1143 block of space-time is occupied by the octets and SYN or FIN flags of 1144 any potentially still in-flight segments, and if a new connection is 1145 started too soon and uses any of the sequence numbers in the space- 1146 time footprint of those potentially still in-flight segments of the 1147 previous connection incarnation, there is a potential sequence number 1148 overlap area that could cause confusion at the receiver. 1150 High performance cases will have shorter cycle times than those in 1151 the megabits per second that the base TCP design described above 1152 considers. At 1 Gbps, the cycle time is 34 seconds, only 3 seconds 1153 at 10 Gbps, and around a third of a second at 100 Gbps. In these 1154 higher performance cases, TCP Timestamp options and Protection 1155 Against Wrapped Sequences (PAWS) [48] provide the needed capability 1156 to detect and discard old duplicates. 1158 3.5. Establishing a connection 1160 The "three-way handshake" is the procedure used to establish a 1161 connection. This procedure normally is initiated by one TCP peer and 1162 responded to by another TCP peer. The procedure also works if two 1163 TCP peers simultaneously initiate the procedure. When simultaneous 1164 open occurs, each TCP peer receives a "SYN" segment that carries no 1165 acknowledgment after it has sent a "SYN". Of course, the arrival of 1166 an old duplicate "SYN" segment can potentially make it appear, to the 1167 recipient, that a simultaneous connection initiation is in progress. 1168 Proper use of "reset" segments can disambiguate these cases. 1170 Several examples of connection initiation follow. Although these 1171 examples do not show connection synchronization using data-carrying 1172 segments, this is perfectly legitimate, so long as the receiving TCP 1173 endpoint doesn't deliver the data to the user until it is clear the 1174 data is valid (e.g., the data is buffered at the receiver until the 1175 connection reaches the ESTABLISHED state, given that the three-way 1176 handshake reduces the possibility of false connections). It is a 1177 trade-off between memory and messages to provide information for this 1178 checking. 1180 The simplest 3WHS is shown in Figure 6. The figures should be 1181 interpreted in the following way. Each line is numbered for 1182 reference purposes. Right arrows (-->) indicate departure of a TCP 1183 segment from TCP peer A to TCP peer B, or arrival of a segment at B 1184 from A. Left arrows (<--), indicate the reverse. Ellipsis (...) 1185 indicates a segment that is still in the network (delayed). Comments 1186 appear in parentheses. TCP connection states represent the state 1187 AFTER the departure or arrival of the segment (whose contents are 1188 shown in the center of each line). Segment contents are shown in 1189 abbreviated form, with sequence number, control flags, and ACK field. 1190 Other fields such as window, addresses, lengths, and text have been 1191 left out in the interest of clarity. 1193 TCP Peer A TCP Peer B 1195 1. CLOSED LISTEN 1197 2. SYN-SENT --> --> SYN-RECEIVED 1199 3. ESTABLISHED <-- <-- SYN-RECEIVED 1201 4. ESTABLISHED --> --> ESTABLISHED 1203 5. ESTABLISHED --> --> ESTABLISHED 1205 Figure 6: Basic 3-Way Handshake for Connection Synchronization 1207 In line 2 of Figure 6, TCP Peer A begins by sending a SYN segment 1208 indicating that it will use sequence numbers starting with sequence 1209 number 100. In line 3, TCP Peer B sends a SYN and acknowledges the 1210 SYN it received from TCP Peer A. Note that the acknowledgment field 1211 indicates TCP Peer B is now expecting to hear sequence 101, 1212 acknowledging the SYN that occupied sequence 100. 1214 At line 4, TCP Peer A responds with an empty segment containing an 1215 ACK for TCP Peer B's SYN; and in line 5, TCP Peer A sends some data. 1216 Note that the sequence number of the segment in line 5 is the same as 1217 in line 4 because the ACK does not occupy sequence number space (if 1218 it did, we would wind up ACKing ACKs!). 1220 Simultaneous initiation is only slightly more complex, as is shown in 1221 Figure 7. Each TCP peer's connection state cycles from CLOSED to 1222 SYN-SENT to SYN-RECEIVED to ESTABLISHED. 1224 TCP Peer A TCP Peer B 1226 1. CLOSED CLOSED 1228 2. SYN-SENT --> ... 1230 3. SYN-RECEIVED <-- <-- SYN-SENT 1232 4. ... --> SYN-RECEIVED 1234 5. SYN-RECEIVED --> ... 1236 6. ESTABLISHED <-- <-- SYN-RECEIVED 1238 7. ... --> ESTABLISHED 1240 Figure 7: Simultaneous Connection Synchronization 1242 A TCP implementation MUST support simultaneous open attempts (MUST- 1243 10). 1245 Note that a TCP implementation MUST keep track of whether a 1246 connection has reached SYN-RECEIVED state as the result of a passive 1247 OPEN or an active OPEN (MUST-11). 1249 The principal reason for the three-way handshake is to prevent old 1250 duplicate connection initiations from causing confusion. To deal 1251 with this, a special control message, reset, is specified. If the 1252 receiving TCP peer is in a non-synchronized state (i.e., SYN-SENT, 1253 SYN-RECEIVED), it returns to LISTEN on receiving an acceptable reset. 1254 If the TCP peer is in one of the synchronized states (ESTABLISHED, 1255 FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT), it 1256 aborts the connection and informs its user. We discuss this latter 1257 case under "half-open" connections below. 1259 TCP Peer A TCP Peer B 1261 1. CLOSED LISTEN 1263 2. SYN-SENT --> ... 1265 3. (duplicate) ... --> SYN-RECEIVED 1267 4. SYN-SENT <-- <-- SYN-RECEIVED 1269 5. SYN-SENT --> --> LISTEN 1271 6. ... --> SYN-RECEIVED 1273 7. ESTABLISHED <-- <-- SYN-RECEIVED 1275 8. ESTABLISHED --> --> ESTABLISHED 1277 Figure 8: Recovery from Old Duplicate SYN 1279 As a simple example of recovery from old duplicates, consider 1280 Figure 8. At line 3, an old duplicate SYN arrives at TCP Peer B. 1281 TCP Peer B cannot tell that this is an old duplicate, so it responds 1282 normally (line 4). TCP Peer A detects that the ACK field is 1283 incorrect and returns a RST (reset) with its SEQ field selected to 1284 make the segment believable. TCP Peer B, on receiving the RST, 1285 returns to the LISTEN state. When the original SYN finally arrives 1286 at line 6, the synchronization proceeds normally. If the SYN at line 1287 6 had arrived before the RST, a more complex exchange might have 1288 occurred with RST's sent in both directions. 1290 3.5.1. Half-Open Connections and Other Anomalies 1292 An established connection is said to be "half-open" if one of the TCP 1293 peers has closed or aborted the connection at its end without the 1294 knowledge of the other, or if the two ends of the connection have 1295 become desynchronized owing to a failure or reboot that resulted in 1296 loss of memory. Such connections will automatically become reset if 1297 an attempt is made to send data in either direction. However, half- 1298 open connections are expected to be unusual. 1300 If at site A the connection no longer exists, then an attempt by the 1301 user at site B to send any data on it will result in the site B TCP 1302 endpoint receiving a reset control message. Such a message indicates 1303 to the site B TCP endpoint that something is wrong, and it is 1304 expected to abort the connection. 1306 Assume that two user processes A and B are communicating with one 1307 another when a failure or reboot occurs causing loss of memory to A's 1308 TCP implementation. Depending on the operating system supporting A's 1309 TCP implementation, it is likely that some error recovery mechanism 1310 exists. When the TCP endpoint is up again, A is likely to start 1311 again from the beginning or from a recovery point. As a result, A 1312 will probably try to OPEN the connection again or try to SEND on the 1313 connection it believes open. In the latter case, it receives the 1314 error message "connection not open" from the local (A's) TCP 1315 implementation. In an attempt to establish the connection, A's TCP 1316 implementation will send a segment containing SYN. This scenario 1317 leads to the example shown in Figure 9. After TCP Peer A reboots, 1318 the user attempts to re-open the connection. TCP Peer B, in the 1319 meantime, thinks the connection is open. 1321 TCP Peer A TCP Peer B 1323 1. (REBOOT) (send 300,receive 100) 1325 2. CLOSED ESTABLISHED 1327 3. SYN-SENT --> --> (??) 1329 4. (!!) <-- <-- ESTABLISHED 1331 5. SYN-SENT --> --> (Abort!!) 1333 6. SYN-SENT CLOSED 1335 7. SYN-SENT --> --> 1337 Figure 9: Half-Open Connection Discovery 1339 When the SYN arrives at line 3, TCP Peer B, being in a synchronized 1340 state, and the incoming segment outside the window, responds with an 1341 acknowledgment indicating what sequence it next expects to hear (ACK 1342 100). TCP Peer A sees that this segment does not acknowledge 1343 anything it sent and, being unsynchronized, sends a reset (RST) 1344 because it has detected a half-open connection. TCP Peer B aborts at 1345 line 5. TCP Peer A will continue to try to establish the connection; 1346 the problem is now reduced to the basic 3-way handshake of Figure 6. 1348 An interesting alternative case occurs when TCP Peer A reboots and 1349 TCP Peer B tries to send data on what it thinks is a synchronized 1350 connection. This is illustrated in Figure 10. In this case, the 1351 data arriving at TCP Peer A from TCP Peer B (line 2) is unacceptable 1352 because no such connection exists, so TCP Peer A sends a RST. The 1353 RST is acceptable so TCP Peer B processes it and aborts the 1354 connection. 1356 TCP Peer A TCP Peer B 1358 1. (REBOOT) (send 300,receive 100) 1360 2. (??) <-- <-- ESTABLISHED 1362 3. --> --> (ABORT!!) 1364 Figure 10: Active Side Causes Half-Open Connection Discovery 1366 In Figure 11, two TCP Peers A and B with passive connections waiting 1367 for SYN are depicted. An old duplicate arriving at TCP Peer B (line 1368 2) stirs B into action. A SYN-ACK is returned (line 3) and causes 1369 TCP A to generate a RST (the ACK in line 3 is not acceptable). TCP 1370 Peer B accepts the reset and returns to its passive LISTEN state. 1372 TCP Peer A TCP Peer B 1374 1. LISTEN LISTEN 1376 2. ... --> SYN-RECEIVED 1378 3. (??) <-- <-- SYN-RECEIVED 1380 4. --> --> (return to LISTEN!) 1382 5. LISTEN LISTEN 1384 Figure 11: Old Duplicate SYN Initiates a Reset on two Passive Sockets 1386 A variety of other cases are possible, all of which are accounted for 1387 by the following rules for RST generation and processing. 1389 3.5.2. Reset Generation 1391 A TCP user or application can issue a reset on a connection at any 1392 time, though reset events are also generated by the protocol itself 1393 when various error conditions occur, as described below. The side of 1394 a connection issuing a reset should enter the TIME-WAIT state, as 1395 this generally helps to reduce the load on busy servers for reasons 1396 described in [71]. 1398 As a general rule, reset (RST) is sent whenever a segment arrives 1399 that apparently is not intended for the current connection. A reset 1400 must not be sent if it is not clear that this is the case. 1402 There are three groups of states: 1404 1. If the connection does not exist (CLOSED) then a reset is sent 1405 in response to any incoming segment except another reset. A SYN 1406 segment that does not match an existing connection is rejected by 1407 this means. 1409 If the incoming segment has the ACK bit set, the reset takes its 1410 sequence number from the ACK field of the segment, otherwise the 1411 reset has sequence number zero and the ACK field is set to the sum 1412 of the sequence number and segment length of the incoming segment. 1413 The connection remains in the CLOSED state. 1415 2. If the connection is in any non-synchronized state (LISTEN, 1416 SYN-SENT, SYN-RECEIVED), and the incoming segment acknowledges 1417 something not yet sent (the segment carries an unacceptable ACK), 1418 or if an incoming segment has a security level or compartment 1419 Appendix A.1 that does not exactly match the level and compartment 1420 requested for the connection, a reset is sent. 1422 If the incoming segment has an ACK field, the reset takes its 1423 sequence number from the ACK field of the segment, otherwise the 1424 reset has sequence number zero and the ACK field is set to the sum 1425 of the sequence number and segment length of the incoming segment. 1426 The connection remains in the same state. 1428 3. If the connection is in a synchronized state (ESTABLISHED, 1429 FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT), 1430 any unacceptable segment (out of window sequence number or 1431 unacceptable acknowledgment number) must be responded to with an 1432 empty acknowledgment segment (without any user data) containing 1433 the current send-sequence number and an acknowledgment indicating 1434 the next sequence number expected to be received, and the 1435 connection remains in the same state. 1437 If an incoming segment has a security level or compartment that 1438 does not exactly match the level and compartment requested for the 1439 connection, a reset is sent and the connection goes to the CLOSED 1440 state. The reset takes its sequence number from the ACK field of 1441 the incoming segment. 1443 3.5.3. Reset Processing 1445 In all states except SYN-SENT, all reset (RST) segments are validated 1446 by checking their SEQ-fields. A reset is valid if its sequence 1447 number is in the window. In the SYN-SENT state (a RST received in 1448 response to an initial SYN), the RST is acceptable if the ACK field 1449 acknowledges the SYN. 1451 The receiver of a RST first validates it, then changes state. If the 1452 receiver was in the LISTEN state, it ignores it. If the receiver was 1453 in SYN-RECEIVED state and had previously been in the LISTEN state, 1454 then the receiver returns to the LISTEN state, otherwise the receiver 1455 aborts the connection and goes to the CLOSED state. If the receiver 1456 was in any other state, it aborts the connection and advises the user 1457 and goes to the CLOSED state. 1459 TCP implementations SHOULD allow a received RST segment to include 1460 data (SHLD-2). It has been suggested that a RST segment could 1461 contain diagnostic data that explains the cause of the RST. No 1462 standard has yet been established for such data. 1464 3.6. Closing a Connection 1466 CLOSE is an operation meaning "I have no more data to send." The 1467 notion of closing a full-duplex connection is subject to ambiguous 1468 interpretation, of course, since it may not be obvious how to treat 1469 the receiving side of the connection. We have chosen to treat CLOSE 1470 in a simplex fashion. The user who CLOSEs may continue to RECEIVE 1471 until the TCP receiver is told that the remote peer has CLOSED also. 1472 Thus, a program could initiate several SENDs followed by a CLOSE, and 1473 then continue to RECEIVE until signaled that a RECEIVE failed because 1474 the remote peer has CLOSED. The TCP implementation will signal a 1475 user, even if no RECEIVEs are outstanding, that the remote peer has 1476 closed, so the user can terminate their side gracefully. A TCP 1477 implementation will reliably deliver all buffers SENT before the 1478 connection was CLOSED so a user who expects no data in return need 1479 only wait to hear the connection was CLOSED successfully to know that 1480 all their data was received at the destination TCP endpoint. Users 1481 must keep reading connections they close for sending until the TCP 1482 implementation indicates there is no more data. 1484 There are essentially three cases: 1486 1) The user initiates by telling the TCP implementation to CLOSE 1487 the connection (TCP Peer A in Figure 12). 1489 2) The remote TCP endpoint initiates by sending a FIN control 1490 signal (TCP Peer B in Figure 12). 1492 3) Both users CLOSE simultaneously (Figure 13). 1494 Case 1: Local user initiates the close 1495 In this case, a FIN segment can be constructed and placed on the 1496 outgoing segment queue. No further SENDs from the user will be 1497 accepted by the TCP implementation, and it enters the FIN-WAIT-1 1498 state. RECEIVEs are allowed in this state. All segments 1499 preceding and including FIN will be retransmitted until 1500 acknowledged. When the other TCP peer has both acknowledged the 1501 FIN and sent a FIN of its own, the first TCP peer can ACK this 1502 FIN. Note that a TCP endpoint receiving a FIN will ACK but not 1503 send its own FIN until its user has CLOSED the connection also. 1505 Case 2: TCP endpoint receives a FIN from the network 1506 If an unsolicited FIN arrives from the network, the receiving TCP 1507 endpoint can ACK it and tell the user that the connection is 1508 closing. The user will respond with a CLOSE, upon which the TCP 1509 endpoint can send a FIN to the other TCP peer after sending any 1510 remaining data. The TCP endpoint then waits until its own FIN is 1511 acknowledged whereupon it deletes the connection. If an ACK is 1512 not forthcoming, after the user timeout the connection is aborted 1513 and the user is told. 1515 Case 3: Both users close simultaneously 1516 A simultaneous CLOSE by users at both ends of a connection causes 1517 FIN segments to be exchanged (Figure 13). When all segments 1518 preceding the FINs have been processed and acknowledged, each TCP 1519 peer can ACK the FIN it has received. Both will, upon receiving 1520 these ACKs, delete the connection. 1522 TCP Peer A TCP Peer B 1524 1. ESTABLISHED ESTABLISHED 1526 2. (Close) 1527 FIN-WAIT-1 --> --> CLOSE-WAIT 1529 3. FIN-WAIT-2 <-- <-- CLOSE-WAIT 1531 4. (Close) 1532 TIME-WAIT <-- <-- LAST-ACK 1534 5. TIME-WAIT --> --> CLOSED 1536 6. (2 MSL) 1537 CLOSED 1539 Figure 12: Normal Close Sequence 1541 TCP Peer A TCP Peer B 1543 1. ESTABLISHED ESTABLISHED 1545 2. (Close) (Close) 1546 FIN-WAIT-1 --> ... FIN-WAIT-1 1547 <-- <-- 1548 ... --> 1550 3. CLOSING --> ... CLOSING 1551 <-- <-- 1552 ... --> 1554 4. TIME-WAIT TIME-WAIT 1555 (2 MSL) (2 MSL) 1556 CLOSED CLOSED 1558 Figure 13: Simultaneous Close Sequence 1560 A TCP connection may terminate in two ways: (1) the normal TCP close 1561 sequence using a FIN handshake (Figure 12), and (2) an "abort" in 1562 which one or more RST segments are sent and the connection state is 1563 immediately discarded. If the local TCP connection is closed by the 1564 remote side due to a FIN or RST received from the remote side, then 1565 the local application MUST be informed whether it closed normally or 1566 was aborted (MUST-12). 1568 3.6.1. Half-Closed Connections 1570 The normal TCP close sequence delivers buffered data reliably in both 1571 directions. Since the two directions of a TCP connection are closed 1572 independently, it is possible for a connection to be "half closed," 1573 i.e., closed in only one direction, and a host is permitted to 1574 continue sending data in the open direction on a half-closed 1575 connection. 1577 A host MAY implement a "half-duplex" TCP close sequence, so that an 1578 application that has called CLOSE cannot continue to read data from 1579 the connection (MAY-1). If such a host issues a CLOSE call while 1580 received data is still pending in the TCP connection, or if new data 1581 is received after CLOSE is called, its TCP implementation SHOULD send 1582 a RST to show that data was lost (SHLD-3). See [24] section 2.17 for 1583 discussion. 1585 When a connection is closed actively, it MUST linger in the TIME-WAIT 1586 state for a time 2xMSL (Maximum Segment Lifetime) (MUST-13). 1587 However, it MAY accept a new SYN from the remote TCP endpoint to 1588 reopen the connection directly from TIME-WAIT state (MAY-2), if it: 1590 (1) assigns its initial sequence number for the new connection to 1591 be larger than the largest sequence number it used on the previous 1592 connection incarnation, and 1594 (2) returns to TIME-WAIT state if the SYN turns out to be an old 1595 duplicate. 1597 When the TCP Timestamp options are available, an improved algorithm 1598 is described in [41] in order to support higher connection 1599 establishment rates. This algorithm for reducing TIME-WAIT is a Best 1600 Current Practice that SHOULD be implemented, since timestamp options 1601 are commonly used, and using them to reduce TIME-WAIT provides 1602 benefits for busy Internet servers (SHLD-4). 1604 3.7. Segmentation 1606 The term "segmentation" refers to the activity TCP performs when 1607 ingesting a stream of bytes from a sending application and 1608 packetizing that stream of bytes into TCP segments. Individual TCP 1609 segments often do not correspond one-for-one to individual send (or 1610 socket write) calls from the application. Applications may perform 1611 writes at the granularity of messages in the upper layer protocol, 1612 but TCP guarantees no boundary coherence between the TCP segments 1613 sent and received versus user application data read or write buffer 1614 boundaries. In some specific protocols, such as Remote Direct Memory 1615 Access (RDMA) using Direct Data Placement (DDP) and Marker PDU 1616 Aligned Framing (MPA) [35], there are performance optimizations 1617 possible when the relation between TCP segments and application data 1618 units can be controlled, and MPA includes a specific mechanism for 1619 detecting and verifying this relationship between TCP segments and 1620 application message data structures, but this is specific to 1621 applications like RDMA. In general, multiple goals influence the 1622 sizing of TCP segments created by a TCP implementation. 1624 Goals driving the sending of larger segments include: 1626 * Reducing the number of packets in flight within the network. 1628 * Increasing processing efficiency and potential performance by 1629 enabling a smaller number of interrupts and inter-layer 1630 interactions. 1632 * Limiting the overhead of TCP headers. 1634 Note that the performance benefits of sending larger segments may 1635 decrease as the size increases, and there may be boundaries where 1636 advantages are reversed. For instance, on some implementation 1637 architectures, 1025 bytes within a segment could lead to worse 1638 performance than 1024 bytes, due purely to data alignment on copy 1639 operations. 1641 Goals driving the sending of smaller segments include: 1643 * Avoiding sending a TCP segment that would result in an IP datagram 1644 larger than the smallest MTU along an IP network path, because 1645 this results in either packet loss or packet fragmentation. 1646 Making matters worse, some firewalls or middleboxes may drop 1647 fragmented packets or ICMP messages related to fragmentation. 1649 * Preventing delays to the application data stream, especially when 1650 TCP is waiting on the application to generate more data, or when 1651 the application is waiting on an event or input from its peer in 1652 order to generate more data. 1654 * Enabling "fate sharing" between TCP segments and lower-layer data 1655 units (e.g. below IP, for links with cell or frame sizes smaller 1656 than the IP MTU). 1658 Towards meeting these competing sets of goals, TCP includes several 1659 mechanisms, including the Maximum Segment Size option, Path MTU 1660 Discovery, the Nagle algorithm, and support for IPv6 Jumbograms, as 1661 discussed in the following subsections. 1663 3.7.1. Maximum Segment Size Option 1665 TCP endpoints MUST implement both sending and receiving the MSS 1666 option (MUST-14). 1668 TCP implementations SHOULD send an MSS option in every SYN segment 1669 when its receive MSS differs from the default 536 for IPv4 or 1220 1670 for IPv6 (SHLD-5), and MAY send it always (MAY-3). 1672 If an MSS option is not received at connection setup, TCP 1673 implementations MUST assume a default send MSS of 536 (576 - 40) for 1674 IPv4 or 1220 (1280 - 60) for IPv6 (MUST-15). 1676 The maximum size of a segment that TCP endpoint really sends, the 1677 "effective send MSS," MUST be the smaller (MUST-16) of the send MSS 1678 (that reflects the available reassembly buffer size at the remote 1679 host, the EMTU_R [20]) and the largest transmission size permitted by 1680 the IP layer (EMTU_S [20]): 1682 Eff.snd.MSS = 1684 min(SendMSS+20, MMS_S) - TCPhdrsize - IPoptionsize 1686 where: 1688 * SendMSS is the MSS value received from the remote host, or the 1689 default 536 for IPv4 or 1220 for IPv6, if no MSS option is 1690 received. 1692 * MMS_S is the maximum size for a transport-layer message that TCP 1693 may send. 1695 * TCPhdrsize is the size of the fixed TCP header and any options. 1696 This is 20 in the (rare) case that no options are present, but may 1697 be larger if TCP options are to be sent. Note that some options 1698 might not be included on all segments, but that for each segment 1699 sent, the sender should adjust the data length accordingly, within 1700 the Eff.snd.MSS. 1702 * IPoptionsize is the size of any IPv4 options or IPv6 extension 1703 headers associated with a TCP connection. Note that some options 1704 or extension headers might not be included on all packets, but 1705 that for each segment sent, the sender should adjust the data 1706 length accordingly, within the Eff.snd.MSS. 1708 The MSS value to be sent in an MSS option should be equal to the 1709 effective MTU minus the fixed IP and TCP headers. By ignoring both 1710 IP and TCP options when calculating the value for the MSS option, if 1711 there are any IP or TCP options to be sent in a packet, then the 1712 sender must decrease the size of the TCP data accordingly. RFC 6691 1713 [44] discusses this in greater detail. 1715 The MSS value to be sent in an MSS option must be less than or equal 1716 to: 1718 MMS_R - 20 1720 where MMS_R is the maximum size for a transport-layer message that 1721 can be received (and reassembled at the IP layer) (MUST-67). TCP 1722 obtains MMS_R and MMS_S from the IP layer; see the generic call 1723 GET_MAXSIZES in Section 3.4 of RFC 1122. These are defined in terms 1724 of their IP MTU equivalents, EMTU_R and EMTU_S [20]. 1726 When TCP is used in a situation where either the IP or TCP headers 1727 are not fixed, the sender must reduce the amount of TCP data in any 1728 given packet by the number of octets used by the IP and TCP options. 1729 This has been a point of confusion historically, as explained in RFC 1730 6691, Section 3.1. 1732 3.7.2. Path MTU Discovery 1734 A TCP implementation may be aware of the MTU on directly connected 1735 links, but will rarely have insight about MTUs across an entire 1736 network path. For IPv4, RFC 1122 recommends an IP-layer default 1737 effective MTU of less than or equal to 576 for destinations not 1738 directly connected, and for IPv6 this would be 1280. Using these 1739 fixed values limits TCP connection performance and efficiency. 1740 Instead, implementation of Path MTU Discovery (PMTUD) and 1741 Packetization Layer Path MTU Discovery (PLPMTUD) is strongly 1742 recommended in order for TCP to improve segmentation decisions. Both 1743 PMTUD and PLPMTUD help TCP choose segment sizes that avoid both on- 1744 path (for IPv4) and source fragmentation (IPv4 and IPv6). 1746 PMTUD for IPv4 [2] or IPv6 [14] is implemented in conjunction between 1747 TCP, IP, and ICMP protocols. It relies both on avoiding source 1748 fragmentation and setting the IPv4 DF (don't fragment) flag, the 1749 latter to inhibit on-path fragmentation. It relies on ICMP errors 1750 from routers along the path, whenever a segment is too large to 1751 traverse a link. Several adjustments to a TCP implementation with 1752 PMTUD are described in RFC 2923 in order to deal with problems 1753 experienced in practice [28]. PLPMTUD [32] is a Standards Track 1754 improvement to PMTUD that relaxes the requirement for ICMP support 1755 across a path, and improves performance in cases where ICMP is not 1756 consistently conveyed, but still tries to avoid source fragmentation. 1757 The mechanisms in all four of these RFCs are recommended to be 1758 included in TCP implementations. 1760 The TCP MSS option specifies an upper bound for the size of packets 1761 that can be received (see [44]). Hence, setting the value in the MSS 1762 option too small can impact the ability for PMTUD or PLPMTUD to find 1763 a larger path MTU. RFC 1191 discusses this implication of many older 1764 TCP implementations setting the TCP MSS to 536 (corresponding to the 1765 IPv4 576 byte default MTU) for non-local destinations, rather than 1766 deriving it from the MTUs of connected interfaces as recommended. 1768 3.7.3. Interfaces with Variable MTU Values 1770 The effective MTU can sometimes vary, as when used with variable 1771 compression, e.g., RObust Header Compression (ROHC) [38]. It is 1772 tempting for a TCP implementation to advertise the largest possible 1773 MSS, to support the most efficient use of compressed payloads. 1774 Unfortunately, some compression schemes occasionally need to transmit 1775 full headers (and thus smaller payloads) to resynchronize state at 1776 their endpoint compressors/decompressors. If the largest MTU is used 1777 to calculate the value to advertise in the MSS option, TCP 1778 retransmission may interfere with compressor resynchronization. 1780 As a result, when the effective MTU of an interface varies packet-to- 1781 packet, TCP implementations SHOULD use the smallest effective MTU of 1782 the interface to calculate the value to advertise in the MSS option 1783 (SHLD-6). 1785 3.7.4. Nagle Algorithm 1787 The "Nagle algorithm" was described in RFC 896 [18] and was 1788 recommended in RFC 1122 [20] for mitigation of an early problem of 1789 too many small packets being generated. It has been implemented in 1790 most current TCP code bases, sometimes with minor variations (see 1791 Appendix A.3). 1793 If there is unacknowledged data (i.e., SND.NXT > SND.UNA), then the 1794 sending TCP endpoint buffers all user data (regardless of the PSH 1795 bit), until the outstanding data has been acknowledged or until the 1796 TCP endpoint can send a full-sized segment (Eff.snd.MSS bytes). 1798 A TCP implementation SHOULD implement the Nagle Algorithm to coalesce 1799 short segments (SHLD-7). However, there MUST be a way for an 1800 application to disable the Nagle algorithm on an individual 1801 connection (MUST-17). In all cases, sending data is also subject to 1802 the limitation imposed by the Slow Start algorithm [8]. 1804 Since there can be problematic interactions between the Nagle 1805 Algorithm and delayed acknowledgements, some implementations use 1806 minor variations of the Nagle algorithm, such as the one described in 1807 Appendix A.3. 1809 3.7.5. IPv6 Jumbograms 1811 In order to support TCP over IPv6 Jumbograms, implementations need to 1812 be able to send TCP segments larger than the 64KB limit that the MSS 1813 option can convey. RFC 2675 [25] defines that an MSS value of 65,535 1814 bytes is to be treated as infinity, and Path MTU Discovery [14] is 1815 used to determine the actual MSS. 1817 The Jumbo Payload option need not be implemented or understood by 1818 IPv6 nodes that do not support attachment to links with a MTU greater 1819 than 65,575 [25], and the present IPv6 Node Requirements does not 1820 include support for Jumbograms [55]. 1822 3.8. Data Communication 1824 Once the connection is established data is communicated by the 1825 exchange of segments. Because segments may be lost due to errors 1826 (checksum test failure), or network congestion, TCP uses 1827 retransmission to ensure delivery of every segment. Duplicate 1828 segments may arrive due to network or TCP retransmission. As 1829 discussed in the section on sequence numbers, the TCP implementation 1830 performs certain tests on the sequence and acknowledgment numbers in 1831 the segments to verify their acceptability. 1833 The sender of data keeps track of the next sequence number to use in 1834 the variable SND.NXT. The receiver of data keeps track of the next 1835 sequence number to expect in the variable RCV.NXT. The sender of 1836 data keeps track of the oldest unacknowledged sequence number in the 1837 variable SND.UNA. If the data flow is momentarily idle and all data 1838 sent has been acknowledged then the three variables will be equal. 1840 When the sender creates a segment and transmits it the sender 1841 advances SND.NXT. When the receiver accepts a segment it advances 1842 RCV.NXT and sends an acknowledgment. When the data sender receives 1843 an acknowledgment it advances SND.UNA. The extent to which the 1844 values of these variables differ is a measure of the delay in the 1845 communication. The amount by which the variables are advanced is the 1846 length of the data and SYN or FIN flags in the segment. Note that 1847 once in the ESTABLISHED state all segments must carry current 1848 acknowledgment information. 1850 The CLOSE user call implies a push function (see Section 3.9.1), as 1851 does the FIN control flag in an incoming segment. 1853 3.8.1. Retransmission Timeout 1855 Because of the variability of the networks that compose an 1856 internetwork system and the wide range of uses of TCP connections the 1857 retransmission timeout (RTO) must be dynamically determined. 1859 The RTO MUST be computed according to the algorithm in [10], 1860 including Karn's algorithm for taking RTT samples (MUST-18). 1862 RFC 793 contains an early example procedure for computing the RTO, 1863 based on work mentioned in IEN 177 [72]. This was then replaced by 1864 the algorithm described in RFC 1122, and subsequently updated in RFC 1865 2988, and then again in RFC 6298. 1867 RFC 1122 allows that if a retransmitted packet is identical to the 1868 original packet (which implies not only that the data boundaries have 1869 not changed, but also that none of the headers have changed), then 1870 the same IPv4 Identification field MAY be used (see Section 3.2.1.5 1871 of RFC 1122) (MAY-4). The same IP identification field may be reused 1872 anyways, since it is only meaningful when a datagram is fragmented 1873 [45]. TCP implementations should not rely on or typically interact 1874 with this IPv4 header field in any way. It is not a reasonable way 1875 to either indicate duplicate sent segments, nor to identify duplicate 1876 received segments. 1878 3.8.2. TCP Congestion Control 1880 RFC 2914 [5] explains the importance of congestion control for the 1881 Internet. 1883 RFC 1122 required implementation of Van Jacobson's congestion control 1884 algorithms slow start and congestion avoidance together with 1885 exponential back-off for successive RTO values for the same segment. 1886 RFC 2581 provided IETF Standards Track description of slow start and 1887 congestion avoidance, along with fast retransmit and fast recovery. 1888 RFC 5681 is the current description of these algorithms and is the 1889 current Standards Track specification providing guidelines for TCP 1890 congestion control. RFC 6298 describes exponential back-off of RTO 1891 values, including keeping the backed-off value until a subsequent 1892 segment with new data has been sent and acknowledged without 1893 retransmission. 1895 A TCP endpoint MUST implement the basic congestion control algorithms 1896 slow start, congestion avoidance, and exponential back-off of RTO to 1897 avoid creating congestion collapse conditions (MUST-19). RFC 5681 1898 and RFC 6298 describe the basic algorithms on the IETF Standards 1899 Track that are broadly applicable. Multiple other suitable 1900 algorithms exist and have been widely used. Many TCP implementations 1901 support a set of alternative algorithms that can be configured for 1902 use on the endpoint. An endpoint MAY implement such alternative 1903 algorithms provided that the algorithms are conformant with the TCP 1904 specifications from the IETF Standards Track as described in RFC 1905 2914, RFC 5033 [7], and RFC 8961 [15] (MAY-18). 1907 Explicit Congestion Notification (ECN) was defined in RFC 3168 and is 1908 an IETF Standards Track enhancement that has many benefits [52]. 1910 A TCP endpoint SHOULD implement ECN as described in RFC 3168 (SHLD- 1911 8). 1913 3.8.3. TCP Connection Failures 1915 Excessive retransmission of the same segment by a TCP endpoint 1916 indicates some failure of the remote host or the Internet path. This 1917 failure may be of short or long duration. The following procedure 1918 MUST be used to handle excessive retransmissions of data segments 1919 (MUST-20): 1921 (a) There are two thresholds R1 and R2 measuring the amount of 1922 retransmission that has occurred for the same segment. R1 and R2 1923 might be measured in time units or as a count of retransmissions 1924 (with the current RTO and corresponding backoffs as a conversion 1925 factor, if needed). 1927 (b) When the number of transmissions of the same segment reaches 1928 or exceeds threshold R1, pass negative advice (see Section 3.3.1.4 1929 of [20]) to the IP layer, to trigger dead-gateway diagnosis. 1931 (c) When the number of transmissions of the same segment reaches a 1932 threshold R2 greater than R1, close the connection. 1934 (d) An application MUST (MUST-21) be able to set the value for R2 1935 for a particular connection. For example, an interactive 1936 application might set R2 to "infinity," giving the user control 1937 over when to disconnect. 1939 (e) TCP implementations SHOULD inform the application of the 1940 delivery problem (unless such information has been disabled by the 1941 application; see Asynchronous Reports section), when R1 is reached 1942 and before R2 (SHLD-9). This will allow a remote login 1943 application program to inform the user, for example. 1945 The value of R1 SHOULD correspond to at least 3 retransmissions, at 1946 the current RTO (SHLD-10). The value of R2 SHOULD correspond to at 1947 least 100 seconds (SHLD-11). 1949 An attempt to open a TCP connection could fail with excessive 1950 retransmissions of the SYN segment or by receipt of a RST segment or 1951 an ICMP Port Unreachable. SYN retransmissions MUST be handled in the 1952 general way just described for data retransmissions, including 1953 notification of the application layer. 1955 However, the values of R1 and R2 may be different for SYN and data 1956 segments. In particular, R2 for a SYN segment MUST be set large 1957 enough to provide retransmission of the segment for at least 3 1958 minutes (MUST-23). The application can close the connection (i.e., 1959 give up on the open attempt) sooner, of course. 1961 3.8.4. TCP Keep-Alives 1963 A TCP connection is said to be "idle" if for some long amount of time 1964 there have been no incoming segments received and there is no new or 1965 unacknowledged data to be sent. 1967 Implementors MAY include "keep-alives" in their TCP implementations 1968 (MAY-5), although this practice is not universally accepted. Some 1969 TCP implementations, however, have included a keep-alive mechanism. 1970 To confirm that an idle connection is still active, these 1971 implementations send a probe segment designed to elicit a response 1972 from the TCP peer. Such a segment generally contains SEG.SEQ = 1973 SND.NXT-1 and may or may not contain one garbage octet of data. If 1974 keep-alives are included, the application MUST be able to turn them 1975 on or off for each TCP connection (MUST-24), and they MUST default to 1976 off (MUST-25). 1978 Keep-alive packets MUST only be sent when no sent data is 1979 outstanding, and no data or acknowledgement packets have been 1980 received for the connection within an interval (MUST-26). This 1981 interval MUST be configurable (MUST-27) and MUST default to no less 1982 than two hours (MUST-28). 1984 It is extremely important to remember that ACK segments that contain 1985 no data are not reliably transmitted by TCP. Consequently, if a 1986 keep-alive mechanism is implemented it MUST NOT interpret failure to 1987 respond to any specific probe as a dead connection (MUST-29). 1989 An implementation SHOULD send a keep-alive segment with no data 1990 (SHLD-12); however, it MAY be configurable to send a keep-alive 1991 segment containing one garbage octet (MAY-6), for compatibility with 1992 erroneous TCP implementations. 1994 3.8.5. The Communication of Urgent Information 1996 As a result of implementation differences and middlebox interactions, 1997 new applications SHOULD NOT employ the TCP urgent mechanism (SHLD- 1998 13). However, TCP implementations MUST still include support for the 1999 urgent mechanism (MUST-30). Information on how some TCP 2000 implementations interpret the urgent pointer can be found in RFC 6093 2001 [40]. 2003 The objective of the TCP urgent mechanism is to allow the sending 2004 user to stimulate the receiving user to accept some urgent data and 2005 to permit the receiving TCP endpoint to indicate to the receiving 2006 user when all the currently known urgent data has been received by 2007 the user. 2009 This mechanism permits a point in the data stream to be designated as 2010 the end of urgent information. Whenever this point is in advance of 2011 the receive sequence number (RCV.NXT) at the receiving TCP endpoint, 2012 that TCP must tell the user to go into "urgent mode"; when the 2013 receive sequence number catches up to the urgent pointer, the TCP 2014 implementation must tell user to go into "normal mode". If the 2015 urgent pointer is updated while the user is in "urgent mode", the 2016 update will be invisible to the user. 2018 The method employs an urgent field that is carried in all segments 2019 transmitted. The URG control flag indicates that the urgent field is 2020 meaningful and must be added to the segment sequence number to yield 2021 the urgent pointer. The absence of this flag indicates that there is 2022 no urgent data outstanding. 2024 To send an urgent indication the user must also send at least one 2025 data octet. If the sending user also indicates a push, timely 2026 delivery of the urgent information to the destination process is 2027 enhanced. Note that because changes in the urgent pointer correspond 2028 to data being written by a sending application, the urgent pointer 2029 can not "recede" in the sequence space, but a TCP receiver should be 2030 robust to invalid urgent pointer values. 2032 A TCP implementation MUST support a sequence of urgent data of any 2033 length (MUST-31). [20] 2035 The urgent pointer MUST point to the sequence number of the octet 2036 following the urgent data (MUST-62). 2038 A TCP implementation MUST (MUST-32) inform the application layer 2039 asynchronously whenever it receives an Urgent pointer and there was 2040 previously no pending urgent data, or whenever the Urgent pointer 2041 advances in the data stream. The TCP implementation MUST (MUST-33) 2042 provide a way for the application to learn how much urgent data 2043 remains to be read from the connection, or at least to determine 2044 whether more urgent data remains to be read [20]. 2046 3.8.6. Managing the Window 2048 The window sent in each segment indicates the range of sequence 2049 numbers the sender of the window (the data receiver) is currently 2050 prepared to accept. There is an assumption that this is related to 2051 the currently available data buffer space available for this 2052 connection. 2054 The sending TCP endpoint packages the data to be transmitted into 2055 segments that fit the current window, and may repackage segments on 2056 the retransmission queue. Such repackaging is not required, but may 2057 be helpful. 2059 In a connection with a one-way data flow, the window information will 2060 be carried in acknowledgment segments that all have the same sequence 2061 number, so there will be no way to reorder them if they arrive out of 2062 order. This is not a serious problem, but it will allow the window 2063 information to be on occasion temporarily based on old reports from 2064 the data receiver. A refinement to avoid this problem is to act on 2065 the window information from segments that carry the highest 2066 acknowledgment number (that is segments with acknowledgment number 2067 equal or greater than the highest previously received). 2069 Indicating a large window encourages transmissions. If more data 2070 arrives than can be accepted, it will be discarded. This will result 2071 in excessive retransmissions, adding unnecessarily to the load on the 2072 network and the TCP endpoints. Indicating a small window may 2073 restrict the transmission of data to the point of introducing a round 2074 trip delay between each new segment transmitted. 2076 The mechanisms provided allow a TCP endpoint to advertise a large 2077 window and to subsequently advertise a much smaller window without 2078 having accepted that much data. This, so-called "shrinking the 2079 window," is strongly discouraged. The robustness principle [20] 2080 dictates that TCP peers will not shrink the window themselves, but 2081 will be prepared for such behavior on the part of other TCP peers. 2083 A TCP receiver SHOULD NOT shrink the window, i.e., move the right 2084 window edge to the left (SHLD-14). However, a sending TCP peer MUST 2085 be robust against window shrinking, which may cause the "usable 2086 window" (see Section 3.8.6.2.1) to become negative (MUST-34). 2088 If this happens, the sender SHOULD NOT send new data (SHLD-15), but 2089 SHOULD retransmit normally the old unacknowledged data between 2090 SND.UNA and SND.UNA+SND.WND (SHLD-16). The sender MAY also 2091 retransmit old data beyond SND.UNA+SND.WND (MAY-7), but SHOULD NOT 2092 time out the connection if data beyond the right window edge is not 2093 acknowledged (SHLD-17). If the window shrinks to zero, the TCP 2094 implementation MUST probe it in the standard way (described below) 2095 (MUST-35). 2097 3.8.6.1. Zero Window Probing 2099 The sending TCP peer must regularly transmit at least one octet of 2100 new data (if available) or retransmit to the receiving TCP peer even 2101 if the send window is zero, in order to "probe" the window. This 2102 retransmission is essential to guarantee that when either TCP peer 2103 has a zero window the re-opening of the window will be reliably 2104 reported to the other. This is referred to as Zero-Window Probing 2105 (ZWP) in other documents. 2107 Probing of zero (offered) windows MUST be supported (MUST-36). 2109 A TCP implementation MAY keep its offered receive window closed 2110 indefinitely (MAY-8). As long as the receiving TCP peer continues to 2111 send acknowledgments in response to the probe segments, the sending 2112 TCP peer MUST allow the connection to stay open (MUST-37). This 2113 enables TCP to function in scenarios such as the "printer ran out of 2114 paper" situation described in Section 4.2.2.17 of [20]. The behavior 2115 is subject to the implementation's resource management concerns, as 2116 noted in [42]. 2118 When the receiving TCP peer has a zero window and a segment arrives 2119 it must still send an acknowledgment showing its next expected 2120 sequence number and current window (zero). 2122 The transmitting host SHOULD send the first zero-window probe when a 2123 zero window has existed for the retransmission timeout period (SHLD- 2124 29) (Section 3.8.1), and SHOULD increase exponentially the interval 2125 between successive probes (SHLD-30). 2127 3.8.6.2. Silly Window Syndrome Avoidance 2129 The "Silly Window Syndrome" (SWS) is a stable pattern of small 2130 incremental window movements resulting in extremely poor TCP 2131 performance. Algorithms to avoid SWS are described below for both 2132 the sending side and the receiving side. RFC 1122 contains more 2133 detailed discussion of the SWS problem. Note that the Nagle 2134 algorithm and the sender SWS avoidance algorithm play complementary 2135 roles in improving performance. The Nagle algorithm discourages 2136 sending tiny segments when the data to be sent increases in small 2137 increments, while the SWS avoidance algorithm discourages small 2138 segments resulting from the right window edge advancing in small 2139 increments. 2141 3.8.6.2.1. Sender's Algorithm - When to Send Data 2143 A TCP implementation MUST include a SWS avoidance algorithm in the 2144 sender (MUST-38). 2146 The Nagle algorithm from Section 3.7.4 additionally describes how to 2147 coalesce short segments. 2149 The sender's SWS avoidance algorithm is more difficult than the 2150 receiver's, because the sender does not know (directly) the 2151 receiver's total buffer space RCV.BUFF. An approach that has been 2152 found to work well is for the sender to calculate Max(SND.WND), the 2153 maximum send window it has seen so far on the connection, and to use 2154 this value as an estimate of RCV.BUFF. Unfortunately, this can only 2155 be an estimate; the receiver may at any time reduce the size of 2156 RCV.BUFF. To avoid a resulting deadlock, it is necessary to have a 2157 timeout to force transmission of data, overriding the SWS avoidance 2158 algorithm. In practice, this timeout should seldom occur. 2160 The "usable window" is: 2162 U = SND.UNA + SND.WND - SND.NXT 2164 i.e., the offered window less the amount of data sent but not 2165 acknowledged. If D is the amount of data queued in the sending TCP 2166 endpoint but not yet sent, then the following set of rules is 2167 recommended. 2169 Send data: 2171 (1) if a maximum-sized segment can be sent, i.e., if: 2173 min(D,U) >= Eff.snd.MSS; 2175 (2) or if the data is pushed and all queued data can be sent now, 2176 i.e., if: 2178 [SND.NXT = SND.UNA and] PUSHED and D <= U 2180 (the bracketed condition is imposed by the Nagle algorithm); 2182 (3) or if at least a fraction Fs of the maximum window can be sent, 2183 i.e., if: 2185 [SND.NXT = SND.UNA and] 2187 min(D,U) >= Fs * Max(SND.WND); 2189 (4) or if the override timeout occurs. 2191 Here Fs is a fraction whose recommended value is 1/2. The override 2192 timeout should be in the range 0.1 - 1.0 seconds. It may be 2193 convenient to combine this timer with the timer used to probe zero 2194 windows (Section 3.8.6.1). 2196 3.8.6.2.2. Receiver's Algorithm - When to Send a Window Update 2198 A TCP implementation MUST include a SWS avoidance algorithm in the 2199 receiver (MUST-39). 2201 The receiver's SWS avoidance algorithm determines when the right 2202 window edge may be advanced; this is customarily known as "updating 2203 the window". This algorithm combines with the delayed ACK algorithm 2204 (Section 3.8.6.3) to determine when an ACK segment containing the 2205 current window will really be sent to the receiver. 2207 The solution to receiver SWS is to avoid advancing the right window 2208 edge RCV.NXT+RCV.WND in small increments, even if data is received 2209 from the network in small segments. 2211 Suppose the total receive buffer space is RCV.BUFF. At any given 2212 moment, RCV.USER octets of this total may be tied up with data that 2213 has been received and acknowledged but that the user process has not 2214 yet consumed. When the connection is quiescent, RCV.WND = RCV.BUFF 2215 and RCV.USER = 0. 2217 Keeping the right window edge fixed as data arrives and is 2218 acknowledged requires that the receiver offer less than its full 2219 buffer space, i.e., the receiver must specify a RCV.WND that keeps 2220 RCV.NXT+RCV.WND constant as RCV.NXT increases. Thus, the total 2221 buffer space RCV.BUFF is generally divided into three parts: 2223 |<------- RCV.BUFF ---------------->| 2224 1 2 3 2225 ----|---------|------------------|------|---- 2226 RCV.NXT ^ 2227 (Fixed) 2229 1 - RCV.USER = data received but not yet consumed; 2230 2 - RCV.WND = space advertised to sender; 2231 3 - Reduction = space available but not yet 2232 advertised. 2234 The suggested SWS avoidance algorithm for the receiver is to keep 2235 RCV.NXT+RCV.WND fixed until the reduction satisfies: 2237 RCV.BUFF - RCV.USER - RCV.WND >= 2239 min( Fr * RCV.BUFF, Eff.snd.MSS ) 2241 where Fr is a fraction whose recommended value is 1/2, and 2242 Eff.snd.MSS is the effective send MSS for the connection (see 2243 Section 3.7.1). When the inequality is satisfied, RCV.WND is set to 2244 RCV.BUFF-RCV.USER. 2246 Note that the general effect of this algorithm is to advance RCV.WND 2247 in increments of Eff.snd.MSS (for realistic receive buffers: 2248 Eff.snd.MSS < RCV.BUFF/2). Note also that the receiver must use its 2249 own Eff.snd.MSS, making the assumption that it is the same as the 2250 sender's. 2252 3.8.6.3. Delayed Acknowledgements - When to Send an ACK Segment 2254 A host that is receiving a stream of TCP data segments can increase 2255 efficiency in both the Internet and the hosts by sending fewer than 2256 one ACK (acknowledgment) segment per data segment received; this is 2257 known as a "delayed ACK". 2259 A TCP endpoint SHOULD implement a delayed ACK (SHLD-18), but an ACK 2260 should not be excessively delayed; in particular, the delay MUST be 2261 less than 0.5 seconds (MUST-40). An ACK SHOULD be generated for at 2262 least every second full-sized segment or 2*RMSS bytes of new data 2263 (where RMSS is the MSS specified by the TCP endpoint receiving the 2264 segments to be acknowledged, or the default value if not specified) 2265 (SHLD-19). Excessive delays on ACKs can disturb the round-trip 2266 timing and packet "clocking" algorithms. More complete discussion of 2267 delayed ACK behavior is in Section 4.2 of RFC 5681 [8], including 2268 recommendations to immediately acknowledge out-of-order segments, 2269 segments above a gap in sequence space, or segments that fill all or 2270 part of a gap, in order to accelerate loss recovery. 2272 Note that there are several current practices that further lead to a 2273 reduced number of ACKs, including generic receive offload (GRO) [73], 2274 ACK compression, and ACK decimation [29]. 2276 3.9. Interfaces 2278 There are of course two interfaces of concern: the user/TCP interface 2279 and the TCP/lower level interface. We have a fairly elaborate model 2280 of the user/TCP interface, but the interface to the lower level 2281 protocol module is left unspecified here, since it will be specified 2282 in detail by the specification of the lower level protocol. For the 2283 case that the lower level is IP we note some of the parameter values 2284 that TCP implementations might use. 2286 3.9.1. User/TCP Interface 2288 The following functional description of user commands to the TCP 2289 implementation is, at best, fictional, since every operating system 2290 will have different facilities. Consequently, we must warn readers 2291 that different TCP implementations may have different user 2292 interfaces. However, all TCP implementations must provide a certain 2293 minimum set of services to guarantee that all TCP implementations can 2294 support the same protocol hierarchy. This section specifies the 2295 functional interfaces required of all TCP implementations. 2297 Section 3.1 of [54] also identifies primitives provided by TCP, and 2298 could be used as an additional reference for implementers. 2300 The following sections functionally characterize a USER/TCP 2301 interface. The notation used is similar to most procedure or 2302 function calls in high level languages, but this usage is not meant 2303 to rule out trap type service calls. 2305 The user commands described below specify the basic functions the TCP 2306 implementation must perform to support interprocess communication. 2307 Individual implementations must define their own exact format, and 2308 may provide combinations or subsets of the basic functions in single 2309 calls. In particular, some implementations may wish to automatically 2310 OPEN a connection on the first SEND or RECEIVE issued by the user for 2311 a given connection. 2313 In providing interprocess communication facilities, the TCP 2314 implementation must not only accept commands, but must also return 2315 information to the processes it serves. The latter consists of: 2317 (a) general information about a connection (e.g., interrupts, 2318 remote close, binding of unspecified remote socket). 2320 (b) replies to specific user commands indicating success or 2321 various types of failure. 2323 3.9.1.1. Open 2325 Format: OPEN (local port, remote socket, active/passive [, 2326 timeout] [, DiffServ field] [, security/compartment] [local IP 2327 address,] [, options]) -> local connection name 2329 If the active/passive flag is set to passive, then this is a call 2330 to LISTEN for an incoming connection. A passive open may have 2331 either a fully specified remote socket to wait for a particular 2332 connection or an unspecified remote socket to wait for any call. 2333 A fully specified passive call can be made active by the 2334 subsequent execution of a SEND. 2336 A transmission control block (TCB) is created and partially filled 2337 in with data from the OPEN command parameters. 2339 Every passive OPEN call either creates a new connection record in 2340 LISTEN state, or it returns an error; it MUST NOT affect any 2341 previously created connection record (MUST-41). 2343 A TCP implementation that supports multiple concurrent connections 2344 MUST provide an OPEN call that will functionally allow an 2345 application to LISTEN on a port while a connection block with the 2346 same local port is in SYN-SENT or SYN-RECEIVED state (MUST-42). 2348 On an active OPEN command, the TCP endpoint will begin the 2349 procedure to synchronize (i.e., establish) the connection at once. 2351 The timeout, if present, permits the caller to set up a timeout 2352 for all data submitted to TCP. If data is not successfully 2353 delivered to the destination within the timeout period, the TCP 2354 endpoint will abort the connection. The present global default is 2355 five minutes. 2357 The TCP implementation or some component of the operating system 2358 will verify the user's authority to open a connection with the 2359 specified DiffServ field value or security/compartment. The 2360 absence of a DiffServ field value or security/compartment 2361 specification in the OPEN call indicates the default values must 2362 be used. 2364 TCP will accept incoming requests as matching only if the 2365 security/compartment information is exactly the same as that 2366 requested in the OPEN call. 2368 The DiffServ field value indicated by the user only impacts 2369 outgoing packets, may be altered en route through the network, and 2370 has no direct bearing or relation to received packets. 2372 A local connection name will be returned to the user by the TCP 2373 implementation. The local connection name can then be used as a 2374 short-hand term for the connection defined by the pair. 2377 The optional "local IP address" parameter MUST be supported to 2378 allow the specification of the local IP address (MUST-43). This 2379 enables applications that need to select the local IP address used 2380 when multihoming is present. 2382 A passive OPEN call with a specified "local IP address" parameter 2383 will await an incoming connection request to that address. If the 2384 parameter is unspecified, a passive OPEN will await an incoming 2385 connection request to any local IP address, and then bind the 2386 local IP address of the connection to the particular address that 2387 is used. 2389 For an active OPEN call, a specified "local IP address" parameter 2390 will be used for opening the connection. If the parameter is 2391 unspecified, the host will choose an appropriate local IP address 2392 (see RFC 1122 section 3.3.4.2). 2394 If an application on a multihomed host does not specify the local 2395 IP address when actively opening a TCP connection, then the TCP 2396 implementation MUST ask the IP layer to select a local IP address 2397 before sending the (first) SYN (MUST-44). See the function 2398 GET_SRCADDR() in Section 3.4 of RFC 1122. 2400 At all other times, a previous segment has either been sent or 2401 received on this connection, and TCP implementations MUST use the 2402 same local address that was used in those previous segments (MUST- 2403 45). 2405 A TCP implementation MUST reject as an error a local OPEN call for 2406 an invalid remote IP address (e.g., a broadcast or multicast 2407 address) (MUST-46). 2409 3.9.1.2. Send 2411 Format: SEND (local connection name, buffer address, byte count, 2412 PUSH flag (optional), URGENT flag [,timeout]) 2413 This call causes the data contained in the indicated user buffer 2414 to be sent on the indicated connection. If the connection has not 2415 been opened, the SEND is considered an error. Some 2416 implementations may allow users to SEND first; in which case, an 2417 automatic OPEN would be done. For example, this might be one way 2418 for application data to be included in SYN segments. If the 2419 calling process is not authorized to use this connection, an error 2420 is returned. 2422 A TCP endpoint MAY implement PUSH flags on SEND calls (MAY-15). 2423 If PUSH flags are not implemented, then the sending TCP peer: (1) 2424 MUST NOT buffer data indefinitely (MUST-60), and (2) MUST set the 2425 PSH bit in the last buffered segment (i.e., when there is no more 2426 queued data to be sent) (MUST-61). The remaining description 2427 below assumes the PUSH flag is supported on SEND calls. 2429 If the PUSH flag is set, the application intends the data to be 2430 transmitted promptly to the receiver, and the PUSH bit will be set 2431 in the last TCP segment created from the buffer. 2433 The PSH bit is not a record marker and is independent of segment 2434 boundaries. The transmitter SHOULD collapse successive bits when 2435 it packetizes data, to send the largest possible segment (SHLD- 2436 27). 2438 If the PUSH flag is not set, the data may be combined with data 2439 from subsequent SENDs for transmission efficiency. When an 2440 application issues a series of SEND calls without setting the PUSH 2441 flag, the TCP implementation MAY aggregate the data internally 2442 without sending it (MAY-16). Note that when the Nagle algorithm 2443 is in use, TCP implementations may buffer the data before sending, 2444 without regard to the PUSH flag (see Section 3.7.4). 2446 An application program is logically required to set the PUSH flag 2447 in a SEND call whenever it needs to force delivery of the data to 2448 avoid a communication deadlock. However, a TCP implementation 2449 SHOULD send a maximum-sized segment whenever possible (SHLD-28), 2450 to improve performance (see Section 3.8.6.2.1). 2452 New applications SHOULD NOT set the URGENT flag [40] due to 2453 implementation differences and middlebox issues (SHLD-13). 2455 If the URGENT flag is set, segments sent to the destination TCP 2456 peer will have the urgent pointer set. The receiving TCP peer 2457 will signal the urgent condition to the receiving process if the 2458 urgent pointer indicates that data preceding the urgent pointer 2459 has not been consumed by the receiving process. The purpose of 2460 urgent is to stimulate the receiver to process the urgent data and 2461 to indicate to the receiver when all the currently known urgent 2462 data has been received. The number of times the sending user's 2463 TCP implementation signals urgent will not necessarily be equal to 2464 the number of times the receiving user will be notified of the 2465 presence of urgent data. 2467 If no remote socket was specified in the OPEN, but the connection 2468 is established (e.g., because a LISTENing connection has become 2469 specific due to a remote segment arriving for the local socket), 2470 then the designated buffer is sent to the implied remote socket. 2471 Users who make use of OPEN with an unspecified remote socket can 2472 make use of SEND without ever explicitly knowing the remote socket 2473 address. 2475 However, if a SEND is attempted before the remote socket becomes 2476 specified, an error will be returned. Users can use the STATUS 2477 call to determine the status of the connection. Some TCP 2478 implementations may notify the user when an unspecified socket is 2479 bound. 2481 If a timeout is specified, the current user timeout for this 2482 connection is changed to the new one. 2484 In the simplest implementation, SEND would not return control to 2485 the sending process until either the transmission was complete or 2486 the timeout had been exceeded. However, this simple method is 2487 both subject to deadlocks (for example, both sides of the 2488 connection might try to do SENDs before doing any RECEIVEs) and 2489 offers poor performance, so it is not recommended. A more 2490 sophisticated implementation would return immediately to allow the 2491 process to run concurrently with network I/O, and, furthermore, to 2492 allow multiple SENDs to be in progress. Multiple SENDs are served 2493 in first come, first served order, so the TCP endpoint will queue 2494 those it cannot service immediately. 2496 We have implicitly assumed an asynchronous user interface in which 2497 a SEND later elicits some kind of SIGNAL or pseudo-interrupt from 2498 the serving TCP endpoint. An alternative is to return a response 2499 immediately. For instance, SENDs might return immediate local 2500 acknowledgment, even if the segment sent had not been acknowledged 2501 by the distant TCP endpoint. We could optimistically assume 2502 eventual success. If we are wrong, the connection will close 2503 anyway due to the timeout. In implementations of this kind 2504 (synchronous), there will still be some asynchronous signals, but 2505 these will deal with the connection itself, and not with specific 2506 segments or buffers. 2508 In order for the process to distinguish among error or success 2509 indications for different SENDs, it might be appropriate for the 2510 buffer address to be returned along with the coded response to the 2511 SEND request. TCP-to-user signals are discussed below, indicating 2512 the information that should be returned to the calling process. 2514 3.9.1.3. Receive 2516 Format: RECEIVE (local connection name, buffer address, byte 2517 count) -> byte count, urgent flag, push flag (optional) 2519 This command allocates a receiving buffer associated with the 2520 specified connection. If no OPEN precedes this command or the 2521 calling process is not authorized to use this connection, an error 2522 is returned. 2524 In the simplest implementation, control would not return to the 2525 calling program until either the buffer was filled, or some error 2526 occurred, but this scheme is highly subject to deadlocks. A more 2527 sophisticated implementation would permit several RECEIVEs to be 2528 outstanding at once. These would be filled as segments arrive. 2529 This strategy permits increased throughput at the cost of a more 2530 elaborate scheme (possibly asynchronous) to notify the calling 2531 program that a PUSH has been seen or a buffer filled. 2533 A TCP receiver MAY pass a received PSH flag to the application 2534 layer via the PUSH flag in the interface (MAY-17), but it is not 2535 required (this was clarified in RFC 1122 section 4.2.2.2). The 2536 remainder of text describing the RECEIVE call below assumes that 2537 passing the PUSH indication is supported. 2539 If enough data arrive to fill the buffer before a PUSH is seen, 2540 the PUSH flag will not be set in the response to the RECEIVE. The 2541 buffer will be filled with as much data as it can hold. If a PUSH 2542 is seen before the buffer is filled the buffer will be returned 2543 partially filled and PUSH indicated. 2545 If there is urgent data the user will have been informed as soon 2546 as it arrived via a TCP-to-user signal. The receiving user should 2547 thus be in "urgent mode". If the URGENT flag is on, additional 2548 urgent data remains. If the URGENT flag is off, this call to 2549 RECEIVE has returned all the urgent data, and the user may now 2550 leave "urgent mode". Note that data following the urgent pointer 2551 (non-urgent data) cannot be delivered to the user in the same 2552 buffer with preceding urgent data unless the boundary is clearly 2553 marked for the user. 2555 To distinguish among several outstanding RECEIVEs and to take care 2556 of the case that a buffer is not completely filled, the return 2557 code is accompanied by both a buffer pointer and a byte count 2558 indicating the actual length of the data received. 2560 Alternative implementations of RECEIVE might have the TCP endpoint 2561 allocate buffer storage, or the TCP endpoint might share a ring 2562 buffer with the user. 2564 3.9.1.4. Close 2566 Format: CLOSE (local connection name) 2568 This command causes the connection specified to be closed. If the 2569 connection is not open or the calling process is not authorized to 2570 use this connection, an error is returned. Closing connections is 2571 intended to be a graceful operation in the sense that outstanding 2572 SENDs will be transmitted (and retransmitted), as flow control 2573 permits, until all have been serviced. Thus, it should be 2574 acceptable to make several SEND calls, followed by a CLOSE, and 2575 expect all the data to be sent to the destination. It should also 2576 be clear that users should continue to RECEIVE on CLOSING 2577 connections, since the remote peer may be trying to transmit the 2578 last of its data. Thus, CLOSE means "I have no more to send" but 2579 does not mean "I will not receive any more." It may happen (if 2580 the user level protocol is not well-thought-out) that the closing 2581 side is unable to get rid of all its data before timing out. In 2582 this event, CLOSE turns into ABORT, and the closing TCP peer gives 2583 up. 2585 The user may CLOSE the connection at any time on their own 2586 initiative, or in response to various prompts from the TCP 2587 implementation (e.g., remote close executed, transmission timeout 2588 exceeded, destination inaccessible). 2590 Because closing a connection requires communication with the 2591 remote TCP peer, connections may remain in the closing state for a 2592 short time. Attempts to reopen the connection before the TCP peer 2593 replies to the CLOSE command will result in error responses. 2595 Close also implies push function. 2597 3.9.1.5. Status 2599 Format: STATUS (local connection name) -> status data 2600 This is an implementation dependent user command and could be 2601 excluded without adverse effect. Information returned would 2602 typically come from the TCB associated with the connection. 2604 This command returns a data block containing the following 2605 information: 2607 - local socket, 2609 remote socket, 2611 local connection name, 2613 receive window, 2615 send window, 2617 connection state, 2619 number of buffers awaiting acknowledgment, 2621 number of buffers pending receipt, 2623 urgent state, 2625 DiffServ field value, 2627 security/compartment, 2629 and transmission timeout. 2631 Depending on the state of the connection, or on the implementation 2632 itself, some of this information may not be available or 2633 meaningful. If the calling process is not authorized to use this 2634 connection, an error is returned. This prevents unauthorized 2635 processes from gaining information about a connection. 2637 3.9.1.6. Abort 2639 Format: ABORT (local connection name) 2641 This command causes all pending SENDs and RECEIVES to be aborted, 2642 the TCB to be removed, and a special RESET message to be sent to 2643 the remote TCP peer of the connection. Depending on the 2644 implementation, users may receive abort indications for each 2645 outstanding SEND or RECEIVE, or may simply receive an ABORT- 2646 acknowledgment. 2648 3.9.1.7. Flush 2650 Some TCP implementations have included a FLUSH call, which will 2651 empty the TCP send queue of any data that the user has issued SEND 2652 calls for but is still to the right of the current send window. 2653 That is, it flushes as much queued send data as possible without 2654 losing sequence number synchronization. The FLUSH call MAY be 2655 implemented (MAY-14). 2657 3.9.1.8. Asynchronous Reports 2659 There MUST be a mechanism for reporting soft TCP error conditions 2660 to the application (MUST-47). Generically, we assume this takes 2661 the form of an application-supplied ERROR_REPORT routine that may 2662 be upcalled asynchronously from the transport layer: 2664 - ERROR_REPORT(local connection name, reason, subreason) 2666 The precise encoding of the reason and subreason parameters is not 2667 specified here. However, the conditions that are reported 2668 asynchronously to the application MUST include: 2670 - * ICMP error message arrived (see Section 3.9.2.2 for 2671 description of handling each ICMP message type, since some 2672 message types need to be suppressed from generating reports to 2673 the application) 2675 - * Excessive retransmissions (see Section 3.8.3) 2677 - * Urgent pointer advance (see Section 3.8.5) 2679 However, an application program that does not want to receive such 2680 ERROR_REPORT calls SHOULD be able to effectively disable these 2681 calls (SHLD-20). 2683 3.9.1.9. Set Differentiated Services Field (IPv4 TOS or IPv6 Traffic 2684 Class) 2686 The application layer MUST be able to specify the Differentiated 2687 Services field for segments that are sent on a connection (MUST- 2688 48). The Differentiated Services field includes the 6-bit 2689 Differentiated Services Code Point (DSCP) value. It is not 2690 required, but the application SHOULD be able to change the 2691 Differentiated Services field during the connection lifetime 2692 (SHLD-21). TCP implementations SHOULD pass the current 2693 Differentiated Services field value without change to the IP 2694 layer, when it sends segments on the connection (SHLD-22). 2696 The Differentiated Services field will be specified independently 2697 in each direction on the connection, so that the receiver 2698 application will specify the Differentiated Services field used 2699 for ACK segments. 2701 TCP implementations MAY pass the most recently received 2702 Differentiated Services field up to the application (MAY-9). 2704 3.9.2. TCP/Lower-Level Interface 2706 The TCP endpoint calls on a lower level protocol module to actually 2707 send and receive information over a network. The two current 2708 standard Internet Protocol (IP) versions layered below TCP are IPv4 2709 [1] and IPv6 [13]. 2711 If the lower level protocol is IPv4 it provides arguments for a type 2712 of service (used within the Differentiated Services field) and for a 2713 time to live. TCP uses the following settings for these parameters: 2715 DiffServ field: The IP header value for the DiffServ field is 2716 given by the user. This includes the bits of the DiffServ Code 2717 Point (DSCP). 2719 Time to Live (TTL): The TTL value used to send TCP segments MUST 2720 be configurable (MUST-49). 2722 - Note that RFC 793 specified one minute (60 seconds) as a 2723 constant for the TTL, because the assumed maximum segment 2724 lifetime was two minutes. This was intended to explicitly ask 2725 that a segment be destroyed if it cannot be delivered by the 2726 internet system within one minute. RFC 1122 changed this 2727 specification to require that the TTL be configurable. 2729 - Note that the DiffServ field is permitted to change during a 2730 connection (Section 4.2.4.2 of RFC 1122). However, the 2731 application interface might not support this ability, and the 2732 application does not have knowledge about individual TCP 2733 segments, so this can only be done on a coarse granularity, at 2734 best. This limitation is further discussed in RFC 7657 (sec 2735 5.1, 5.3, and 6) [51]. Generally, an application SHOULD NOT 2736 change the DiffServ field value during the course of a 2737 connection (SHLD-23). 2739 Any lower level protocol will have to provide the source address, 2740 destination address, and protocol fields, and some way to determine 2741 the "TCP length", both to provide the functional equivalent service 2742 of IP and to be used in the TCP checksum. 2744 When received options are passed up to TCP from the IP layer, a TCP 2745 implementation MUST ignore options that it does not understand (MUST- 2746 50). 2748 A TCP implementation MAY support the Time Stamp (MAY-10) and Record 2749 Route (MAY-11) options. 2751 3.9.2.1. Source Routing 2753 If the lower level is IP (or other protocol that provides this 2754 feature) and source routing is used, the interface must allow the 2755 route information to be communicated. This is especially important 2756 so that the source and destination addresses used in the TCP checksum 2757 be the originating source and ultimate destination. It is also 2758 important to preserve the return route to answer connection requests. 2760 An application MUST be able to specify a source route when it 2761 actively opens a TCP connection (MUST-51), and this MUST take 2762 precedence over a source route received in a datagram (MUST-52). 2764 When a TCP connection is OPENed passively and a packet arrives with a 2765 completed IP Source Route option (containing a return route), TCP 2766 implementations MUST save the return route and use it for all 2767 segments sent on this connection (MUST-53). If a different source 2768 route arrives in a later segment, the later definition SHOULD 2769 override the earlier one (SHLD-24). 2771 3.9.2.2. ICMP Messages 2773 TCP implementations MUST act on an ICMP error message passed up from 2774 the IP layer, directing it to the connection that created the error 2775 (MUST-54). The necessary demultiplexing information can be found in 2776 the IP header contained within the ICMP message. 2778 This applies to ICMPv6 in addition to IPv4 ICMP. 2780 [36] contains discussion of specific ICMP and ICMPv6 messages 2781 classified as either "soft" or "hard" errors that may bear different 2782 responses. Treatment for classes of ICMP messages is described 2783 below: 2785 Source Quench 2786 TCP implementations MUST silently discard any received ICMP Source 2787 Quench messages (MUST-55). See [11] for discussion. 2789 Soft Errors 2790 For IPv4 ICMP these include: Destination Unreachable -- codes 0, 1, 2791 5; Time Exceeded -- codes 0, 1; and Parameter Problem. 2793 For ICMPv6 these include: Destination Unreachable -- codes 0, 3; 2794 Time Exceeded -- codes 0, 1; and Parameter Problem -- codes 0, 1, 2795 2. 2797 Since these Unreachable messages indicate soft error conditions, 2798 TCP implementations MUST NOT abort the connection (MUST-56), and it 2799 SHOULD make the information available to the application (SHLD-25). 2801 Hard Errors 2802 For ICMP these include Destination Unreachable -- codes 2-4. 2804 These are hard error conditions, so TCP implementations SHOULD 2805 abort the connection (SHLD-26). [36] notes that some 2806 implementations do not abort connections when an ICMP hard error is 2807 received for a connection that is in any of the synchronized 2808 states. 2810 Note that [36] section 4 describes widespread implementation behavior 2811 that treats soft errors as hard errors during connection 2812 establishment. 2814 3.9.2.3. Source Address Validation 2816 RFC 1122 requires addresses to be validated in incoming SYN packets: 2818 An incoming SYN with an invalid source address MUST be ignored 2819 either by TCP or by the IP layer (MUST-63) (Section 3.2.1.3 of 2820 [20]). 2822 A TCP implementation MUST silently discard an incoming SYN segment 2823 that is addressed to a broadcast or multicast address (MUST-57). 2825 This prevents connection state and replies from being erroneously 2826 generated, and implementers should note that this guidance is 2827 applicable to all incoming segments, not just SYNs, as specifically 2828 indicated in RFC 1122. 2830 3.10. Event Processing 2832 The processing depicted in this section is an example of one possible 2833 implementation. Other implementations may have slightly different 2834 processing sequences, but they should differ from those in this 2835 section only in detail, not in substance. 2837 The activity of the TCP endpoint can be characterized as responding 2838 to events. The events that occur can be cast into three categories: 2839 user calls, arriving segments, and timeouts. This section describes 2840 the processing the TCP endpoint does in response to each of the 2841 events. In many cases the processing required depends on the state 2842 of the connection. 2844 Events that occur: 2846 User Calls 2848 - OPEN 2850 SEND 2852 RECEIVE 2854 CLOSE 2856 ABORT 2858 STATUS 2860 Arriving Segments 2862 - SEGMENT ARRIVES 2864 Timeouts 2866 - USER TIMEOUT 2868 RETRANSMISSION TIMEOUT 2870 TIME-WAIT TIMEOUT 2872 The model of the TCP/user interface is that user commands receive an 2873 immediate return and possibly a delayed response via an event or 2874 pseudo interrupt. In the following descriptions, the term "signal" 2875 means cause a delayed response. 2877 Error responses in this document are identified by character strings. 2878 For example, user commands referencing connections that do not exist 2879 receive "error: connection not open". 2881 Please note in the following that all arithmetic on sequence numbers, 2882 acknowledgment numbers, windows, et cetera, is modulo 2**32 (the size 2883 of the sequence number space). Also note that "=<" means less than 2884 or equal to (modulo 2**32). 2886 A natural way to think about processing incoming segments is to 2887 imagine that they are first tested for proper sequence number (i.e., 2888 that their contents lie in the range of the expected "receive window" 2889 in the sequence number space) and then that they are generally queued 2890 and processed in sequence number order. 2892 When a segment overlaps other already received segments we 2893 reconstruct the segment to contain just the new data, and adjust the 2894 header fields to be consistent. 2896 Note that if no state change is mentioned the TCP connection stays in 2897 the same state. 2899 3.10.1. OPEN Call 2901 CLOSED STATE (i.e., TCB does not exist) 2903 - Create a new transmission control block (TCB) to hold 2904 connection state information. Fill in local socket identifier, 2905 remote socket, DiffServ field, security/compartment, and user 2906 timeout information. Note that some parts of the remote socket 2907 may be unspecified in a passive OPEN and are to be filled in by 2908 the parameters of the incoming SYN segment. Verify the 2909 security and DiffServ value requested are allowed for this 2910 user, if not return "error: DiffServ value not allowed" or 2911 "error: security/compartment not allowed." If passive enter 2912 the LISTEN state and return. If active and the remote socket 2913 is unspecified, return "error: remote socket unspecified"; if 2914 active and the remote socket is specified, issue a SYN segment. 2915 An initial send sequence number (ISS) is selected. A SYN 2916 segment of the form is sent. Set SND.UNA to 2917 ISS, SND.NXT to ISS+1, enter SYN-SENT state, and return. 2919 - If the caller does not have access to the local socket 2920 specified, return "error: connection illegal for this process". 2921 If there is no room to create a new connection, return "error: 2922 insufficient resources". 2924 LISTEN STATE 2925 - If the OPEN call is active and the remote socket is specified, 2926 then change the connection from passive to active, select an 2927 ISS. Send a SYN segment, set SND.UNA to ISS, SND.NXT to ISS+1. 2928 Enter SYN-SENT state. Data associated with SEND may be sent 2929 with SYN segment or queued for transmission after entering 2930 ESTABLISHED state. The urgent bit if requested in the command 2931 must be sent with the data segments sent as a result of this 2932 command. If there is no room to queue the request, respond 2933 with "error: insufficient resources". If the remote socket was 2934 not specified, then return "error: remote socket unspecified". 2936 SYN-SENT STATE 2938 SYN-RECEIVED STATE 2940 ESTABLISHED STATE 2942 FIN-WAIT-1 STATE 2944 FIN-WAIT-2 STATE 2946 CLOSE-WAIT STATE 2948 CLOSING STATE 2950 LAST-ACK STATE 2952 TIME-WAIT STATE 2954 - Return "error: connection already exists". 2956 3.10.2. SEND Call 2958 CLOSED STATE (i.e., TCB does not exist) 2960 - If the user does not have access to such a connection, then 2961 return "error: connection illegal for this process". 2963 - Otherwise, return "error: connection does not exist". 2965 LISTEN STATE 2967 - If the remote socket is specified, then change the connection 2968 from passive to active, select an ISS. Send a SYN segment, set 2969 SND.UNA to ISS, SND.NXT to ISS+1. Enter SYN-SENT state. Data 2970 associated with SEND may be sent with SYN segment or queued for 2971 transmission after entering ESTABLISHED state. The urgent bit 2972 if requested in the command must be sent with the data segments 2973 sent as a result of this command. If there is no room to queue 2974 the request, respond with "error: insufficient resources". If 2975 the remote socket was not specified, then return "error: remote 2976 socket unspecified". 2978 SYN-SENT STATE 2980 SYN-RECEIVED STATE 2982 - Queue the data for transmission after entering ESTABLISHED 2983 state. If no space to queue, respond with "error: insufficient 2984 resources". 2986 ESTABLISHED STATE 2988 CLOSE-WAIT STATE 2990 - Segmentize the buffer and send it with a piggybacked 2991 acknowledgment (acknowledgment value = RCV.NXT). If there is 2992 insufficient space to remember this buffer, simply return 2993 "error: insufficient resources". 2995 - If the urgent flag is set, then SND.UP <- SND.NXT and set the 2996 urgent pointer in the outgoing segments. 2998 FIN-WAIT-1 STATE 3000 FIN-WAIT-2 STATE 3002 CLOSING STATE 3004 LAST-ACK STATE 3006 TIME-WAIT STATE 3008 - Return "error: connection closing" and do not service request. 3010 3.10.3. RECEIVE Call 3012 CLOSED STATE (i.e., TCB does not exist) 3014 - If the user does not have access to such a connection, return 3015 "error: connection illegal for this process". 3017 - Otherwise return "error: connection does not exist". 3019 LISTEN STATE 3020 SYN-SENT STATE 3022 SYN-RECEIVED STATE 3024 - Queue for processing after entering ESTABLISHED state. If 3025 there is no room to queue this request, respond with "error: 3026 insufficient resources". 3028 ESTABLISHED STATE 3030 FIN-WAIT-1 STATE 3032 FIN-WAIT-2 STATE 3034 - If insufficient incoming segments are queued to satisfy the 3035 request, queue the request. If there is no queue space to 3036 remember the RECEIVE, respond with "error: insufficient 3037 resources". 3039 - Reassemble queued incoming segments into receive buffer and 3040 return to user. Mark "push seen" (PUSH) if this is the case. 3042 - If RCV.UP is in advance of the data currently being passed to 3043 the user notify the user of the presence of urgent data. 3045 - When the TCP endpoint takes responsibility for delivering data 3046 to the user that fact must be communicated to the sender via an 3047 acknowledgment. The formation of such an acknowledgment is 3048 described below in the discussion of processing an incoming 3049 segment. 3051 CLOSE-WAIT STATE 3053 - Since the remote side has already sent FIN, RECEIVEs must be 3054 satisfied by data already on hand, but not yet delivered to the 3055 user. If no text is awaiting delivery, the RECEIVE will get an 3056 "error: connection closing" response. Otherwise, any remaining 3057 data can be used to satisfy the RECEIVE. 3059 CLOSING STATE 3061 LAST-ACK STATE 3063 TIME-WAIT STATE 3065 - Return "error: connection closing". 3067 3.10.4. CLOSE Call 3069 CLOSED STATE (i.e., TCB does not exist) 3071 - If the user does not have access to such a connection, return 3072 "error: connection illegal for this process". 3074 - Otherwise, return "error: connection does not exist". 3076 LISTEN STATE 3078 - Any outstanding RECEIVEs are returned with "error: closing" 3079 responses. Delete TCB, enter CLOSED state, and return. 3081 SYN-SENT STATE 3083 - Delete the TCB and return "error: closing" responses to any 3084 queued SENDs, or RECEIVEs. 3086 SYN-RECEIVED STATE 3088 - If no SENDs have been issued and there is no pending data to 3089 send, then form a FIN segment and send it, and enter FIN-WAIT-1 3090 state; otherwise queue for processing after entering 3091 ESTABLISHED state. 3093 ESTABLISHED STATE 3095 - Queue this until all preceding SENDs have been segmentized, 3096 then form a FIN segment and send it. In any case, enter FIN- 3097 WAIT-1 state. 3099 FIN-WAIT-1 STATE 3101 FIN-WAIT-2 STATE 3103 - Strictly speaking, this is an error and should receive an 3104 "error: connection closing" response. An "ok" response would 3105 be acceptable, too, as long as a second FIN is not emitted (the 3106 first FIN may be retransmitted though). 3108 CLOSE-WAIT STATE 3110 - Queue this request until all preceding SENDs have been 3111 segmentized; then send a FIN segment, enter LAST-ACK state. 3113 CLOSING STATE 3114 LAST-ACK STATE 3116 TIME-WAIT STATE 3118 - Respond with "error: connection closing". 3120 3.10.5. ABORT Call 3122 CLOSED STATE (i.e., TCB does not exist) 3124 - If the user should not have access to such a connection, return 3125 "error: connection illegal for this process". 3127 - Otherwise return "error: connection does not exist". 3129 LISTEN STATE 3131 - Any outstanding RECEIVEs should be returned with "error: 3132 connection reset" responses. Delete TCB, enter CLOSED state, 3133 and return. 3135 SYN-SENT STATE 3137 - All queued SENDs and RECEIVEs should be given "connection 3138 reset" notification, delete the TCB, enter CLOSED state, and 3139 return. 3141 SYN-RECEIVED STATE 3143 ESTABLISHED STATE 3145 FIN-WAIT-1 STATE 3147 FIN-WAIT-2 STATE 3149 CLOSE-WAIT STATE 3151 - Send a reset segment: 3153 o 3155 - All queued SENDs and RECEIVEs should be given "connection 3156 reset" notification; all segments queued for transmission 3157 (except for the RST formed above) or retransmission should be 3158 flushed, delete the TCB, enter CLOSED state, and return. 3160 CLOSING STATE LAST-ACK STATE TIME-WAIT STATE 3161 - Respond with "ok" and delete the TCB, enter CLOSED state, and 3162 return. 3164 3.10.6. STATUS Call 3166 CLOSED STATE (i.e., TCB does not exist) 3168 - If the user should not have access to such a connection, return 3169 "error: connection illegal for this process". 3171 - Otherwise return "error: connection does not exist". 3173 LISTEN STATE 3175 - Return "state = LISTEN", and the TCB pointer. 3177 SYN-SENT STATE 3179 - Return "state = SYN-SENT", and the TCB pointer. 3181 SYN-RECEIVED STATE 3183 - Return "state = SYN-RECEIVED", and the TCB pointer. 3185 ESTABLISHED STATE 3187 - Return "state = ESTABLISHED", and the TCB pointer. 3189 FIN-WAIT-1 STATE 3191 - Return "state = FIN-WAIT-1", and the TCB pointer. 3193 FIN-WAIT-2 STATE 3195 - Return "state = FIN-WAIT-2", and the TCB pointer. 3197 CLOSE-WAIT STATE 3199 - Return "state = CLOSE-WAIT", and the TCB pointer. 3201 CLOSING STATE 3203 - Return "state = CLOSING", and the TCB pointer. 3205 LAST-ACK STATE 3207 - Return "state = LAST-ACK", and the TCB pointer. 3209 TIME-WAIT STATE 3211 - Return "state = TIME-WAIT", and the TCB pointer. 3213 3.10.7. SEGMENT ARRIVES 3215 3.10.7.1. CLOSED State 3217 If the state is CLOSED (i.e., TCB does not exist) then 3219 all data in the incoming segment is discarded. An incoming 3220 segment containing a RST is discarded. An incoming segment not 3221 containing a RST causes a RST to be sent in response. The 3222 acknowledgment and sequence field values are selected to make the 3223 reset sequence acceptable to the TCP endpoint that sent the 3224 offending segment. 3226 If the ACK bit is off, sequence number zero is used, 3228 - 3230 If the ACK bit is on, 3232 - 3234 Return. 3236 3.10.7.2. LISTEN State 3238 If the state is LISTEN then 3240 first check for an RST 3242 - An incoming RST segment could not be valid, since it could not 3243 have been sent in response to anything sent by this incarnation 3244 of the connection. An incoming RST should be ignored. Return. 3246 second check for an ACK 3248 - Any acknowledgment is bad if it arrives on a connection still 3249 in the LISTEN state. An acceptable reset segment should be 3250 formed for any arriving ACK-bearing segment. The RST should be 3251 formatted as follows: 3253 o 3255 - Return. 3257 third check for a SYN 3259 - If the SYN bit is set, check the security. If the security/ 3260 compartment on the incoming segment does not exactly match the 3261 security/compartment in the TCB then send a reset and return. 3263 o 3265 - Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any other 3266 control or text should be queued for processing later. ISS 3267 should be selected and a SYN segment sent of the form: 3269 o 3271 - SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection 3272 state should be changed to SYN-RECEIVED. Note that any other 3273 incoming control or data (combined with SYN) will be processed 3274 in the SYN-RECEIVED state, but processing of SYN and ACK should 3275 not be repeated. If the listen was not fully specified (i.e., 3276 the remote socket was not fully specified), then the 3277 unspecified fields should be filled in now. 3279 fourth other data or control 3281 - This should not be reached. Drop the segment and return. Any 3282 other control or data-bearing segment (not containing SYN) must 3283 have an ACK and thus would have been discarded by the ACK 3284 processing in the second step, unless it was first discarded by 3285 RST checking in the first step. 3287 3.10.7.3. SYN-SENT State 3289 If the state is SYN-SENT then 3291 first check the ACK bit 3293 - If the ACK bit is set 3295 o If SEG.ACK =< ISS, or SEG.ACK > SND.NXT, send a reset 3296 (unless the RST bit is set, if so drop the segment and 3297 return) 3299 + 3301 o and discard the segment. Return. 3303 o If SND.UNA < SEG.ACK =< SND.NXT then the ACK is acceptable. 3304 Some deployed TCP code has used the check SEG.ACK == SND.NXT 3305 (using "==" rather than "=<", but this is not appropriate 3306 when the stack is capable of sending data on the SYN, 3307 because the TCP peer may not accept and acknowledge all of 3308 the data on the SYN. 3310 second check the RST bit 3312 - If the RST bit is set 3314 o A potential blind reset attack is described in RFC 5961 [9]. 3315 The mitigation described in that document has specific 3316 applicability explained therein, and is not a substitute for 3317 cryptographic protection (e.g. IPsec or TCP-AO). A TCP 3318 implementation that supports the RFC 5961 mitigation SHOULD 3319 first check that the sequence number exactly matches RCV.NXT 3320 prior to executing the action in the next paragraph. 3322 o If the ACK was acceptable then signal the user "error: 3323 connection reset", drop the segment, enter CLOSED state, 3324 delete TCB, and return. Otherwise (no ACK), drop the 3325 segment and return. 3327 third check the security 3329 - If the security/compartment in the segment does not exactly 3330 match the security/compartment in the TCB, send a reset 3332 o If there is an ACK 3334 + 3336 o Otherwise 3338 + 3340 - If a reset was sent, discard the segment and return. 3342 fourth check the SYN bit 3344 - This step should be reached only if the ACK is ok, or there is 3345 no ACK, and the segment did not contain a RST. 3347 - If the SYN bit is on and the security/compartment is acceptable 3348 then, RCV.NXT is set to SEG.SEQ+1, IRS is set to SEG.SEQ. 3349 SND.UNA should be advanced to equal SEG.ACK (if there is an 3350 ACK), and any segments on the retransmission queue that are 3351 thereby acknowledged should be removed. 3353 - If SND.UNA > ISS (our SYN has been ACKed), change the 3354 connection state to ESTABLISHED, form an ACK segment 3356 o 3358 - and send it. Data or controls that were queued for 3359 transmission MAY be included. Some TCP implementations 3360 suppress sending this segment when the received segment 3361 contains data that will anyways generate an acknowledgement in 3362 the later processing steps, saving this extra acknowledgement 3363 of the SYN from being sent. If there are other controls or 3364 text in the segment then continue processing at the sixth step 3365 under Section 3.10.7.4 where the URG bit is checked, otherwise 3366 return. 3368 - Otherwise enter SYN-RECEIVED, form a SYN,ACK segment 3370 o 3372 - and send it. Set the variables: 3374 o SND.WND <- SEG.WND 3376 SND.WL1 <- SEG.SEQ 3378 SND.WL2 <- SEG.ACK 3380 If there are other controls or text in the segment, queue them 3381 for processing after the ESTABLISHED state has been reached, 3382 return. 3384 - Note that it is legal to send and receive application data on 3385 SYN segments (this is the "text in the segment" mentioned 3386 above. There has been significant misinformation and 3387 misunderstanding of this topic historically. Some firewalls 3388 and security devices consider this suspicious. However, the 3389 capability was used in T/TCP [22] and is used in TCP Fast Open 3390 (TFO) [49], so is important for implementations and network 3391 devices to permit. 3393 fifth, if neither of the SYN or RST bits is set then drop the 3394 segment and return. 3396 3.10.7.4. Other States 3398 Otherwise, 3400 first check sequence number 3402 - SYN-RECEIVED STATE 3404 ESTABLISHED STATE 3406 FIN-WAIT-1 STATE 3408 FIN-WAIT-2 STATE 3410 CLOSE-WAIT STATE 3412 CLOSING STATE 3414 LAST-ACK STATE 3416 TIME-WAIT STATE 3418 o Segments are processed in sequence. Initial tests on 3419 arrival are used to discard old duplicates, but further 3420 processing is done in SEG.SEQ order. If a segment's 3421 contents straddle the boundary between old and new, only the 3422 new parts are processed. 3424 o In general, the processing of received segments MUST be 3425 implemented to aggregate ACK segments whenever possible 3426 (MUST-58). For example, if the TCP endpoint is processing a 3427 series of queued segments, it MUST process them all before 3428 sending any ACK segments (MUST-59). 3430 o There are four cases for the acceptability test for an 3431 incoming segment: 3433 Segment Receive Test 3434 Length Window 3435 ------- ------- ------------------------------------------- 3437 0 0 SEG.SEQ = RCV.NXT 3439 0 >0 RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND 3441 >0 0 not acceptable 3443 >0 >0 RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND 3444 or RCV.NXT =< SEG.SEQ+SEG.LEN-1 < RCV.NXT+RCV.WND 3446 o In implementing sequence number validation as described 3447 here, please note Appendix A.2. 3449 o If the RCV.WND is zero, no segments will be acceptable, but 3450 special allowance should be made to accept valid ACKs, URGs 3451 and RSTs. 3453 o If an incoming segment is not acceptable, an acknowledgment 3454 should be sent in reply (unless the RST bit is set, if so 3455 drop the segment and return): 3457 + 3459 o After sending the acknowledgment, drop the unacceptable 3460 segment and return. 3462 o Note that for the TIME-WAIT state, there is an improved 3463 algorithm described in [41] for handling incoming SYN 3464 segments, that utilizes timestamps rather than relying on 3465 the sequence number check described here. When the improved 3466 algorithm is implemented, the logic above is not applicable 3467 for incoming SYN segments with timestamp options, received 3468 on a connection in the TIME-WAIT state. 3470 o In the following it is assumed that the segment is the 3471 idealized segment that begins at RCV.NXT and does not exceed 3472 the window. One could tailor actual segments to fit this 3473 assumption by trimming off any portions that lie outside the 3474 window (including SYN and FIN), and only processing further 3475 if the segment then begins at RCV.NXT. Segments with higher 3476 beginning sequence numbers SHOULD be held for later 3477 processing (SHLD-31). 3479 - second check the RST bit, 3480 o RFC 5961 [9] section 3 describes a potential blind reset 3481 attack and optional mitigation approach. This does not 3482 provide a cryptographic protection (e.g. as in IPsec or TCP- 3483 AO), but can be applicable in situations described in RFC 3484 5961. For stacks implementing the RFC 5961 protection, the 3485 three checks below apply, otherwise processing for these 3486 states is indicated further below. 3488 + 1) If the RST bit is set and the sequence number is 3489 outside the current receive window, silently drop the 3490 segment. 3492 + 2) If the RST bit is set and the sequence number exactly 3493 matches the next expected sequence number (RCV.NXT), then 3494 TCP endpoints MUST reset the connection in the manner 3495 prescribed below according to the connection state. 3497 + 3) If the RST bit is set and the sequence number does not 3498 exactly match the next expected sequence value, yet is 3499 within the current receive window, TCP endpoints MUST 3500 send an acknowledgement (challenge ACK): 3502 3504 After sending the challenge ACK, TCP endpoints MUST drop 3505 the unacceptable segment and stop processing the incoming 3506 packet further. Note that RFC 5961 and Errata ID 4772 3507 contain additional considerations for ACK throttling in 3508 an implementation. 3510 o SYN-RECEIVED STATE 3512 + If the RST bit is set 3514 * If this connection was initiated with a passive OPEN 3515 (i.e., came from the LISTEN state), then return this 3516 connection to LISTEN state and return. The user need 3517 not be informed. If this connection was initiated 3518 with an active OPEN (i.e., came from SYN-SENT state) 3519 then the connection was refused, signal the user 3520 "connection refused". In either case, the 3521 retransmission queue should be flushed. And in the 3522 active OPEN case, enter the CLOSED state and delete 3523 the TCB, and return. 3525 o ESTABLISHED 3527 FIN-WAIT-1 3528 FIN-WAIT-2 3530 CLOSE-WAIT 3532 + If the RST bit is set then, any outstanding RECEIVEs and 3533 SEND should receive "reset" responses. All segment 3534 queues should be flushed. Users should also receive an 3535 unsolicited general "connection reset" signal. Enter the 3536 CLOSED state, delete the TCB, and return. 3538 o CLOSING STATE 3540 LAST-ACK STATE 3542 TIME-WAIT 3544 + If the RST bit is set then, enter the CLOSED state, 3545 delete the TCB, and return. 3547 - third check security 3549 o SYN-RECEIVED 3551 + If the security/compartment in the segment does not 3552 exactly match the security/compartment in the TCB then 3553 send a reset, and return. 3555 o ESTABLISHED 3557 FIN-WAIT-1 3559 FIN-WAIT-2 3561 CLOSE-WAIT 3563 CLOSING 3565 LAST-ACK 3567 TIME-WAIT 3569 + If the security/compartment in the segment does not 3570 exactly match the security/compartment in the TCB then 3571 send a reset, any outstanding RECEIVEs and SEND should 3572 receive "reset" responses. All segment queues should be 3573 flushed. Users should also receive an unsolicited 3574 general "connection reset" signal. Enter the CLOSED 3575 state, delete the TCB, and return. 3577 o Note this check is placed following the sequence check to 3578 prevent a segment from an old connection between these port 3579 numbers with a different security from causing an abort of 3580 the current connection. 3582 - fourth, check the SYN bit, 3584 o SYN-RECEIVED 3586 + If the connection was initiated with a passive OPEN, then 3587 return this connection to the LISTEN state and return. 3588 Otherwise, handle per the directions for synchronized 3589 states below. 3591 ESTABLISHED STATE 3593 FIN-WAIT STATE-1 3595 FIN-WAIT STATE-2 3597 CLOSE-WAIT STATE 3599 CLOSING STATE 3601 LAST-ACK STATE 3603 TIME-WAIT STATE 3605 + If the SYN bit is set in these synchronized states, it 3606 may be either a legitimate new connection attempt (e.g. 3607 in the case of TIME-WAIT), an error where the connection 3608 should be reset, or the result of an attack attempt, as 3609 described in RFC 5961 [9]. For the TIME-WAIT state, new 3610 connections can be accepted if the timestamp option is 3611 used and meets expectations (per [41]). For all other 3612 cases, RFC 5961 provides a mitigation with applicability 3613 to some situations, though there are also alternatives 3614 that offer cryptographic protection (see Section 7). RFC 3615 5961 recommends that in these synchronized states, if the 3616 SYN bit is set, irrespective of the sequence number, TCP 3617 endpoints MUST send a "challenge ACK" to the remote peer: 3619 + 3621 + After sending the acknowledgement, TCP implementations 3622 MUST drop the unacceptable segment and stop processing 3623 further. Note that RFC 5961 and Errata ID 4772 contain 3624 additional ACK throttling notes for an implementation. 3626 + For implementations that do not follow RFC 5961, the 3627 original RFC 793 behavior follows in this paragraph. If 3628 the SYN is in the window it is an error, send a reset, 3629 any outstanding RECEIVEs and SEND should receive "reset" 3630 responses, all segment queues should be flushed, the user 3631 should also receive an unsolicited general "connection 3632 reset" signal, enter the CLOSED state, delete the TCB, 3633 and return. 3635 + If the SYN is not in the window this step would not be 3636 reached and an ACK would have been sent in the first step 3637 (sequence number check). 3639 - fifth check the ACK field, 3641 o if the ACK bit is off drop the segment and return 3643 o if the ACK bit is on 3645 + RFC 5961 [9] section 5 describes a potential blind data 3646 injection attack, and mitigation that implementations MAY 3647 choose to include (MAY-12). TCP stacks that implement 3648 RFC 5961 MUST add an input check that the ACK value is 3649 acceptable only if it is in the range of ((SND.UNA - 3650 MAX.SND.WND) =< SEG.ACK =< SND.NXT). All incoming 3651 segments whose ACK value doesn't satisfy the above 3652 condition MUST be discarded and an ACK sent back. The 3653 new state variable MAX.SND.WND is defined as the largest 3654 window that the local sender has ever received from its 3655 peer (subject to window scaling) or may be hard-coded to 3656 a maximum permissible window value. When the ACK value 3657 is acceptable, the processing per-state below applies: 3659 + SYN-RECEIVED STATE 3661 * If SND.UNA < SEG.ACK =< SND.NXT then enter ESTABLISHED 3662 state and continue processing with variables below set 3663 to: 3665 - SND.WND <- SEG.WND 3667 SND.WL1 <- SEG.SEQ 3669 SND.WL2 <- SEG.ACK 3671 * If the segment acknowledgment is not acceptable, form 3672 a reset segment, 3673 - 3675 * and send it. 3677 + ESTABLISHED STATE 3679 * If SND.UNA < SEG.ACK =< SND.NXT then, set SND.UNA <- 3680 SEG.ACK. Any segments on the retransmission queue 3681 that are thereby entirely acknowledged are removed. 3682 Users should receive positive acknowledgments for 3683 buffers that have been SENT and fully acknowledged 3684 (i.e., SEND buffer should be returned with "ok" 3685 response). If the ACK is a duplicate (SEG.ACK =< 3686 SND.UNA), it can be ignored. If the ACK acks 3687 something not yet sent (SEG.ACK > SND.NXT) then send 3688 an ACK, drop the segment, and return. 3690 * If SND.UNA =< SEG.ACK =< SND.NXT, the send window 3691 should be updated. If (SND.WL1 < SEG.SEQ or (SND.WL1 3692 = SEG.SEQ and SND.WL2 =< SEG.ACK)), set SND.WND <- 3693 SEG.WND, set SND.WL1 <- SEG.SEQ, and set SND.WL2 <- 3694 SEG.ACK. 3696 * Note that SND.WND is an offset from SND.UNA, that 3697 SND.WL1 records the sequence number of the last 3698 segment used to update SND.WND, and that SND.WL2 3699 records the acknowledgment number of the last segment 3700 used to update SND.WND. The check here prevents using 3701 old segments to update the window. 3703 + FIN-WAIT-1 STATE 3705 * In addition to the processing for the ESTABLISHED 3706 state, if the FIN segment is now acknowledged then 3707 enter FIN-WAIT-2 and continue processing in that 3708 state. 3710 + FIN-WAIT-2 STATE 3712 * In addition to the processing for the ESTABLISHED 3713 state, if the retransmission queue is empty, the 3714 user's CLOSE can be acknowledged ("ok") but do not 3715 delete the TCB. 3717 + CLOSE-WAIT STATE 3719 * Do the same processing as for the ESTABLISHED state. 3721 + CLOSING STATE 3723 * In addition to the processing for the ESTABLISHED 3724 state, if the ACK acknowledges our FIN then enter the 3725 TIME-WAIT state, otherwise ignore the segment. 3727 + LAST-ACK STATE 3729 * The only thing that can arrive in this state is an 3730 acknowledgment of our FIN. If our FIN is now 3731 acknowledged, delete the TCB, enter the CLOSED state, 3732 and return. 3734 + TIME-WAIT STATE 3736 * The only thing that can arrive in this state is a 3737 retransmission of the remote FIN. Acknowledge it, and 3738 restart the 2 MSL timeout. 3740 - sixth, check the URG bit, 3742 o ESTABLISHED STATE 3744 FIN-WAIT-1 STATE 3746 FIN-WAIT-2 STATE 3748 + If the URG bit is set, RCV.UP <- max(RCV.UP,SEG.UP), and 3749 signal the user that the remote side has urgent data if 3750 the urgent pointer (RCV.UP) is in advance of the data 3751 consumed. If the user has already been signaled (or is 3752 still in the "urgent mode") for this continuous sequence 3753 of urgent data, do not signal the user again. 3755 o CLOSE-WAIT STATE 3757 CLOSING STATE 3759 LAST-ACK STATE 3761 TIME-WAIT 3763 + This should not occur, since a FIN has been received from 3764 the remote side. Ignore the URG. 3766 - seventh, process the segment text, 3768 o ESTABLISHED STATE 3769 FIN-WAIT-1 STATE 3771 FIN-WAIT-2 STATE 3773 + Once in the ESTABLISHED state, it is possible to deliver 3774 segment data to user RECEIVE buffers. Data from segments 3775 can be moved into buffers until either the buffer is full 3776 or the segment is empty. If the segment empties and 3777 carries a PUSH flag, then the user is informed, when the 3778 buffer is returned, that a PUSH has been received. 3780 + When the TCP endpoint takes responsibility for delivering 3781 the data to the user it must also acknowledge the receipt 3782 of the data. 3784 + Once the TCP endpoint takes responsibility for the data 3785 it advances RCV.NXT over the data accepted, and adjusts 3786 RCV.WND as appropriate to the current buffer 3787 availability. The total of RCV.NXT and RCV.WND should 3788 not be reduced. 3790 + A TCP implementation MAY send an ACK segment 3791 acknowledging RCV.NXT when a valid segment arrives that 3792 is in the window but not at the left window edge (MAY- 3793 13). 3795 + Please note the window management suggestions in 3796 Section 3.8. 3798 + Send an acknowledgment of the form: 3800 * 3802 + This acknowledgment should be piggybacked on a segment 3803 being transmitted if possible without incurring undue 3804 delay. 3806 o CLOSE-WAIT STATE 3808 CLOSING STATE 3810 LAST-ACK STATE 3812 TIME-WAIT STATE 3814 + This should not occur, since a FIN has been received from 3815 the remote side. Ignore the segment text. 3817 - eighth, check the FIN bit, 3819 o Do not process the FIN if the state is CLOSED, LISTEN or 3820 SYN-SENT since the SEG.SEQ cannot be validated; drop the 3821 segment and return. 3823 o If the FIN bit is set, signal the user "connection closing" 3824 and return any pending RECEIVEs with same message, advance 3825 RCV.NXT over the FIN, and send an acknowledgment for the 3826 FIN. Note that FIN implies PUSH for any segment text not 3827 yet delivered to the user. 3829 + SYN-RECEIVED STATE 3831 ESTABLISHED STATE 3833 * Enter the CLOSE-WAIT state. 3835 + FIN-WAIT-1 STATE 3837 * If our FIN has been ACKed (perhaps in this segment), 3838 then enter TIME-WAIT, start the time-wait timer, turn 3839 off the other timers; otherwise enter the CLOSING 3840 state. 3842 + FIN-WAIT-2 STATE 3844 * Enter the TIME-WAIT state. Start the time-wait timer, 3845 turn off the other timers. 3847 + CLOSE-WAIT STATE 3849 * Remain in the CLOSE-WAIT state. 3851 + CLOSING STATE 3853 * Remain in the CLOSING state. 3855 + LAST-ACK STATE 3857 * Remain in the LAST-ACK state. 3859 + TIME-WAIT STATE 3861 * Remain in the TIME-WAIT state. Restart the 2 MSL 3862 time-wait timeout. 3864 - and return. 3866 3.10.8. Timeouts 3868 USER TIMEOUT 3870 - For any state if the user timeout expires, flush all queues, 3871 signal the user "error: connection aborted due to user timeout" 3872 in general and for any outstanding calls, delete the TCB, enter 3873 the CLOSED state and return. 3875 RETRANSMISSION TIMEOUT 3877 - For any state if the retransmission timeout expires on a 3878 segment in the retransmission queue, send the segment at the 3879 front of the retransmission queue again, reinitialize the 3880 retransmission timer, and return. 3882 TIME-WAIT TIMEOUT 3884 - If the time-wait timeout expires on a connection delete the 3885 TCB, enter the CLOSED state and return. 3887 4. Glossary 3889 ACK 3890 A control bit (acknowledge) occupying no sequence space, 3891 which indicates that the acknowledgment field of this segment 3892 specifies the next sequence number the sender of this segment 3893 is expecting to receive, hence acknowledging receipt of all 3894 previous sequence numbers. 3896 connection 3897 A logical communication path identified by a pair of sockets. 3899 datagram 3900 A message sent in a packet switched computer communications 3901 network. 3903 Destination Address 3904 The network layer address of the endpoint intended to receive 3905 a segment. 3907 FIN 3908 A control bit (finis) occupying one sequence number, which 3909 indicates that the sender will send no more data or control 3910 occupying sequence space. 3912 flush 3913 To remove all of the contents (data or segments) from a store 3914 (buffer or queue). 3916 fragment 3917 A portion of a logical unit of data, in particular an 3918 internet fragment is a portion of an internet datagram. 3920 header 3921 Control information at the beginning of a message, segment, 3922 fragment, packet or block of data. 3924 host 3925 A computer. In particular a source or destination of 3926 messages from the point of view of the communication network. 3928 Identification 3929 An Internet Protocol field. This identifying value assigned 3930 by the sender aids in assembling the fragments of a datagram. 3932 internet address 3933 A network layer address. 3935 internet datagram 3936 A unit of data exchanged between internet hosts, together 3937 with the internet header that allows the datagram to be 3938 routed from source to destination. 3940 internet fragment 3941 A portion of the data of an internet datagram with an 3942 internet header. 3944 IP 3945 Internet Protocol. See [1] and [13]. 3947 IRS 3948 The Initial Receive Sequence number. The first sequence 3949 number used by the sender on a connection. 3951 ISN 3952 The Initial Sequence Number. The first sequence number used 3953 on a connection, (either ISS or IRS). Selected in a way that 3954 is unique within a given period of time and is unpredictable 3955 to attackers. 3957 ISS 3958 The Initial Send Sequence number. The first sequence number 3959 used by the sender on a connection. 3961 left sequence 3962 This is the next sequence number to be acknowledged by the 3963 data receiving TCP endpoint (or the lowest currently 3964 unacknowledged sequence number) and is sometimes referred to 3965 as the left edge of the send window. 3967 module 3968 An implementation, usually in software, of a protocol or 3969 other procedure. 3971 MSL 3972 Maximum Segment Lifetime, the time a TCP segment can exist in 3973 the internetwork system. Arbitrarily defined to be 2 3974 minutes. 3976 octet 3977 An eight bit byte. 3979 Options 3980 An Option field may contain several options, and each option 3981 may be several octets in length. 3983 packet 3984 A package of data with a header that may or may not be 3985 logically complete. More often a physical packaging than a 3986 logical packaging of data. 3988 port 3989 The portion of a connection identifier used for 3990 demultiplexing connections at an endpoint. 3992 process 3993 A program in execution. A source or destination of data from 3994 the point of view of the TCP endpoint or other host-to-host 3995 protocol. 3997 PUSH 3998 A control bit occupying no sequence space, indicating that 3999 this segment contains data that must be pushed through to the 4000 receiving user. 4002 RCV.NXT 4003 receive next sequence number 4005 RCV.UP 4006 receive urgent pointer 4008 RCV.WND 4009 receive window 4011 receive next sequence number 4012 This is the next sequence number the local TCP endpoint is 4013 expecting to receive. 4015 receive window 4016 This represents the sequence numbers the local (receiving) 4017 TCP endpoint is willing to receive. Thus, the local TCP 4018 endpoint considers that segments overlapping the range 4019 RCV.NXT to RCV.NXT + RCV.WND - 1 carry acceptable data or 4020 control. Segments containing sequence numbers entirely 4021 outside this range are considered duplicates or injection 4022 attacks and discarded. 4024 RST 4025 A control bit (reset), occupying no sequence space, 4026 indicating that the receiver should delete the connection 4027 without further interaction. The receiver can determine, 4028 based on the sequence number and acknowledgment fields of the 4029 incoming segment, whether it should honor the reset command 4030 or ignore it. In no case does receipt of a segment 4031 containing RST give rise to a RST in response. 4033 SEG.ACK 4034 segment acknowledgment 4036 SEG.LEN 4037 segment length 4039 SEG.SEQ 4040 segment sequence 4042 SEG.UP 4043 segment urgent pointer field 4045 SEG.WND 4046 segment window field 4048 segment 4049 A logical unit of data, in particular a TCP segment is the 4050 unit of data transferred between a pair of TCP modules. 4052 segment acknowledgment 4053 The sequence number in the acknowledgment field of the 4054 arriving segment. 4056 segment length 4057 The amount of sequence number space occupied by a segment, 4058 including any controls that occupy sequence space. 4060 segment sequence 4061 The number in the sequence field of the arriving segment. 4063 send sequence 4064 This is the next sequence number the local (sending) TCP 4065 endpoint will use on the connection. It is initially 4066 selected from an initial sequence number curve (ISN) and is 4067 incremented for each octet of data or sequenced control 4068 transmitted. 4070 send window 4071 This represents the sequence numbers that the remote 4072 (receiving) TCP endpoint is willing to receive. It is the 4073 value of the window field specified in segments from the 4074 remote (data receiving) TCP endpoint. The range of new 4075 sequence numbers that may be emitted by a TCP implementation 4076 lies between SND.NXT and SND.UNA + SND.WND - 1. 4077 (Retransmissions of sequence numbers between SND.UNA and 4078 SND.NXT are expected, of course.) 4080 SND.NXT 4081 send sequence 4083 SND.UNA 4084 left sequence 4086 SND.UP 4087 send urgent pointer 4089 SND.WL1 4090 segment sequence number at last window update 4092 SND.WL2 4093 segment acknowledgment number at last window update 4095 SND.WND 4096 send window 4098 socket (or socket number, or socket address, or socket identifier) 4099 An address that specifically includes a port identifier, that 4100 is, the concatenation of an Internet Address with a TCP port. 4102 Source Address 4103 The network layer address of the sending endpoint. 4105 SYN 4106 A control bit in the incoming segment, occupying one sequence 4107 number, used at the initiation of a connection, to indicate 4108 where the sequence numbering will start. 4110 TCB 4111 Transmission control block, the data structure that records 4112 the state of a connection. 4114 TCP 4115 Transmission Control Protocol: A host-to-host protocol for 4116 reliable communication in internetwork environments. 4118 TOS 4119 Type of Service, an obsoleted IPv4 field. The same header 4120 bits currently are used for the Differentiated Services field 4121 [4] containing the Differentiated Services Code Point (DSCP) 4122 value and the 2-bit ECN codepoint [6]. 4124 Type of Service 4125 See "TOS". 4127 URG 4128 A control bit (urgent), occupying no sequence space, used to 4129 indicate that the receiving user should be notified to do 4130 urgent processing as long as there is data to be consumed 4131 with sequence numbers less than the value indicated by the 4132 urgent pointer. 4134 urgent pointer 4135 A control field meaningful only when the URG bit is on. This 4136 field communicates the value of the urgent pointer that 4137 indicates the data octet associated with the sending user's 4138 urgent call. 4140 5. Changes from RFC 793 4142 This document obsoletes RFC 793 as well as RFC 6093 and 6528, which 4143 updated 793. In all cases, only the normative protocol specification 4144 and requirements have been incorporated into this document, and some 4145 informational text with background and rationale may not have been 4146 carried in. The informational content of those documents is still 4147 valuable in learning about and understanding TCP, and they are valid 4148 Informational references, even though their normative content has 4149 been incorporated into this document. 4151 The main body of this document was adapted from RFC 793's Section 3, 4152 titled "FUNCTIONAL SPECIFICATION", with an attempt to keep formatting 4153 and layout as close as possible. 4155 The collection of applicable RFC Errata that have been reported and 4156 either accepted or held for an update to RFC 793 were incorporated 4157 (Errata IDs: 573, 574, 700, 701, 1283, 1561, 1562, 1564, 1571, 1572, 4158 2297, 2298, 2748, 2749, 2934, 3213, 3300, 3301, 6222). Some errata 4159 were not applicable due to other changes (Errata IDs: 572, 575, 1565, 4160 1569, 2296, 3305, 3602). 4162 Changes to the specification of the Urgent Pointer described in RFCs 4163 1011, 1122, and 6093 were incorporated. See RFC 6093 for detailed 4164 discussion of why these changes were necessary. 4166 The discussion of the RTO from RFC 793 was updated to refer to RFC 4167 6298. The RFC 1122 text on the RTO originally replaced the 793 text, 4168 however, RFC 2988 should have updated 1122, and has subsequently been 4169 obsoleted by 6298. 4171 RFC 1011 [19] contains a number of comments about RFC 793, including 4172 some needed changes to the TCP specification. These are expanded in 4173 RFC 1122, which contains a collection of other changes and 4174 clarifications to RFC 793. The normative items impacting the 4175 protocol have been incorporated here, though some historically useful 4176 implementation advice and informative discussion from RFC 1122 is not 4177 included here. The present document updates RFC 1011, since this is 4178 now the TCP specification rather than RFC 793, and the comments noted 4179 in 1011 have been incorporated. 4181 RFC 1122 contains more than just TCP requirements, so this document 4182 can't obsolete RFC 1122 entirely. It is only marked as "updating" 4183 1122, however, it should be understood to effectively obsolete all of 4184 the RFC 1122 material on TCP. 4186 The more secure Initial Sequence Number generation algorithm from RFC 4187 6528 was incorporated. See RFC 6528 for discussion of the attacks 4188 that this mitigates, as well as advice on selecting PRF algorithms 4189 and managing secret key data. 4191 A note based on RFC 6429 was added to explicitly clarify that system 4192 resource management concerns allow connection resources to be 4193 reclaimed. RFC 6429 is obsoleted in the sense that this 4194 clarification has been reflected in this update to the base TCP 4195 specification now. 4197 The description of congestion control implementation was added, based 4198 on the set of documents that are IETF BCP or Standards Track on the 4199 topic, and the current state of common implementations. 4201 RFC EDITOR'S NOTE: the content below is for detailed change tracking 4202 and planning, and not to be included with the final revision of the 4203 document. 4205 This document started as draft-eddy-rfc793bis-00, that was merely a 4206 proposal and rough plan for updating RFC 793. 4208 The -01 revision of this draft-eddy-rfc793bis incorporates the 4209 content of RFC 793 Section 3 titled "FUNCTIONAL SPECIFICATION". 4210 Other content from RFC 793 has not been incorporated. The -01 4211 revision of this document makes some minor formatting changes to the 4212 RFC 793 content in order to convert the content into XML2RFC format 4213 and account for left-out parts of RFC 793. For instance, figure 4214 numbering differs and some indentation is not exactly the same. 4216 The -02 revision of draft-eddy-rfc793bis incorporates errata that 4217 have been verified: 4219 Errata ID 573: Reported by Bob Braden (note: This errata report 4220 basically is just a reminder that RFC 1122 updates 793. Some of 4221 the associated changes are left pending to a separate revision 4222 that incorporates 1122. Bob's mention of PUSH in 793 section 2.8 4223 was not applicable here because that section was not part of the 4224 "functional specification". Also, the 1122 text on the 4225 retransmission timeout also has been updated by subsequent RFCs, 4226 so the change here deviates from Bob's suggestion to apply the 4227 1122 text.) 4228 Errata ID 574: Reported by Yin Shuming 4229 Errata ID 700: Reported by Yin Shuming 4230 Errata ID 701: Reported by Yin Shuming 4231 Errata ID 1283: Reported by Pei-chun Cheng 4232 Errata ID 1561: Reported by Constantin Hagemeier 4233 Errata ID 1562: Reported by Constantin Hagemeier 4234 Errata ID 1564: Reported by Constantin Hagemeier 4235 Errata ID 1565: Reported by Constantin Hagemeier 4236 Errata ID 1571: Reported by Constantin Hagemeier 4237 Errata ID 1572: Reported by Constantin Hagemeier 4238 Errata ID 2296: Reported by Vishwas Manral 4239 Errata ID 2297: Reported by Vishwas Manral 4240 Errata ID 2298: Reported by Vishwas Manral 4241 Errata ID 2748: Reported by Mykyta Yevstifeyev 4242 Errata ID 2749: Reported by Mykyta Yevstifeyev 4243 Errata ID 2934: Reported by Constantin Hagemeier 4244 Errata ID 3213: Reported by EugnJun Yi 4245 Errata ID 3300: Reported by Botong Huang 4246 Errata ID 3301: Reported by Botong Huang 4247 Errata ID 3305: Reported by Botong Huang 4248 Note: Some verified errata were not used in this update, as they 4249 relate to sections of RFC 793 elided from this document. These 4250 include Errata ID 572, 575, and 1569. 4251 Note: Errata ID 3602 was not applied in this revision as it is 4252 duplicative of the 1122 corrections. 4254 Not related to RFC 793 content, this revision also makes small tweaks 4255 to the introductory text, fixes indentation of the pseudo header 4256 diagram, and notes that the Security Considerations should also 4257 include privacy, when this section is written. 4259 The -03 revision of draft-eddy-rfc793bis revises all discussion of 4260 the urgent pointer in order to comply with RFC 6093, 1122, and 1011. 4261 Since 1122 held requirements on the urgent pointer, the full list of 4262 requirements was brought into an appendix of this document, so that 4263 it can be updated as-needed. 4265 The -04 revision of draft-eddy-rfc793bis includes the ISN generation 4266 changes from RFC 6528. 4268 The -05 revision of draft-eddy-rfc793bis incorporates MSS 4269 requirements and definitions from RFC 879 [17], 1122, and 6691, as 4270 well as option-handling requirements from RFC 1122. 4272 The -00 revision of draft-ietf-tcpm-rfc793bis incorporates several 4273 additional clarifications and updates to the section on segmentation, 4274 many of which are based on feedback from Joe Touch improving from the 4275 initial text on this in the previous revision. 4277 The -01 revision incorporates the change to Reserved bits due to ECN, 4278 as well as many other changes that come from RFC 1122. 4280 The -02 revision has small formatting modifications in order to 4281 address xml2rfc warnings about long lines. It was a quick update to 4282 avoid document expiration. TCPM working group discussion in 2015 4283 also indicated that we should not try to add sections on 4284 implementation advice or similar non-normative information. 4286 The -03 revision incorporates more content from RFC 1122: Passive 4287 OPEN Calls, Time-To-Live, Multihoming, IP Options, ICMP messages, 4288 Data Communications, When to Send Data, When to Send a Window Update, 4289 Managing the Window, Probing Zero Windows, When to Send an ACK 4290 Segment. The section on data communications was re-organized into 4291 clearer subsections (previously headings were embedded in the 793 4292 text), and windows management advice from 793 was removed (as 4293 reviewed by TCPM working group) in favor of the 1122 additions on 4294 SWS, ZWP, and related topics. 4296 The -04 revision includes reference to RFC 6429 on the ZWP condition, 4297 RFC1122 material on TCP Connection Failures, TCP Keep-Alives, 4298 Acknowledging Queued Segments, and Remote Address Validation. RTO 4299 computation is referenced from RFC 6298 rather than RFC 1122. 4301 The -05 revision includes the requirement to implement TCP congestion 4302 control with recommendation to implement ECN, the RFC 6633 update to 4303 1122, which changed the requirement on responding to source quench 4304 ICMP messages, and discussion of ICMP (and ICMPv6) soft and hard 4305 errors per RFC 5461 (ICMPv6 handling for TCP doesn't seem to be 4306 mentioned elsewhere in standards track). 4308 The -06 revision includes an appendix on "Other Implementation Notes" 4309 to capture widely-deployed fundamental features that are not 4310 contained in the RFC series yet. It also added mention of RFC 6994 4311 and the IANA TCP parameters registry as a reference. It includes 4312 references to RFC 5961 in appropriate places. The references to TOS 4313 were changed to DiffServ field, based on reflecting RFC 2474 as well 4314 as the IPv6 presence of traffic class (carrying DiffServ field) 4315 rather than TOS. 4317 The -07 revision includes reference to RFC 6191, updated security 4318 considerations, discussion of additional implementation 4319 considerations, and clarification of data on the SYN. 4321 The -08 revision includes changes based on: 4323 describing treatment of reserved bits (following TCPM mailing list 4324 thread from July 2014 on "793bis item - reserved bit behavior" 4325 addition a brief TCP key concepts section to make up for not 4326 including the outdated section 2 of RFC 793 4327 changed "TCP" to "host" to resolve conflict between 1122 wording 4328 on whether TCP or the network layer chooses an address when 4329 multihomed 4330 fixed/updated definition of options in glossary 4331 moved note on aggregating ACKs from 1122 to a more appropriate 4332 location 4333 resolved notes on IP precedence and security/compartment 4334 added implementation note on sequence number validation 4335 added note that PUSH does not apply when Nagle is active 4336 added 1122 content on asynchronous reports to replace 793 section 4337 on TCP to user messages 4339 The -09 revision fixes section numbering problems. 4341 The -10 revision includes additions to the security considerations 4342 based on comments from Joe Touch, and suggested edits on RST/FIN 4343 notification, RFC 2525 reference, and other edits suggested by 4344 Yuchung Cheng, as well as modifications to DiffServ text from Yuchung 4345 Cheng and Gorry Fairhurst. 4347 The -11 revision includes a start at identifying all of the 4348 requirements text and referencing each instance in the common table 4349 at the end of the document. 4351 The -12 revision completes the requirement language indexing started 4352 in -11 and adds necessary description of the PUSH functionality that 4353 was missing. 4355 The -13 revision contains only changes in the inline editor notes. 4357 The -14 revision includes updates with regard to several comments 4358 from the mailing list, including editorial fixes, adding IANA 4359 considerations for the header flags, improving figure title 4360 placement, and breaking up the "Terminology" section into more 4361 appropriately titled subsections. 4363 The -15 revision has many technical and editorial corrections from 4364 Gorry Fairhurst's review, and subsequent discussion on the TCPM list, 4365 as well as some other collected clarifications and improvements from 4366 mailing list discussion. 4368 The -16 revision addresses several discussions that rose from 4369 additional reviews and follow-up on some of Gorry Fairhurst's 4370 comments from revision 14. 4372 The -17 revision includes errata 6222 from Charles Deng, update to 4373 the key words boilerplate, updated description of the header flags 4374 registry changes, and clarification about connections rather than 4375 users in the discussion of OPEN calls. 4377 The -18 revision includes editorial changes to the IANA 4378 considerations, based on comments from Richard Scheffenegger at the 4379 IETF 108 TCPM virtual meeting. 4381 The -19 revision includes editorial changes from Errata 6281 and 6282 4382 reported by Merlin Buge. It also includes WGLC changes noted by 4383 Mohamed Boucadair, Rahul Jadhav, Praveen Balasubramanian, Matt Olson, 4384 Yi Huang, Joe Touch, and Juhamatti Kuusisaari. 4386 The -20 revision includes text on congestion control based on mailing 4387 list and meeting discussion, put together in its final form by Markku 4388 Kojo. It also clarifies that SACK, WS, and TS options are 4389 recommended for high performance, but not needed for basic 4390 interoperability. It also clarifies that the length field is 4391 required for new TCP options. 4393 The -21 revision includes slight changes to the header diagram for 4394 compatibility with tooling, from Stephen McQuistin, clarification on 4395 the meaning of idle connections from Yuchung Cheng, Neal Cardwell, 4396 Michael Scharf, and Richard Scheffenegger, editorial improvements 4397 from Markku Kojo, notes that some stacks suppress extra 4398 acknowledgments of the SYN when SYN-ACK carries data from Richard 4399 Scheffenegger, and adds MAY-18 numbering based on note from Jonathan 4400 Morton. 4402 The -22 revision includes small clarifications on terminology (might 4403 versus may) and IPv6 extension headers versus IPv4 options, based on 4404 comments from Gorry Fairhurst. 4406 The -23 revision has a fix to indentation from Michael Tuexen and 4407 idnits issues addressed from Michael Scharf. 4409 The -24 revision incorporates changes after Martin Duke's AD review, 4410 including further feedback on those comments from Yuchung Cheng and 4411 Joe Touch. Important changes for review include (1) removal of the 4412 need to check for the PUSH flag when evaluating the SWS override 4413 timer expiration, (2) clarification about receding urgent pointer, 4414 and (3) de-duplicating handling of the RST checking between step 4 4415 and step 1. 4417 The -25 revision incorporates changes based on the GENART review from 4418 Francis Dupont, SECDIR review from Kyle Rose, and OPSDIR review from 4419 Sarah Banks. 4421 The -26 revision incorporates changes stemming from the IESG reviews, 4422 and INTDIR review from Bernie Volz. 4424 The -27 revision fixes a few small editorial incompatibilities that 4425 Stephen McQuistin found related to automated code generation. 4427 The -28 revision addresses some COMMENTs from Ben Kaduk's IESG 4428 review. 4430 Some other suggested changes that will not be incorporated in this 4431 793 update unless TCPM consensus changes with regard to scope are: 4433 1. Tony Sabatini's suggestion for describing DO field 4434 2. Per discussion with Joe Touch (TAPS list, 6/20/2015), the 4435 description of the API could be revisited 4436 3. Reducing the R2 value for SYNs has been suggested as a possible 4437 topic for future consideration. 4439 Early in the process of updating RFC 793, Scott Brim mentioned that 4440 this should include a PERPASS/privacy review. This may be something 4441 for the chairs or AD to request during WGLC or IETF LC. 4443 6. IANA Considerations 4445 In the "Transmission Control Protocol (TCP) Header Flags" registry, 4446 IANA is asked to make several changes described in this section. 4448 RFC 3168 originally created this registry, but only populated it with 4449 the new bits defined in RFC 3168, neglecting the other bits that had 4450 previously been described in RFC 793 and other documents. Bit 7 has 4451 since also been updated by RFC 8311. 4453 The "Bit" column is renamed below as the "Bit Offset" column, since 4454 it references each header flag's offset within the 16-bit aligned 4455 view of the TCP header in Figure 1. The bits in offsets 0 through 4 4456 are the TCP segment Data Offset field, and not header flags. 4458 IANA should add a column for "Assignment Notes". 4460 IANA should assign values indicated below. 4462 TCP Header Flags 4464 Bit Name Reference Assignment Notes 4465 Offset 4466 --- ---- --------- ---------------- 4467 4 Reserved for future use (this document) 4468 5 Reserved for future use (this document) 4469 6 Reserved for future use (this document) 4470 7 Reserved for future use [RFC8311] [1] 4471 8 CWR (Congestion Window Reduced) [RFC3168] 4472 9 ECE (ECN-Echo) [RFC3168] 4473 10 Urgent Pointer field is significant (URG) (this document) 4474 11 Acknowledgment field is significant (ACK) (this document) 4475 12 Push Function (PSH) (this document) 4476 13 Reset the connection (RST) (this document) 4477 14 Synchronize sequence numbers (SYN) (this document) 4478 15 No more data from sender (FIN) (this document) 4480 FOOTNOTES: 4481 [1] Previously used by Historic [RFC3540] as NS (Nonce Sum). 4483 This TCP Header Flags registry should also be moved to a sub-registry 4484 under the global "Transmission Control Protocol (TCP) Parameters 4485 registry (https://www.iana.org/assignments/tcp-parameters/tcp- 4486 parameters.xhtml). 4488 The registry's Registration Procedure should remain Standards Action, 4489 but the Reference can be updated to this document, and the Note 4490 removed. 4492 7. Security and Privacy Considerations 4494 The TCP design includes only rudimentary security features that 4495 improve the robustness and reliability of connections and application 4496 data transfer, but there are no built-in cryptographic capabilities 4497 to support any form of confidentiality, authentication, or other 4498 typical security functions. Non-cryptographic enhancements (e.g. 4499 [9]) have been developed to improve robustness of TCP connections to 4500 particular types of attacks, but the applicability and protections of 4501 non-cryptographic enhancements are limited (e.g. see section 1.1 of 4502 [9]). Applications typically utilize lower-layer (e.g. IPsec) and 4503 upper-layer (e.g. TLS) protocols to provide security and privacy for 4504 TCP connections and application data carried in TCP. Methods based 4505 on TCP options have been developed as well, to support some security 4506 capabilities. 4508 In order to fully provide confidentiality, integrity protection, and 4509 authentication for TCP connections (including their control flags) 4510 IPsec is the only current effective method. For integrity protection 4511 and authentication, the TCP Authentication Option (TCP-AO) [39] is 4512 available, with a proposed extension to also provide confidentiality 4513 for the segment payload. Other methods discussed in this section may 4514 provide confidentiality or integrity protection for the payload, but 4515 for the TCP header only cover either a subset of the fields (e.g. 4516 tcpcrypt [57]) or none at all (e.g. TLS). Other security features 4517 that have been added to TCP (e.g. ISN generation, sequence number 4518 checks, and others) are only capable of partially hindering attacks. 4520 Applications using long-lived TCP flows have been vulnerable to 4521 attacks that exploit the processing of control flags described in 4522 earlier TCP specifications [34]. TCP-MD5 was a commonly implemented 4523 TCP option to support authentication for some of these connections, 4524 but had flaws and is now deprecated. TCP-AO provides a capability to 4525 protect long-lived TCP connections from attacks, and has superior 4526 properties to TCP-MD5. It does not provide any privacy for 4527 application data, nor for the TCP headers. 4529 The "tcpcrypt" [57] Experimental extension to TCP provides the 4530 ability to cryptographically protect connection data. Metadata 4531 aspects of the TCP flow are still visible, but the application stream 4532 is well-protected. Within the TCP header, only the urgent pointer 4533 and FIN flag are protected through tcpcrypt. 4535 The TCP Roadmap [50] includes notes about several RFCs related to TCP 4536 security. Many of the enhancements provided by these RFCs have been 4537 integrated into the present document, including ISN generation, 4538 mitigating blind in-window attacks, and improving handling of soft 4539 errors and ICMP packets. These are all discussed in greater detail 4540 in the referenced RFCs that originally described the changes needed 4541 to earlier TCP specifications. Additionally, see RFC 6093 [40] for 4542 discussion of security considerations related to the urgent pointer 4543 field, that has been deprecated. 4545 Since TCP is often used for bulk transfer flows, some attacks are 4546 possible that abuse the TCP congestion control logic. An example is 4547 "ACK-division" attacks. Updates that have been made to the TCP 4548 congestion control specifications include mechanisms like Appropriate 4549 Byte Counting (ABC) [30] that act as mitigations to these attacks. 4551 Other attacks are focused on exhausting the resources of a TCP 4552 server. Examples include SYN flooding [33] or wasting resources on 4553 non-progressing connections [42]. Operating systems commonly 4554 implement mitigations for these attacks. Some common defenses also 4555 utilize proxies, stateful firewalls, and other technologies outside 4556 the end-host TCP implementation. 4558 The concept of a protocol's "wire image" is described in RFC 8546 4559 [56], which describes how TCP's cleartext headers expose more 4560 metadata to nodes on the path than is strictly required to route the 4561 packets to their destination. On-path adversaries may be able to 4562 leverage this metadata. Lessons learned in this respect from TCP 4563 have been applied in the design of newer transports like QUIC [60]. 4564 Additionally, based partly on experiences with TCP and its 4565 extensions, there are considerations that might be applicable for 4566 future TCP extensions and other transports that the IETF has 4567 documented in RFC 9065 [61], along with IAB recommendations in RFC 4568 8558 [58] and [68]. 4570 There are also methods of "fingerprinting" that can be used to infer 4571 the host TCP implementation (operating system) version or platform 4572 information. These collect observations of several aspects such as 4573 the options present in segments, the ordering of options, the 4574 specific behaviors in the case of various conditions, packet timing, 4575 packet sizing, and other aspects of the protocol that are left to be 4576 determined by an implementer, and can use those observations to 4577 identify information about the host and implementation. 4579 8. Acknowledgements 4581 This document is largely a revision of RFC 793, which Jon Postel was 4582 the editor of. Due to his excellent work, it was able to last for 4583 three decades before we felt the need to revise it. 4585 Andre Oppermann was a contributor and helped to edit the first 4586 revision of this document. 4588 We are thankful for the assistance of the IETF TCPM working group 4589 chairs, over the course of work on this document: 4591 Michael Scharf 4593 Yoshifumi Nishida 4595 Pasi Sarolahti 4597 Michael Tuexen 4599 During the discussions of this work on the TCPM mailing list, in 4600 working group meetings, and via area reviews, helpful comments, 4601 critiques, and reviews were received from (listed alphabetically by 4602 last name): Praveen Balasubramanian, David Borman, Mohamed Boucadair, 4603 Bob Briscoe, Neal Cardwell, Yuchung Cheng, Martin Duke, Francis 4604 Dupont, Ted Faber, Gorry Fairhurst, Fernando Gont, Rodney Grimes, Yi 4605 Huang, Rahul Jadhav, Markku Kojo, Mike Kosek, Juhamatti Kuusisaari, 4606 Kevin Lahey, Kevin Mason, Matt Mathis, Stephen McQuistin, Jonathan 4607 Morton, Matt Olson, Tommy Pauly, Tom Petch, Hagen Paul Pfeifer, Kyle 4608 Rose, Anthony Sabatini, Michael Scharf, Greg Skinner, Joe Touch, 4609 Michael Tuexen, Reji Varghese, Bernie Volz, Tim Wicinski, Lloyd Wood, 4610 and Alex Zimmermann. 4612 Joe Touch provided additional help in clarifying the description of 4613 segment size parameters and PMTUD/PLPMTUD recommendations. Markku 4614 Kojo helped put together the text in the section on TCP Congestion 4615 Control. 4617 This document includes content from errata that were reported by 4618 (listed chronologically): Yin Shuming, Bob Braden, Morris M. Keesan, 4619 Pei-chun Cheng, Constantin Hagemeier, Vishwas Manral, Mykyta 4620 Yevstifeyev, EungJun Yi, Botong Huang, Charles Deng, Merlin Buge. 4622 9. References 4624 9.1. Normative References 4626 [1] Postel, J., "Internet Protocol", STD 5, RFC 791, 4627 DOI 10.17487/RFC0791, September 1981, 4628 . 4630 [2] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 4631 DOI 10.17487/RFC1191, November 1990, 4632 . 4634 [3] Bradner, S., "Key words for use in RFCs to Indicate 4635 Requirement Levels", BCP 14, RFC 2119, 4636 DOI 10.17487/RFC2119, March 1997, 4637 . 4639 [4] Nichols, K., Blake, S., Baker, F., and D. Black, 4640 "Definition of the Differentiated Services Field (DS 4641 Field) in the IPv4 and IPv6 Headers", RFC 2474, 4642 DOI 10.17487/RFC2474, December 1998, 4643 . 4645 [5] Floyd, S., "Congestion Control Principles", BCP 41, 4646 RFC 2914, DOI 10.17487/RFC2914, September 2000, 4647 . 4649 [6] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 4650 of Explicit Congestion Notification (ECN) to IP", 4651 RFC 3168, DOI 10.17487/RFC3168, September 2001, 4652 . 4654 [7] Floyd, S. and M. Allman, "Specifying New Congestion 4655 Control Algorithms", BCP 133, RFC 5033, 4656 DOI 10.17487/RFC5033, August 2007, 4657 . 4659 [8] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 4660 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 4661 . 4663 [9] Ramaiah, A., Stewart, R., and M. Dalal, "Improving TCP's 4664 Robustness to Blind In-Window Attacks", RFC 5961, 4665 DOI 10.17487/RFC5961, August 2010, 4666 . 4668 [10] Paxson, V., Allman, M., Chu, J., and M. Sargent, 4669 "Computing TCP's Retransmission Timer", RFC 6298, 4670 DOI 10.17487/RFC6298, June 2011, 4671 . 4673 [11] Gont, F., "Deprecation of ICMP Source Quench Messages", 4674 RFC 6633, DOI 10.17487/RFC6633, May 2012, 4675 . 4677 [12] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 4678 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 4679 May 2017, . 4681 [13] Deering, S. and R. Hinden, "Internet Protocol, Version 6 4682 (IPv6) Specification", STD 86, RFC 8200, 4683 DOI 10.17487/RFC8200, July 2017, 4684 . 4686 [14] McCann, J., Deering, S., Mogul, J., and R. Hinden, Ed., 4687 "Path MTU Discovery for IP version 6", STD 87, RFC 8201, 4688 DOI 10.17487/RFC8201, July 2017, 4689 . 4691 [15] Allman, M., "Requirements for Time-Based Loss Detection", 4692 BCP 233, RFC 8961, DOI 10.17487/RFC8961, November 2020, 4693 . 4695 9.2. Informative References 4697 [16] Postel, J., "Transmission Control Protocol", STD 7, 4698 RFC 793, DOI 10.17487/RFC0793, September 1981, 4699 . 4701 [17] Postel, J., "The TCP Maximum Segment Size and Related 4702 Topics", RFC 879, DOI 10.17487/RFC0879, November 1983, 4703 . 4705 [18] Nagle, J., "Congestion Control in IP/TCP Internetworks", 4706 RFC 896, DOI 10.17487/RFC0896, January 1984, 4707 . 4709 [19] Reynolds, J. and J. Postel, "Official Internet protocols", 4710 RFC 1011, DOI 10.17487/RFC1011, May 1987, 4711 . 4713 [20] Braden, R., Ed., "Requirements for Internet Hosts - 4714 Communication Layers", STD 3, RFC 1122, 4715 DOI 10.17487/RFC1122, October 1989, 4716 . 4718 [21] Almquist, P., "Type of Service in the Internet Protocol 4719 Suite", RFC 1349, DOI 10.17487/RFC1349, July 1992, 4720 . 4722 [22] Braden, R., "T/TCP -- TCP Extensions for Transactions 4723 Functional Specification", RFC 1644, DOI 10.17487/RFC1644, 4724 July 1994, . 4726 [23] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 4727 Selective Acknowledgment Options", RFC 2018, 4728 DOI 10.17487/RFC2018, October 1996, 4729 . 4731 [24] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, 4732 J., Heavens, I., Lahey, K., Semke, J., and B. Volz, "Known 4733 TCP Implementation Problems", RFC 2525, 4734 DOI 10.17487/RFC2525, March 1999, 4735 . 4737 [25] Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms", 4738 RFC 2675, DOI 10.17487/RFC2675, August 1999, 4739 . 4741 [26] Xiao, X., Hannan, A., Paxson, V., and E. Crabbe, "TCP 4742 Processing of the IPv4 Precedence Field", RFC 2873, 4743 DOI 10.17487/RFC2873, June 2000, 4744 . 4746 [27] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 4747 Extension to the Selective Acknowledgement (SACK) Option 4748 for TCP", RFC 2883, DOI 10.17487/RFC2883, July 2000, 4749 . 4751 [28] Lahey, K., "TCP Problems with Path MTU Discovery", 4752 RFC 2923, DOI 10.17487/RFC2923, September 2000, 4753 . 4755 [29] Balakrishnan, H., Padmanabhan, V., Fairhurst, G., and M. 4756 Sooriyabandara, "TCP Performance Implications of Network 4757 Path Asymmetry", BCP 69, RFC 3449, DOI 10.17487/RFC3449, 4758 December 2002, . 4760 [30] Allman, M., "TCP Congestion Control with Appropriate Byte 4761 Counting (ABC)", RFC 3465, DOI 10.17487/RFC3465, February 4762 2003, . 4764 [31] Fenner, B., "Experimental Values In IPv4, IPv6, ICMPv4, 4765 ICMPv6, UDP, and TCP Headers", RFC 4727, 4766 DOI 10.17487/RFC4727, November 2006, 4767 . 4769 [32] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 4770 Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007, 4771 . 4773 [33] Eddy, W., "TCP SYN Flooding Attacks and Common 4774 Mitigations", RFC 4987, DOI 10.17487/RFC4987, August 2007, 4775 . 4777 [34] Touch, J., "Defending TCP Against Spoofing Attacks", 4778 RFC 4953, DOI 10.17487/RFC4953, July 2007, 4779 . 4781 [35] Culley, P., Elzur, U., Recio, R., Bailey, S., and J. 4782 Carrier, "Marker PDU Aligned Framing for TCP 4783 Specification", RFC 5044, DOI 10.17487/RFC5044, October 4784 2007, . 4786 [36] Gont, F., "TCP's Reaction to Soft Errors", RFC 5461, 4787 DOI 10.17487/RFC5461, February 2009, 4788 . 4790 [37] StJohns, M., Atkinson, R., and G. Thomas, "Common 4791 Architecture Label IPv6 Security Option (CALIPSO)", 4792 RFC 5570, DOI 10.17487/RFC5570, July 2009, 4793 . 4795 [38] Sandlund, K., Pelletier, G., and L-E. Jonsson, "The RObust 4796 Header Compression (ROHC) Framework", RFC 5795, 4797 DOI 10.17487/RFC5795, March 2010, 4798 . 4800 [39] Touch, J., Mankin, A., and R. Bonica, "The TCP 4801 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 4802 June 2010, . 4804 [40] Gont, F. and A. Yourtchenko, "On the Implementation of the 4805 TCP Urgent Mechanism", RFC 6093, DOI 10.17487/RFC6093, 4806 January 2011, . 4808 [41] Gont, F., "Reducing the TIME-WAIT State Using TCP 4809 Timestamps", BCP 159, RFC 6191, DOI 10.17487/RFC6191, 4810 April 2011, . 4812 [42] Bashyam, M., Jethanandani, M., and A. Ramaiah, "TCP Sender 4813 Clarification for Persist Condition", RFC 6429, 4814 DOI 10.17487/RFC6429, December 2011, 4815 . 4817 [43] Gont, F. and S. Bellovin, "Defending against Sequence 4818 Number Attacks", RFC 6528, DOI 10.17487/RFC6528, February 4819 2012, . 4821 [44] Borman, D., "TCP Options and Maximum Segment Size (MSS)", 4822 RFC 6691, DOI 10.17487/RFC6691, July 2012, 4823 . 4825 [45] Touch, J., "Updated Specification of the IPv4 ID Field", 4826 RFC 6864, DOI 10.17487/RFC6864, February 2013, 4827 . 4829 [46] Touch, J., "Shared Use of Experimental TCP Options", 4830 RFC 6994, DOI 10.17487/RFC6994, August 2013, 4831 . 4833 [47] McPherson, D., Oran, D., Thaler, D., and E. Osterweil, 4834 "Architectural Considerations of IP Anycast", RFC 7094, 4835 DOI 10.17487/RFC7094, January 2014, 4836 . 4838 [48] Borman, D., Braden, B., Jacobson, V., and R. 4839 Scheffenegger, Ed., "TCP Extensions for High Performance", 4840 RFC 7323, DOI 10.17487/RFC7323, September 2014, 4841 . 4843 [49] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 4844 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 4845 . 4847 [50] Duke, M., Braden, R., Eddy, W., Blanton, E., and A. 4848 Zimmermann, "A Roadmap for Transmission Control Protocol 4849 (TCP) Specification Documents", RFC 7414, 4850 DOI 10.17487/RFC7414, February 2015, 4851 . 4853 [51] Black, D., Ed. and P. Jones, "Differentiated Services 4854 (Diffserv) and Real-Time Communication", RFC 7657, 4855 DOI 10.17487/RFC7657, November 2015, 4856 . 4858 [52] Fairhurst, G. and M. Welzl, "The Benefits of Using 4859 Explicit Congestion Notification (ECN)", RFC 8087, 4860 DOI 10.17487/RFC8087, March 2017, 4861 . 4863 [53] Fairhurst, G., Ed., Trammell, B., Ed., and M. Kuehlewind, 4864 Ed., "Services Provided by IETF Transport Protocols and 4865 Congestion Control Mechanisms", RFC 8095, 4866 DOI 10.17487/RFC8095, March 2017, 4867 . 4869 [54] Welzl, M., Tuexen, M., and N. Khademi, "On the Usage of 4870 Transport Features Provided by IETF Transport Protocols", 4871 RFC 8303, DOI 10.17487/RFC8303, February 2018, 4872 . 4874 [55] Chown, T., Loughney, J., and T. Winters, "IPv6 Node 4875 Requirements", BCP 220, RFC 8504, DOI 10.17487/RFC8504, 4876 January 2019, . 4878 [56] Trammell, B. and M. Kuehlewind, "The Wire Image of a 4879 Network Protocol", RFC 8546, DOI 10.17487/RFC8546, April 4880 2019, . 4882 [57] Bittau, A., Giffin, D., Handley, M., Mazieres, D., Slack, 4883 Q., and E. Smith, "Cryptographic Protection of TCP Streams 4884 (tcpcrypt)", RFC 8548, DOI 10.17487/RFC8548, May 2019, 4885 . 4887 [58] Hardie, T., Ed., "Transport Protocol Path Signals", 4888 RFC 8558, DOI 10.17487/RFC8558, April 2019, 4889 . 4891 [59] Ford, A., Raiciu, C., Handley, M., Bonaventure, O., and C. 4892 Paasch, "TCP Extensions for Multipath Operation with 4893 Multiple Addresses", RFC 8684, DOI 10.17487/RFC8684, March 4894 2020, . 4896 [60] Iyengar, J., Ed. and M. Thomson, Ed., "QUIC: A UDP-Based 4897 Multiplexed and Secure Transport", RFC 9000, 4898 DOI 10.17487/RFC9000, May 2021, 4899 . 4901 [61] Fairhurst, G. and C. Perkins, "Considerations around 4902 Transport Header Confidentiality, Network Operations, and 4903 the Evolution of Internet Transport Protocols", RFC 9065, 4904 DOI 10.17487/RFC9065, July 2021, 4905 . 4907 [62] IANA, "Transmission Control Protocol (TCP) Parameters, 4908 https://www.iana.org/assignments/tcp-parameters/tcp- 4909 parameters.xhtml", 2019. 4911 [63] IANA, "Transmission Control Protocol (TCP) Header Flags, 4912 https://www.iana.org/assignments/tcp-header-flags/tcp- 4913 header-flags.xhtml", 2019. 4915 [64] Gont, F., "Processing of IP Security/Compartment and 4916 Precedence Information by TCP", Work in Progress, 4917 Internet-Draft, draft-gont-tcpm-tcp-seccomp-prec-00, 29 4918 March 2012, . 4921 [65] Gont, F. and D. Borman, "On the Validation of TCP Sequence 4922 Numbers", Work in Progress, Internet-Draft, draft-gont- 4923 tcpm-tcp-seq-validation-04, 11 March 2019, 4924 . 4927 [66] Touch, J. and W. Eddy, "TCP Extended Data Offset Option", 4928 Work in Progress, Internet-Draft, draft-ietf-tcpm-tcp-edo- 4929 10, 19 July 2018, . 4932 [67] McQuistin, S., Band, V., Jacob, D., and C. Perkins, 4933 "Describing Protocol Data Units with Augmented Packet 4934 Header Diagrams", Work in Progress, Internet-Draft, draft- 4935 mcquistin-augmented-ascii-diagrams-08, 5 May 2021, 4936 . 4939 [68] Thomson, M. and T. Pauly, "Long-term Viability of Protocol 4940 Extension Mechanisms", Work in Progress, Internet-Draft, 4941 draft-iab-use-it-or-lose-it-02, 23 August 2021, 4942 . 4945 [69] Minshall, G., "A Proposed Modification to Nagle's 4946 Algorithm", Work in Progress, Internet-Draft, draft- 4947 minshall-nagle-01, June 1999, 4948 . 4951 [70] Dalal, Y. and C. Sunshine, "Connection Management in 4952 Transport Protocols", Computer Networks Vol. 2, No. 6, pp. 4953 454-473, December 1978. 4955 [71] Faber, T., Touch, J., and W. Yui, "The TIME-WAIT state in 4956 TCP and Its Effect on Busy Servers", Proceedings of IEEE 4957 INFOCOM pp. 1573-1583, March 1999. 4959 [72] Postel, J., "Comments on Action Items from the January 4960 Meeting", IEN 177, March 1981, 4961 . 4963 [73] "Segmentation Offloads", Linux Networking Documentation , 4964 . 4967 Appendix A. Other Implementation Notes 4969 This section includes additional notes and references on TCP 4970 implementation decisions that are currently not a part of the RFC 4971 series or included within the TCP standard. These items can be 4972 considered by implementers, but there was not yet a consensus to 4973 include them in the standard. 4975 A.1. IP Security Compartment and Precedence 4977 The IPv4 specification [1] includes a precedence value in the (now 4978 obsoleted) Type of Service field (TOS) field. It was modified in 4979 [21], and then obsoleted by the definition of Differentiated Services 4980 (DiffServ) [4]. Setting and conveying TOS between the network layer, 4981 TCP implementation, and applications is obsolete, and replaced by 4982 DiffServ in the current TCP specification. 4984 RFC 793 required checking the IP security compartment and precedence 4985 on incoming TCP segments for consistency within a connection, and 4986 with application requests. Each of these aspects of IP have become 4987 outdated, without specific updates to RFC 793. The issues with 4988 precedence were fixed by [26], which is Standards Track, and so this 4989 present TCP specification includes those changes. However, the state 4990 of IP security options that may be used by MLS systems is not as 4991 apparent in the IETF currently. 4993 Resetting connections when incoming packets do not meet expected 4994 security compartment or precedence expectations has been recognized 4995 as a possible attack vector [64], and there has been discussion about 4996 amending the TCP specification to prevent connections from being 4997 aborted due to non-matching IP security compartment and DiffServ 4998 codepoint values. 5000 A.1.1. Precedence 5002 In DiffServ the former precedence values are treated as Class 5003 Selector codepoints, and methods for compatible treatment are 5004 described in the DiffServ architecture. The RFC 793/1122 TCP 5005 specification includes logic intending to have connections use the 5006 highest precedence requested by either endpoint application, and to 5007 keep the precedence consistent throughout a connection. This logic 5008 from the obsolete TOS is not applicable for DiffServ, and should not 5009 be included in TCP implementations, though changes to DiffServ values 5010 within a connection are discouraged. For discussion of this, see RFC 5011 7657 (sec 5.1, 5.3, and 6) [51]. 5013 The obsoleted TOS processing rules in TCP assumed bidirectional (or 5014 symmetric) precedence values used on a connection, but the DiffServ 5015 architecture is asymmetric. Problems with the old TCP logic in this 5016 regard were described in [26] and the solution described is to ignore 5017 IP precedence in TCP. Since RFC 2873 is a Standards Track document 5018 (although not marked as updating RFC 793), current implementations 5019 are expected to be robust to these conditions. Note that the 5020 DiffServ field value used in each direction is a part of the 5021 interface between TCP and the network layer, and values in use can be 5022 indicated both ways between TCP and the application. 5024 A.1.2. MLS Systems 5026 The IP security option (IPSO) and compartment defined in [1] was 5027 refined in RFC 1038 that was later obsoleted by RFC 1108. The 5028 Commercial IP Security Option (CIPSO) is defined in FIPS-188 5029 (withdrawn by NIST in 2015), and is supported by some vendors and 5030 operating systems. RFC 1108 is now Historic, though RFC 791 itself 5031 has not been updated to remove the IP security option. For IPv6, a 5032 similar option (CALIPSO) has been defined [37]. RFC 793 includes 5033 logic that includes the IP security/compartment information in 5034 treatment of TCP segments. References to the IP "security/ 5035 compartment" in this document may be relevant for Multi-Level Secure 5036 (MLS) system implementers, but can be ignored for non-MLS 5037 implementations, consistent with running code on the Internet. See 5038 Appendix A.1 for further discussion. Note that RFC 5570 describes 5039 some MLS networking scenarios where IPSO, CIPSO, or CALIPSO may be 5040 used. In these special cases, TCP implementers should see section 5041 7.3.1 of RFC 5570, and follow the guidance in that document. 5043 A.2. Sequence Number Validation 5045 There are cases where the TCP sequence number validation rules can 5046 prevent ACK fields from being processed. This can result in 5047 connection issues, as described in [65], which includes descriptions 5048 of potential problems in conditions of simultaneous open, self- 5049 connects, simultaneous close, and simultaneous window probes. The 5050 document also describes potential changes to the TCP specification to 5051 mitigate the issue by expanding the acceptable sequence numbers. 5053 In Internet usage of TCP, these conditions are rarely occurring. 5054 Common operating systems include different alternative mitigations, 5055 and the standard has not been updated yet to codify one of them, but 5056 implementers should consider the problems described in [65]. 5058 A.3. Nagle Modification 5060 In common operating systems, both the Nagle algorithm and delayed 5061 acknowledgements are implemented and enabled by default. TCP is used 5062 by many applications that have a request-response style of 5063 communication, where the combination of the Nagle algorithm and 5064 delayed acknowledgements can result in poor application performance. 5065 A modification to the Nagle algorithm is described in [69] that 5066 improves the situation for these applications. 5068 This modification is implemented in some common operating systems, 5069 and does not impact TCP interoperability. Additionally, many 5070 applications simply disable Nagle, since this is generally supported 5071 by a socket option. The TCP standard has not been updated to include 5072 this Nagle modification, but implementers may find it beneficial to 5073 consider. 5075 A.4. Low Watermark Settings 5077 Some operating system kernel TCP implementations include socket 5078 options that allow specifying the number of bytes in the buffer until 5079 the socket layer will pass sent data to TCP (SO_SNDLOWAT) or to the 5080 application on receiving (SO_RCVLOWAT). 5082 In addition, another socket option (TCP_NOTSENT_LOWAT) can be used to 5083 control the amount of unsent bytes in the write queue. This can help 5084 a sending TCP application to avoid creating large amounts of buffered 5085 data (and corresponding latency). As an example, this may be useful 5086 for applications that are multiplexing data from multiple upper level 5087 streams onto a connection, especially when streams may be a mix of 5088 interactive / real-time and bulk data transfer. 5090 Appendix B. TCP Requirement Summary 5092 This section is adapted from RFC 1122. 5094 Note that there is no requirement related to PLPMTUD in this list, 5095 but that PLPMTUD is recommended. 5097 | | | | |S| | 5098 | | | | |H| |F 5099 | | | | |O|M|o 5100 | | |S| |U|U|o 5101 | | |H| |L|S|t 5102 | |M|O| |D|T|n 5103 | |U|U|M| | |o 5104 | |S|L|A|N|N|t 5105 | |T|D|Y|O|O|t 5106 FEATURE | ReqID | | | |T|T|e 5107 -------------------------------------------------|--------|-|-|-|-|-|-- 5108 | | | | | | | 5109 Push flag | | | | | | | 5110 Aggregate or queue un-pushed data | MAY-16 | | |x| | | 5111 Sender collapse successive PSH flags | SHLD-27| |x| | | | 5112 SEND call can specify PUSH | MAY-15 | | |x| | | 5113 If cannot: sender buffer indefinitely | MUST-60| | | | |x| 5114 If cannot: PSH last segment | MUST-61|x| | | | | 5115 Notify receiving ALP of PSH | MAY-17 | | |x| | |1 5116 Send max size segment when possible | SHLD-28| |x| | | | 5117 | | | | | | | 5118 Window | | | | | | | 5119 Treat as unsigned number | MUST-1 |x| | | | | 5120 Handle as 32-bit number | REC-1 | |x| | | | 5121 Shrink window from right | SHLD-14| | | |x| | 5122 - Send new data when window shrinks | SHLD-15| | | |x| | 5123 - Retransmit old unacked data within window | SHLD-16| |x| | | | 5124 - Time out conn for data past right edge | SHLD-17| | | |x| | 5125 Robust against shrinking window | MUST-34|x| | | | | 5126 Receiver's window closed indefinitely | MAY-8 | | |x| | | 5127 Use standard probing logic | MUST-35|x| | | | | 5128 Sender probe zero window | MUST-36|x| | | | | 5129 First probe after RTO | SHLD-29| |x| | | | 5130 Exponential backoff | SHLD-30| |x| | | | 5131 Allow window stay zero indefinitely | MUST-37|x| | | | | 5132 Retransmit old data beyond SND.UNA+SND.WND | MAY-7 | | |x| | | 5133 Process RST and URG even with zero window | MUST-66|x| | | | | 5134 | | | | | | | 5135 Urgent Data | | | | | | | 5136 Include support for urgent pointer | MUST-30|x| | | | | 5137 Pointer indicates first non-urgent octet | MUST-62|x| | | | | 5138 Arbitrary length urgent data sequence | MUST-31|x| | | | | 5139 Inform ALP asynchronously of urgent data | MUST-32|x| | | | |1 5140 ALP can learn if/how much urgent data Q'd | MUST-33|x| | | | |1 5141 ALP employ the urgent mechanism | SHLD-13| | | |x| | 5142 | | | | | | | 5143 TCP Options | | | | | | | 5144 Support the mandatory option set | MUST-4 |x| | | | | 5145 Receive TCP option in any segment | MUST-5 |x| | | | | 5146 Ignore unsupported options | MUST-6 |x| | | | | 5147 Include length for all options except EOL+NOP | MUST-68|x| | | | | 5148 Cope with illegal option length | MUST-7 |x| | | | | 5149 Process options regardless of word alignment | MUST-64|x| | | | | 5150 Implement sending & receiving MSS option | MUST-14|x| | | | | 5151 IPv4 Send MSS option unless 536 | SHLD-5 | |x| | | | 5152 IPv6 Send MSS option unless 1220 | SHLD-5 | |x| | | | 5153 Send MSS option always | MAY-3 | | |x| | | 5154 IPv4 Send-MSS default is 536 | MUST-15|x| | | | | 5155 IPv6 Send-MSS default is 1220 | MUST-15|x| | | | | 5156 Calculate effective send seg size | MUST-16|x| | | | | 5157 MSS accounts for varying MTU | SHLD-6 | |x| | | | 5158 MSS not sent on non-SYN segments | MUST-65| | | | |x| 5159 MSS value based on MMS_R | MUST-67|x| | | | | 5160 Pad with zero | MUST-69|x| | | | | 5161 | | | | | | | 5162 TCP Checksums | | | | | | | 5163 Sender compute checksum | MUST-2 |x| | | | | 5164 Receiver check checksum | MUST-3 |x| | | | | 5165 | | | | | | | 5166 ISN Selection | | | | | | | 5167 Include a clock-driven ISN generator component | MUST-8 |x| | | | | 5168 Secure ISN generator with a PRF component | SHLD-1 | |x| | | | 5169 PRF computable from outside the host | MUST-9 | | | | |x| 5170 | | | | | | | 5171 Opening Connections | | | | | | | 5172 Support simultaneous open attempts | MUST-10|x| | | | | 5173 SYN-RECEIVED remembers last state | MUST-11|x| | | | | 5174 Passive Open call interfere with others | MUST-41| | | | |x| 5175 Function: simultan. LISTENs for same port | MUST-42|x| | | | | 5176 Ask IP for src address for SYN if necc. | MUST-44|x| | | | | 5177 Otherwise, use local addr of conn. | MUST-45|x| | | | | 5178 OPEN to broadcast/multicast IP Address | MUST-46| | | | |x| 5179 Silently discard seg to bcast/mcast addr | MUST-57|x| | | | | 5180 | | | | | | | 5181 Closing Connections | | | | | | | 5182 RST can contain data | SHLD-2 | |x| | | | 5183 Inform application of aborted conn | MUST-12|x| | | | | 5184 Half-duplex close connections | MAY-1 | | |x| | | 5185 Send RST to indicate data lost | SHLD-3 | |x| | | | 5186 In TIME-WAIT state for 2MSL seconds | MUST-13|x| | | | | 5187 Accept SYN from TIME-WAIT state | MAY-2 | | |x| | | 5188 Use Timestamps to reduce TIME-WAIT | SHLD-4 | |x| | | | 5189 | | | | | | | 5190 Retransmissions | | | | | | | 5191 Implement exponential backoff, slow start, and | MUST-19|x| | | | | 5192 congestion avoidance | | | | | | | 5193 Retransmit with same IP ident | MAY-4 | | |x| | | 5194 Karn's algorithm | MUST-18|x| | | | | 5195 | | | | | | | 5196 Generating ACKs: | | | | | | | 5197 Aggregate whenever possible | MUST-58|x| | | | | 5198 Queue out-of-order segments | SHLD-31| |x| | | | 5199 Process all Q'd before send ACK | MUST-59|x| | | | | 5200 Send ACK for out-of-order segment | MAY-13 | | |x| | | 5201 Delayed ACKs | SHLD-18| |x| | | | 5202 Delay < 0.5 seconds | MUST-40|x| | | | | 5203 Every 2nd full-sized segment or 2*RMSS ACK'd | SHLD-19| |x| | | | 5204 Receiver SWS-Avoidance Algorithm | MUST-39|x| | | | | 5205 | | | | | | | 5206 Sending data | | | | | | | 5207 Configurable TTL | MUST-49|x| | | | | 5208 Sender SWS-Avoidance Algorithm | MUST-38|x| | | | | 5209 Nagle algorithm | SHLD-7 | |x| | | | 5210 Application can disable Nagle algorithm | MUST-17|x| | | | | 5211 | | | | | | | 5213 Connection Failures: | | | | | | | 5214 Negative advice to IP on R1 retxs | MUST-20|x| | | | | 5215 Close connection on R2 retxs | MUST-20|x| | | | | 5216 ALP can set R2 | MUST-21|x| | | | |1 5217 Inform ALP of R1<=retxs inform ALP | SHLD-25| |x| | | | 5245 Abort on Dest. Unreach (0,1,5) =>nn | MUST-56| | | | |x| 5246 Dest. Unreach (2-4) => abort conn | SHLD-26| |x| | | | 5247 Source Quench => silent discard | MUST-55|x| | | | | 5248 Abort on Time Exceeded => | MUST-56| | | | |x| 5249 Abort on Param Problem => | MUST-56| | | | |x| 5250 | | | | | | | 5251 Address Validation | | | | | | | 5252 Reject OPEN call to invalid IP address | MUST-46|x| | | | | 5253 Reject SYN from invalid IP address | MUST-63|x| | | | | 5254 Silently discard SYN to bcast/mcast addr | MUST-57|x| | | | | 5255 | | | | | | | 5256 TCP/ALP Interface Services | | | | | | | 5257 Error Report mechanism | MUST-47|x| | | | | 5258 ALP can disable Error Report Routine | SHLD-20| |x| | | | 5259 ALP can specify DiffServ field for sending | MUST-48|x| | | | | 5260 Passed unchanged to IP | SHLD-22| |x| | | | 5262 ALP can change DiffServ field during connection| SHLD-21| |x| | | | 5263 ALP generally changing DiffServ during conn. | SHLD-23| | | |x| | 5264 Pass received DiffServ field up to ALP | MAY-9 | | |x| | | 5265 FLUSH call | MAY-14 | | |x| | | 5266 Optional local IP addr parm. in OPEN | MUST-43|x| | | | | 5267 | | | | | | | 5268 RFC 5961 Support: | | | | | | | 5269 Implement data injection protection | MAY-12 | | |x| | | 5270 | | | | | | | 5271 Explicit Congestion Notification: | | | | | | | 5272 Support ECN | SHLD-8 | |x| | | | 5273 | | | | | | | 5274 Alternative Congestion Control: | | | | | | | 5275 Implement alternative conformant algorithm(s) | MAY-18 | | |x| | | 5276 -------------------------------------------------|--------|-|-|-|-|-|- 5278 FOOTNOTES: (1) "ALP" means Application-Layer Program. 5280 Author's Address 5282 Wesley M. Eddy (editor) 5283 MTI Systems 5284 United States of America 5285 Email: wes@mti-systems.com