idnits 2.17.1 draft-ietf-tcpm-rfc793bis-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. -- The draft header indicates that this document obsoletes RFC6093, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC6429, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC879, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document obsoletes RFC6528, but the abstract doesn't seem to mention this, which it should. -- The draft header indicates that this document updates RFC1122, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year (Using the creation date from RFC1122, updated by this document, for RFC5378 checks: 1989-10-01) -- The document seems to contain a disclaimer for pre-RFC5378 work, and may have content which was first submitted before 10 November 2008. The disclaimer is necessary when there are original authors that you have been unable to contact, or if some do not wish to grant the BCP78 rights to the IETF Trust. If you are able to get all authors (current and original) to grant those rights, you can and should remove the disclaimer; otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 28, 2017) is 2584 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 1981 (ref. '3') (Obsoleted by RFC 8201) ** Downref: Normative reference to an Informational RFC: RFC 2923 (ref. '6') -- Obsolete informational reference (is this intentional?): RFC 793 (ref. '10') (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 896 (ref. '11') (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 6093 (ref. '18') (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 6429 (ref. '19') (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 6528 (ref. '20') (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 6691 (ref. '21') (Obsoleted by RFC 9293) Summary: 3 errors (**), 0 flaws (~~), 3 warnings (==), 13 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force W. Eddy, Ed. 3 Internet-Draft MTI Systems 4 Obsoletes: 793, 879, 6093, 6429, 6528, March 28, 2017 5 6691 (if approved) 6 Updates: 1122 (if approved) 7 Intended status: Standards Track 8 Expires: September 29, 2017 10 Transmission Control Protocol Specification 11 draft-ietf-tcpm-rfc793bis-05 13 Abstract 15 This document specifies the Internet's Transmission Control Protocol 16 (TCP). TCP is an important transport layer protocol in the Internet 17 stack, and has continuously evolved over decades of use and growth of 18 the Internet. Over this time, a number of changes have been made to 19 TCP as it was specified in RFC 793, though these have only been 20 documented in a piecemeal fashion. This document collects and brings 21 those changes together with the protocol specification from RFC 793. 22 This document obsoletes RFC 793 and several other RFCs (TODO: list 23 all actual RFCs when finished). 25 RFC EDITOR NOTE: If approved for publication as an RFC, this should 26 be marked additionally as "STD: 7" and replace RFC 793 in that role. 28 Requirements Language 30 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 31 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 32 document are to be interpreted as described in RFC 2119 [4]. 34 Status of This Memo 36 This Internet-Draft is submitted in full conformance with the 37 provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF). Note that other groups may also distribute 41 working documents as Internet-Drafts. The list of current Internet- 42 Drafts is at http://datatracker.ietf.org/drafts/current/. 44 Internet-Drafts are draft documents valid for a maximum of six months 45 and may be updated, replaced, or obsoleted by other documents at any 46 time. It is inappropriate to use Internet-Drafts as reference 47 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on September 29, 2017. 50 Copyright Notice 52 Copyright (c) 2017 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (http://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 This document may contain material from IETF Documents or IETF 66 Contributions published or made publicly available before November 67 10, 2008. The person(s) controlling the copyright in some of this 68 material may not have granted the IETF Trust the right to allow 69 modifications of such material outside the IETF Standards Process. 70 Without obtaining an adequate license from the person(s) controlling 71 the copyright in such materials, this document may not be modified 72 outside the IETF Standards Process, and derivative works of it may 73 not be created outside the IETF Standards Process, except to format 74 it for publication as an RFC or to translate it into languages other 75 than English. 77 Table of Contents 79 1. Purpose and Scope . . . . . . . . . . . . . . . . . . . . . . 3 80 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 81 3. Functional Specification . . . . . . . . . . . . . . . . . . 5 82 3.1. Header Format . . . . . . . . . . . . . . . . . . . . . . 5 83 3.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 10 84 3.3. Sequence Numbers . . . . . . . . . . . . . . . . . . . . 15 85 3.4. Establishing a connection . . . . . . . . . . . . . . . . 21 86 3.4.1. Remote Address Validation . . . . . . . . . . . . . . 28 87 3.5. Closing a Connection . . . . . . . . . . . . . . . . . . 28 88 3.5.1. Half-Closed Connections . . . . . . . . . . . . . . . 31 89 3.6. Precedence and Security . . . . . . . . . . . . . . . . . 31 90 3.7. Segmentation . . . . . . . . . . . . . . . . . . . . . . 32 91 3.7.1. Maximum Segment Size Option . . . . . . . . . . . . . 33 92 3.7.2. Path MTU Discovery . . . . . . . . . . . . . . . . . 34 93 3.7.3. Interfaces with Variable MTU Values . . . . . . . . . 35 94 3.7.4. Nagle Algorithm . . . . . . . . . . . . . . . . . . . 35 95 3.7.5. IPv6 Jumbograms . . . . . . . . . . . . . . . . . . . 36 97 3.8. Data Communication . . . . . . . . . . . . . . . . . . . 36 98 3.8.1. Retransmission Timeout . . . . . . . . . . . . . . . 37 99 3.8.2. TCP Congestion Control . . . . . . . . . . . . . . . 37 100 3.8.3. TCP Connection Failures . . . . . . . . . . . . . . . 37 101 3.8.4. TCP Keep-Alives . . . . . . . . . . . . . . . . . . . 38 102 3.8.5. The Communication of Urgent Information . . . . . . . 39 103 3.8.6. Managing the Window . . . . . . . . . . . . . . . . . 40 104 3.9. Interfaces . . . . . . . . . . . . . . . . . . . . . . . 44 105 3.9.1. User/TCP Interface . . . . . . . . . . . . . . . . . 44 106 3.9.2. TCP/Lower-Level Interface . . . . . . . . . . . . . . 52 107 3.10. Event Processing . . . . . . . . . . . . . . . . . . . . 54 108 3.11. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 79 109 4. Changes from RFC 793 . . . . . . . . . . . . . . . . . . . . 84 110 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 88 111 6. Security and Privacy Considerations . . . . . . . . . . . . . 89 112 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 89 113 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 89 114 8.1. Normative References . . . . . . . . . . . . . . . . . . 89 115 8.2. Informative References . . . . . . . . . . . . . . . . . 90 116 Appendix A. TCP Requirement Summary . . . . . . . . . . . . . . 92 117 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 95 119 1. Purpose and Scope 121 In 1981, RFC 793 [10] was released, documenting the Transmission 122 Control Protocol (TCP), and replacing earlier specifications for TCP 123 that had been published in the past. 125 Since then, TCP has been implemented many times, and has been used as 126 a transport protocol for numerous applications on the Internet. 128 For several decades, RFC 793 plus a number of other documents have 129 combined to serve as the specification for TCP [23]. Over time, a 130 number of errata have been identified on RFC 793, as well as 131 deficiencies in security, performance, and other aspects. A number 132 of enhancements has grown and been documented separately. These were 133 never accumulated together into an update to the base specification. 135 The purpose of this document is to bring together all of the IETF 136 Standards Track changes that have been made to the basic TCP 137 functional specification and unify them into an update of the RFC 793 138 protocol specification. Some companion documents are referenced for 139 important algorithms that TCP uses (e.g. for congestion control), but 140 have not been attempted to include in this document. This is a 141 conscious choice, as this base specification can be used with 142 multiple additional algorithms that are developed and incorporated 143 separately, but all TCP implementations need to implement this 144 specification as a common basis in order to interoperate. As some 145 additional TCP features have become quite complicated themselves 146 (e.g. advanced loss recovery and congestion control), future 147 companion documents may attempt to similarly bring these together. 149 In addition to the protocol specification that descibes the TCP 150 segment format, generation, and processing rules that are to be 151 implemented in code, RFC 793 and other updates also contain 152 informative and descriptive text for human readers to understand 153 aspects of the protocol design and operation. This document does not 154 attempt to alter or update this informative text, and is focused only 155 on updating the normative protocol specification. We preserve 156 references to the documentation containing the important explanations 157 and rationale, where appropriate. 159 This document is intended to be useful both in checking existing TCP 160 implementations for conformance, as well as in writing new 161 implementations. 163 2. Introduction 165 RFC 793 contains a discussion of the TCP design goals and provides 166 examples of its operation, including examples of connection 167 establishment, closing connections, and retransmitting packets to 168 repair losses. 170 This document describes the basic functionality expected in modern 171 implementations of TCP, and replaces the protocol specification in 172 RFC 793. It does not replicate or attempt to update the examples and 173 other discussion in RFC 793. Other documents are referenced to 174 provide explanation of the theory of operation, rationale, and 175 detailed discussion of design decisions. This document only focuses 176 on the normative behavior of the protocol. 178 The "TCP Roadmap" [23] provides a more extensive guide to the RFCs 179 that define TCP and describe various important algorithms. The TCP 180 Roadmap contains sections on strongly encouraged enhancements that 181 improve performance and other aspects of TCP beyond the basic 182 operation specified in this document. As one example, implementing 183 congestion control (e.g. [16]) is a TCP requirement, but is a complex 184 topic on its own, and not described in detail in this document, as 185 there are many options and possibilities that do not impact basic 186 interoperability. Similarly, most common TCP implementations today 187 include the high-performance extensions in [22], but these are not 188 strictly required or discussed in this document. 190 TEMPORARY EDITOR'S NOTE: This is an early revision in the process of 191 updating RFC 793. Many planned changes are not yet incorporated. 193 ***Please do not use this revision as a basis for any work or 194 reference.*** 196 A list of changes from RFC 793 is contained in Section 4. 198 TEMPORARY EDITOR'S NOTE: the current revision of this document does 199 not yet collect all of the changes that will be in the final version. 200 The set of content changes planned for future revisions is kept in 201 Section 4. 203 3. Functional Specification 205 3.1. Header Format 207 TCP segments are sent as internet datagrams. The Internet Protocol 208 header carries several information fields, including the source and 209 destination host addresses [1]. A TCP header follows the internet 210 header, supplying information specific to the TCP protocol. This 211 division allows for the existence of host level protocols other than 212 TCP. (Editorial TODO - this last sentence makes sense in 793 213 context, but may be a candidate to remove here? ... additionally, 214 Section 2 of 793 is not includeed here, but some parts may be useful, 215 to quickly define basic concepts of ports, bytestream service, etc. 216 at high-level before delving into protocol details?) 218 TCP Header Format 219 0 1 2 3 220 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 221 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 222 | Source Port | Destination Port | 223 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 224 | Sequence Number | 225 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 226 | Acknowledgment Number | 227 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 228 | Data | |C|E|U|A|P|R|S|F| | 229 | Offset| Rsrvd |W|C|R|C|S|S|Y|I| Window | 230 | | |R|E|G|K|H|T|N|N| | 231 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 232 | Checksum | Urgent Pointer | 233 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 234 | Options | Padding | 235 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 236 | data | 237 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 239 TCP Header Format 241 Note that one tick mark represents one bit position. 243 Figure 1 245 Source Port: 16 bits 247 The source port number. 249 Destination Port: 16 bits 251 The destination port number. 253 Sequence Number: 32 bits 255 The sequence number of the first data octet in this segment (except 256 when SYN is present). If SYN is present the sequence number is the 257 initial sequence number (ISN) and the first data octet is ISN+1. 259 Acknowledgment Number: 32 bits 261 If the ACK control bit is set this field contains the value of the 262 next sequence number the sender of the segment is expecting to 263 receive. Once a connection is established this is always sent. 265 Data Offset: 4 bits 266 The number of 32 bit words in the TCP Header. This indicates where 267 the data begins. The TCP header (even one including options) is an 268 integral number of 32 bits long. 270 Rsrvd - Reserved: 4 bits 272 Reserved for future use. Must be zero. 274 Control Bits: 8 bits (from left to right): 276 CWR: Congestion Window Reduced (see [7]) 277 ECE: ECN-Echo (see [7]) 278 URG: Urgent Pointer field significant 279 ACK: Acknowledgment field significant 280 PSH: Push Function 281 RST: Reset the connection 282 SYN: Synchronize sequence numbers 283 FIN: No more data from sender 285 Window: 16 bits 287 The number of data octets beginning with the one indicated in the 288 acknowledgment field which the sender of this segment is willing to 289 accept. 291 The window size MUST be treated as an unsigned number, or else 292 large window sizes will appear like negative windows and TCP will 293 now work. It is RECOMMENDED that implementations will reserve 294 32-bit fields for the send and receive window sizes in the 295 connection record and do all window computations with 32 bits. 297 Checksum: 16 bits 299 The checksum field is the 16 bit one's complement of the one's 300 complement sum of all 16 bit words in the header and text. If a 301 segment contains an odd number of header and text octets to be 302 checksummed, the last octet is padded on the right with zeros to 303 form a 16 bit word for checksum purposes. The pad is not 304 transmitted as part of the segment. While computing the checksum, 305 the checksum field itself is replaced with zeros. 307 The checksum also covers a 96 bit pseudo header conceptually 308 prefixed to the TCP header. This pseudo header contains the Source 309 Address, the Destination Address, the Protocol, and TCP length. 310 This gives the TCP protection against misrouted segments. This 311 information is carried in the Internet Protocol and is transferred 312 across the TCP/Network interface in the arguments or results of 313 calls by the TCP on the IP. (TODO: this is IPv4-specific, need to 314 mention IPv6 psuedoheader as well) 316 +--------+--------+--------+--------+ 317 | Source Address | 318 +--------+--------+--------+--------+ 319 | Destination Address | 320 +--------+--------+--------+--------+ 321 | zero | PTCL | TCP Length | 322 +--------+--------+--------+--------+ 324 The TCP Length is the TCP header length plus the data length in 325 octets (this is not an explicitly transmitted quantity, but is 326 computed), and it does not count the 12 octets of the pseudo 327 header. 329 The TCP checksum is never optional. The sender MUST generate it 330 and the receiver MUST check it. 332 Urgent Pointer: 16 bits 334 This field communicates the current value of the urgent pointer as 335 a positive offset from the sequence number in this segment. The 336 urgent pointer points to the sequence number of the octet following 337 the urgent data. This field is only be interpreted in segments 338 with the URG control bit set. 340 Options: variable 342 Options may occupy space at the end of the TCP header and are a 343 multiple of 8 bits in length. All options are included in the 344 checksum. An option may begin on any octet boundary. There are 345 two cases for the format of an option: 347 Case 1: A single octet of option-kind. 349 Case 2: An octet of option-kind, an octet of option-length, and 350 the actual option-data octets. 352 The option-length counts the two octets of option-kind and option- 353 length as well as the option-data octets. 355 Note that the list of options may be shorter than the data offset 356 field might imply. The content of the header beyond the End-of- 357 Option option must be header padding (i.e., zero). 359 Currently defined options include (kind indicated in octal): 361 Kind Length Meaning 362 ---- ------ ------- 363 0 - End of option list. 364 1 - No-Operation. 365 2 4 Maximum Segment Size. 367 TODO - I think we may need to include designated experimental 368 options and RFC 6994 reference here 370 A TCP MUST be able to receive a TCP option in any segment. 371 A TCP MUST ignore without error any TCP option it does not 372 implement, assuming that the option has a length field (all TCP 373 options except End of option list and No-Operation have length 374 fields). TCP MUST be prepared to handle an illegal option length 375 (e.g., zero) without crashing; a suggested procedure is to reset 376 the connection and log the reason. 378 Specific Option Definitions 380 End of Option List 382 +--------+ 383 |00000000| 384 +--------+ 385 Kind=0 387 This option code indicates the end of the option list. This 388 might not coincide with the end of the TCP header according to 389 the Data Offset field. This is used at the end of all options, 390 not the end of each option, and need only be used if the end of 391 the options would not otherwise coincide with the end of the TCP 392 header. 394 No-Operation 396 +--------+ 397 |00000001| 398 +--------+ 399 Kind=1 401 This option code may be used between options, for example, to 402 align the beginning of a subsequent option on a word boundary. 403 There is no guarantee that senders will use this option, so 404 receivers must be prepared to process options even if they do 405 not begin on a word boundary. 407 Maximum Segment Size (MSS) 409 +--------+--------+---------+--------+ 410 |00000010|00000100| max seg size | 411 +--------+--------+---------+--------+ 412 Kind=2 Length=4 414 Maximum Segment Size Option Data: 16 bits 416 If this option is present, then it communicates the maximum 417 receive segment size at the TCP which sends this segment. This 418 field may be sent in the initial connection request (i.e., in 419 segments with the SYN control bit set) and must not be sent in 420 other segments. If this option is not used, any segment size is 421 allowed. A more complete description of this option is in 422 Section 3.7.1. 424 Padding: variable 426 The TCP header padding is used to ensure that the TCP header ends 427 and data begins on a 32 bit boundary. The padding is composed of 428 zeros. 430 3.2. Terminology 432 Before we can discuss very much about the operation of the TCP we 433 need to introduce some detailed terminology. The maintenance of a 434 TCP connection requires the remembering of several variables. We 435 conceive of these variables being stored in a connection record 436 called a Transmission Control Block or TCB. Among the variables 437 stored in the TCB are the local and remote socket numbers, the 438 security and precedence of the connection, pointers to the user's 439 send and receive buffers, pointers to the retransmit queue and to the 440 current segment. In addition several variables relating to the send 441 and receive sequence numbers are stored in the TCB. 443 Send Sequence Variables 445 SND.UNA - send unacknowledged 446 SND.NXT - send next 447 SND.WND - send window 448 SND.UP - send urgent pointer 449 SND.WL1 - segment sequence number used for last window update 450 SND.WL2 - segment acknowledgment number used for last window 451 update 452 ISS - initial send sequence number 454 Receive Sequence Variables 456 RCV.NXT - receive next 457 RCV.WND - receive window 458 RCV.UP - receive urgent pointer 459 IRS - initial receive sequence number 461 The following diagrams may help to relate some of these variables to 462 the sequence space. 464 Send Sequence Space 466 1 2 3 4 467 ----------|----------|----------|---------- 468 SND.UNA SND.NXT SND.UNA 469 +SND.WND 471 1 - old sequence numbers which have been acknowledged 472 2 - sequence numbers of unacknowledged data 473 3 - sequence numbers allowed for new data transmission 474 4 - future sequence numbers which are not yet allowed 476 Send Sequence Space 478 Figure 2 480 The send window is the portion of the sequence space labeled 3 in 481 Figure 2. 483 Receive Sequence Space 485 1 2 3 486 ----------|----------|---------- 487 RCV.NXT RCV.NXT 488 +RCV.WND 490 1 - old sequence numbers which have been acknowledged 491 2 - sequence numbers allowed for new reception 492 3 - future sequence numbers which are not yet allowed 494 Receive Sequence Space 496 Figure 3 498 The receive window is the portion of the sequence space labeled 2 in 499 Figure 3. 501 There are also some variables used frequently in the discussion that 502 take their values from the fields of the current segment. 504 Current Segment Variables 506 SEG.SEQ - segment sequence number 507 SEG.ACK - segment acknowledgment number 508 SEG.LEN - segment length 509 SEG.WND - segment window 510 SEG.UP - segment urgent pointer 511 SEG.PRC - segment precedence value 513 A connection progresses through a series of states during its 514 lifetime. The states are: LISTEN, SYN-SENT, SYN-RECEIVED, 515 ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, 516 TIME-WAIT, and the fictional state CLOSED. CLOSED is fictional 517 because it represents the state when there is no TCB, and therefore, 518 no connection. Briefly the meanings of the states are: 520 LISTEN - represents waiting for a connection request from any 521 remote TCP and port. 523 SYN-SENT - represents waiting for a matching connection request 524 after having sent a connection request. 526 SYN-RECEIVED - represents waiting for a confirming connection 527 request acknowledgment after having both received and sent a 528 connection request. 530 ESTABLISHED - represents an open connection, data received can be 531 delivered to the user. The normal state for the data transfer 532 phase of the connection. 534 FIN-WAIT-1 - represents waiting for a connection termination 535 request from the remote TCP, or an acknowledgment of the 536 connection termination request previously sent. 538 FIN-WAIT-2 - represents waiting for a connection termination 539 request from the remote TCP. 541 CLOSE-WAIT - represents waiting for a connection termination 542 request from the local user. 544 CLOSING - represents waiting for a connection termination request 545 acknowledgment from the remote TCP. 547 LAST-ACK - represents waiting for an acknowledgment of the 548 connection termination request previously sent to the remote TCP 549 (this termination request sent to the remote TCP already included 550 an acknowledgment of the termination request sent from the remote 551 TCP). 553 TIME-WAIT - represents waiting for enough time to pass to be sure 554 the remote TCP received the acknowledgment of its connection 555 termination request. 557 CLOSED - represents no connection state at all. 559 A TCP connection progresses from one state to another in response to 560 events. The events are the user calls, OPEN, SEND, RECEIVE, CLOSE, 561 ABORT, and STATUS; the incoming segments, particularly those 562 containing the SYN, ACK, RST and FIN flags; and timeouts. 564 The state diagram in Figure 4 illustrates only state changes, 565 together with the causing events and resulting actions, but addresses 566 neither error conditions nor actions which are not connected with 567 state changes. In a later section, more detail is offered with 568 respect to the reaction of the TCP to events. 570 NOTA BENE: this diagram is only a summary and must not be taken as 571 the total specification. 573 +---------+ ---------\ active OPEN 574 | CLOSED | \ ----------- 575 +---------+<---------\ \ create TCB 576 | ^ \ \ snd SYN 577 passive OPEN | | CLOSE \ \ 578 ------------ | | ---------- \ \ 579 create TCB | | delete TCB \ \ 580 V | \ \ 581 rcv RST (note 1) +---------+ CLOSE | \ 582 -------------------->| LISTEN | ---------- | | 583 / +---------+ delete TCB | | 584 / rcv SYN | | SEND | | 585 / ----------- | | ------- | V 586 +--------+ snd SYN,ACK / \ snd SYN +--------+ 587 | |<----------------- ------------------>| | 588 | SYN | rcv SYN | SYN | 589 | RCVD |<-----------------------------------------------| SENT | 590 | | snd SYN,ACK | | 591 | |------------------ -------------------| | 592 +--------+ rcv ACK of SYN \ / rcv SYN,ACK +--------+ 593 | -------------- | | ----------- 594 | x | | snd ACK 595 | V V 596 | CLOSE +---------+ 597 | ------- | ESTAB | 598 | snd FIN +---------+ 599 | CLOSE | | rcv FIN 600 V ------- | | ------- 601 +---------+ snd FIN / \ snd ACK +---------+ 602 | FIN |<----------------- ------------------>| CLOSE | 603 | WAIT-1 |------------------ | WAIT | 604 +---------+ rcv FIN \ +---------+ 605 | rcv ACK of FIN ------- | CLOSE | 606 | -------------- snd ACK | ------- | 607 V x V snd FIN V 608 +---------+ +---------+ +---------+ 609 |FINWAIT-2| | CLOSING | | LAST-ACK| 610 +---------+ +---------+ +---------+ 611 | rcv ACK of FIN | rcv ACK of FIN | 612 | rcv FIN -------------- | Timeout=2MSL -------------- | 613 | ------- x V ------------ x V 614 \ snd ACK +---------+delete TCB +---------+ 615 ------------------------>|TIME WAIT|------------------>| CLOSED | 616 +---------+ +---------+ 618 note 1: The transition from SYN-RCVD to LISTEN on receiving a RST is 619 conditional on having reached SYN-RCVD after a passive open. 621 note 2: An unshown transition exists from FIN-WAIT-1 to TIME-WAIT if 622 a FIN is received and the local FIN is also acknowledged. 624 TCP Connection State Diagram 625 Figure 4 627 3.3. Sequence Numbers 629 A fundamental notion in the design is that every octet of data sent 630 over a TCP connection has a sequence number. Since every octet is 631 sequenced, each of them can be acknowledged. The acknowledgment 632 mechanism employed is cumulative so that an acknowledgment of 633 sequence number X indicates that all octets up to but not including X 634 have been received. This mechanism allows for straight-forward 635 duplicate detection in the presence of retransmission. Numbering of 636 octets within a segment is that the first data octet immediately 637 following the header is the lowest numbered, and the following octets 638 are numbered consecutively. 640 It is essential to remember that the actual sequence number space is 641 finite, though very large. This space ranges from 0 to 2**32 - 1. 642 Since the space is finite, all arithmetic dealing with sequence 643 numbers must be performed modulo 2**32. This unsigned arithmetic 644 preserves the relationship of sequence numbers as they cycle from 645 2**32 - 1 to 0 again. There are some subtleties to computer modulo 646 arithmetic, so great care should be taken in programming the 647 comparison of such values. The symbol "=<" means "less than or 648 equal" (modulo 2**32). 650 The typical kinds of sequence number comparisons which the TCP must 651 perform include: 653 (a) Determining that an acknowledgment refers to some sequence 654 number sent but not yet acknowledged. 656 (b) Determining that all sequence numbers occupied by a segment 657 have been acknowledged (e.g., to remove the segment from a 658 retransmission queue). 660 (c) Determining that an incoming segment contains sequence numbers 661 which are expected (i.e., that the segment "overlaps" the receive 662 window). 664 In response to sending data the TCP will receive acknowledgments. 665 The following comparisons are needed to process the acknowledgments. 667 SND.UNA = oldest unacknowledged sequence number 669 SND.NXT = next sequence number to be sent 671 SEG.ACK = acknowledgment from the receiving TCP (next sequence 672 number expected by the receiving TCP) 673 SEG.SEQ = first sequence number of a segment 675 SEG.LEN = the number of octets occupied by the data in the segment 676 (counting SYN and FIN) 678 SEG.SEQ+SEG.LEN-1 = last sequence number of a segment 680 A new acknowledgment (called an "acceptable ack"), is one for which 681 the inequality below holds: 683 SND.UNA < SEG.ACK =< SND.NXT 685 A segment on the retransmission queue is fully acknowledged if the 686 sum of its sequence number and length is less or equal than the 687 acknowledgment value in the incoming segment. 689 When data is received the following comparisons are needed: 691 RCV.NXT = next sequence number expected on an incoming segments, 692 and is the left or lower edge of the receive window 694 RCV.NXT+RCV.WND-1 = last sequence number expected on an incoming 695 segment, and is the right or upper edge of the receive window 697 SEG.SEQ = first sequence number occupied by the incoming segment 699 SEG.SEQ+SEG.LEN-1 = last sequence number occupied by the incoming 700 segment 702 A segment is judged to occupy a portion of valid receive sequence 703 space if 705 RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND 707 or 709 RCV.NXT =< SEG.SEQ+SEG.LEN-1 < RCV.NXT+RCV.WND 711 The first part of this test checks to see if the beginning of the 712 segment falls in the window, the second part of the test checks to 713 see if the end of the segment falls in the window; if the segment 714 passes either part of the test it contains data in the window. 716 Actually, it is a little more complicated than this. Due to zero 717 windows and zero length segments, we have four cases for the 718 acceptability of an incoming segment: 720 Segment Receive Test 721 Length Window 722 ------- ------- ------------------------------------------- 724 0 0 SEG.SEQ = RCV.NXT 726 0 >0 RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND 728 >0 0 not acceptable 730 >0 >0 RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND 731 or RCV.NXT =< SEG.SEQ+SEG.LEN-1 < RCV.NXT+RCV.WND 733 Note that when the receive window is zero no segments should be 734 acceptable except ACK segments. Thus, it is be possible for a TCP to 735 maintain a zero receive window while transmitting data and receiving 736 ACKs. However, even when the receive window is zero, a TCP must 737 process the RST and URG fields of all incoming segments. 739 We have taken advantage of the numbering scheme to protect certain 740 control information as well. This is achieved by implicitly 741 including some control flags in the sequence space so they can be 742 retransmitted and acknowledged without confusion (i.e., one and only 743 one copy of the control will be acted upon). Control information is 744 not physically carried in the segment data space. Consequently, we 745 must adopt rules for implicitly assigning sequence numbers to 746 control. The SYN and FIN are the only controls requiring this 747 protection, and these controls are used only at connection opening 748 and closing. For sequence number purposes, the SYN is considered to 749 occur before the first actual data octet of the segment in which it 750 occurs, while the FIN is considered to occur after the last actual 751 data octet in a segment in which it occurs. The segment length 752 (SEG.LEN) includes both data and sequence space occupying controls. 753 When a SYN is present then SEG.SEQ is the sequence number of the SYN. 755 Initial Sequence Number Selection 757 The protocol places no restriction on a particular connection being 758 used over and over again. A connection is defined by a pair of 759 sockets. New instances of a connection will be referred to as 760 incarnations of the connection. The problem that arises from this is 761 -- "how does the TCP identify duplicate segments from previous 762 incarnations of the connection?" This problem becomes apparent if 763 the connection is being opened and closed in quick succession, or if 764 the connection breaks with loss of memory and is then reestablished. 766 To avoid confusion we must prevent segments from one incarnation of a 767 connection from being used while the same sequence numbers may still 768 be present in the network from an earlier incarnation. We want to 769 assure this, even if a TCP crashes and loses all knowledge of the 770 sequence numbers it has been using. When new connections are 771 created, an initial sequence number (ISN) generator is employed which 772 selects a new 32 bit ISN. There are security issues that result if 773 an off-path attacker is able to predict or guess ISN values. 775 The recommended ISN generator is based on the combination of a 776 (possibly fictitious) 32 bit clock whose low order bit is incremented 777 roughly every 4 microseconds, and a pseudorandom hash function (PRF). 778 The clock component is intended to insure that with a Maximum Segment 779 Lifetime (MSL), generated ISNs will be unique, since it cycles 780 approximately every 4.55 hours, which is much longer than the MSL. 781 This recommended algorithm is further described in RFC 1948 and 782 builds on the basic clock-driven algorithm from RFC 793. 784 A TCP MUST use a clock-driven selection of initial sequence numbers, 785 and SHOULD generate its Initial Sequence Numbers with the expression: 787 ISN = M + F(localip, localport, remoteip, remoteport, secretkey) 789 where M is the 4 microsecond timer, and F() is a pseudorandom 790 function (PRF) of the connection's identifying parameters ("localip, 791 localport, remoteip, remoteport") and a secret key ("secretkey"). 792 F() MUST NOT be computable from the outside, or an attacker could 793 still guess at sequence numbers from the ISN used for some other 794 connection. The PRF could be implemented as a cryptographic has of 795 the concatenation of the TCP connection parameters and some secret 796 data. For discussion of the selection of a specific hash algorithm 797 and management of the secret key data, please see Section 3 of [20]. 799 For each connection there is a send sequence number and a receive 800 sequence number. The initial send sequence number (ISS) is chosen by 801 the data sending TCP, and the initial receive sequence number (IRS) 802 is learned during the connection establishing procedure. 804 For a connection to be established or initialized, the two TCPs must 805 synchronize on each other's initial sequence numbers. This is done 806 in an exchange of connection establishing segments carrying a control 807 bit called "SYN" (for synchronize) and the initial sequence numbers. 808 As a shorthand, segments carrying the SYN bit are also called "SYNs". 809 Hence, the solution requires a suitable mechanism for picking an 810 initial sequence number and a slightly involved handshake to exchange 811 the ISN's. 813 The synchronization requires each side to send it's own initial 814 sequence number and to receive a confirmation of it in acknowledgment 815 from the other side. Each side must also receive the other side's 816 initial sequence number and send a confirming acknowledgment. 818 1) A --> B SYN my sequence number is X 819 2) A <-- B ACK your sequence number is X 820 3) A <-- B SYN my sequence number is Y 821 4) A --> B ACK your sequence number is Y 823 Because steps 2 and 3 can be combined in a single message this is 824 called the three way (or three message) handshake. 826 A three way handshake is necessary because sequence numbers are not 827 tied to a global clock in the network, and TCPs may have different 828 mechanisms for picking the ISN's. The receiver of the first SYN has 829 no way of knowing whether the segment was an old delayed one or not, 830 unless it remembers the last sequence number used on the connection 831 (which is not always possible), and so it must ask the sender to 832 verify this SYN. The three way handshake and the advantages of a 833 clock-driven scheme are discussed in [3]. 835 Knowing When to Keep Quiet 837 To be sure that a TCP does not create a segment that carries a 838 sequence number which may be duplicated by an old segment remaining 839 in the network, the TCP must keep quiet for an MSL before assigning 840 any sequence numbers upon starting up or recovering from a crash in 841 which memory of sequence numbers in use was lost. For this 842 specification the MSL is taken to be 2 minutes. This is an 843 engineering choice, and may be changed if experience indicates it is 844 desirable to do so. Note that if a TCP is reinitialized in some 845 sense, yet retains its memory of sequence numbers in use, then it 846 need not wait at all; it must only be sure to use sequence numbers 847 larger than those recently used. 849 The TCP Quiet Time Concept 851 This specification provides that hosts which "crash" without 852 retaining any knowledge of the last sequence numbers transmitted on 853 each active (i.e., not closed) connection shall delay emitting any 854 TCP segments for at least the agreed MSL in the internet system of 855 which the host is a part. In the paragraphs below, an explanation 856 for this specification is given. TCP implementors may violate the 857 "quiet time" restriction, but only at the risk of causing some old 858 data to be accepted as new or new data rejected as old duplicated by 859 some receivers in the internet system. 861 TCPs consume sequence number space each time a segment is formed and 862 entered into the network output queue at a source host. The 863 duplicate detection and sequencing algorithm in the TCP protocol 864 relies on the unique binding of segment data to sequence space to the 865 extent that sequence numbers will not cycle through all 2**32 values 866 before the segment data bound to those sequence numbers has been 867 delivered and acknowledged by the receiver and all duplicate copies 868 of the segments have "drained" from the internet. Without such an 869 assumption, two distinct TCP segments could conceivably be assigned 870 the same or overlapping sequence numbers, causing confusion at the 871 receiver as to which data is new and which is old. Remember that 872 each segment is bound to as many consecutive sequence numbers as 873 there are octets of data and SYN or FIN flags in the segment. 875 Under normal conditions, TCPs keep track of the next sequence number 876 to emit and the oldest awaiting acknowledgment so as to avoid 877 mistakenly using a sequence number over before its first use has been 878 acknowledged. This alone does not guarantee that old duplicate data 879 is drained from the net, so the sequence space has been made very 880 large to reduce the probability that a wandering duplicate will cause 881 trouble upon arrival. At 2 megabits/sec. it takes 4.5 hours to use 882 up 2**32 octets of sequence space. Since the maximum segment 883 lifetime in the net is not likely to exceed a few tens of seconds, 884 this is deemed ample protection for foreseeable nets, even if data 885 rates escalate to l0's of megabits/sec. At 100 megabits/sec, the 886 cycle time is 5.4 minutes which may be a little short, but still 887 within reason. 889 The basic duplicate detection and sequencing algorithm in TCP can be 890 defeated, however, if a source TCP does not have any memory of the 891 sequence numbers it last used on a given connection. For example, if 892 the TCP were to start all connections with sequence number 0, then 893 upon crashing and restarting, a TCP might re-form an earlier 894 connection (possibly after half-open connection resolution) and emit 895 packets with sequence numbers identical to or overlapping with 896 packets still in the network which were emitted on an earlier 897 incarnation of the same connection. In the absence of knowledge 898 about the sequence numbers used on a particular connection, the TCP 899 specification recommends that the source delay for MSL seconds before 900 emitting segments on the connection, to allow time for segments from 901 the earlier connection incarnation to drain from the system. 903 Even hosts which can remember the time of day and used it to select 904 initial sequence number values are not immune from this problem 905 (i.e., even if time of day is used to select an initial sequence 906 number for each new connection incarnation). 908 Suppose, for example, that a connection is opened starting with 909 sequence number S. Suppose that this connection is not used much and 910 that eventually the initial sequence number function (ISN(t)) takes 911 on a value equal to the sequence number, say S1, of the last segment 912 sent by this TCP on a particular connection. Now suppose, at this 913 instant, the host crashes, recovers, and establishes a new 914 incarnation of the connection. The initial sequence number chosen is 915 S1 = ISN(t) -- last used sequence number on old incarnation of 916 connection! If the recovery occurs quickly enough, any old 917 duplicates in the net bearing sequence numbers in the neighborhood of 918 S1 may arrive and be treated as new packets by the receiver of the 919 new incarnation of the connection. 921 The problem is that the recovering host may not know for how long it 922 crashed nor does it know whether there are still old duplicates in 923 the system from earlier connection incarnations. 925 One way to deal with this problem is to deliberately delay emitting 926 segments for one MSL after recovery from a crash- this is the "quiet 927 time" specification. Hosts which prefer to avoid waiting are willing 928 to risk possible confusion of old and new packets at a given 929 destination may choose not to wait for the "quite time". 930 Implementors may provide TCP users with the ability to select on a 931 connection by connection basis whether to wait after a crash, or may 932 informally implement the "quite time" for all connections. 933 Obviously, even where a user selects to "wait," this is not necessary 934 after the host has been "up" for at least MSL seconds. 936 To summarize: every segment emitted occupies one or more sequence 937 numbers in the sequence space, the numbers occupied by a segment are 938 "busy" or "in use" until MSL seconds have passed, upon crashing a 939 block of space-time is occupied by the octets and SYN or FIN flags of 940 the last emitted segment, if a new connection is started too soon and 941 uses any of the sequence numbers in the space-time footprint of the 942 last segment of the previous connection incarnation, there is a 943 potential sequence number overlap area which could cause confusion at 944 the receiver. 946 3.4. Establishing a connection 948 The "three-way handshake" is the procedure used to establish a 949 connection. This procedure normally is initiated by one TCP and 950 responded to by another TCP. The procedure also works if two TCP 951 simultaneously initiate the procedure. When simultaneous attempt 952 occurs, each TCP receives a "SYN" segment which carries no 953 acknowledgment after it has sent a "SYN". Of course, the arrival of 954 an old duplicate "SYN" segment can potentially make it appear, to the 955 recipient, that a simultaneous connection initiation is in progress. 956 Proper use of "reset" segments can disambiguate these cases. 958 Several examples of connection initiation follow. Although these 959 examples do not show connection synchronization using data-carrying 960 segments, this is perfectly legitimate, so long as the receiving TCP 961 doesn't deliver the data to the user until it is clear the data is 962 valid (i.e., the data must be buffered at the receiver until the 963 connection reaches the ESTABLISHED state). The three-way handshake 964 reduces the possibility of false connections. It is the 965 implementation of a trade-off between memory and messages to provide 966 information for this checking. 968 The simplest three-way handshake is shown in Figure 5 below. The 969 figures should be interpreted in the following way. Each line is 970 numbered for reference purposes. Right arrows (-->) indicate 971 departure of a TCP segment from TCP A to TCP B, or arrival of a 972 segment at B from A. Left arrows (<--), indicate the reverse. 973 Ellipsis (...) indicates a segment which is still in the network 974 (delayed). An "XXX" indicates a segment which is lost or rejected. 975 Comments appear in parentheses. TCP states represent the state AFTER 976 the departure or arrival of the segment (whose contents are shown in 977 the center of each line). Segment contents are shown in abbreviated 978 form, with sequence number, control flags, and ACK field. Other 979 fields such as window, addresses, lengths, and text have been left 980 out in the interest of clarity. 982 TCP A TCP B 984 1. CLOSED LISTEN 986 2. SYN-SENT --> --> SYN-RECEIVED 988 3. ESTABLISHED <-- <-- SYN-RECEIVED 990 4. ESTABLISHED --> --> ESTABLISHED 992 5. ESTABLISHED --> --> ESTABLISHED 994 Basic 3-Way Handshake for Connection Synchronization 996 Figure 5 998 In line 2 of Figure 5, TCP A begins by sending a SYN segment 999 indicating that it will use sequence numbers starting with sequence 1000 number 100. In line 3, TCP B sends a SYN and acknowledges the SYN it 1001 received from TCP A. Note that the acknowledgment field indicates 1002 TCP B is now expecting to hear sequence 101, acknowledging the SYN 1003 which occupied sequence 100. 1005 At line 4, TCP A responds with an empty segment containing an ACK for 1006 TCP B's SYN; and in line 5, TCP A sends some data. Note that the 1007 sequence number of the segment in line 5 is the same as in line 4 1008 because the ACK does not occupy sequence number space (if it did, we 1009 would wind up ACKing ACK's!). 1011 Simultaneous initiation is only slightly more complex, as is shown in 1012 Figure 6. Each TCP cycles from CLOSED to SYN-SENT to SYN-RECEIVED to 1013 ESTABLISHED. 1015 TCP A TCP B 1017 1. CLOSED CLOSED 1019 2. SYN-SENT --> ... 1021 3. SYN-RECEIVED <-- <-- SYN-SENT 1023 4. ... --> SYN-RECEIVED 1025 5. SYN-RECEIVED --> ... 1027 6. ESTABLISHED <-- <-- SYN-RECEIVED 1029 7. ... --> ESTABLISHED 1031 Simultaneous Connection Synchronization 1033 Figure 6 1035 A TCP MUST support simultaneous open attempts. 1037 Note that a TCP implementation MUST keep track of whether a 1038 connection has reached SYN_RCVD state as the result of a passive OPEN 1039 or an active OPEN. 1041 The principle reason for the three-way handshake is to prevent old 1042 duplicate connection initiations from causing confusion. To deal 1043 with this, a special control message, reset, has been devised. If 1044 the receiving TCP is in a non-synchronized state (i.e., SYN-SENT, 1045 SYN-RECEIVED), it returns to LISTEN on receiving an acceptable reset. 1046 If the TCP is in one of the synchronized states (ESTABLISHED, FIN- 1047 WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT), it 1048 aborts the connection and informs its user. We discuss this latter 1049 case under "half-open" connections below. 1051 TCP A TCP B 1053 1. CLOSED LISTEN 1055 2. SYN-SENT --> ... 1057 3. (duplicate) ... --> SYN-RECEIVED 1059 4. SYN-SENT <-- <-- SYN-RECEIVED 1061 5. SYN-SENT --> --> LISTEN 1063 6. ... --> SYN-RECEIVED 1065 7. SYN-SENT <-- <-- SYN-RECEIVED 1067 8. ESTABLISHED --> --> ESTABLISHED 1069 Recovery from Old Duplicate SYN 1071 Figure 7 1073 As a simple example of recovery from old duplicates, consider 1074 Figure 7. At line 3, an old duplicate SYN arrives at TCP B. TCP B 1075 cannot tell that this is an old duplicate, so it responds normally 1076 (line 4). TCP A detects that the ACK field is incorrect and returns 1077 a RST (reset) with its SEQ field selected to make the segment 1078 believable. TCP B, on receiving the RST, returns to the LISTEN 1079 state. When the original SYN (pun intended) finally arrives at line 1080 6, the synchronization proceeds normally. If the SYN at line 6 had 1081 arrived before the RST, a more complex exchange might have occurred 1082 with RST's sent in both directions. 1084 Half-Open Connections and Other Anomalies 1086 An established connection is said to be "half-open" if one of the 1087 TCPs has closed or aborted the connection at its end without the 1088 knowledge of the other, or if the two ends of the connection have 1089 become desynchronized owing to a crash that resulted in loss of 1090 memory. Such connections will automatically become reset if an 1091 attempt is made to send data in either direction. However, half-open 1092 connections are expected to be unusual, and the recovery procedure is 1093 mildly involved. 1095 If at site A the connection no longer exists, then an attempt by the 1096 user at site B to send any data on it will result in the site B TCP 1097 receiving a reset control message. Such a message indicates to the 1098 site B TCP that something is wrong, and it is expected to abort the 1099 connection. 1101 Assume that two user processes A and B are communicating with one 1102 another when a crash occurs causing loss of memory to A's TCP. 1103 Depending on the operating system supporting A's TCP, it is likely 1104 that some error recovery mechanism exists. When the TCP is up again, 1105 A is likely to start again from the beginning or from a recovery 1106 point. As a result, A will probably try to OPEN the connection again 1107 or try to SEND on the connection it believes open. In the latter 1108 case, it receives the error message "connection not open" from the 1109 local (A's) TCP. In an attempt to establish the connection, A's TCP 1110 will send a segment containing SYN. This scenario leads to the 1111 example shown in Figure 8. After TCP A crashes, the user attempts to 1112 re-open the connection. TCP B, in the meantime, thinks the 1113 connection is open. 1115 TCP A TCP B 1117 1. (CRASH) (send 300,receive 100) 1119 2. CLOSED ESTABLISHED 1121 3. SYN-SENT --> --> (??) 1123 4. (!!) <-- <-- ESTABLISHED 1125 5. SYN-SENT --> --> (Abort!!) 1127 6. SYN-SENT CLOSED 1129 7. SYN-SENT --> --> 1131 Half-Open Connection Discovery 1133 Figure 8 1135 When the SYN arrives at line 3, TCP B, being in a synchronized state, 1136 and the incoming segment outside the window, responds with an 1137 acknowledgment indicating what sequence it next expects to hear (ACK 1138 100). TCP A sees that this segment does not acknowledge anything it 1139 sent and, being unsynchronized, sends a reset (RST) because it has 1140 detected a half-open connection. TCP B aborts at line 5. TCP A will 1141 continue to try to establish the connection; the problem is now 1142 reduced to the basic 3-way handshake of Figure 5. 1144 An interesting alternative case occurs when TCP A crashes and TCP B 1145 tries to send data on what it thinks is a synchronized connection. 1147 This is illustrated in Figure 9. In this case, the data arriving at 1148 TCP A from TCP B (line 2) is unacceptable because no such connection 1149 exists, so TCP A sends a RST. The RST is acceptable so TCP B 1150 processes it and aborts the connection. 1152 TCP A TCP B 1154 1. (CRASH) (send 300,receive 100) 1156 2. (??) <-- <-- ESTABLISHED 1158 3. --> --> (ABORT!!) 1160 Active Side Causes Half-Open Connection Discovery 1162 Figure 9 1164 In Figure 10, we find the two TCPs A and B with passive connections 1165 waiting for SYN. An old duplicate arriving at TCP B (line 2) stirs B 1166 into action. A SYN-ACK is returned (line 3) and causes TCP A to 1167 generate a RST (the ACK in line 3 is not acceptable). TCP B accepts 1168 the reset and returns to its passive LISTEN state. 1170 TCP A TCP B 1172 1. LISTEN LISTEN 1174 2. ... --> SYN-RECEIVED 1176 3. (??) <-- <-- SYN-RECEIVED 1178 4. --> --> (return to LISTEN!) 1180 5. LISTEN LISTEN 1182 Old Duplicate SYN Initiates a Reset on two Passive Sockets 1184 Figure 10 1186 A variety of other cases are possible, all of which are accounted for 1187 by the following rules for RST generation and processing. 1189 Reset Generation 1190 As a general rule, reset (RST) must be sent whenever a segment 1191 arrives which apparently is not intended for the current connection. 1192 A reset must not be sent if it is not clear that this is the case. 1194 There are three groups of states: 1196 1. If the connection does not exist (CLOSED) then a reset is sent 1197 in response to any incoming segment except another reset. In 1198 particular, SYNs addressed to a non-existent connection are 1199 rejected by this means. 1201 If the incoming segment has the ACK bit set, the reset takes its 1202 sequence number from the ACK field of the segment, otherwise the 1203 reset has sequence number zero and the ACK field is set to the sum 1204 of the sequence number and segment length of the incoming segment. 1205 The connection remains in the CLOSED state. 1207 2. If the connection is in any non-synchronized state (LISTEN, 1208 SYN-SENT, SYN-RECEIVED), and the incoming segment acknowledges 1209 something not yet sent (the segment carries an unacceptable ACK), 1210 or if an incoming segment has a security level or compartment 1211 which does not exactly match the level and compartment requested 1212 for the connection, a reset is sent. 1214 If our SYN has not been acknowledged and the precedence level of 1215 the incoming segment is higher than the precedence level requested 1216 then either raise the local precedence level (if allowed by the 1217 user and the system) or send a reset; or if the precedence level 1218 of the incoming segment is lower than the precedence level 1219 requested then continue as if the precedence matched exactly (if 1220 the remote TCP cannot raise the precedence level to match ours 1221 this will be detected in the next segment it sends, and the 1222 connection will be terminated then). If our SYN has been 1223 acknowledged (perhaps in this incoming segment) the precedence 1224 level of the incoming segment must match the local precedence 1225 level exactly, if it does not a reset must be sent. 1227 If the incoming segment has an ACK field, the reset takes its 1228 sequence number from the ACK field of the segment, otherwise the 1229 reset has sequence number zero and the ACK field is set to the sum 1230 of the sequence number and segment length of the incoming segment. 1231 The connection remains in the same state. 1233 3. If the connection is in a synchronized state (ESTABLISHED, 1234 FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT), 1235 any unacceptable segment (out of window sequence number or 1236 unacceptable acknowledgment number) must elicit only an empty 1237 acknowledgment segment containing the current send-sequence number 1238 and an acknowledgment indicating the next sequence number expected 1239 to be received, and the connection remains in the same state. 1241 If an incoming segment has a security level, or compartment, or 1242 precedence which does not exactly match the level, and 1243 compartment, and precedence requested for the connection,a reset 1244 is sent and the connection goes to the CLOSED state. The reset 1245 takes its sequence number from the ACK field of the incoming 1246 segment. 1248 Reset Processing 1250 In all states except SYN-SENT, all reset (RST) segments are validated 1251 by checking their SEQ-fields. A reset is valid if its sequence 1252 number is in the window. In the SYN-SENT state (a RST received in 1253 response to an initial SYN), the RST is acceptable if the ACK field 1254 acknowledges the SYN. 1256 The receiver of a RST first validates it, then changes state. If the 1257 receiver was in the LISTEN state, it ignores it. If the receiver was 1258 in SYN-RECEIVED state and had previously been in the LISTEN state, 1259 then the receiver returns to the LISTEN state, otherwise the receiver 1260 aborts the connection and goes to the CLOSED state. If the receiver 1261 was in any other state, it aborts the connection and advises the user 1262 and goes to the CLOSED state. 1264 TCP SHOULD allow a received RST segment to include data. 1266 3.4.1. Remote Address Validation 1268 TODO - figure out if this section would fit better elsewhere, for 1269 instance in the more detailed description of the OPEN call later on 1271 A TCP implementation MUST reject as an error a local OPEN call for an 1272 invalid remote IP address (e.g., a broadcast or multicast address). 1274 An incoming SYN with an invalid source address must be ignored either 1275 by TCP or by the IP layer (see Section 3.2.1.3 of [12]). 1277 A TCP implementation MUST silently discard an incoming SYN segment 1278 that is addressed to a broadcast or multicast address. 1280 3.5. Closing a Connection 1282 CLOSE is an operation meaning "I have no more data to send." The 1283 notion of closing a full-duplex connection is subject to ambiguous 1284 interpretation, of course, since it may not be obvious how to treat 1285 the receiving side of the connection. We have chosen to treat CLOSE 1286 in a simplex fashion. The user who CLOSEs may continue to RECEIVE 1287 until he is told that the other side has CLOSED also. Thus, a 1288 program could initiate several SENDs followed by a CLOSE, and then 1289 continue to RECEIVE until signaled that a RECEIVE failed because the 1290 other side has CLOSED. We assume that the TCP will signal a user, 1291 even if no RECEIVEs are outstanding, that the other side has closed, 1292 so the user can terminate his side gracefully. A TCP will reliably 1293 deliver all buffers SENT before the connection was CLOSED so a user 1294 who expects no data in return need only wait to hear the connection 1295 was CLOSED successfully to know that all his data was received at the 1296 destination TCP. Users must keep reading connections they close for 1297 sending until the TCP says no more data. 1299 There are essentially three cases: 1301 1) The user initiates by telling the TCP to CLOSE the connection 1303 2) The remote TCP initiates by sending a FIN control signal 1305 3) Both users CLOSE simultaneously 1307 Case 1: Local user initiates the close 1309 In this case, a FIN segment can be constructed and placed on the 1310 outgoing segment queue. No further SENDs from the user will be 1311 accepted by the TCP, and it enters the FIN-WAIT-1 state. RECEIVEs 1312 are allowed in this state. All segments preceding and including 1313 FIN will be retransmitted until acknowledged. When the other TCP 1314 has both acknowledged the FIN and sent a FIN of its own, the first 1315 TCP can ACK this FIN. Note that a TCP receiving a FIN will ACK 1316 but not send its own FIN until its user has CLOSED the connection 1317 also. 1319 Case 2: TCP receives a FIN from the network 1321 If an unsolicited FIN arrives from the network, the receiving TCP 1322 can ACK it and tell the user that the connection is closing. The 1323 user will respond with a CLOSE, upon which the TCP can send a FIN 1324 to the other TCP after sending any remaining data. The TCP then 1325 waits until its own FIN is acknowledged whereupon it deletes the 1326 connection. If an ACK is not forthcoming, after the user timeout 1327 the connection is aborted and the user is told. 1329 Case 3: both users close simultaneously 1331 A simultaneous CLOSE by users at both ends of a connection causes 1332 FIN segments to be exchanged. When all segments preceding the 1333 FINs have been processed and acknowledged, each TCP can ACK the 1334 FIN it has received. Both will, upon receiving these ACKs, delete 1335 the connection. 1337 TCP A TCP B 1339 1. ESTABLISHED ESTABLISHED 1341 2. (Close) 1342 FIN-WAIT-1 --> --> CLOSE-WAIT 1344 3. FIN-WAIT-2 <-- <-- CLOSE-WAIT 1346 4. (Close) 1347 TIME-WAIT <-- <-- LAST-ACK 1349 5. TIME-WAIT --> --> CLOSED 1351 6. (2 MSL) 1352 CLOSED 1354 Normal Close Sequence 1356 Figure 11 1358 TCP A TCP B 1360 1. ESTABLISHED ESTABLISHED 1362 2. (Close) (Close) 1363 FIN-WAIT-1 --> ... FIN-WAIT-1 1364 <-- <-- 1365 ... --> 1367 3. CLOSING --> ... CLOSING 1368 <-- <-- 1369 ... --> 1371 4. TIME-WAIT TIME-WAIT 1372 (2 MSL) (2 MSL) 1373 CLOSED CLOSED 1375 Simultaneous Close Sequence 1377 Figure 12 1379 A TCP connection may terminate in two ways: (1) the normal TCP close 1380 sequence using a FIN handshake, and (2) an "abort" in which one or 1381 more RST segments are sent and the connection state is immediately 1382 discarded. If a TCP connection is closed by the remote site, the 1383 local application MUST be informed whether it closed normally or was 1384 aborted. 1386 3.5.1. Half-Closed Connections 1388 The normal TCP close sequence delivers buffered data reliably in both 1389 directions. Since the two directions of a TCP connection are closed 1390 independently, it is possible for a connection to be "half closed," 1391 i.e., closed in only one direction, and a host is permitted to 1392 continue sending data in the open direction on a half-closed 1393 connection. 1395 A host MAY implement a "half-duplex" TCP close sequence, so that an 1396 application that has called CLOSE cannot continue to read data from 1397 the connection. If such a host issues a CLOSE call while received 1398 data is still pending in TCP, or if new data is received after CLOSE 1399 is called, its TCP SHOULD send a RST to show that data was lost. 1401 When a connection is closed actively, it MUST linger in TIME-WAIT 1402 state for a time 2xMSL (Maximum Segment Lifetime). However, it MAY 1403 accept a new SYN from the remote TCP to reopen the connection 1404 directly from TIME-WAIT state, if it: 1406 (1) assigns its initial sequence number for the new connection to 1407 be larger than the largest sequence number it used on the previous 1408 connection incarnation, and 1410 (2) returns to TIME-WAIT state if the SYN turns out to be an old 1411 duplicate. 1413 3.6. Precedence and Security 1415 The intent is that connection be allowed only between ports operating 1416 with exactly the same security and compartment values and at the 1417 higher of the precedence level requested by the two ports. 1419 The precedence and security parameters used in TCP are exactly those 1420 defined in the Internet Protocol (IP) [1]. Throughout this TCP 1421 specification the term "security/compartment" is intended to indicate 1422 the security parameters used in IP including security, compartment, 1423 user group, and handling restriction. 1425 A connection attempt with mismatched security/compartment values or a 1426 lower precedence value must be rejected by sending a reset. 1427 Rejecting a connection due to too low a precedence only occurs after 1428 an acknowledgment of the SYN has been received. 1430 Note that TCP modules which operate only at the default value of 1431 precedence will still have to check the precedence of incoming 1432 segments and possibly raise the precedence level they use on the 1433 connection. 1435 The security parameters may be used even in a non-secure environment 1436 (the values would indicate unclassified data), thus hosts in non- 1437 secure environments must be prepared to receive the security 1438 parameters, though they need not send them. 1440 3.7. Segmentation 1442 The term "segmentation" refers to the activity TCP performs when 1443 ingesting a stream of bytes from a sending application and 1444 packetizing that stream of bytes into TCP segments. Individual TCP 1445 segments often do not correspond one-for-one to individual send (or 1446 socket write) calls from the application. Applications may perform 1447 writes at the granularity of messages in the upper layer protocol, 1448 but TCP guarantees no boundary coherence between the TCP segments 1449 sent and received versus user application data read or write buffer 1450 boundaries. In some specific protocols, such as RDMA using DDP and 1451 MPA [14], there are performance optimizations possible when the 1452 relation between TCP segments and application data units can be 1453 controlled, and MPA includes a specific mechanism for detecting and 1454 verifying this relationship between TCP segments and application 1455 message data strcutures, but this is specific to applications like 1456 RDMA. In general, multiple goals influence the sizing of TCP 1457 segments created by a TCP implementation. 1459 Goals driving the sending of larger segments include: 1461 o Reducing the number of packets in flight within the network. 1463 o Increasing processing efficiency and potential performance by 1464 enabling a smaller number of interrupts and inter-layer 1465 interactions. 1467 o Limiting the overhead of TCP headers. 1469 Note that the performance benefits of sending larger segments may 1470 decrease as the size increases, and there may be boundaries where 1471 advantages are reversed. For instance, on some machines 1025 bytes 1472 within a segment could lead to worse performance than 1024 bytes, due 1473 purely to data alignment on copy operations. 1475 Goals driving the sending of smaller segments include: 1477 o Avoiding sending segments larger than the smallest MTU within an 1478 IP network path, because this results in either packet loss or 1479 fragmentation. Making matters worse, some firewalls or 1480 middleboxes may drop fragmented packets or ICMP messages related 1481 related to fragmentation. 1483 o Preventing delays to the application data stream, especially when 1484 TCP is waiting on the application to generate more data, or when 1485 the application is waiting on an event or input from its peer in 1486 order to generate more data. 1488 o Enabling "fate sharing" between TCP segments and lower-layer data 1489 units (e.g. below IP, for links with cell or frame sizes smaller 1490 than the IP MTU). 1492 Towards meeting these competing sets of goals, TCP includes several 1493 mechanisms, including the Maximum Segment Size option, Path MTU 1494 Discovery, the Nagle algorithm, and support for IPv6 Jumbograms, as 1495 discussed in the following subsections. 1497 3.7.1. Maximum Segment Size Option 1499 TCP MUST implement both sending and receiving the MSS option. 1501 TCP SHOULD send an MSS option in every SYN segment when its receive 1502 MSS differs from the default 536 for IPv4 or 1220 for IPv6, and MAY 1503 send it always. 1505 If an MSS option is not received at connection setup, TCP MUST assume 1506 a default send MSS of 536 (576-40) for IPv4 or 1220 (1280 - 60) for 1507 IPv6. 1509 The maximum size of a segment that TCP really sends, the "effective 1510 send MSS," MUST be the smaller of the send MSS (which reflects the 1511 available reassembly buffer size at the remote host) and the largest 1512 size permitted by the IP layer: 1514 Eff.snd.MSS = 1516 min(SendMSS+20, MMS_S) - TCPhdrsize - IPoptionsize 1518 where: 1520 o SendMSS is the MSS value received from the remote host, or the 1521 default 536 for IPv4 or 1220 for IPv6, if no MSS option is 1522 received. 1524 o MMS_S is the maximum size for a transport-layer message that TCP 1525 may send. 1527 o TCPhdrsize is the size of the fixed TCP header and any options. 1528 This is 20 in the (rare) case that no options are present, but may 1529 be larger if TCP options are to be sent. Note that some options 1530 may not be included on all segments, but that for each segment 1531 sent, the sender should adjust the data length accordingly, within 1532 the Eff.snd.MSS. 1534 o IPoptionsize is the size of any IP options associated with a TCP 1535 connection. Note that some options may not be included on all 1536 packets, but that for each segment sent, the sender should adjust 1537 the data length accordingly, within the Eff.snd.MSS. 1539 The MSS value to be sent in an MSS option should be equal to the 1540 effective MTU minus the fixed IP and TCP headers. By ignoring both 1541 IP and TCP options when calculating the value for the MSS option, if 1542 there are any IP or TCP options to be sent in a packet, then the 1543 sender must decrease the size of the TCP data accordingly. RFC 6691 1544 [21] discusses this in greater detail. 1546 The MSS value to be sent in an MSS option must be less than or equal 1547 to: 1549 MMS_R - 20 1551 where MMS_R is the maximum size for a transport-layer message that 1552 can be received (and reassembled). TCP obtains MMS_R and MMS_S from 1553 the IP layer; see the generic call GET_MAXSIZES in Section 3.4 of RFC 1554 1122. 1556 When TCP is used in a situation where either the IP or TCP headers 1557 are not fixed, the sender must reduce the amount of TCP data in any 1558 given packet by the number of octets used by the IP and TCP options. 1559 This has been a point of confusion historically, as explained in RFC 1560 6691, Section 3.1. 1562 3.7.2. Path MTU Discovery 1564 A TCP implementation may be aware of the MTU on directly connected 1565 links, but will rarely have insight about MTUs across an entire 1566 network path. For IPv4, RFC 1122 provides an IP-layer recommendation 1567 on the default effective MTU for sending to be less than or equal to 1568 576 for destinations not directly connected. For IPv6, this would be 1569 1280. In all cases, however, implementation of Path MTU Discovery 1570 (PMTUD) and Packetization Layer Path MTU Discovery (PLPMTUD) is 1571 strongly recommended in order for TCP to improve segmentation 1572 decisions. 1574 PMTUD for IPv4 [2] or IPv6 [3] is implemented in conjunction between 1575 TCP, IP, and ICMP protocols. Several adjustments to a TCP 1576 implementation with PMTUD are described in RFC 2923 in order to deal 1577 with problems experienced in practice [6]. PLPMTUD [13] is a 1578 Standards Track improvement to PMTUD that relaxes the requirement for 1579 ICMP support across a path, and improves performance in cases where 1580 ICMP is not consistently conveyed. The mechanisms in all four of 1581 these RFCs are recommended to be included in TCP implementations. 1583 The TCP MSS option specifies an upper bound for the size of packets 1584 that can be received. Hence, setting the value in the MSS option too 1585 small can impact the ability for PMTUD or PLPMTUD to find a larger 1586 path MTU. RFC 1191 discusses this implication of many older TCP 1587 implementations setting MSS to 536 for non-local destinations, rather 1588 than deriving it from the MTUs of connected interfaces as 1589 recommended. 1591 3.7.3. Interfaces with Variable MTU Values 1593 The effective MTU can sometimes vary, as when used with variable 1594 compression, e.g., RObust Header Compression (ROHC) [17]. It is 1595 tempting for TCP to want to advertise the largest possible MSS, to 1596 support the most efficient use of compressed payloads. 1597 Unfortunately, some compression schemes occasionally need to transmit 1598 full headers (and thus smaller payloads) to resynchronize state at 1599 their endpoint compressors/decompressors. If the largest MTU is used 1600 to calculate the value to advertise in the MSS option, TCP 1601 retransmission may interfere with compressor resynchronization. 1603 As a result, when the effective MTU of an interface varies, TCP 1604 SHOULD use the smallest effective MTU of the interface to calculate 1605 the value to advertise in the MSS option. 1607 3.7.4. Nagle Algorithm 1609 The "Nagle algorithm" was described in RFC 896 [11] and was 1610 recommended in RFC 1122 [12] for mitigation of an early problem of 1611 too many small packets being generated. It has been implemented in 1612 most current TCP code bases, sometimes with minor variations. 1614 If there is unacknowledged data (i.e., SND.NXT > SND.UNA), then the 1615 sending TCP buffers all user data (regardless of the PSH bit), until 1616 the outstanding data has been acknowledged or until the TCP can send 1617 a full-sized segment (Eff.snd.MSS bytes). 1619 TODO - see if SEND description later should be updated to reflect 1620 this 1622 A TCP SHOULD implement the Nagle Algorithm to coalesce short 1623 segments. However, there MUST be a way for an application to disable 1624 the Nagle algorithm on an individual connection. In all cases, 1625 sending data is also subject to the limitation imposed by the Slow 1626 Start algorithm [16]. 1628 3.7.5. IPv6 Jumbograms 1630 In order to support TCP over IPv6 jumbograms, implementations need to 1631 be able to send TCP segments larger than the 64KB limit that the MSS 1632 option can convey. RFC 2675 [5] defines that an MSS value of 65,535 1633 bytes is to be treated as infinity, and Path MTU Discovery [3] is 1634 used to determine the actual MSS. 1636 3.8. Data Communication 1638 Once the connection is established data is communicated by the 1639 exchange of segments. Because segments may be lost due to errors 1640 (checksum test failure), or network congestion, TCP uses 1641 retransmission (after a timeout) to ensure delivery of every segment. 1642 Duplicate segments may arrive due to network or TCP retransmission. 1643 As discussed in the section on sequence numbers the TCP performs 1644 certain tests on the sequence and acknowledgment numbers in the 1645 segments to verify their acceptability. 1647 The sender of data keeps track of the next sequence number to use in 1648 the variable SND.NXT. The receiver of data keeps track of the next 1649 sequence number to expect in the variable RCV.NXT. The sender of 1650 data keeps track of the oldest unacknowledged sequence number in the 1651 variable SND.UNA. If the data flow is momentarily idle and all data 1652 sent has been acknowledged then the three variables will be equal. 1654 When the sender creates a segment and transmits it the sender 1655 advances SND.NXT. When the receiver accepts a segment it advances 1656 RCV.NXT and sends an acknowledgment. When the data sender receives 1657 an acknowledgment it advances SND.UNA. The extent to which the 1658 values of these variables differ is a measure of the delay in the 1659 communication. The amount by which the variables are advanced is the 1660 length of the data and SYN or FIN flags in the segment. Note that 1661 once in the ESTABLISHED state all segments must carry current 1662 acknowledgment information. 1664 The CLOSE user call implies a push function, as does the FIN control 1665 flag in an incoming segment. 1667 3.8.1. Retransmission Timeout 1669 Because of the variability of the networks that compose an 1670 internetwork system and the wide range of uses of TCP connections the 1671 retransmission timeout (RTO) must be dynamically determined. 1673 The RTO MUST be computed according to the algorithm in [8], including 1674 Karn's algorithm for taking RTT samples. 1676 RFC 793 contains an early example procedure for computing the RTO. 1677 This was then replaced by the algorithm described in RFC 1122, and 1678 subsequently updated in RFC 2988, and then again in RFC 6298. 1680 If a retransmitted packet is identical to the original packet (which 1681 implies not only that the data boundaries have not changed, but also 1682 that the window and acknowledgment fields of the header have not 1683 changed), then the same IP Identification field MAY be used (see 1684 Section 3.2.1.5 of RFC 1122). 1686 3.8.2. TCP Congestion Control 1688 RFC 1122 required implementation of Van Jacobson's congestion control 1689 algorithm combining slow start with congestion avoidance. RFC 2581 1690 provided IETF Standards Track description of this, along with fast 1691 retransmit and fast recovery. RFC 5681 is the current description of 1692 these algorithms and is the current standard for TCP congestion 1693 control. 1695 A TCP MUST implement RFC 5681. 1697 Explicit Congestion Notification (ECN) was defined in RFC 3168 and is 1698 an IETF Standards Track enhancement that has many benefits [24]. 1700 A TCP SHOULD implement ECN as described in RFC 3168. 1702 3.8.3. TCP Connection Failures 1704 Excessive retransmission of the same segment by TCP indicates some 1705 failure of the remote host or the Internet path. This failure may be 1706 of short or long duration. The following procedure MUST be used to 1707 handle excessive retransmissions of data segments: 1709 (a) There are two thresholds R1 and R2 measuring the amount of 1710 retransmission that has occurred for the same segment. R1 and R2 1711 might be measured in time units or as a count of retransmissions. 1713 (b) When the number of transmissions of the same segment reaches 1714 or exceeds threshold R1, pass negative advice (see [12] 1715 Section 3.3.1.4) to the IP layer, to trigger dead-gateway 1716 diagnosis. 1718 (c) When the number of transmissions of the same segment reaches a 1719 threshold R2 greater than R1, close the connection. 1721 (d) An application MUST be able to set the value for R2 for a 1722 particular connection. For example, an interactive application 1723 might set R2 to "infinity," giving the user control over when to 1724 disconnect. 1726 (d) TCP SHOULD inform the application of the delivery problem 1727 (unless such information has been disabled by the application; see 1728 RFC1122 Section 4.2.4.1 - TODO update to error reporting 1729 description in this document), when R1 is reached and before R2. 1730 This will allow a remote login (User Telnet) application program 1731 to inform the user, for example. 1733 The value of R1 SHOULD correspond to at least 3 retransmissions, at 1734 the current RTO. The value of R2 SHOULD correspond to at least 100 1735 seconds. 1737 An attempt to open a TCP connection could fail with excessive 1738 retransmissions of the SYN segment or by receipt of a RST segment or 1739 an ICMP Port Unreachable. SYN retransmissions MUST be handled in the 1740 general way just described for data retransmissions, including 1741 notification of the application layer. 1743 However, the values of R1 and R2 may be different for SYN and data 1744 segments. In particular, R2 for a SYN segment MUST be set large 1745 enough to provide retransmission of the segment for at least 3 1746 minutes. The application can close the connection (i.e., give up on 1747 the open attempt) sooner, of course. 1749 3.8.4. TCP Keep-Alives 1751 Implementors MAY include "keep-alives" in their TCP implementations, 1752 although this practice is not universally accepted. If keep-alives 1753 are included, the application MUST be able to turn them on or off for 1754 each TCP connection, and they MUST default to off. 1756 Keep-alive packets MUST only be sent when no data or acknowledgement 1757 packets have been received for the connection within an interval. 1758 This interval MUST be configurable and MUST default to no less than 1759 two hours. 1761 It is extremely important to remember that ACK segments that contain 1762 no data are not reliably transmitted by TCP. Consequently, if a 1763 keep-alive mechanism is implemented it MUST NOT interpret failure to 1764 respond to any specific probe as a dead connection. 1766 An implementation SHOULD send a keep-alive segment with no data; 1767 however, it MAY be configurable to send a keep-alive segment 1768 containing one garbage octet, for compatibility with erroneous TCP 1769 implementations. 1771 3.8.5. The Communication of Urgent Information 1773 As a result of implementation differences and middlebox interactions, 1774 new applications SHOULD NOT employ the TCP urgent mechanism. 1775 However, TCP implementations MUST still include support for the 1776 urgent mechanism. Details can be found in RFC 6093 [18]. 1778 The objective of the TCP urgent mechanism is to allow the sending 1779 user to stimulate the receiving user to accept some urgent data and 1780 to permit the receiving TCP to indicate to the receiving user when 1781 all the currently known urgent data has been received by the user. 1783 This mechanism permits a point in the data stream to be designated as 1784 the end of urgent information. Whenever this point is in advance of 1785 the receive sequence number (RCV.NXT) at the receiving TCP, that TCP 1786 must tell the user to go into "urgent mode"; when the receive 1787 sequence number catches up to the urgent pointer, the TCP must tell 1788 user to go into "normal mode". If the urgent pointer is updated 1789 while the user is in "urgent mode", the update will be invisible to 1790 the user. 1792 The method employs a urgent field which is carried in all segments 1793 transmitted. The URG control flag indicates that the urgent field is 1794 meaningful and must be added to the segment sequence number to yield 1795 the urgent pointer. The absence of this flag indicates that there is 1796 no urgent data outstanding. 1798 To send an urgent indication the user must also send at least one 1799 data octet. If the sending user also indicates a push, timely 1800 delivery of the urgent information to the destination process is 1801 enhanced. 1803 A TCP MUST support a sequence of urgent data of any length. [12] 1805 A TCP MUST inform the application layer asynchronously whenever it 1806 receives an Urgent pointer and there was previously no pending urgent 1807 data, or whenvever the Urgent pointer advances in the data stream. 1808 There MUST be a way for the application to learn how much urgent data 1809 remains to be read from the connection, or at least to determine 1810 whether or not more urgent data remains to be read. [12] 1812 3.8.6. Managing the Window 1814 The window sent in each segment indicates the range of sequence 1815 numbers the sender of the window (the data receiver) is currently 1816 prepared to accept. There is an assumption that this is related to 1817 the currently available data buffer space available for this 1818 connection. 1820 The sending TCP packages the data to be transmitted into segments 1821 which fit the current window, and may repackage segments on the 1822 retransmission queue. Such repackaging is not required, but may be 1823 helpful. 1825 In a connection with a one-way data flow, the window information will 1826 be carried in acknowledgment segments that all have the same sequence 1827 number so there will be no way to reorder them if they arrive out of 1828 order. This is not a serious problem, but it will allow the window 1829 information to be on occasion temporarily based on old reports from 1830 the data receiver. A refinement to avoid this problem is to act on 1831 the window information from segments that carry the highest 1832 acknowledgment number (that is segments with acknowledgment number 1833 equal or greater than the highest previously received). 1835 Indicating a large window encourages transmissions. If more data 1836 arrives than can be accepted, it will be discarded. This will result 1837 in excessive retransmissions, adding unnecessarily to the load on the 1838 network and the TCPs. Indicating a small window may restrict the 1839 transmission of data to the point of introducing a round trip delay 1840 between each new segment transmitted. 1842 The mechanisms provided allow a TCP to advertise a large window and 1843 to subsequently advertise a much smaller window without having 1844 accepted that much data. This, so called "shrinking the window," is 1845 strongly discouraged. The robustness principle dictates that TCPs 1846 will not shrink the window themselves, but will be prepared for such 1847 behavior on the part of other TCPs. 1849 A TCP receiver SHOULD NOT shrink the window, i.e., move the right 1850 window edge to the left. However, a sending TCP MUST be robust 1851 against window shrinking, which may cause the "useable window" (see 1852 Section 3.8.6.2.1) to become negative. 1854 If this happens, the sender SHOULD NOT send new data, but SHOULD 1855 retransmit normally the old unacknowledged data between SND.UNA and 1856 SND.UNA+SND.WND. The sender MAY also retransmit old data beyond 1857 SND.UNA+SND.WND, but SHOULD NOT time out the connection if data 1858 beyond the right window edge is not acknowledged. If the window 1859 shrinks to zero, the TCP MUST probe it in the standard way (described 1860 below). 1862 3.8.6.1. Zero Window Probing 1864 The sending TCP must be prepared to accept from the user and send at 1865 least one octet of new data even if the send window is zero. The 1866 sending TCP must regularly retransmit to the receiving TCP even when 1867 the window is zero, in order to "probe" the window. Two minutes is 1868 recommended for the retransmission interval when the window is zero. 1869 This retransmission is essential to guarantee that when either TCP 1870 has a zero window the re-opening of the window will be reliably 1871 reported to the other. This is referred to as Zero-Window Probing 1872 (ZWP) in other documents. 1874 Probing of zero (offered) windows MUST be supported. 1876 A TCP MAY keep its offered receive window closed indefinitely. As 1877 long as the receiving TCP continues to send acknowledgments in 1878 response to the probe segments, the sending TCP MUST allow the 1879 connection to stay open. This enables TCP to function in scenarios 1880 such as the "printer ran out of paper" situation described in 1881 Section 4.2.2.17 of RFC1122. The behavior is subject to the 1882 implementation's resource management concerns, as noted in [19]. 1884 When the receiving TCP has a zero window and a segment arrives it 1885 must still send an acknowledgment showing its next expected sequence 1886 number and current window (zero). 1888 3.8.6.2. Silly Window Syndrome Avoidance 1890 The "Silly Window Syndrome" (SWS) is a stable pattern of small 1891 incremental window movements resulting in extremely poor TCP 1892 performance. Algorithms to avoid SWS are described below for both 1893 the sending side and the receiving side. RFC 1122 contains more 1894 detailed discussion of the SWS problem. Note that the Nagle 1895 algorithm and the sender SWS avoidance algorithm play complementary 1896 roles in improving performance. The Nagle algorithm discourages 1897 sending tiny segments when the data to be sent increases in small 1898 increments, while the SWS avoidance algorithm discourages small 1899 segments resulting from the right window edge advancing in small 1900 increments. 1902 3.8.6.2.1. Sender's Algorithm - When to Send Data 1904 A TCP MUST include a SWS avoidance algorithm in the sender. 1906 A TCP SHOULD implement the Nagle Algorithm to coalesce short 1907 segments. However, there MUST be a way for an application to disable 1908 the Nagle algorithm on an individual connection. In all cases, 1909 sending data is also subject to the limitation imposed by the Slow 1910 Start algorithm. 1912 The sender's SWS avoidance algorithm is more difficult than the 1913 receivers's, because the sender does not know (directly) the 1914 receiver's total buffer space RCV.BUFF. An approach which has been 1915 found to work well is for the sender to calculate Max(SND.WND), the 1916 maximum send window it has seen so far on the connection, and to use 1917 this value as an estimate of RCV.BUFF. Unfortunately, this can only 1918 be an estimate; the receiver may at any time reduce the size of 1919 RCV.BUFF. To avoid a resulting deadlock, it is necessary to have a 1920 timeout to force transmission of data, overriding the SWS avoidance 1921 algorithm. In practice, this timeout should seldom occur. 1923 The "useable window" is: 1925 U = SND.UNA + SND.WND - SND.NXT 1927 i.e., the offered window less the amount of data sent but not 1928 acknowledged. If D is the amount of data queued in the sending TCP 1929 but not yet sent, then the following set of rules is recommended. 1931 Send data: 1933 (1) if a maximum-sized segment can be sent, i.e, if: 1935 min(D,U) >= Eff.snd.MSS; 1937 (2) or if the data is pushed and all queued data can be sent now, 1938 i.e., if: 1940 [SND.NXT = SND.UNA and] PUSHED and D <= U 1942 (the bracketed condition is imposed by the Nagle algorithm); 1944 (3) or if at least a fraction Fs of the maximum window can be sent, 1945 i.e., if: 1947 [SND.NXT = SND.UNA and] 1949 min(D.U) >= Fs * Max(SND.WND); 1951 (4) or if data is PUSHed and the override timeout occurs. 1953 Here Fs is a fraction whose recommended value is 1/2. The override 1954 timeout should be in the range 0.1 - 1.0 seconds. It may be 1955 convenient to combine this timer with the timer used to probe zero 1956 windows (Section Section 3.8.6.1). 1958 3.8.6.2.2. Receiver's Algorithm - When to Send a Window Update 1960 A TCP MUST include a SWS avoidance algorithm in the receiver. 1962 The receiver's SWS avoidance algorithm determines when the right 1963 window edge may be advanced; this is customarily known as "updating 1964 the window". This algorithm combines with the delayed ACK algorithm 1965 (see Section 3.8.6.3) to determine when an ACK segment containing the 1966 current window will really be sent to the receiver. 1968 The solution to receiver SWS is to avoid advancing the right window 1969 edge RCV.NXT+RCV.WND in small increments, even if data is received 1970 from the network in small segments. 1972 Suppose the total receive buffer space is RCV.BUFF. At any given 1973 moment, RCV.USER octets of this total may be tied up with data that 1974 has been received and acknowledged but which the user process has not 1975 yet consumed. When the connection is quiescent, RCV.WND = RCV.BUFF 1976 and RCV.USER = 0. 1978 Keeping the right window edge fixed as data arrives and is 1979 acknowledged requires that the receiver offer less than its full 1980 buffer space, i.e., the receiver must specify a RCV.WND that keeps 1981 RCV.NXT+RCV.WND constant as RCV.NXT increases. Thus, the total 1982 buffer space RCV.BUFF is generally divided into three parts: 1984 |<------- RCV.BUFF ---------------->| 1985 1 2 3 1986 ----|---------|------------------|------|---- 1987 RCV.NXT ^ 1988 (Fixed) 1990 1 - RCV.USER = data received but not yet consumed; 1991 2 - RCV.WND = space advertised to sender; 1992 3 - Reduction = space available but not yet 1993 advertised. 1995 The suggested SWS avoidance algorithm for the receiver is to keep 1996 RCV.NXT+RCV.WND fixed until the reduction satisfies: 1998 RCV.BUFF - RCV.USER - RCV.WND >= 2000 min( Fr * RCV.BUFF, Eff.snd.MSS ) 2002 where Fr is a fraction whose recommended value is 1/2, and 2003 Eff.snd.MSS is the effective send MSS for the connection (see 2004 Section 3.7.1). When the inequality is satisfied, RCV.WND is set to 2005 RCV.BUFF-RCV.USER. 2007 Note that the general effect of this algorithm is to advance RCV.WND 2008 in increments of Eff.snd.MSS (for realistic receive buffers: 2009 Eff.snd.MSS < RCV.BUFF/2). Note also that the receiver must use its 2010 own Eff.snd.MSS, assuming it is the same as the sender's. 2012 3.8.6.3. Delayed Acknowledgements - When to Send an ACK Segment 2014 A host that is receiving a stream of TCP data segments can increase 2015 efficiency in both the Internet and the hosts by sending fewer than 2016 one ACK (acknowledgment) segment per data segment received; this is 2017 known as a "delayed ACK". 2019 A TCP SHOULD implement a delayed ACK, but an ACK should not be 2020 excessively delayed; in particular, the delay MUST be less than 0.5 2021 seconds, and in a stream of full-sized segments there SHOULD be an 2022 ACK for at least every second segment. Excessive delays on ACK's can 2023 disturb the round-trip timing and packet "clocking" algorithms. 2025 3.9. Interfaces 2027 There are of course two interfaces of concern: the user/TCP interface 2028 and the TCP/lower-level interface. We have a fairly elaborate model 2029 of the user/TCP interface, but the interface to the lower level 2030 protocol module is left unspecified here, since it will be specified 2031 in detail by the specification of the lower level protocol. For the 2032 case that the lower level is IP we note some of the parameter values 2033 that TCPs might use. 2035 3.9.1. User/TCP Interface 2037 The following functional description of user commands to the TCP is, 2038 at best, fictional, since every operating system will have different 2039 facilities. Consequently, we must warn readers that different TCP 2040 implementations may have different user interfaces. However, all 2041 TCPs must provide a certain minimum set of services to guarantee that 2042 all TCP implementations can support the same protocol hierarchy. 2043 This section specifies the functional interfaces required of all TCP 2044 implementations. 2046 TCP User Commands 2048 The following sections functionally characterize a USER/TCP 2049 interface. The notation used is similar to most procedure or 2050 function calls in high level languages, but this usage is not 2051 meant to rule out trap type service calls (e.g., SVCs, UUOs, 2052 EMTs). 2054 The user commands described below specify the basic functions the 2055 TCP must perform to support interprocess communication. 2056 Individual implementations must define their own exact format, and 2057 may provide combinations or subsets of the basic functions in 2058 single calls. In particular, some implementations may wish to 2059 automatically OPEN a connection on the first SEND or RECEIVE 2060 issued by the user for a given connection. 2062 In providing interprocess communication facilities, the TCP must 2063 not only accept commands, but must also return information to the 2064 processes it serves. The latter consists of: 2066 (a) general information about a connection (e.g., interrupts, 2067 remote close, binding of unspecified foreign socket). 2069 (b) replies to specific user commands indicating success or 2070 various types of failure. 2072 Open 2074 Format: OPEN (local port, foreign socket, active/passive [, 2075 timeout] [, precedence] [, security/compartment] [local IP 2076 address,] [, options]) -> local connection name 2078 We assume that the local TCP is aware of the identity of the 2079 processes it serves and will check the authority of the process 2080 to use the connection specified. Depending upon the 2081 implementation of the TCP, the local network and TCP 2082 identifiers for the source address will either be supplied by 2083 the TCP or the lower level protocol (e.g., IP). These 2084 considerations are the result of concern about security, to the 2085 extent that no TCP be able to masquerade as another one, and so 2086 on. Similarly, no process can masquerade as another without 2087 the collusion of the TCP. 2089 If the active/passive flag is set to passive, then this is a 2090 call to LISTEN for an incoming connection. A passive open may 2091 have either a fully specified foreign socket to wait for a 2092 particular connection or an unspecified foreign socket to wait 2093 for any call. A fully specified passive call can be made 2094 active by the subsequent execution of a SEND. 2096 A transmission control block (TCB) is created and partially 2097 filled in with data from the OPEN command parameters. 2099 Every passive OPEN call either creates a new connection record 2100 in LISTEN state, or it returns an error; it MUST NOT affect any 2101 previously created connection record. 2103 A TCP that supports multiple concurrent users MUST provide an 2104 OPEN call that will functionally allow an application to LISTEN 2105 on a port while a connection block with the same local port is 2106 in SYN-SENT or SYN-RECEIVED state. 2108 On an active OPEN command, the TCP will begin the procedure to 2109 synchronize (i.e., establish) the connection at once. 2111 The timeout, if present, permits the caller to set up a timeout 2112 for all data submitted to TCP. If data is not successfully 2113 delivered to the destination within the timeout period, the TCP 2114 will abort the connection. The present global default is five 2115 minutes. 2117 The TCP or some component of the operating system will verify 2118 the users authority to open a connection with the specified 2119 precedence or security/compartment. The absence of precedence 2120 or security/compartment specification in the OPEN call 2121 indicates the default values must be used. 2123 TCP will accept incoming requests as matching only if the 2124 security/compartment information is exactly the same and only 2125 if the precedence is equal to or higher than the precedence 2126 requested in the OPEN call. 2128 The precedence for the connection is the higher of the values 2129 requested in the OPEN call and received from the incoming 2130 request, and fixed at that value for the life of the 2131 connection.Implementers may want to give the user control of 2132 this precedence negotiation. For example, the user might be 2133 allowed to specify that the precedence must be exactly matched, 2134 or that any attempt to raise the precedence be confirmed by the 2135 user. 2137 A local connection name will be returned to the user by the 2138 TCP. The local connection name can then be used as a short 2139 hand term for the connection defined by the pair. 2142 The optional "local IP address" parameter MUST be supported to 2143 allow the specification of the local IP address. This enables 2144 applications that need to select the local IP address used when 2145 multihoming is present. 2147 A passive OPEN call with a specified "local IP address" 2148 parameter will await an incoming connection request to that 2149 address. If the parameter is unspecified, a passive OPEN will 2150 await an incoming connection request to any local IP address, 2151 and then bind the local IP address of the connection to the 2152 particular address that is used. 2154 For an active OPEN call, a specified "local IP address" 2155 parameter MUST be used for opening the connection. If the 2156 parameter is unspecified, the TCP will choose an appropriate 2157 local IP address (see RFC 1122 section 3.3.4.2). 2159 TODO - the previous and next paragraphs are mildly in conflict. 2160 Previous paragraph says that the TCP chooses an address, but 2161 next paragraph says that it asks IP to choose ... need to make 2162 this consistent 2164 If an application on a multihomed host does not specify the 2165 local IP address when actively opening a TCP connection, then 2166 the TCP MUST ask the IP layer to select a local IP address 2167 before sending the (first) SYN. See the function GET_SRCADDR() 2168 in Section 3.4 of RFC 1122. 2170 At all other times, a previous segment has either been sent or 2171 received on this connection, and TCP MUST use the same local 2172 address is used that was used in those previous segments. 2174 Send 2176 Format: SEND (local connection name, buffer address, byte 2177 count, PUSH flag, URGENT flag [,timeout]) 2179 This call causes the data contained in the indicated user 2180 buffer to be sent on the indicated connection. If the 2181 connection has not been opened, the SEND is considered an 2182 error. Some implementations may allow users to SEND first; in 2183 which case, an automatic OPEN would be done. If the calling 2184 process is not authorized to use this connection, an error is 2185 returned. 2187 If the PUSH flag is set, the data must be transmitted promptly 2188 to the receiver, and the PUSH bit will be set in the last TCP 2189 segment created from the buffer. If the PUSH flag is not set, 2190 the data may be combined with data from subsequent SENDs for 2191 transmission efficiency. 2193 New applications SHOULD NOT set the URGENT flag [18] due to 2194 implementation differences and middlebox issues. 2196 If the URGENT flag is set, segments sent to the destination TCP 2197 will have the urgent pointer set. The receiving TCP will 2198 signal the urgent condition to the receiving process if the 2199 urgent pointer indicates that data preceding the urgent pointer 2200 has not been consumed by the receiving process. The purpose of 2201 urgent is to stimulate the receiver to process the urgent data 2202 and to indicate to the receiver when all the currently known 2203 urgent data has been received. The number of times the sending 2204 user's TCP signals urgent will not necessarily be equal to the 2205 number of times the receiving user will be notified of the 2206 presence of urgent data. 2208 If no foreign socket was specified in the OPEN, but the 2209 connection is established (e.g., because a LISTENing connection 2210 has become specific due to a foreign segment arriving for the 2211 local socket), then the designated buffer is sent to the 2212 implied foreign socket. Users who make use of OPEN with an 2213 unspecified foreign socket can make use of SEND without ever 2214 explicitly knowing the foreign socket address. 2216 However, if a SEND is attempted before the foreign socket 2217 becomes specified, an error will be returned. Users can use 2218 the STATUS call to determine the status of the connection. In 2219 some implementations the TCP may notify the user when an 2220 unspecified socket is bound. 2222 If a timeout is specified, the current user timeout for this 2223 connection is changed to the new one. 2225 In the simplest implementation, SEND would not return control 2226 to the sending process until either the transmission was 2227 complete or the timeout had been exceeded. However, this 2228 simple method is both subject to deadlocks (for example, both 2229 sides of the connection might try to do SENDs before doing any 2230 RECEIVEs) and offers poor performance, so it is not 2231 recommended. A more sophisticated implementation would return 2232 immediately to allow the process to run concurrently with 2233 network I/O, and, furthermore, to allow multiple SENDs to be in 2234 progress. Multiple SENDs are served in first come, first 2235 served order, so the TCP will queue those it cannot service 2236 immediately. 2238 We have implicitly assumed an asynchronous user interface in 2239 which a SEND later elicits some kind of SIGNAL or pseudo- 2240 interrupt from the serving TCP. An alternative is to return a 2241 response immediately. For instance, SENDs might return 2242 immediate local acknowledgment, even if the segment sent had 2243 not been acknowledged by the distant TCP. We could 2244 optimistically assume eventual success. If we are wrong, the 2245 connection will close anyway due to the timeout. In 2246 implementations of this kind (synchronous), there will still be 2247 some asynchronous signals, but these will deal with the 2248 connection itself, and not with specific segments or buffers. 2250 In order for the process to distinguish among error or success 2251 indications for different SENDs, it might be appropriate for 2252 the buffer address to be returned along with the coded response 2253 to the SEND request. TCP-to-user signals are discussed below, 2254 indicating the information which should be returned to the 2255 calling process. 2257 Receive 2259 Format: RECEIVE (local connection name, buffer address, byte 2260 count) -> byte count, urgent flag, push flag 2262 This command allocates a receiving buffer associated with the 2263 specified connection. If no OPEN precedes this command or the 2264 calling process is not authorized to use this connection, an 2265 error is returned. 2267 In the simplest implementation, control would not return to the 2268 calling program until either the buffer was filled, or some 2269 error occurred, but this scheme is highly subject to deadlocks. 2270 A more sophisticated implementation would permit several 2271 RECEIVEs to be outstanding at once. These would be filled as 2272 segments arrive. This strategy permits increased throughput at 2273 the cost of a more elaborate scheme (possibly asynchronous) to 2274 notify the calling program that a PUSH has been seen or a 2275 buffer filled. 2277 If enough data arrive to fill the buffer before a PUSH is seen, 2278 the PUSH flag will not be set in the response to the RECEIVE. 2279 The buffer will be filled with as much data as it can hold. If 2280 a PUSH is seen before the buffer is filled the buffer will be 2281 returned partially filled and PUSH indicated. 2283 If there is urgent data the user will have been informed as 2284 soon as it arrived via a TCP-to-user signal. The receiving 2285 user should thus be in "urgent mode". If the URGENT flag is 2286 on, additional urgent data remains. If the URGENT flag is off, 2287 this call to RECEIVE has returned all the urgent data, and the 2288 user may now leave "urgent mode". Note that data following the 2289 urgent pointer (non-urgent data) cannot be delivered to the 2290 user in the same buffer with preceding urgent data unless the 2291 boundary is clearly marked for the user. 2293 To distinguish among several outstanding RECEIVEs and to take 2294 care of the case that a buffer is not completely filled, the 2295 return code is accompanied by both a buffer pointer and a byte 2296 count indicating the actual length of the data received. 2298 Alternative implementations of RECEIVE might have the TCP 2299 allocate buffer storage, or the TCP might share a ring buffer 2300 with the user. 2302 Close 2304 Format: CLOSE (local connection name) 2306 This command causes the connection specified to be closed. If 2307 the connection is not open or the calling process is not 2308 authorized to use this connection, an error is returned. 2309 Closing connections is intended to be a graceful operation in 2310 the sense that outstanding SENDs will be transmitted (and 2311 retransmitted), as flow control permits, until all have been 2312 serviced. Thus, it should be acceptable to make several SEND 2313 calls, followed by a CLOSE, and expect all the data to be sent 2314 to the destination. It should also be clear that users should 2315 continue to RECEIVE on CLOSING connections, since the other 2316 side may be trying to transmit the last of its data. Thus, 2317 CLOSE means "I have no more to send" but does not mean "I will 2318 not receive any more." It may happen (if the user level 2319 protocol is not well thought out) that the closing side is 2320 unable to get rid of all its data before timing out. In this 2321 event, CLOSE turns into ABORT, and the closing TCP gives up. 2323 The user may CLOSE the connection at any time on his own 2324 initiative, or in response to various prompts from the TCP 2325 (e.g., remote close executed, transmission timeout exceeded, 2326 destination inaccessible). 2328 Because closing a connection requires communication with the 2329 foreign TCP, connections may remain in the closing state for a 2330 short time. Attempts to reopen the connection before the TCP 2331 replies to the CLOSE command will result in error responses. 2333 Close also implies push function. 2335 Status 2337 Format: STATUS (local connection name) -> status data 2339 This is an implementation dependent user command and could be 2340 excluded without adverse effect. Information returned would 2341 typically come from the TCB associated with the connection. 2343 This command returns a data block containing the following 2344 information: 2346 local socket, 2347 foreign socket, 2348 local connection name, 2349 receive window, 2350 send window, 2351 connection state, 2352 number of buffers awaiting acknowledgment, 2353 number of buffers pending receipt, 2354 urgent state, 2355 precedence, 2356 security/compartment, 2357 and transmission timeout. 2359 Depending on the state of the connection, or on the 2360 implementation itself, some of this information may not be 2361 available or meaningful. If the calling process is not 2362 authorized to use this connection, an error is returned. This 2363 prevents unauthorized processes from gaining information about 2364 a connection. 2366 Abort 2368 Format: ABORT (local connection name) 2370 This command causes all pending SENDs and RECEIVES to be 2371 aborted, the TCB to be removed, and a special RESET message to 2372 be sent to the TCP on the other side of the connection. 2373 Depending on the implementation, users may receive abort 2374 indications for each outstanding SEND or RECEIVE, or may simply 2375 receive an ABORT-acknowledgment. 2377 Flush 2379 Some TCP implementations have included a FLUSH call, which will 2380 empty the TCP send queue of any data for which the user has 2381 issued SEND calls but which is still to the right of the 2382 current send window. That is, it flushes as much queued send 2383 data as possible without losing sequence number 2384 synchronization. 2386 Set TOS 2388 The application layer MUST be able to specify the Type-of- 2389 Service (TOS) for segments that are sent on a connection. It 2390 not required, but the application SHOULD be able to change the 2391 TOS during the connection lifetime. TCP SHOULD pass the 2392 current TOS value without change to the IP layer, when it sends 2393 segments on the connection. 2395 The TOS will be specified independently in each direction on 2396 the connection, so that the receiver application will specify 2397 the TOS used for ACK segments. 2399 TCP MAY pass the most recently received TOS up to the 2400 application. 2402 TCP-to-User Messages 2404 It is assumed that the operating system environment provides a 2405 means for the TCP to asynchronously signal the user program. 2406 When the TCP does signal a user program, certain information is 2407 passed to the user. Often in the specification the information 2408 will be an error message. In other cases there will be 2409 information relating to the completion of processing a SEND or 2410 RECEIVE or other user call. 2412 The following information is provided: 2414 Local Connection Name Always 2415 Response String Always 2416 Buffer Address Send & Receive 2417 Byte count (counts bytes received) Receive 2418 Push flag Receive 2419 Urgent flag Receive 2421 3.9.2. TCP/Lower-Level Interface 2423 The TCP calls on a lower level protocol module to actually send and 2424 receive information over a network. One case is that of the ARPA 2425 internetwork system where the lower level module is the Internet 2426 Protocol (IP) [1]. 2428 If the lower level protocol is IP it provides arguments for a type of 2429 service and for a time to live. TCP uses the following settings for 2430 these parameters: 2432 Type of Service = Precedence: given by user, Delay: normal, 2433 Throughput: normal, Reliability: normal; or binary XXX00000, where 2434 XXX are the three bits determining precedence, e.g. 000 means 2435 routine precedence. 2437 Time to Live (TTL): The TTL value used to send TCP segments MUST 2438 be configurable. 2440 Note that RFC 793 specified one minute (60 seconds) as a 2441 constant for the TTL, because the assumed maximum segment 2442 lifetime was two minutes. This was intended to explicitly ask 2443 that a segment be destroyed if it cannot be delivered by the 2444 internet system within one minute. RFC 1122 changed this 2445 specification to require that the TTL be configurable. 2447 Any lower level protocol will have to provide the source address, 2448 destination address, and protocol fields, and some way to determine 2449 the "TCP length", both to provide the functional equivalent service 2450 of IP and to be used in the TCP checksum. 2452 When received options are passed up to TCP from the IP layer, TCP 2453 MUST ignore options that it does not understand. 2455 A TCP MAY support the Time Stamp and Record Route options. 2457 3.9.2.1. Source Routing 2459 If the lower level is IP (or other protocol that provides this 2460 feature) and source routing is used, the interface must allow the 2461 route information to be communicated. This is especially important 2462 so that the source and destination addresses used in the TCP checksum 2463 be the originating source and ultimate destination. It is also 2464 important to preserve the return route to answer connection requests. 2466 An application MUST be able to specify a source route when it 2467 actively opens a TCP connection, and this MUST take precedence over a 2468 source route received in a datagram. 2470 When a TCP connection is OPENed passively and a packet arrives with a 2471 completed IP Source Route option (containing a return route), TCP 2472 MUST save the return route and use it for all segments sent on this 2473 connection. If a different source route arrives in a later segment, 2474 the later definition SHOULD override the earlier one. 2476 3.9.2.2. ICMP Messages 2478 TCP MUST act on an ICMP error message passed up from the IP layer, 2479 directing it to the connection that created the error. The necessary 2480 demultiplexing information can be found in the IP header contained 2481 within the ICMP message. 2483 This applies to ICMPv6 in addition to IPv4 ICMP. 2485 [15] contains discussion of specific ICMP and ICMPv6 messages 2486 classified as either "soft" or "hard" errors that may bear different 2487 responses. Treatment for classes of ICMP messages is described 2488 below: 2490 Source Quench 2491 TCP MUST silently discard any received ICMP Source Quench messages. 2492 See [9] for discussion. 2494 Soft Errors 2495 For ICMP these include: Destination Unreachable -- codes 0, 1, 5, 2496 Time Exceeded -- codes 0, 1, and Parameter Problem. 2497 For ICMPv6 these include: Destination Unreachable -- codes 0 and 3, 2498 Time Exceeded -- codes 0, 1, and Parameter Problem -- codes 0, 1, 2 2499 Since these Unreachable messages indicate soft error conditions, 2500 TCP MUST NOT abort the connection, and it SHOULD make the 2501 information available to the application. 2503 Hard Errors 2504 For ICMP these include Destination Unreachable -- codes 2-4"> 2505 These are hard error conditions, so TCP SHOULD abort the 2506 connection. [15] notes that some implementations do not abort 2507 connections when an ICMP hard error is received for a connection 2508 that is in any of the synchronized states. 2510 Note that [15] section 4 describes widespread implementation behavior 2511 that treats soft errors as hard errors during connection 2512 establishment. 2514 3.10. Event Processing 2516 The processing depicted in this section is an example of one possible 2517 implementation. Other implementations may have slightly different 2518 processing sequences, but they should differ from those in this 2519 section only in detail, not in substance. 2521 The activity of the TCP can be characterized as responding to events. 2522 The events that occur can be cast into three categories: user calls, 2523 arriving segments, and timeouts. This section describes the 2524 processing the TCP does in response to each of the events. In many 2525 cases the processing required depends on the state of the connection. 2527 Events that occur: 2529 User Calls 2531 OPEN 2532 SEND 2533 RECEIVE 2534 CLOSE 2535 ABORT 2536 STATUS 2538 Arriving Segments 2540 SEGMENT ARRIVES 2542 Timeouts 2544 USER TIMEOUT 2545 RETRANSMISSION TIMEOUT 2546 TIME-WAIT TIMEOUT 2548 The model of the TCP/user interface is that user commands receive an 2549 immediate return and possibly a delayed response via an event or 2550 pseudo interrupt. In the following descriptions, the term "signal" 2551 means cause a delayed response. 2553 Error responses are given as character strings. For example, user 2554 commands referencing connections that do not exist receive "error: 2555 connection not open". 2557 Please note in the following that all arithmetic on sequence numbers, 2558 acknowledgment numbers, windows, et cetera, is modulo 2**32 the size 2559 of the sequence number space. Also note that "=<" means less than or 2560 equal to (modulo 2**32). 2562 A natural way to think about processing incoming segments is to 2563 imagine that they are first tested for proper sequence number (i.e., 2564 that their contents lie in the range of the expected "receive window" 2565 in the sequence number space) and then that they are generally queued 2566 and processed in sequence number order. 2568 When a segment overlaps other already received segments we 2569 reconstruct the segment to contain just the new data, and adjust the 2570 header fields to be consistent. 2572 Note that if no state change is mentioned the TCP stays in the same 2573 state. 2575 OPEN Call 2577 CLOSED STATE (i.e., TCB does not exist) 2579 Create a new transmission control block (TCB) to hold 2580 connection state information. Fill in local socket identifier, 2581 foreign socket, precedence, security/compartment, and user 2582 timeout information. Note that some parts of the foreign 2583 socket may be unspecified in a passive OPEN and are to be 2584 filled in by the parameters of the incoming SYN segment. 2585 Verify the security and precedence requested are allowed for 2586 this user, if not return "error: precedence not allowed" or 2587 "error: security/compartment not allowed." If passive enter 2588 the LISTEN state and return. If active and the foreign socket 2589 is unspecified, return "error: foreign socket unspecified"; if 2590 active and the foreign socket is specified, issue a SYN 2591 segment. An initial send sequence number (ISS) is selected. A 2592 SYN segment of the form is sent. Set 2593 SND.UNA to ISS, SND.NXT to ISS+1, enter SYN-SENT state, and 2594 return. 2596 If the caller does not have access to the local socket 2597 specified, return "error: connection illegal for this process". 2598 If there is no room to create a new connection, return "error: 2599 insufficient resources". 2601 LISTEN STATE 2603 If active and the foreign socket is specified, then change the 2604 connection from passive to active, select an ISS. Send a SYN 2605 segment, set SND.UNA to ISS, SND.NXT to ISS+1. Enter SYN-SENT 2606 state. Data associated with SEND may be sent with SYN segment 2607 or queued for transmission after entering ESTABLISHED state. 2608 The urgent bit if requested in the command must be sent with 2609 the data segments sent as a result of this command. If there 2610 is no room to queue the request, respond with "error: 2611 insufficient resources". If Foreign socket was not specified, 2612 then return "error: foreign socket unspecified". 2614 SYN-SENT STATE 2615 SYN-RECEIVED STATE 2616 ESTABLISHED STATE 2617 FIN-WAIT-1 STATE 2618 FIN-WAIT-2 STATE 2619 CLOSE-WAIT STATE 2620 CLOSING STATE 2621 LAST-ACK STATE 2622 TIME-WAIT STATE 2624 Return "error: connection already exists". 2626 SEND Call 2628 CLOSED STATE (i.e., TCB does not exist) 2630 If the user does not have access to such a connection, then 2631 return "error: connection illegal for this process". 2633 Otherwise, return "error: connection does not exist". 2635 LISTEN STATE 2637 If the foreign socket is specified, then change the connection 2638 from passive to active, select an ISS. Send a SYN segment, set 2639 SND.UNA to ISS, SND.NXT to ISS+1. Enter SYN-SENT state. Data 2640 associated with SEND may be sent with SYN segment or queued for 2641 transmission after entering ESTABLISHED state. The urgent bit 2642 if requested in the command must be sent with the data segments 2643 sent as a result of this command. If there is no room to queue 2644 the request, respond with "error: insufficient resources". If 2645 Foreign socket was not specified, then return "error: foreign 2646 socket unspecified". 2648 SYN-SENT STATE 2649 SYN-RECEIVED STATE 2651 Queue the data for transmission after entering ESTABLISHED 2652 state. If no space to queue, respond with "error: insufficient 2653 resources". 2655 ESTABLISHED STATE 2656 CLOSE-WAIT STATE 2658 Segmentize the buffer and send it with a piggybacked 2659 acknowledgment (acknowledgment value = RCV.NXT). If there is 2660 insufficient space to remember this buffer, simply return 2661 "error: insufficient resources". 2663 If the urgent flag is set, then SND.UP <- SND.NXT and set the 2664 urgent pointer in the outgoing segments. 2666 FIN-WAIT-1 STATE 2667 FIN-WAIT-2 STATE 2668 CLOSING STATE 2669 LAST-ACK STATE 2670 TIME-WAIT STATE 2672 Return "error: connection closing" and do not service request. 2674 RECEIVE Call 2676 CLOSED STATE (i.e., TCB does not exist) 2678 If the user does not have access to such a connection, return 2679 "error: connection illegal for this process". 2681 Otherwise return "error: connection does not exist". 2683 LISTEN STATE 2684 SYN-SENT STATE 2685 SYN-RECEIVED STATE 2687 Queue for processing after entering ESTABLISHED state. If 2688 there is no room to queue this request, respond with "error: 2689 insufficient resources". 2691 ESTABLISHED STATE 2692 FIN-WAIT-1 STATE 2693 FIN-WAIT-2 STATE 2695 If insufficient incoming segments are queued to satisfy the 2696 request, queue the request. If there is no queue space to 2697 remember the RECEIVE, respond with "error: insufficient 2698 resources". 2700 Reassemble queued incoming segments into receive buffer and 2701 return to user. Mark "push seen" (PUSH) if this is the case. 2703 If RCV.UP is in advance of the data currently being passed to 2704 the user notify the user of the presence of urgent data. 2706 When the TCP takes responsibility for delivering data to the 2707 user that fact must be communicated to the sender via an 2708 acknowledgment. The formation of such an acknowledgment is 2709 described below in the discussion of processing an incoming 2710 segment. 2712 CLOSE-WAIT STATE 2714 Since the remote side has already sent FIN, RECEIVEs must be 2715 satisfied by text already on hand, but not yet delivered to the 2716 user. If no text is awaiting delivery, the RECEIVE will get a 2717 "error: connection closing" response. Otherwise, any remaining 2718 text can be used to satisfy the RECEIVE. 2720 CLOSING STATE 2721 LAST-ACK STATE 2722 TIME-WAIT STATE 2724 Return "error: connection closing". 2726 CLOSE Call 2728 CLOSED STATE (i.e., TCB does not exist) 2730 If the user does not have access to such a connection, return 2731 "error: connection illegal for this process". 2733 Otherwise, return "error: connection does not exist". 2735 LISTEN STATE 2737 Any outstanding RECEIVEs are returned with "error: closing" 2738 responses. Delete TCB, enter CLOSED state, and return. 2740 SYN-SENT STATE 2742 Delete the TCB and return "error: closing" responses to any 2743 queued SENDs, or RECEIVEs. 2745 SYN-RECEIVED STATE 2747 If no SENDs have been issued and there is no pending data to 2748 send, then form a FIN segment and send it, and enter FIN-WAIT-1 2749 state; otherwise queue for processing after entering 2750 ESTABLISHED state. 2752 ESTABLISHED STATE 2754 Queue this until all preceding SENDs have been segmentized, 2755 then form a FIN segment and send it. In any case, enter FIN- 2756 WAIT-1 state. 2758 FIN-WAIT-1 STATE 2759 FIN-WAIT-2 STATE 2761 Strictly speaking, this is an error and should receive a 2762 "error: connection closing" response. An "ok" response would 2763 be acceptable, too, as long as a second FIN is not emitted (the 2764 first FIN may be retransmitted though). 2766 CLOSE-WAIT STATE 2768 Queue this request until all preceding SENDs have been 2769 segmentized; then send a FIN segment, enter LAST-ACK state. 2771 CLOSING STATE 2772 LAST-ACK STATE 2773 TIME-WAIT STATE 2774 Respond with "error: connection closing". 2776 ABORT Call 2778 CLOSED STATE (i.e., TCB does not exist) 2780 If the user should not have access to such a connection, return 2781 "error: connection illegal for this process". 2783 Otherwise return "error: connection does not exist". 2785 LISTEN STATE 2787 Any outstanding RECEIVEs should be returned with "error: 2788 connection reset" responses. Delete TCB, enter CLOSED state, 2789 and return. 2791 SYN-SENT STATE 2793 All queued SENDs and RECEIVEs should be given "connection 2794 reset" notification, delete the TCB, enter CLOSED state, and 2795 return. 2797 SYN-RECEIVED STATE 2798 ESTABLISHED STATE 2799 FIN-WAIT-1 STATE 2800 FIN-WAIT-2 STATE 2801 CLOSE-WAIT STATE 2803 Send a reset segment: 2805 2807 All queued SENDs and RECEIVEs should be given "connection 2808 reset" notification; all segments queued for transmission 2809 (except for the RST formed above) or retransmission should be 2810 flushed, delete the TCB, enter CLOSED state, and return. 2812 CLOSING STATE LAST-ACK STATE TIME-WAIT STATE 2814 Respond with "ok" and delete the TCB, enter CLOSED state, and 2815 return. 2817 STATUS Call 2819 CLOSED STATE (i.e., TCB does not exist) 2821 If the user should not have access to such a connection, return 2822 "error: connection illegal for this process". 2824 Otherwise return "error: connection does not exist". 2826 LISTEN STATE 2828 Return "state = LISTEN", and the TCB pointer. 2830 SYN-SENT STATE 2832 Return "state = SYN-SENT", and the TCB pointer. 2834 SYN-RECEIVED STATE 2836 Return "state = SYN-RECEIVED", and the TCB pointer. 2838 ESTABLISHED STATE 2840 Return "state = ESTABLISHED", and the TCB pointer. 2842 FIN-WAIT-1 STATE 2844 Return "state = FIN-WAIT-1", and the TCB pointer. 2846 FIN-WAIT-2 STATE 2848 Return "state = FIN-WAIT-2", and the TCB pointer. 2850 CLOSE-WAIT STATE 2852 Return "state = CLOSE-WAIT", and the TCB pointer. 2854 CLOSING STATE 2856 Return "state = CLOSING", and the TCB pointer. 2858 LAST-ACK STATE 2860 Return "state = LAST-ACK", and the TCB pointer. 2862 TIME-WAIT STATE 2864 Return "state = TIME-WAIT", and the TCB pointer. 2866 SEGMENT ARRIVES 2868 If the state is CLOSED (i.e., TCB does not exist) then 2870 all data in the incoming segment is discarded. An incoming 2871 segment containing a RST is discarded. An incoming segment not 2872 containing a RST causes a RST to be sent in response. The 2873 acknowledgment and sequence field values are selected to make 2874 the reset sequence acceptable to the TCP that sent the 2875 offending segment. 2877 If the ACK bit is off, sequence number zero is used, 2879 2881 If the ACK bit is on, 2883 2885 Return. 2887 If the state is LISTEN then 2889 first check for an RST 2891 An incoming RST should be ignored. Return. 2893 second check for an ACK 2895 Any acknowledgment is bad if it arrives on a connection 2896 still in the LISTEN state. An acceptable reset segment 2897 should be formed for any arriving ACK-bearing segment. The 2898 RST should be formatted as follows: 2900 2902 Return. 2904 third check for a SYN 2906 If the SYN bit is set, check the security. If the security/ 2907 compartment on the incoming segment does not exactly match 2908 the security/compartment in the TCB then send a reset and 2909 return. 2911 2913 If the SEG.PRC is greater than the TCB.PRC then if allowed 2914 by the user and the system set TCB.PRC<-SEG.PRC, if not 2915 allowed send a reset and return. 2917 2919 If the SEG.PRC is less than the TCB.PRC then continue. 2921 Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any 2922 other control or text should be queued for processing later. 2923 ISS should be selected and a SYN segment sent of the form: 2925 2927 SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection 2928 state should be changed to SYN-RECEIVED. Note that any 2929 other incoming control or data (combined with SYN) will be 2930 processed in the SYN-RECEIVED state, but processing of SYN 2931 and ACK should not be repeated. If the listen was not fully 2932 specified (i.e., the foreign socket was not fully 2933 specified), then the unspecified fields should be filled in 2934 now. 2936 fourth other text or control 2938 Any other control or text-bearing segment (not containing 2939 SYN) must have an ACK and thus would be discarded by the ACK 2940 processing. An incoming RST segment could not be valid, 2941 since it could not have been sent in response to anything 2942 sent by this incarnation of the connection. So you are 2943 unlikely to get here, but if you do, drop the segment, and 2944 return. 2946 If the state is SYN-SENT then 2948 first check the ACK bit 2950 If the ACK bit is set 2952 If SEG.ACK =< ISS, or SEG.ACK > SND.NXT, send a reset 2953 (unless the RST bit is set, if so drop the segment and 2954 return) 2956 2958 and discard the segment. Return. 2960 If SND.UNA < SEG.ACK =< SND.NXT then the ACK is 2961 acceptable. (TODO: in processing Errata ID 3300, it was 2962 noted that some stacks in the wild that do not send data 2963 on the SYN are just checking that SEG.ACK == SND.NXT ... 2964 think about whether anything should be said about that 2965 here) 2967 second check the RST bit 2969 If the RST bit is set 2971 If the ACK was acceptable then signal the user "error: 2972 connection reset", drop the segment, enter CLOSED state, 2973 delete TCB, and return. Otherwise (no ACK) drop the 2974 segment and return. 2976 third check the security and precedence 2978 If the security/compartment in the segment does not exactly 2979 match the security/compartment in the TCB, send a reset 2981 If there is an ACK 2983 2985 Otherwise 2987 2989 If there is an ACK 2991 The precedence in the segment must match the precedence 2992 in the TCB, if not, send a reset 2994 2996 If there is no ACK 2998 If the precedence in the segment is higher than the 2999 precedence in the TCB then if allowed by the user and the 3000 system raise the precedence in the TCB to that in the 3001 segment, if not allowed to raise the prec then send a 3002 reset. 3004 3006 If the precedence in the segment is lower than the 3007 precedence in the TCB continue. 3009 If a reset was sent, discard the segment and return. 3011 fourth check the SYN bit 3013 This step should be reached only if the ACK is ok, or there 3014 is no ACK, and it the segment did not contain a RST. 3016 If the SYN bit is on and the security/compartment and 3017 precedence are acceptable then, RCV.NXT is set to SEG.SEQ+1, 3018 IRS is set to SEG.SEQ. SND.UNA should be advanced to equal 3019 SEG.ACK (if there is an ACK), and any segments on the 3020 retransmission queue which are thereby acknowledged should 3021 be removed. 3023 If SND.UNA > ISS (our SYN has been ACKed), change the 3024 connection state to ESTABLISHED, form an ACK segment 3026 3028 and send it. Data or controls which were queued for 3029 transmission may be included. If there are other controls 3030 or text in the segment then continue processing at the sixth 3031 step below where the URG bit is checked, otherwise return. 3033 Otherwise enter SYN-RECEIVED, form a SYN,ACK segment 3035 3037 and send it. Set the variables: 3039 SND.WND <- SEG.WND 3040 SND.WL1 <- SEG.SEQ 3041 SND.WL2 <- SEG.ACK 3043 If there are other controls or text in the segment, queue 3044 them for processing after the ESTABLISHED state has been 3045 reached, return. 3047 fifth, if neither of the SYN or RST bits is set then drop the 3048 segment and return. 3050 Otherwise, 3052 first check sequence number 3054 SYN-RECEIVED STATE 3055 ESTABLISHED STATE 3056 FIN-WAIT-1 STATE 3057 FIN-WAIT-2 STATE 3058 CLOSE-WAIT STATE 3059 CLOSING STATE 3060 LAST-ACK STATE 3061 TIME-WAIT STATE 3063 Segments are processed in sequence. Initial tests on 3064 arrival are used to discard old duplicates, but further 3065 processing is done in SEG.SEQ order. If a segment's 3066 contents straddle the boundary between old and new, only the 3067 new parts should be processed. 3069 There are four cases for the acceptability test for an 3070 incoming segment: 3072 Segment Receive Test 3073 Length Window 3074 ------- ------- ------------------------------------------- 3076 0 0 SEG.SEQ = RCV.NXT 3078 0 >0 RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND 3080 >0 0 not acceptable 3082 >0 >0 RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND 3083 or RCV.NXT =< SEG.SEQ+SEG.LEN-1 < RCV.NXT+RCV.WND 3085 If the RCV.WND is zero, no segments will be acceptable, but 3086 special allowance should be made to accept valid ACKs, URGs 3087 and RSTs. 3089 If an incoming segment is not acceptable, an acknowledgment 3090 should be sent in reply (unless the RST bit is set, if so 3091 drop the segment and return): 3093 3095 After sending the acknowledgment, drop the unacceptable 3096 segment and return. 3098 In the following it is assumed that the segment is the 3099 idealized segment that begins at RCV.NXT and does not exceed 3100 the window. One could tailor actual segments to fit this 3101 assumption by trimming off any portions that lie outside the 3102 window (including SYN and FIN), and only processing further 3103 if the segment then begins at RCV.NXT. Segments with higher 3104 beginning sequence numbers should be held for later 3105 processing. 3107 In general, the processing of received segments MUST be 3108 implemented to aggregate ACK segments whenever possible. 3109 For example, if the TCP is processing a series of queued 3110 segments, it MUST process them all before sending any ACK 3111 segments. (TODO - see if there's a better place for this 3112 paragraph - taken from RFC1122) 3114 second check the RST bit, 3116 SYN-RECEIVED STATE 3118 If the RST bit is set 3120 If this connection was initiated with a passive OPEN 3121 (i.e., came from the LISTEN state), then return this 3122 connection to LISTEN state and return. The user need 3123 not be informed. If this connection was initiated 3124 with an active OPEN (i.e., came from SYN-SENT state) 3125 then the connection was refused, signal the user 3126 "connection refused". In either case, all segments on 3127 the retransmission queue should be removed. And in 3128 the active OPEN case, enter the CLOSED state and 3129 delete the TCB, and return. 3131 ESTABLISHED 3132 FIN-WAIT-1 3133 FIN-WAIT-2 3134 CLOSE-WAIT 3136 If the RST bit is set then, any outstanding RECEIVEs and 3137 SEND should receive "reset" responses. All segment 3138 queues should be flushed. Users should also receive an 3139 unsolicited general "connection reset" signal. Enter the 3140 CLOSED state, delete the TCB, and return. 3142 CLOSING STATE 3143 LAST-ACK STATE 3144 TIME-WAIT 3146 If the RST bit is set then, enter the CLOSED state, 3147 delete the TCB, and return. 3149 third check security and precedence 3150 SYN-RECEIVED 3152 If the security/compartment and precedence in the segment 3153 do not exactly match the security/compartment and 3154 precedence in the TCB then send a reset, and return. 3156 ESTABLISHED 3157 FIN-WAIT-1 3158 FIN-WAIT-2 3159 CLOSE-WAIT 3160 CLOSING 3161 LAST-ACK 3162 TIME-WAIT 3164 If the security/compartment and precedence in the segment 3165 do not exactly match the security/compartment and 3166 precedence in the TCB then send a reset, any outstanding 3167 RECEIVEs and SEND should receive "reset" responses. All 3168 segment queues should be flushed. Users should also 3169 receive an unsolicited general "connection reset" signal. 3170 Enter the CLOSED state, delete the TCB, and return. 3172 Note this check is placed following the sequence check to 3173 prevent a segment from an old connection between these ports 3174 with a different security or precedence from causing an 3175 abort of the current connection. 3177 fourth, check the SYN bit, 3179 SYN-RECEIVED 3180 ESTABLISHED STATE 3181 FIN-WAIT STATE-1 3182 FIN-WAIT STATE-2 3183 CLOSE-WAIT STATE 3184 CLOSING STATE 3185 LAST-ACK STATE 3186 TIME-WAIT STATE 3188 TODO: need to incorporate RFC 1122 4.2.2.20(e) here 3190 If the SYN is in the window it is an error, send a reset, 3191 any outstanding RECEIVEs and SEND should receive "reset" 3192 responses, all segment queues should be flushed, the user 3193 should also receive an unsolicited general "connection 3194 reset" signal, enter the CLOSED state, delete the TCB, 3195 and return. 3197 If the SYN is not in the window this step would not be 3198 reached and an ack would have been sent in the first step 3199 (sequence number check). 3201 fifth check the ACK field, 3203 if the ACK bit is off drop the segment and return 3205 if the ACK bit is on 3207 SYN-RECEIVED STATE 3209 If SND.UNA < SEG.ACK =< SND.NXT then enter ESTABLISHED 3210 state and continue processing with variables below set 3211 to: 3213 SND.WND <- SEG.WND 3214 SND.WL1 <- SEG.SEQ 3215 SND.WL2 <- SEG.ACK 3217 If the segment acknowledgment is not acceptable, 3218 form a reset segment, 3220 3222 and send it. 3224 ESTABLISHED STATE 3226 If SND.UNA < SEG.ACK =< SND.NXT then, set SND.UNA <- 3227 SEG.ACK. Any segments on the retransmission queue 3228 which are thereby entirely acknowledged are removed. 3229 Users should receive positive acknowledgments for 3230 buffers which have been SENT and fully acknowledged 3231 (i.e., SEND buffer should be returned with "ok" 3232 response). If the ACK is a duplicate (SEG.ACK =< 3233 SND.UNA), it can be ignored. If the ACK acks 3234 something not yet sent (SEG.ACK > SND.NXT) then send 3235 an ACK, drop the segment, and return. 3237 If SND.UNA =< SEG.ACK =< SND.NXT, the send window 3238 should be updated. If (SND.WL1 < SEG.SEQ or (SND.WL1 3239 = SEG.SEQ and SND.WL2 =< SEG.ACK)), set SND.WND <- 3240 SEG.WND, set SND.WL1 <- SEG.SEQ, and set SND.WL2 <- 3241 SEG.ACK. 3243 Note that SND.WND is an offset from SND.UNA, that 3244 SND.WL1 records the sequence number of the last 3245 segment used to update SND.WND, and that SND.WL2 3246 records the acknowledgment number of the last segment 3247 used to update SND.WND. The check here prevents using 3248 old segments to update the window. 3250 FIN-WAIT-1 STATE 3252 In addition to the processing for the ESTABLISHED 3253 state, if our FIN is now acknowledged then enter FIN- 3254 WAIT-2 and continue processing in that state. 3256 FIN-WAIT-2 STATE 3258 In addition to the processing for the ESTABLISHED 3259 state, if the retransmission queue is empty, the 3260 user's CLOSE can be acknowledged ("ok") but do not 3261 delete the TCB. 3263 CLOSE-WAIT STATE 3265 Do the same processing as for the ESTABLISHED state. 3267 CLOSING STATE 3269 In addition to the processing for the ESTABLISHED 3270 state, if the ACK acknowledges our FIN then enter the 3271 TIME-WAIT state, otherwise ignore the segment. 3273 LAST-ACK STATE 3275 The only thing that can arrive in this state is an 3276 acknowledgment of our FIN. If our FIN is now 3277 acknowledged, delete the TCB, enter the CLOSED state, 3278 and return. 3280 TIME-WAIT STATE 3282 The only thing that can arrive in this state is a 3283 retransmission of the remote FIN. Acknowledge it, and 3284 restart the 2 MSL timeout. 3286 sixth, check the URG bit, 3288 ESTABLISHED STATE 3289 FIN-WAIT-1 STATE 3290 FIN-WAIT-2 STATE 3291 If the URG bit is set, RCV.UP <- max(RCV.UP,SEG.UP), and 3292 signal the user that the remote side has urgent data if 3293 the urgent pointer (RCV.UP) is in advance of the data 3294 consumed. If the user has already been signaled (or is 3295 still in the "urgent mode") for this continuous sequence 3296 of urgent data, do not signal the user again. 3298 CLOSE-WAIT STATE 3299 CLOSING STATE 3300 LAST-ACK STATE 3301 TIME-WAIT 3303 This should not occur, since a FIN has been received from 3304 the remote side. Ignore the URG. 3306 seventh, process the segment text, 3308 ESTABLISHED STATE 3309 FIN-WAIT-1 STATE 3310 FIN-WAIT-2 STATE 3312 Once in the ESTABLISHED state, it is possible to deliver 3313 segment text to user RECEIVE buffers. Text from segments 3314 can be moved into buffers until either the buffer is full 3315 or the segment is empty. If the segment empties and 3316 carries an PUSH flag, then the user is informed, when the 3317 buffer is returned, that a PUSH has been received. 3319 When the TCP takes responsibility for delivering the data 3320 to the user it must also acknowledge the receipt of the 3321 data. 3323 Once the TCP takes responsibility for the data it 3324 advances RCV.NXT over the data accepted, and adjusts 3325 RCV.WND as appropriate to the current buffer 3326 availability. The total of RCV.NXT and RCV.WND should 3327 not be reduced. 3329 A TCP MAY send an ACK segment acknowledging RCV.NXT when 3330 a valid segment arrives that is in the window but not at 3331 the left window edge. 3333 Please note the window management suggestions in section 3334 3.7. 3336 Send an acknowledgment of the form: 3338 3340 This acknowledgment should be piggybacked on a segment 3341 being transmitted if possible without incurring undue 3342 delay. 3344 CLOSE-WAIT STATE 3345 CLOSING STATE 3346 LAST-ACK STATE 3347 TIME-WAIT STATE 3349 This should not occur, since a FIN has been received from 3350 the remote side. Ignore the segment text. 3352 eighth, check the FIN bit, 3354 Do not process the FIN if the state is CLOSED, LISTEN or 3355 SYN-SENT since the SEG.SEQ cannot be validated; drop the 3356 segment and return. 3358 If the FIN bit is set, signal the user "connection closing" 3359 and return any pending RECEIVEs with same message, advance 3360 RCV.NXT over the FIN, and send an acknowledgment for the 3361 FIN. Note that FIN implies PUSH for any segment text not 3362 yet delivered to the user. 3364 SYN-RECEIVED STATE 3365 ESTABLISHED STATE 3367 Enter the CLOSE-WAIT state. 3369 FIN-WAIT-1 STATE 3371 If our FIN has been ACKed (perhaps in this segment), 3372 then enter TIME-WAIT, start the time-wait timer, turn 3373 off the other timers; otherwise enter the CLOSING 3374 state. 3376 FIN-WAIT-2 STATE 3378 Enter the TIME-WAIT state. Start the time-wait timer, 3379 turn off the other timers. 3381 CLOSE-WAIT STATE 3383 Remain in the CLOSE-WAIT state. 3385 CLOSING STATE 3387 Remain in the CLOSING state. 3389 LAST-ACK STATE 3391 Remain in the LAST-ACK state. 3393 TIME-WAIT STATE 3395 Remain in the TIME-WAIT state. Restart the 2 MSL 3396 time-wait timeout. 3398 and return. 3400 USER TIMEOUT 3402 USER TIMEOUT 3404 For any state if the user timeout expires, flush all queues, 3405 signal the user "error: connection aborted due to user timeout" 3406 in general and for any outstanding calls, delete the TCB, enter 3407 the CLOSED state and return. 3409 RETRANSMISSION TIMEOUT 3411 For any state if the retransmission timeout expires on a 3412 segment in the retransmission queue, send the segment at the 3413 front of the retransmission queue again, reinitialize the 3414 retransmission timer, and return. 3416 TIME-WAIT TIMEOUT 3418 If the time-wait timeout expires on a connection delete the 3419 TCB, enter the CLOSED state and return. 3421 3.11. Glossary 3423 1822 BBN Report 1822, "The Specification of the Interconnection of 3424 a Host and an IMP". The specification of interface between a 3425 host and the ARPANET. 3427 ACK 3428 A control bit (acknowledge) occupying no sequence space, 3429 which indicates that the acknowledgment field of this segment 3430 specifies the next sequence number the sender of this segment 3431 is expecting to receive, hence acknowledging receipt of all 3432 previous sequence numbers. 3434 ARPANET message 3435 The unit of transmission between a host and an IMP in the 3436 ARPANET. The maximum size is about 1012 octets (8096 bits). 3438 ARPANET packet 3439 A unit of transmission used internally in the ARPANET between 3440 IMPs. The maximum size is about 126 octets (1008 bits). 3442 connection 3443 A logical communication path identified by a pair of sockets. 3445 datagram 3446 A message sent in a packet switched computer communications 3447 network. 3449 Destination Address 3450 The destination address, usually the network and host 3451 identifiers. 3453 FIN 3454 A control bit (finis) occupying one sequence number, which 3455 indicates that the sender will send no more data or control 3456 occupying sequence space. 3458 fragment 3459 A portion of a logical unit of data, in particular an 3460 internet fragment is a portion of an internet datagram. 3462 FTP 3463 A file transfer protocol. 3465 header 3466 Control information at the beginning of a message, segment, 3467 fragment, packet or block of data. 3469 host 3470 A computer. In particular a source or destination of 3471 messages from the point of view of the communication network. 3473 Identification 3474 An Internet Protocol field. This identifying value assigned 3475 by the sender aids in assembling the fragments of a datagram. 3477 IMP 3478 The Interface Message Processor, the packet switch of the 3479 ARPANET. 3481 internet address 3482 A source or destination address specific to the host level. 3484 internet datagram 3485 The unit of data exchanged between an internet module and the 3486 higher level protocol together with the internet header. 3488 internet fragment 3489 A portion of the data of an internet datagram with an 3490 internet header. 3492 IP 3493 Internet Protocol. 3495 IRS 3496 The Initial Receive Sequence number. The first sequence 3497 number used by the sender on a connection. 3499 ISN 3500 The Initial Sequence Number. The first sequence number used 3501 on a connection, (either ISS or IRS). Selected in a way that 3502 is unique within a given period of time and is unpredictable 3503 to attackers. 3505 ISS 3506 The Initial Send Sequence number. The first sequence number 3507 used by the sender on a connection. 3509 leader 3510 Control information at the beginning of a message or block of 3511 data. In particular, in the ARPANET, the control information 3512 on an ARPANET message at the host-IMP interface. 3514 left sequence 3515 This is the next sequence number to be acknowledged by the 3516 data receiving TCP (or the lowest currently unacknowledged 3517 sequence number) and is sometimes referred to as the left 3518 edge of the send window. 3520 local packet 3521 The unit of transmission within a local network. 3523 module 3524 An implementation, usually in software, of a protocol or 3525 other procedure. 3527 MSL 3528 Maximum Segment Lifetime, the time a TCP segment can exist in 3529 the internetwork system. Arbitrarily defined to be 2 3530 minutes. 3532 octet 3533 An eight bit byte. 3535 Options 3536 An Option field may contain several options, and each option 3537 may be several octets in length. The options are used 3538 primarily in testing situations; for example, to carry 3539 timestamps. Both the Internet Protocol and TCP provide for 3540 options fields. 3542 packet 3543 A package of data with a header which may or may not be 3544 logically complete. More often a physical packaging than a 3545 logical packaging of data. 3547 port 3548 The portion of a socket that specifies which logical input or 3549 output channel of a process is associated with the data. 3551 process 3552 A program in execution. A source or destination of data from 3553 the point of view of the TCP or other host-to-host protocol. 3555 PUSH 3556 A control bit occupying no sequence space, indicating that 3557 this segment contains data that must be pushed through to the 3558 receiving user. 3560 RCV.NXT 3561 receive next sequence number 3563 RCV.UP 3564 receive urgent pointer 3566 RCV.WND 3567 receive window 3569 receive next sequence number 3570 This is the next sequence number the local TCP is expecting 3571 to receive. 3573 receive window 3574 This represents the sequence numbers the local (receiving) 3575 TCP is willing to receive. Thus, the local TCP considers 3576 that segments overlapping the range RCV.NXT to RCV.NXT + 3577 RCV.WND - 1 carry acceptable data or control. Segments 3578 containing sequence numbers entirely outside of this range 3579 are considered duplicates and discarded. 3581 RST 3582 A control bit (reset), occupying no sequence space, 3583 indicating that the receiver should delete the connection 3584 without further interaction. The receiver can determine, 3585 based on the sequence number and acknowledgment fields of the 3586 incoming segment, whether it should honor the reset command 3587 or ignore it. In no case does receipt of a segment 3588 containing RST give rise to a RST in response. 3590 RTP 3591 Real Time Protocol: A host-to-host protocol for communication 3592 of time critical information. 3594 SEG.ACK 3595 segment acknowledgment 3597 SEG.LEN 3598 segment length 3600 SEG.PRC 3601 segment precedence value 3603 SEG.SEQ 3604 segment sequence 3606 SEG.UP 3607 segment urgent pointer field 3609 SEG.WND 3610 segment window field 3612 segment 3613 A logical unit of data, in particular a TCP segment is the 3614 unit of data transfered between a pair of TCP modules. 3616 segment acknowledgment 3617 The sequence number in the acknowledgment field of the 3618 arriving segment. 3620 segment length 3621 The amount of sequence number space occupied by a segment, 3622 including any controls which occupy sequence space. 3624 segment sequence 3625 The number in the sequence field of the arriving segment. 3627 send sequence 3628 This is the next sequence number the local (sending) TCP will 3629 use on the connection. It is initially selected from an 3630 initial sequence number curve (ISN) and is incremented for 3631 each octet of data or sequenced control transmitted. 3633 send window 3634 This represents the sequence numbers which the remote 3635 (receiving) TCP is willing to receive. It is the value of 3636 the window field specified in segments from the remote (data 3637 receiving) TCP. The range of new sequence numbers which may 3638 be emitted by a TCP lies between SND.NXT and SND.UNA + 3639 SND.WND - 1. (Retransmissions of sequence numbers between 3640 SND.UNA and SND.NXT are expected, of course.) 3642 SND.NXT 3643 send sequence 3645 SND.UNA 3646 left sequence 3648 SND.UP 3649 send urgent pointer 3651 SND.WL1 3652 segment sequence number at last window update 3654 SND.WL2 3655 segment acknowledgment number at last window update 3657 SND.WND 3658 send window 3660 socket 3661 An address which specifically includes a port identifier, 3662 that is, the concatenation of an Internet Address with a TCP 3663 port. 3665 Source Address 3666 The source address, usually the network and host identifiers. 3668 SYN 3669 A control bit in the incoming segment, occupying one sequence 3670 number, used at the initiation of a connection, to indicate 3671 where the sequence numbering will start. 3673 TCB 3674 Transmission control block, the data structure that records 3675 the state of a connection. 3677 TCB.PRC 3678 The precedence of the connection. 3680 TCP 3681 Transmission Control Protocol: A host-to-host protocol for 3682 reliable communication in internetwork environments. 3684 TOS 3685 Type of Service, an Internet Protocol field. 3687 Type of Service 3688 An Internet Protocol field which indicates the type of 3689 service for this internet fragment. 3691 URG 3692 A control bit (urgent), occupying no sequence space, used to 3693 indicate that the receiving user should be notified to do 3694 urgent processing as long as there is data to be consumed 3695 with sequence numbers less than the value indicated in the 3696 urgent pointer. 3698 urgent pointer 3699 A control field meaningful only when the URG bit is on. This 3700 field communicates the value of the urgent pointer which 3701 indicates the data octet associated with the sending user's 3702 urgent call. 3704 4. Changes from RFC 793 3706 This document obsoletes RFC 793 as well as RFC 6093 and 6528, which 3707 updated 793. In all cases, only the normative protocol specification 3708 and requirements have been incorporated into this document, and the 3709 informational text with background and rationale has not been carried 3710 in. The informational content of those documents is still valuable 3711 in learning about and understanding TCP, and they are valid 3712 Informational references, even though their normative content has 3713 been incorporated into this document. 3715 The main body of this document was adapted from RFC 793's Section 3, 3716 titled "FUNCTIONAL SPECIFICATION", with an attempt to keep formatting 3717 and layout as close as possible. 3719 The collection of applicable RFC Errata that have been reported and 3720 either accepted or held for an update to RFC 793 were incorporated 3721 (Errata IDs: 573, 574, 700, 701, 1283, 1561, 1562, 1564, 1565, 1571, 3722 1572, 2296, 2297, 2298, 2748, 2749, 2934, 3213, 3300, 3301). Some 3723 errata were not applicable due to other changes (Errata IDs: 572, 3724 575, 1569, 3602). TODO: 3305 3726 Changes to the specification of the Urgent Pointer described in RFC 3727 1122 and 6093 were incorporated. See RFC 6093 for detailed 3728 discussion of why these changes were necessary. 3730 The discussion of the RTO from RFC 793 was updated to refer to RFC 3731 6298. The RFC 1122 text on the RTO originally replaced the 793 text, 3732 however, RFC 2988 should have updated 1122, and has subsequently been 3733 obsoleted by 6298. 3735 RFC 1122 contains a collection of other changes and clarifications to 3736 RFC 793. The normative items impacting the protocol have been 3737 incorporated here, though some historically useful implementation 3738 advice and informative discussion from RFC 1122 is not included here. 3740 RFC 1122 contains more than just TCP requirements, so this document 3741 can't obsolete RFC 1122 entirely. It is only marked as "updating" 3742 1122, however, it should be understood to effectively obsolete all of 3743 the RFC 1122 material on TCP. 3745 The more secure Initial Sequence Number generation algorithm from RFC 3746 6528 was incorporated. See RFC 6528 for discussion of the attacks 3747 that this mitigates, as well as advice on selecting PRF algorithms 3748 and managing secret key data. 3750 A note based on RFC 6429 was added to explicitly clarify that system 3751 resource mangement concerns allow connection resources to be 3752 reclaimed. RFC 6429 is obsoleted in the sense that this 3753 clarification has been reflected in this update to the base TCP 3754 specification now. 3756 RFC EDITOR'S NOTE: the content below is for detailed change tracking 3757 and planning, and not to be included with the final revision of the 3758 document. 3760 This document started as draft-eddy-rfc793bis-00, that was merely a 3761 proposal and rough plan for updating RFC 793. 3763 The -01 revision of this draft-eddy-rfc793bis incorporates the 3764 content of RFC 793 Section 3 titled "FUNCTIONAL SPECIFICATION". 3765 Other content from RFC 793 has not been incorporated. The -01 3766 revision of this document makes some minor formatting changes to the 3767 RFC 793 content in order to convert the content into XML2RFC format 3768 and account for left-out parts of RFC 793. For instance, figure 3769 numbering differs and some indentation is not exactly the same. 3771 The -02 revision of draft-eddy-rfc793bis incorporates errata that 3772 have been verified: 3774 Errata ID 573: Reported by Bob Braden (note: This errata basically 3775 is just a reminder that RFC 1122 updates 793. Some of the 3776 associated changes are left pending to a separate revision that 3777 incorporates 1122. Bob's mention of PUSH in 793 section 2.8 was 3778 not applicable here because that section was not part of the 3779 "functional specification". Also the 1122 text on the 3780 retransmission timeout also has been updated by subsequent RFCs, 3781 so the change here deviates from Bob's suggestion to apply the 3782 1122 text.) 3783 Errata ID 574: Reported by Yin Shuming 3784 Errata ID 700: Reported by Yin Shuming 3785 Errata ID 701: Reported by Yin Shuming 3786 Errata ID 1283: Reported by Pei-chun Cheng 3787 Errata ID 1561: Reported by Constantin Hagemeier 3788 Errata ID 1562: Reported by Constantin Hagemeier 3789 Errata ID 1564: Reported by Constantin Hagemeier 3790 Errata ID 1565: Reported by Constantin Hagemeier 3791 Errata ID 1571: Reported by Constantin Hagemeier 3792 Errata ID 1572: Reported by Constantin Hagemeier 3793 Errata ID 2296: Reported by Vishwas Manral 3794 Errata ID 2297: Reported by Vishwas Manral 3795 Errata ID 2298: Reported by Vishwas Manral 3796 Errata ID 2748: Reported by Mykyta Yevstifeyev 3797 Errata ID 2749: Reported by Mykyta Yevstifeyev 3798 Errata ID 2934: Reported by Constantin Hagemeier 3799 Errata ID 3213: Reported by EugnJun Yi 3800 Errata ID 3300: Reported by Botong Huang 3801 Errata ID 3301: Reported by Botong Huang 3802 Note: Some verified errata were not used in this update, as they 3803 relate to sections of RFC 793 elided from this document. These 3804 include Errata ID 572, 575, and 1569. 3805 Note: Errata ID 3602 was not applied in this revision as it is 3806 duplicative of the 1122 corrections. 3807 There is an errata 3305 currently reported that need to be 3808 verified, held, or rejected by the ADs; it is addressing the same 3809 issue as draft-gont-tcpm-tcp-seq-validation and was not attempted 3810 to be applied to this document. 3812 Not related to RFC 793 content, this revision also makes small tweaks 3813 to the introductory text, fixes indentation of the pseudoheader 3814 diagram, and notes that the Security Considerations should also 3815 include privacy, when this section is written. 3817 The -03 revision of draft-eddy-rfc793bis revises all discussion of 3818 the urgent pointer in order to comply with RFC 6093, 1122, and 1011. 3819 Since 1122 held requirements on the urgent pointer, the full list of 3820 requirements was brought into an appendix of this document, so that 3821 it can be updated as-needed. 3823 The -04 revision of draft-eddy-rfc793bis includes the ISN generation 3824 changes from RFC 6528. 3826 The -05 revision of draft-eddy-rfc793bis incorporates MSS 3827 requirements and definitions from RFC 879, 1122, and 6691, as well as 3828 option-handling requirements from RFC 1122. 3830 The -00 revision of draft-ietf-tcpm-rfc793bis incorporates several 3831 additional clarifications and updates to the section on segmentation, 3832 many of which are based on feedback from Joe Touch improving from the 3833 initial text on this in the previous revision. 3835 The -01 revision incorporates the change to Reserved bits due to ECN, 3836 as well as many other changes that come from RFC 1122. 3838 The -02 revision has small formating modifications in order to 3839 address xml2rfc warnings about long lines. It was a quick update to 3840 avoid document expiration. TCPM working group discussion in 2015 3841 also indicated that that we should not try to add sections on 3842 implementation advice or similar non-normative information. 3844 The -03 revision incorporates more content from RFC 1122: Passive 3845 OPEN Calls, Time-To-Live, Multihoming, IP Options, ICMP messages, 3846 Data Communications, When to Send Data, When to Send a Window Update, 3847 Managing the Window, Probing Zero Windows, When to Send an ACK 3848 Segment. The section on data communications was re-organized into 3849 clearer subsections (previously headings were embedded in the 793 3850 text), and windows management advice from 793 was removed (as 3851 reviewed by TCPM working group) in favor of the 1122 additions on 3852 SWS, ZWP, and related topics. 3854 The -04 revision includes reference to RFC 6429 on the ZWP condition, 3855 RFC1122 material on TCP Connection Failures, TCP Keep-Alives, 3856 Acknowledging Queued Segments, and Remote Address Validation. RTO 3857 computation is referenced from RFC 6298 rather than RFC 1122. 3859 The -05 revision includes the requirement to implement TCP congestion 3860 control with recommendation to implemente ECN, the RFC 6633 update to 3861 1122, which changed the requirement on responding to source quench 3862 ICMP messages, and discussion of ICMP (and ICMPv6) soft and hard 3863 errors per RFC 5461 (ICMPv6 handling for TCP doesn't seem to be 3864 mentioned elsewhere in standards track). 3866 TODO list of other planned changes (these can be added to or made 3867 more specific, as the document proceeds): 3869 1. mention 5961 state machine option 3870 2. mention 6161 (reducing TIME-WAIT) 3871 3. TOS material does not take DSCP changes into account 3872 4. there is inconsistency between use of SYN_RCVD and SYNC-RECEIVED 3873 in diagrams and text in various places 3874 5. make sure that clarifications in RFC 1011 are captured 3876 TODO list of other potential changes, if there is TCPM consensus: 3878 1. see draft-gont-tcpm-tcp-seccomp-prec 3879 2. incorporate Fernando's new number-checking fixes (if past the 3880 IESG in time) 3881 3. look at Tony Sabatini suggestion for describing DO field 3882 4. clearly specify treatment of reserved bits (see TCPM thread on 3883 EDO draft April 25, 2014) 3884 5. look at possible mention of draft-minshall-nagle (e.g. as in 3885 Linux) 3886 6. per discussion with Joe Touch (TAPS list, 6/20/2015), the 3887 description of the API could be revisited 3889 5. IANA Considerations 3891 This memo includes no request to IANA. Existing IANA registries for 3892 TCP parameters are sufficient. 3894 TODO: check whether entries pointing to 793 and other documents 3895 obsoleted by this one should be updated to point to this one instead. 3897 6. Security and Privacy Considerations 3899 TODO 3901 See RFC 6093 [18] for discussion of security considerations related 3902 to the urgent pointer field. 3904 Editor's Note: Scott Brim mentioned that this should include a 3905 PERPASS/privacy review. 3907 7. Acknowledgements 3909 This document is largely a revision of RFC 793, which Jon Postel was 3910 the editor of. Due to his excellent work, it was able to last for 3911 three decades before we felt the need to revise it. 3913 Andre Oppermann was a contributor and helped to edit the first 3914 revision of this document. 3916 We are thankful for the assistance of the IETF TCPM working group 3917 chairs: 3919 Michael Scharf 3920 Yoshifumi Nishida 3921 Pasi Sarolahti 3923 During early discussion of this work on the TCPM mailing list, and at 3924 the IETF 88 meeting in Vancouver, helpful comments, critiques, and 3925 reviews were received from (listed alphebetically): David Borman, 3926 Yuchung Cheng, Martin Duke, Kevin Lahey, Kevin Mason, Matt Mathis, 3927 Hagen Paul Pfeifer, Anthony Sabatini, Joe Touch, Reji Varghese, Lloyd 3928 Wood, and Alex Zimmermann. 3930 This document includes content from errata that were reported by 3931 (listed chronologically): Yin Shuming, Bob Braden, Morris M. Keesan, 3932 Pei-chun Cheng, Constantin Hagemeier, Vishwas Manral, Mykyta 3933 Yevstifeyev, EungJun Yi, Botong Huang. 3935 8. References 3937 8.1. Normative References 3939 [1] Postel, J., "Internet Protocol", STD 5, RFC 791, 3940 DOI 10.17487/RFC0791, September 1981, 3941 . 3943 [2] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 3944 DOI 10.17487/RFC1191, November 1990, 3945 . 3947 [3] McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery 3948 for IP version 6", RFC 1981, DOI 10.17487/RFC1981, August 3949 1996, . 3951 [4] Bradner, S., "Key words for use in RFCs to Indicate 3952 Requirement Levels", BCP 14, RFC 2119, 3953 DOI 10.17487/RFC2119, March 1997, 3954 . 3956 [5] Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms", 3957 RFC 2675, DOI 10.17487/RFC2675, August 1999, 3958 . 3960 [6] Lahey, K., "TCP Problems with Path MTU Discovery", 3961 RFC 2923, DOI 10.17487/RFC2923, September 2000, 3962 . 3964 [7] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 3965 of Explicit Congestion Notification (ECN) to IP", 3966 RFC 3168, DOI 10.17487/RFC3168, September 2001, 3967 . 3969 [8] Paxson, V., Allman, M., Chu, J., and M. Sargent, 3970 "Computing TCP's Retransmission Timer", RFC 6298, 3971 DOI 10.17487/RFC6298, June 2011, 3972 . 3974 [9] Gont, F., "Deprecation of ICMP Source Quench Messages", 3975 RFC 6633, DOI 10.17487/RFC6633, May 2012, 3976 . 3978 8.2. Informative References 3980 [10] Postel, J., "Transmission Control Protocol", STD 7, 3981 RFC 793, DOI 10.17487/RFC0793, September 1981, 3982 . 3984 [11] Nagle, J., "Congestion Control in IP/TCP Internetworks", 3985 RFC 896, DOI 10.17487/RFC0896, January 1984, 3986 . 3988 [12] Braden, R., Ed., "Requirements for Internet Hosts - 3989 Communication Layers", STD 3, RFC 1122, 3990 DOI 10.17487/RFC1122, October 1989, 3991 . 3993 [13] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 3994 Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007, 3995 . 3997 [14] Culley, P., Elzur, U., Recio, R., Bailey, S., and J. 3998 Carrier, "Marker PDU Aligned Framing for TCP 3999 Specification", RFC 5044, DOI 10.17487/RFC5044, October 4000 2007, . 4002 [15] Gont, F., "TCP's Reaction to Soft Errors", RFC 5461, 4003 DOI 10.17487/RFC5461, February 2009, 4004 . 4006 [16] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 4007 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 4008 . 4010 [17] Sandlund, K., Pelletier, G., and L-E. Jonsson, "The RObust 4011 Header Compression (ROHC) Framework", RFC 5795, 4012 DOI 10.17487/RFC5795, March 2010, 4013 . 4015 [18] Gont, F. and A. Yourtchenko, "On the Implementation of the 4016 TCP Urgent Mechanism", RFC 6093, DOI 10.17487/RFC6093, 4017 January 2011, . 4019 [19] Bashyam, M., Jethanandani, M., and A. Ramaiah, "TCP Sender 4020 Clarification for Persist Condition", RFC 6429, 4021 DOI 10.17487/RFC6429, December 2011, 4022 . 4024 [20] Gont, F. and S. Bellovin, "Defending against Sequence 4025 Number Attacks", RFC 6528, DOI 10.17487/RFC6528, February 4026 2012, . 4028 [21] Borman, D., "TCP Options and Maximum Segment Size (MSS)", 4029 RFC 6691, DOI 10.17487/RFC6691, July 2012, 4030 . 4032 [22] Borman, D., Braden, B., Jacobson, V., and R. 4033 Scheffenegger, Ed., "TCP Extensions for High Performance", 4034 RFC 7323, DOI 10.17487/RFC7323, September 2014, 4035 . 4037 [23] Duke, M., Braden, R., Eddy, W., Blanton, E., and A. 4038 Zimmermann, "A Roadmap for Transmission Control Protocol 4039 (TCP) Specification Documents", RFC 7414, 4040 DOI 10.17487/RFC7414, February 2015, 4041 . 4043 [24] Fairhurst, G. and M. Welzl, "The Benefits of Using 4044 Explicit Congestion Notification (ECN)", RFC 8087, 4045 DOI 10.17487/RFC8087, March 2017, 4046 . 4048 Appendix A. TCP Requirement Summary 4050 This section is adapted from RFC 1122. 4052 TODO: this needs to be seriously redone, to use 793bis section 4053 numbers instead of 1122 ones, the RFC1122 heading should be removed, 4054 and all 1122 requirements need to be reflected in 793bis text. 4056 TODO: NOTE that PMTUD+PLPMTUD is not included in this table of 4057 recommendations. 4059 | | | | |S| | 4060 | | | | |H| |F 4061 | | | | |O|M|o 4062 | | |S| |U|U|o 4063 | | |H| |L|S|t 4064 | |M|O| |D|T|n 4065 | |U|U|M| | |o 4066 | |S|L|A|N|N|t 4067 |RFC1122 |T|D|Y|O|O|t 4068 FEATURE |SECTION | | | |T|T|e 4069 -------------------------------------------------|--------|-|-|-|-|-|-- 4070 | | | | | | | 4071 Push flag | | | | | | | 4072 Aggregate or queue un-pushed data |4.2.2.2 | | |x| | | 4073 Sender collapse successive PSH flags |4.2.2.2 | |x| | | | 4074 SEND call can specify PUSH |4.2.2.2 | | |x| | | 4075 If cannot: sender buffer indefinitely |4.2.2.2 | | | | |x| 4076 If cannot: PSH last segment |4.2.2.2 |x| | | | | 4077 Notify receiving ALP of PSH |4.2.2.2 | | |x| | |1 4078 Send max size segment when possible |4.2.2.2 | |x| | | | 4079 | | | | | | | 4080 Window | | | | | | | 4081 Treat as unsigned number |4.2.2.3 |x| | | | | 4082 Handle as 32-bit number |4.2.2.3 | |x| | | | 4083 Shrink window from right |4.2.2.16| | | |x| | 4084 Robust against shrinking window |4.2.2.16|x| | | | | 4085 Receiver's window closed indefinitely |4.2.2.17| | |x| | | 4086 Sender probe zero window |4.2.2.17|x| | | | | 4087 First probe after RTO |4.2.2.17| |x| | | | 4088 Exponential backoff |4.2.2.17| |x| | | | 4089 Allow window stay zero indefinitely |4.2.2.17|x| | | | | 4090 Sender timeout OK conn with zero wind |4.2.2.17| | | | |x| 4091 | | | | | | | 4092 Urgent Data | | | | | | | 4093 Pointer indicates first non-urgent octet |4.2.2.4 |x| | | | | 4094 Arbitrary length urgent data sequence |4.2.2.4 |x| | | | | 4095 Inform ALP asynchronously of urgent data |4.2.2.4 |x| | | | |1 4096 ALP can learn if/how much urgent data Q'd |4.2.2.4 |x| | | | |1 4097 | | | | | | | 4098 TCP Options | | | | | | | 4099 Receive TCP option in any segment |4.2.2.5 |x| | | | | 4100 Ignore unsupported options |4.2.2.5 |x| | | | | 4101 Cope with illegal option length |4.2.2.5 |x| | | | | 4102 Implement sending & receiving MSS option |4.2.2.6 |x| | | | | 4103 IPv4 Send MSS option unless 536 |4.2.2.6 | |x| | | | 4104 IPv6 Send MSS option unless 1220 | N/A | |x| | | | 4105 Send MSS option always |4.2.2.6 | | |x| | | 4106 IPv4 Send-MSS default is 536 |4.2.2.6 |x| | | | | 4107 IPv6 Send-MSS default is 1220 | N/A |x| | | | | 4108 Calculate effective send seg size |4.2.2.6 |x| | | | | 4109 MSS accounts for varying MTU | N/A | |x| | | | 4110 | | | | | | | 4111 TCP Checksums | | | | | | | 4112 Sender compute checksum |4.2.2.7 |x| | | | | 4113 Receiver check checksum |4.2.2.7 |x| | | | | 4114 | | | | | | | 4115 ISN Selection | | | | | | | 4116 Include a clock-driven ISN generator component |4.2.2.9 |x| | | | | 4117 Secure ISN generator with a PRF component | N/A | |x| | | | 4118 | | | | | | | 4119 Opening Connections | | | | | | | 4120 Support simultaneous open attempts |4.2.2.10|x| | | | | 4121 SYN-RCVD remembers last state |4.2.2.11|x| | | | | 4122 Passive Open call interfere with others |4.2.2.18| | | | |x| 4123 Function: simultan. LISTENs for same port |4.2.2.18|x| | | | | 4124 Ask IP for src address for SYN if necc. |4.2.3.7 |x| | | | | 4125 Otherwise, use local addr of conn. |4.2.3.7 |x| | | | | 4126 OPEN to broadcast/multicast IP Address |4.2.3.14| | | | |x| 4127 Silently discard seg to bcast/mcast addr |4.2.3.14|x| | | | | 4128 | | | | | | | 4129 Closing Connections | | | | | | | 4130 RST can contain data |4.2.2.12| |x| | | | 4131 Inform application of aborted conn |4.2.2.13|x| | | | | 4132 Half-duplex close connections |4.2.2.13| | |x| | | 4133 Send RST to indicate data lost |4.2.2.13| |x| | | | 4134 In TIME-WAIT state for 2MSL seconds |4.2.2.13|x| | | | | 4135 Accept SYN from TIME-WAIT state |4.2.2.13| | |x| | | 4136 | | | | | | | 4137 Retransmissions | | | | | | | 4138 Jacobson Slow Start algorithm |4.2.2.15|x| | | | | 4139 Jacobson Congestion-Avoidance algorithm |4.2.2.15|x| | | | | 4140 Retransmit with same IP ident |4.2.2.15| | |x| | | 4141 Karn's algorithm |4.2.3.1 |x| | | | | 4142 Jacobson's RTO estimation alg. |4.2.3.1 |x| | | | | 4143 Exponential backoff |4.2.3.1 |x| | | | | 4144 SYN RTO calc same as data |4.2.3.1 | |x| | | | 4145 Recommended initial values and bounds |4.2.3.1 | |x| | | | 4146 | | | | | | | 4147 Generating ACK's: | | | | | | | 4148 Queue out-of-order segments |4.2.2.20| |x| | | | 4149 Process all Q'd before send ACK |4.2.2.20|x| | | | | 4150 Send ACK for out-of-order segment |4.2.2.21| | |x| | | 4151 Delayed ACK's |4.2.3.2 | |x| | | | 4152 Delay < 0.5 seconds |4.2.3.2 |x| | | | | 4153 Every 2nd full-sized segment ACK'd |4.2.3.2 |x| | | | | 4154 Receiver SWS-Avoidance Algorithm |4.2.3.3 |x| | | | | 4155 | | | | | | | 4156 Sending data | | | | | | | 4157 Configurable TTL |4.2.2.19|x| | | | | 4158 Sender SWS-Avoidance Algorithm |4.2.3.4 |x| | | | | 4159 Nagle algorithm |4.2.3.4 | |x| | | | 4160 Application can disable Nagle algorithm |4.2.3.4 |x| | | | | 4161 | | | | | | | 4162 Connection Failures: | | | | | | | 4163 Negative advice to IP on R1 retxs |4.2.3.5 |x| | | | | 4164 Close connection on R2 retxs |4.2.3.5 |x| | | | | 4165 ALP can set R2 |4.2.3.5 |x| | | | |1 4166 Inform ALP of R1<=retxs inform ALP |4.2.3.9 | |x| | | | 4191 Dest. Unreach (0,1,5) => abort conn |4.2.3.9 | | | | |x| 4192 Dest. Unreach (2-4) => abort conn |4.2.3.9 | |x| | | | 4193 Source Quench => silent discard |4.2.3.9 | |x| | | | 4194 Time Exceeded => tell ALP, don't abort |4.2.3.9 | |x| | | | 4195 Param Problem => tell ALP, don't abort |4.2.3.9 | |x| | | | 4196 | | | | | | | 4197 Address Validation | | | | | | | 4198 Reject OPEN call to invalid IP address |4.2.3.10|x| | | | | 4199 Reject SYN from invalid IP address |4.2.3.10|x| | | | | 4200 Silently discard SYN to bcast/mcast addr |4.2.3.10|x| | | | | 4201 | | | | | | | 4202 TCP/ALP Interface Services | | | | | | | 4203 Error Report mechanism |4.2.4.1 |x| | | | | 4204 ALP can disable Error Report Routine |4.2.4.1 | |x| | | | 4205 ALP can specify TOS for sending |4.2.4.2 |x| | | | | 4206 Passed unchanged to IP |4.2.4.2 | |x| | | | 4207 ALP can change TOS during connection |4.2.4.2 | |x| | | | 4208 Pass received TOS up to ALP |4.2.4.2 | | |x| | | 4209 FLUSH call |4.2.4.3 | | |x| | | 4210 Optional local IP addr parm. in OPEN |4.2.4.4 |x| | | | | 4211 -------------------------------------------------|--------|-|-|-|-|-|-- 4213 FOOTNOTES: (1) "ALP" means Application-Layer program. 4215 Author's Address 4217 Wesley M. Eddy (editor) 4218 MTI Systems 4219 US 4221 Email: wes@mti-systems.com