idnits 2.17.1 draft-ietf-tcpm-1323bis-15.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The abstract seems to indicate that this document obsoletes RFC1323, but the header doesn't have an 'Obsoletes:' line to match this. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (August 6, 2013) is 3888 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'Ekstroem04' is defined on line 1240, but no explicit reference was found in the text == Unused Reference: 'Hamming77' is defined on line 1258, but no explicit reference was found in the text == Unused Reference: 'Jain86' is defined on line 1282, but no explicit reference was found in the text == Unused Reference: 'Mathis08' is defined on line 1302, but no explicit reference was found in the text == Unused Reference: 'RFC0896' is defined on line 1321, but no explicit reference was found in the text == Unused Reference: 'RFC1110' is defined on line 1327, but no explicit reference was found in the text == Unused Reference: 'RFC2581' is defined on line 1345, but no explicit reference was found in the text == Unused Reference: 'Watson81' is defined on line 1382, but no explicit reference was found in the text == Unused Reference: 'Zhang86' is defined on line 1387, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 1072 (Obsoleted by RFC 1323, RFC 2018, RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1110 (Obsoleted by RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1185 (Obsoleted by RFC 1323) -- Obsolete informational reference (is this intentional?): RFC 1323 (Obsoleted by RFC 7323) -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 6691 (Obsoleted by RFC 9293) Summary: 1 error (**), 0 flaws (~~), 10 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance (TCPM) D. Borman 3 Internet-Draft Quantum Corporation 4 Intended status: Standards Track B. Braden 5 Expires: February 7, 2014 University of Southern 6 California 7 V. Jacobson 8 Google, Inc. 9 R. Scheffenegger, Ed. 10 NetApp, Inc. 11 August 6, 2013 13 TCP Extensions for High Performance 14 draft-ietf-tcpm-1323bis-15 16 Abstract 18 This document specifies a set of TCP extensions to improve 19 performance over paths with a large bandwidth * delay product and to 20 provide reliable operation over very high-speed paths. It defines 21 TCP options for scaled windows and timestamps. The timestamps are 22 used for two distinct mechanisms, RTTM (Round Trip Time Measurement) 23 and PAWS (Protection Against Wrapped Sequences). 25 This document obsoletes RFC 1323 and describes changes from it. 27 Status of this Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on February 7, 2014. 44 Copyright Notice 46 Copyright (c) 2013 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 62 1.1. TCP Performance . . . . . . . . . . . . . . . . . . . . . 4 63 1.2. TCP Reliability . . . . . . . . . . . . . . . . . . . . . 5 64 1.3. Using TCP options . . . . . . . . . . . . . . . . . . . . 6 65 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7 66 2. TCP Window Scale Option . . . . . . . . . . . . . . . . . . . 8 67 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 8 68 2.2. Window Scale Option . . . . . . . . . . . . . . . . . . . 8 69 2.3. Using the Window Scale Option . . . . . . . . . . . . . . 9 70 2.4. Addressing Window Retraction . . . . . . . . . . . . . . . 10 71 3. TCP Timestamps option . . . . . . . . . . . . . . . . . . . . 12 72 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 12 73 3.2. Timestamps option . . . . . . . . . . . . . . . . . . . . 12 74 3.3. The RTTM Mechanism . . . . . . . . . . . . . . . . . . . . 14 75 3.4. Updating the RTO value . . . . . . . . . . . . . . . . . . 15 76 3.5. Which Timestamp to Echo . . . . . . . . . . . . . . . . . 16 77 4. PAWS - Protection Against Wrapped Sequence Numbers . . . . . . 18 78 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 18 79 4.2. The PAWS Mechanism . . . . . . . . . . . . . . . . . . . . 19 80 4.3. Basic PAWS Algorithm . . . . . . . . . . . . . . . . . . . 20 81 4.4. Timestamp Clock . . . . . . . . . . . . . . . . . . . . . 22 82 4.5. Outdated Timestamps . . . . . . . . . . . . . . . . . . . 23 83 4.6. Header Prediction . . . . . . . . . . . . . . . . . . . . 24 84 4.7. IP Fragmentation . . . . . . . . . . . . . . . . . . . . . 25 85 4.8. Duplicates from Earlier Incarnations of Connection . . . . 26 86 5. Conclusions and Acknowledgements . . . . . . . . . . . . . . . 26 87 6. Security Considerations . . . . . . . . . . . . . . . . . . . 27 88 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 28 89 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 29 90 8.1. Normative References . . . . . . . . . . . . . . . . . . . 29 91 8.2. Informative References . . . . . . . . . . . . . . . . . . 29 92 Appendix A. Implementation Suggestions . . . . . . . . . . . . . 32 93 Appendix B. Duplicates from Earlier Connection Incarnations . . . 33 94 B.1. System Crash with Loss of State . . . . . . . . . . . . . 33 95 B.2. Closing and Reopening a Connection . . . . . . . . . . . . 34 96 Appendix C. Summary of Notation . . . . . . . . . . . . . . . . . 35 97 Appendix D. Event Processing Summary . . . . . . . . . . . . . . 36 98 Appendix E. Timestamps Edge Cases . . . . . . . . . . . . . . . . 42 99 Appendix F. Window Retraction Example . . . . . . . . . . . . . . 42 100 Appendix G. RTO calculation modification . . . . . . . . . . . . 43 101 Appendix H. Changes from RFC 1323 . . . . . . . . . . . . . . . . 44 102 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 46 104 1. Introduction 106 The TCP protocol [RFC0793] was designed to operate reliably over 107 almost any transmission medium regardless of transmission rate, 108 delay, corruption, duplication, or reordering of segments. Over the 109 years, advances in networking technology has resulted in ever-higher 110 transmission speeds, and the fastest paths are well beyond the domain 111 for which TCP was originally engineered. 113 This document defines a set of modest extensions to TCP to extend the 114 domain of its application to match the increasing network capability. 115 It is an update to and obsoletes [RFC1323], which in turn is based 116 upon and obsoletes [RFC1072] and [RFC1185]. 118 Changes between [RFC1323] and this document are detailed in 119 Appendix H. 121 For brevity, the full discussions of the merits and history behind 122 the TCP options defined within this document have been omitted. 123 [RFC1323] should be consulted for reference. It is recommended that 124 a modern TCP stack implements and make use of the extensions 125 described in this document. 127 1.1. TCP Performance 129 TCP performance problems arise when the bandwidth * delay product is 130 large. A network having such paths is referred to as "long, fat 131 network" (LFN). 133 There are two fundamental performance problems with basic TCP over 134 LFN paths: 136 (1) Window Size Limit 138 The TCP header uses a 16 bit field to report the receive window 139 size to the sender. Therefore, the largest window that can be 140 used is 2^16 = 64 KiB. For LFN paths where the bandwidth * 141 delay product exceeds 64 KiB, the receive window limits the 142 maximum throughput of the TCP connection over the path, i.e., 143 the amount of unacknowledged data that TCP can send in order to 144 keep the pipeline full. 146 To circumvent this problem, Section 2 of this memo defines a TCP 147 option, "Window Scale", to allow windows larger than 2^16. This 148 option defines an implicit scale factor, which is used to 149 multiply the window size value found in a TCP header to obtain 150 the true window size. 152 (2) Recovery from Losses 154 Packet losses in an LFN can have a catastrophic effect on 155 throughput. 157 To generalize the Fast Retransmit / Fast Recovery mechanism to 158 handle multiple packets dropped per window, Selective 159 Acknowledgments are required. Unlike the normal cumulative 160 acknowledgments of TCP, Selective Acknowledgments give the 161 sender a complete picture of which segments are queued at the 162 receiver and which have not yet arrived. 164 Selective acknowledgements and their use are specified in 165 separate documents, "TCP Selective Acknowledgment Options" 166 [RFC2018], "An Extension to the Selective Acknowledgement (SACK) 167 Option for TCP" [RFC2883], and "A Conservative Selective 168 Acknowledgment (SACK)-based Loss Recovery Algorithm for TCP" 169 [RFC6675], and not further discussed in this document. 171 1.2. TCP Reliability 173 An especially serious kind of error may result from an accidental 174 reuse of TCP sequence numbers in data segments. TCP reliability 175 depends upon the existence of a bound on the lifetime of a segment: 176 the "Maximum Segment Lifetime" or MSL. 178 Duplication of sequence numbers might happen in either of two ways: 180 (1) Sequence number wrap-around on the current connection 182 A TCP sequence number contains 32 bits. At a high enough 183 transfer rate, the 32-bit sequence space may be "wrapped" 184 (cycled) within the time that a segment is delayed in queues. 186 (2) Earlier incarnation of the connection 188 Suppose that a connection terminates, either by a proper close 189 sequence or due to a host crash, and the same connection (i.e., 190 using the same pair of port numbers) is immediately reopened. A 191 delayed segment from the terminated connection could fall within 192 the current window for the new incarnation and be accepted as 193 valid. 195 Duplicates from earlier incarnations, case (2), are avoided by 196 enforcing the current fixed MSL of the TCP specification, as 197 explained in Section 4.8 and Appendix B. However, case (1), avoiding 198 the reuse of sequence numbers within the same connection, requires an 199 upper bound on MSL that depends upon the transfer rate, and at high 200 enough rates, a dedicated mechanism is required. 202 A possible fix for the problem of cycling the sequence space would be 203 to increase the size of the TCP sequence number field. For example, 204 the sequence number field (and also the acknowledgment field) could 205 be expanded to 64 bits. This could be done either by changing the 206 TCP header or by means of an additional option. 208 Section 4 presents a different mechanism, which we call PAWS 209 (Protection Against Wrapped Sequence numbers), to extend TCP 210 reliability to transfer rates well beyond the foreseeable upper limit 211 of network bandwidths. PAWS uses the TCP Timestamps option defined 212 in Section 3.2 to protect against old duplicates from the same 213 connection. 215 1.3. Using TCP options 217 The extensions defined in this document all use TCP options. 219 When [RFC1323] was published, there was concern that some buggy TCP 220 implementation might be crashed by the first appearance of an option 221 on a non- segment. However, bugs like that can lead to DOS 222 attacks against a TCP. Research has shown that most TCP 223 implementations will properly handle unknown options on non- 224 segments ([Medina04], [Medina05]). But it is still prudent to be 225 conservative in what you send, and avoiding buggy TCP implementation 226 is not the only reason for negotiating TCP options on segments. 228 The window scale option negotiates fundamental parameters of the TCP 229 session. Therefore, it is only sent during the initial handshake. 230 Furthermore, the window scale option will be sent in a 231 segment only if the corresponding option was received in the initial 232 segment. 234 The Timestamps option may appear in any data or segment, adding 235 12 bytes to the 20-byte TCP header. It is required that this TCP 236 option will be sent on all non- segments after an exchange of 237 options on the segments has indicated that both sides 238 understand this extension. 240 Research has shown that the use of the Timestamps option to arrive at 241 an optimal retransmission timeout value has only limited benefit 242 ([Allman99]. However, there are other uses of the Timestamps option, 243 such as the Eifel mechanism [RFC3522], [RFC4015], and PAWS (see 244 Section 4) which improve overall TCP security and performance. The 245 extra header bandwidth used by this option should be evaluated for 246 the gains in performance and security in an actual deployment. 248 Appendix A contains a recommended layout of the options in TCP 249 headers to achieve reasonable data field alignment. 251 Finally, we observe that most of the mechanisms defined in this 252 document are important for LFN's and/or very high-speed networks. 253 For low-speed networks, it might be a performance optimization to NOT 254 use these mechanisms. A TCP vendor concerned about optimal 255 performance over low-speed paths might consider turning these 256 extensions off for low- speed paths, or allow a user or installation 257 manager to disable them. 259 1.4. Terminology 261 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 262 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 263 document are to be interpreted as described in [RFC2119]. 265 In this document, these words will appear with that interpretation 266 only when in UPPER CASE. Lower case uses of these words are not to 267 be interpreted as carrying [RFC2119] significance. 269 2. TCP Window Scale Option 271 2.1. Introduction 273 The window scale extension expands the definition of the TCP window 274 to 30 bits and then uses an implicit scale factor to carry this 30- 275 bit value in the 16-bit Window field of the TCP header (SEG.WND in 276 [RFC0793]). The exponent of the scale factor is carried in a TCP 277 option, Window Scale. This option is sent only in a segment (a 278 segment with the SYN bit on), hence the window scale is fixed in each 279 direction when a connection is opened. 281 The maximum receive window, and therefore the scale factor, is 282 determined by the maximum receive buffer space. In a typical modern 283 implementation, this maximum buffer space is set by default but can 284 be overridden by a user program before a TCP connection is opened. 285 This determines the scale factor, and therefore no new user interface 286 is needed for window scaling. 288 2.2. Window Scale Option 290 The three-byte Window Scale option MAY be sent in a segment by 291 a TCP. It has two purposes: (1) indicate that the TCP is prepared to 292 do both send and receive window scaling, and (2) communicate the 293 exponent of a scale factor to be applied to its receive window. 294 Thus, a TCP that is prepared to scale windows SHOULD send the option, 295 even if its own scale factor is 1 and the exponent 0. The scale 296 factor is limited to a power of two and encoded logarithmically, so 297 it may be implemented by binary shift operations. The maximum scale 298 exponent is limited to 14 for a maximum permissible receive window 299 size of 1 GiB (2^(14+16)). 301 TCP Window Scale Option (WSopt): 303 Kind: 3 305 Length: 3 bytes 307 +---------+---------+---------+ 308 | Kind=3 |Length=3 |shift.cnt| 309 +---------+---------+---------+ 310 1 1 1 312 This option is an offer, not a promise; both sides MUST send Window 313 Scale options in their segments to enable window scaling in 314 either direction. If window scaling is enabled, then the TCP that 315 sent this option will right-shift its true receive-window values by 316 'shift.cnt' bits for transmission in SEG.WND. The value 'shift.cnt' 317 MAY be zero (offering to scale, while applying a scale factor of 1 to 318 the receive window). 320 This option MAY be sent in an initial segment (i.e., a segment 321 with the SYN bit on and the ACK bit off). It MAY also be sent in a 322 segment, but only if a Window Scale option was received in 323 the initial segment. A Window Scale option in a segment 324 without a SYN bit MUST be ignored. 326 The window field in a segment where the SYN bit is set (i.e., a 327 or ) is never scaled. 329 2.3. Using the Window Scale Option 331 A model implementation of window scaling is as follows, using the 332 notation of [RFC0793]: 334 o All windows are treated as 32-bit quantities for storage in the 335 connection control block and for local calculations. This 336 includes the send-window (SND.WND) and the receive-window 337 (RCV.WND) values, as well as the congestion window. 339 o The connection state is augmented by two window shift counters, 340 Snd.Wind.Shift and Rcv.Wind.Shift, to be applied to the incoming 341 and outgoing window fields, respectively. 343 o If a TCP receives a segment containing a Window Scale 344 option, it sends its own Window Scale option in the 345 segment. 347 o The Window Scale option is sent with shift.cnt = R, where R is the 348 value that the TCP would like to use for its receive window. 350 o Upon receiving a segment with a Window Scale option 351 containing shift.cnt = S, a TCP sets Snd.Wind.Shift to S and sets 352 Rcv.Wind.Shift to R; otherwise, it sets both Snd.Wind.Shift and 353 Rcv.Wind.Shift to zero. 355 o The window field (SEG.WND) in the header of every incoming 356 segment, with the exception of segments, is left-shifted by 357 Snd.Wind.Shift bits before updating SND.WND: 359 SND.WND = SEG.WND << Snd.Wind.Shift 361 (assuming the other conditions of [RFC0793] are met, and using the 362 "C" notation "<<" for left-shift). 364 o The window field (SEG.WND) of every outgoing segment, with the 365 exception of segments, is right-shifted by Rcv.Wind.Shift 366 bits: 368 SEG.WND = RCV.WND >> Rcv.Wind.Shift 370 TCP determines if a data segment is "old" or "new" by testing whether 371 its sequence number is within 2^31 bytes of the left edge of the 372 window, and if it is not, discarding the data as "old". To insure 373 that new data is never mistakenly considered old and vice versa, the 374 left edge of the sender's window has to be at most 2^31 away from the 375 right edge of the receiver's window. Similarly with the sender's 376 right edge and receiver's left edge. Since the right and left edges 377 of either the sender's or receiver's window differ by the window 378 size, and since the sender and receiver windows can be out of phase 379 by at most the window size, the above constraints imply that two 380 times the maximum window size must be less than 2^31, or 382 max window < 2^30 384 Since the max window is 2^S (where S is the scaling shift count) 385 times at most 2^16 - 1 (the maximum unscaled window), the maximum 386 window is guaranteed to be < 2^30 if S <= 14. Thus, the shift count 387 MUST be limited to 14 (which allows windows of 2^30 = 1 GiB). If a 388 Window Scale option is received with a shift.cnt value larger than 389 14, the TCP SHOULD log the error but MUST use 14 instead of the 390 specified value. This is safe as a sender can always choose to only 391 partially use any signaled receive window. 393 The scale factor applies only to the Window field as transmitted in 394 the TCP header; each TCP using extended windows will maintain the 395 window values locally as 32-bit numbers. For example, the 396 "congestion window" computed by Slow Start and Congestion Avoidance 397 (see [RFC5681]) is not affected by the scale factor, so window 398 scaling will not introduce quantization into the congestion window. 400 2.4. Addressing Window Retraction 402 When a non-zero scale factor is in use, there are instances when a 403 retracted window can be offered - see Appendix F for a detailed 404 example. The end of the window will be on a boundary based on the 405 granularity of the scale factor being used. If the sequence number 406 is then updated by a number of bytes smaller than that granularity, 407 the TCP will have to either advertise a new window that is beyond 408 what it previously advertised (and perhaps beyond the buffer), or 409 will have to advertise a smaller window, which will cause the TCP 410 window to shrink. Implementations MUST ensure that they handle a 411 shrinking window, as specified in section 4.2.2.16 of [RFC1122]. 413 For the receiver, this implies that: 415 1) The receiver MUST honor, as in-window, any segment that would 416 have been in-window for any sent by the receiver. 418 2) When window scaling is in effect, the receiver SHOULD track the 419 actual maximum window sequence number (which is likely to be 420 greater than the window announced by the most recent , if 421 more than one segment has arrived since the application consumed 422 any data in the receive buffer). 424 On the sender side: 426 3) The initial transmission MUST be within the window announced by 427 the most recent . 429 4) On first retransmission, or if the sequence number is out-of- 430 window by less than 2^Rcv.Wind.Shift then do normal 431 retransmission(s) without regard to receiver window as long as 432 the original segment was in window when it was sent. 434 5) Subsequent retransmissions MAY only be sent, if they are within 435 the window announced by the most recent . 437 3. TCP Timestamps option 439 3.1. Introduction 441 TCP measures the round trip time (RTT), primarily for the purpose of 442 arriving at a reasonable value for the Retransmission Timeout (RTO) 443 timer interval. Accurate and current RTT estimates are necessary to 444 adapt to changing traffic conditions, while a conservative estimate 445 of the RTO interval is necessary to minimize spurious RTOs. 447 When [RFC1323] was originally written, it was perceived that taking 448 RTT measurements for each segment, and also during retransmissions, 449 would contribute to reduce spurious RTOs, while maintaining the 450 timeliness of necessary RTOs. At the time, RTO was also the only 451 mechanism to make use of the measured RTT. It has been shown, that 452 taking more RTT samples has only a very limited effect to optimize 453 RTOs [Allman99]. 455 This document makes a clear distinction between the round trip time 456 measurement (RTTM) mechanism, and subsequent mechanisms using the RTT 457 signal as input, such as RTO (see Section 3.4). 459 The Timestamps option is important when large receive windows are 460 used, to allow the use of the PAWS mechanism (see Section 4). 461 Furthermore, the option is useful for all TCP's, since it simplifies 462 the sender and allows the use of additional optimizations such as 463 Eifel ([RFC3522], [RFC4015]) and others. 465 3.2. Timestamps option 467 TCP is a symmetric protocol, allowing data to be sent at any time in 468 either direction, and therefore timestamp echoing may occur in either 469 direction. For simplicity and symmetry, we specify that timestamps 470 always be sent and echoed in both directions. For efficiency, we 471 combine the timestamp and timestamp reply fields into a single TCP 472 Timestamps option. 474 TCP Timestamps option (TSopt): 476 Kind: 8 478 Length: 10 bytes 480 +-------+-------+---------------------+---------------------+ 481 |Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)| 482 +-------+-------+---------------------+---------------------+ 483 1 1 4 4 485 The Timestamps option carries two four-byte timestamp fields. The 486 Timestamp Value field (TSval) contains the current value of the 487 timestamp clock of the TCP sending the option. 489 The Timestamp Echo Reply field (TSecr) is valid if the ACK bit is set 490 in the TCP header; if it is valid, it echoes a timestamp value that 491 was sent by the remote TCP in the TSval field of a Timestamps option. 492 When TSecr is not valid, its value MUST be zero. However, a value of 493 zero does not imply TSecr being invalid. The TSecr value will 494 generally be from the most recent Timestamps option that was 495 received; however, there are exceptions that are explained below. 497 A TCP MAY send the Timestamps option (TSopt) in an initial 498 segment (i.e., segment containing a SYN bit and no ACK bit), and MAY 499 send a TSopt in other segments only if it received a TSopt in the 500 initial or segment for the connection. 502 Once TSopt has been successfully negotiated (sent and received) 503 during the , exchange, TSopt MUST be sent in every 504 non- segment for the duration of the connection, and SHOULD be 505 sent in an segment (see Section 4.2 for details). If a non- 506 segment is received without a TSopt, a TCP SHOULD silently drop 507 the segment. A TCP MUST NOT abort a TCP connection because any 508 segment lacks an expected TSopt. 510 Implementations are strongly encouraged to follow the above rules for 511 handling a missing Timestamps option, and the order of precedence 512 mentioned in Section 4.3 when deciding on the acceptance of a 513 segment. 515 If a receiver chooses to accept a segment without an expected 516 Timestamps option, it must be clear that undetectable data corruption 517 may occur. 519 Such a TCP receiver may experience undetectable wrapped- sequence 520 effects, such as data (payload) corruption or session stalls. In 521 order to maintain the integrity of the payload data, in particular on 522 high speed networks, it is paramount to follow the described 523 processing rules. 525 However, it has been mentioned that under some circumstances, the 526 above guidelines are too strict, and some paths sporadically suppress 527 the Timestamps option, while maintaining payload integrity. A path 528 behaving in this manner should be deemed unacceptable, but it has 529 been noted that some implementations relax the acceptance rules as a 530 workaround, and allow TCP to run across such paths. 532 If a TSopt is received on a connection where TSopt was not negotiated 533 in the initial three-way handshake, the TSopt MUST be ignored and the 534 packet processed normally. 536 In the case of crossing segments where one contains a 537 TSopt and the other doesn't, both sides MAY send a TSopt in the 538 segment. 540 TSopt is required for the two mechanisms described in sections 3.3 541 and 4.2. There are also other mechanisms that rely on the presence 542 of the TSopt, e.g. [RFC3522]. If a TCP stopped sending TSopt at any 543 time during an established session, it interferes with these 544 mechanisms. This update to [RFC1323] describes explicitly the 545 previous assumption (see Section 4.2), that each TCP segment must 546 have TSopt, once negotiated. 548 3.3. The RTTM Mechanism 550 RTTM places a Timestamps option in every segment, with a TSval that 551 is obtained from a (virtual) "timestamp clock". Values of this clock 552 MUST be at least approximately proportional to real time, in order to 553 measure actual RTT. 555 These TSval values are echoed in TSecr values in the reverse 556 direction. The difference between a received TSecr value and the 557 current timestamp clock value provides an RTT measurement. 559 When timestamps are used, every segment that is received will contain 560 a TSecr value. However, these values cannot all be used to update 561 the measured RTT. The following example illustrates why. It shows a 562 one-way data flow with segments arriving in sequence without loss. 563 Here A, B, C... represent data blocks occupying successive blocks of 564 sequence numbers, and ACK(A),... represent the corresponding 565 cumulative acknowledgments. The two timestamp fields of the 566 Timestamps option are shown symbolically as . Each 567 TSecr field contains the value most recently received in a TSval 568 field. 570 TCP A TCP B 572 -----> 574 <---- 576 -----> 578 <---- 580 . . . . . . . . . . . . . . . . . . . . . . 582 ----> 584 <---- 586 (etc.) 588 The dotted line marks a pause (60 time units long) in which A had 589 nothing to send. Note that this pause inflates the RTT which B could 590 infer from receiving TSecr=131 in data segment C. Thus, in one-way 591 data flows, RTTM in the reverse direction measures a value that is 592 inflated by gaps in sending data. However, the following rule 593 prevents a resulting inflation of the measured RTT: 595 RTTM Rule: A TSecr value received in a segment MAY be used to update 596 the averaged RTT measurement only if the segment advances 597 the left edge of the send window, i.e. SND.UNA is 598 increased. 600 Since TCP B is not sending data, the data segment C does not 601 acknowledge any new data when it arrives at B. Thus, the inflated 602 RTTM measurement is not used to update B's RTTM measurement. 604 3.4. Updating the RTO value 606 [Ludwig00] and [Floyd05] have highlighted the problem that an 607 unmodified RTO calculation, which is updated with per- packet RTT 608 samples, will truncate the path history too soon. This can lead to 609 an increase in spurious retransmissions, when the path properties 610 vary in the order of a few RTTs, but a high number of RTT samples are 611 taken on a much shorter timescale. 613 Implementers should note that with timestamps multiple RTTMs can be 614 taken per RTT. The [RFC6298] RTO estimator has weighting factors, 615 alpha and beta, based on an implicit assumption that at most one RTTM 616 will be sampled per RTT. When multiple RTTMs per RTT are available 617 to update the RTO estimator, this implicit assumption must be 618 considered. An implementation suggestion is detailed in Appendix G. 620 3.5. Which Timestamp to Echo 622 If more than one Timestamps option is received before a reply segment 623 is sent, the TCP must choose only one of the TSvals to echo, ignoring 624 the others. To minimize the state kept in the receiver (i.e., the 625 number of unprocessed TSvals), the receiver should be required to 626 retain at most one timestamp in the connection control block. 628 There are three situations to consider: 630 (A) Delayed ACKs. 632 Many TCP's acknowledge only every second segment out of a group 633 of segments arriving within a short time interval; this policy 634 is known generally as "delayed ACKs". The data-sender TCP must 635 measure the effective RTT, including the additional time due to 636 delayed ACKs, or else it will retransmit unnecessarily. Thus, 637 when delayed ACKs are in use, the receiver SHOULD reply with the 638 TSval field from the earliest unacknowledged segment. 640 (B) A hole in the sequence space (segment(s) have been lost). 642 The sender will continue sending until the window is filled, and 643 the receiver may be generating s as these out-of-order 644 segments arrive (e.g., to aid "fast retransmit"). 646 The lost segment is probably a sign of congestion, and in that 647 situation the sender should be conservative about 648 retransmission. Furthermore, it is better to overestimate than 649 underestimate the RTT. An for an out-of-order segment 650 SHOULD therefore contain the timestamp from the most recent 651 segment that advanced RCV.NXT. 653 The same situation occurs if segments are re-ordered by the 654 network. 656 (C) A filled hole in the sequence space. 658 The segment that fills the hole and advances the window 659 represents the most recent measurement of the network 660 characteristics. An RTT computed from an earlier segment would 661 probably include the sender's retransmit time-out, badly biasing 662 the sender's average RTT estimate. Thus, the timestamp from the 663 latest segment (which filled the hole) MUST be echoed. 665 An algorithm that covers all three cases is described in the 666 following rules for Timestamps option processing on a synchronized 667 connection: 669 (1) The connection state is augmented with two 32-bit slots: 671 TS.Recent holds a timestamp to be echoed in TSecr whenever a 672 segment is sent, and Last.ACK.sent holds the ACK field from the 673 last segment sent. Last.ACK.sent will equal RCV.NXT except when 674 s have been delayed. 676 (2) If: 678 SEG.TSval >= TS.recent and SEG.SEQ <= Last.ACK.sent 680 then SEG.TSval is copied to TS.Recent; otherwise, it is ignored. 682 (3) When a TSopt is sent, its TSecr field is set to the current 683 TS.Recent value. 685 The following examples illustrate these rules. Here A, B, C... 686 represent data segments occupying successive blocks of sequence 687 numbers, and ACK(A),... represent the corresponding acknowledgment 688 segments. Note that ACK(A) has the same sequence number as B. We 689 show only one direction of timestamp echoing, for clarity. 691 o Segments arrive in sequence, and some of the s are delayed. 693 By case (A), the timestamp from the oldest unacknowledged segment 694 is echoed. 696 TS.Recent 697 -------------------> 698 1 699 -------------------> 700 1 701 -------------------> 702 1 703 <---- 704 (etc) 706 o Segments arrive out of order, and every segment is acknowledged. 708 By case (B), the timestamp from the last segment that advanced the 709 left window edge is echoed, until the missing segment arrives; it 710 is echoed according to Case (C). The same sequence would occur if 711 segments B and D were lost and retransmitted. 713 TS.Recent 714 -------------------> 715 1 716 <---- 717 1 718 -------------------> 719 1 720 <---- 721 1 722 -------------------> 723 2 724 <---- 725 2 726 -------------------> 727 2 728 <---- 729 2 730 -------------------> 731 4 732 <---- 733 (etc) 735 4. PAWS - Protection Against Wrapped Sequence Numbers 737 4.1. Introduction 739 Section 4.2 describes a simple mechanism to reject old duplicate 740 segments that might corrupt an open TCP connection; we call this 741 mechanism PAWS (Protection Against Wrapped Sequence numbers). PAWS 742 operates within a single TCP connection, using state that is saved in 743 the connection control block. Section 4.8 and Appendix H discuss the 744 implications of the PAWS mechanism for avoiding old duplicates from 745 previous incarnations of the same connection. 747 4.2. The PAWS Mechanism 749 PAWS uses the same TCP Timestamps option as the RTTM mechanism 750 described earlier, and assumes that every received TCP segment 751 (including data and segments) contains a timestamp SEG.TSval 752 whose values are monotonically non-decreasing in time. The basic 753 idea is that a segment can be discarded as an old duplicate if it is 754 received with a timestamp SEG.TSval less than some timestamp recently 755 received on this connection. 757 In both the PAWS and the RTTM mechanism, the "timestamps" are 32-bit 758 unsigned integers in a modular 32-bit space. Thus, "less than" is 759 defined the same way it is for TCP sequence numbers, and the same 760 implementation techniques apply. If s and t are timestamp values, 762 s < t if 0 < (t - s) < 2^31, 764 computed in unsigned 32-bit arithmetic. 766 The choice of incoming timestamps to be saved for this comparison 767 MUST guarantee a value that is monotonically increasing. For 768 example, we might save the timestamp from the segment that last 769 advanced the left edge of the receive window, i.e., the most recent 770 in-sequence segment. Instead, we choose the value TS.Recent 771 introduced in Section 3.5 for the RTTM mechanism, since using a 772 common value for both PAWS and RTTM simplifies the implementation of 773 both. As Section 3.5 explained, TS.Recent differs from the timestamp 774 from the last in-sequence segment only in the case of delayed s, 775 and therefore by less than one window. Either choice will therefore 776 protect against sequence number wrap-around. 778 RTTM was specified in a symmetrical manner, so that TSval timestamps 779 are carried in both data and segments and are echoed in TSecr 780 fields carried in returning or data segments. PAWS submits all 781 incoming segments to the same test, and therefore protects against 782 duplicate segments as well as data segments. (An alternative 783 non-symmetric algorithm would protect against old duplicate s: 784 the sender of data would reject incoming segments whose TSecr 785 values were less than the TSecr saved from the last segment whose ACK 786 field advanced the left edge of the send window. This algorithm was 787 deemed to lack economy of mechanism and symmetry.) 789 TSval timestamps sent on and segments are used to 790 initialize PAWS. PAWS protects against old duplicate non- 791 segments, and duplicate segments received while there is a 792 synchronized connection. Duplicate and segments 793 received when there is no connection will be discarded by the normal 794 3-way handshake and sequence number checks of TCP. 796 [RFC1323] recommended that segments NOT carry timestamps, and 797 that they be acceptable regardless of their timestamp. At that time, 798 the thinking was that old duplicate segments should be 799 exceedingly unlikely, and their cleanup function should take 800 precedence over timestamps. More recently, discussions about various 801 blind attacks on TCP connections have raised the suggestion that if 802 the Timestamps option is present, SEG.TSecr could be used to provide 803 stricter acceptance tests for segments. While still under 804 discussion, to enable research into this area it is now RECOMMENDED 805 that when generating an , that if the segment causing the 806 to be generated contained a Timestamps option, that the also 807 contain a Timestamps option. In the segment, SEG.TSecr SHOULD 808 be set to SEG.TSval from the incoming segment and SEG.TSval SHOULD be 809 set to zero. If an is being generated because of a user abort, 810 and Snd.TS.OK is set, then a Timestamps option SHOULD be included in 811 the . When an segment is received, it MUST NOT be 812 subjected to PAWS checks, and information from the Timestamps option 813 MUST NOT be used to update connection state information. SEG.TSecr 814 MAY be used to provide stricter acceptance checks. 816 4.3. Basic PAWS Algorithm 818 If the PAWS algorithm is used, the following processing MUST be 819 performed on all incoming segments for a synchronized connection. 820 Also, PAWS processing MUST take precedence over the regular TCP 821 acceptablitiy check (Section 3.3 in [RFC0793]), which is performed 822 after verification of the received Timestamps option: 824 R1) If there is a Timestamps option in the arriving segment, 825 SEG.TSval < TS.Recent, TS.Recent is valid (see later discussion) 826 and the RST bit is not set, then treat the arriving segment as 827 not acceptable: 829 Send an acknowledgement in reply as specified in [RFC0793] 830 page 69 and drop the segment. 832 Note: it is necessary to send an segment in order to 833 retain TCP's mechanisms for detecting and recovering from 834 half- open connections. For example, see Figure 10 of 835 [RFC0793]. 837 R2) If the segment is outside the window, reject it (normal TCP 838 processing) 840 R3) If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent (see 841 Section 3.5), then record its timestamp in TS.Recent. 843 R4) If an arriving segment is in-sequence (i.e., at the left window 844 edge), then accept it normally. 846 R5) Otherwise, treat the segment as a normal in-window, out-of- 847 sequence TCP segment (e.g., queue it for later delivery to the 848 user). 850 Steps R2, R4, and R5 are the normal TCP processing steps specified by 851 [RFC0793]. 853 It is important to note that the timestamp MUST be checked only when 854 a segment first arrives at the receiver, regardless of whether it is 855 in- sequence or it must be queued for later delivery. 857 Consider the following example. 859 Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has been 860 sent, where the letter indicates the sequence number and the digit 861 represents the timestamp. Suppose also that segment B.1 has been 862 lost. The timestamp in TS.Recent is 1 (from A.1), so C.1, ..., 863 Z.1 are considered acceptable and are queued. When B is 864 retransmitted as segment B.2 (using the latest timestamp), it 865 fills the hole and causes all the segments through Z to be 866 acknowledged and passed to the user. The timestamps of the queued 867 segments are *not* inspected again at this time, since they have 868 already been accepted. When B.2 is accepted, TS.Recent is set to 869 2. 871 This rule allows reasonable performance under loss. A full window of 872 data is in transit at all times, and after a loss a full window less 873 one segment will show up out-of-sequence to be queued at the receiver 874 (e.g., up to ~2^30 bytes of data); the Timestamps option must not 875 result in discarding this data. 877 In certain unlikely circumstances, the algorithm of rules R1-R5 could 878 lead to discarding some segments unnecessarily, as shown in the 879 following example: 881 Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have been 882 sent in sequence and that segment B.1 has been lost. Furthermore, 883 suppose delivery of some of C.1, ... Z.1 is delayed until *after* 884 the retransmission B.2 arrives at the receiver. These delayed 885 segments will be discarded unnecessarily when they do arrive, 886 since their timestamps are now out of date. 888 This case is very unlikely to occur. If the retransmission was 889 triggered by a timeout, some of the segments C.1, ... Z.1 must have 890 been delayed longer than the RTO time. This is presumably an 891 unlikely event, or there would be many spurious timeouts and 892 retransmissions. If B's retransmission was triggered by the "fast 893 retransmit" algorithm, i.e., by duplicate s, then the queued 894 segments that caused these s must have been received already. 896 Even if a segment were delayed past the RTO, the Fast Retransmit 897 mechanism [Jacobson90c] will cause the delayed segments to be 898 retransmitted at the same time as B.2, avoiding an extra RTT and 899 therefore causing a very small performance penalty. 901 We know of no case with a significant probability of occurrence in 902 which timestamps will cause performance degradation by unnecessarily 903 discarding segments. 905 4.4. Timestamp Clock 907 It is important to understand that the PAWS algorithm does not 908 require clock synchronization between sender and receiver. The 909 sender's timestamp clock is used to stamp the segments, and the 910 sender uses the echoed timestamp to measure RTTs. However, the 911 receiver treats the timestamp as simply a monotonically increasing 912 serial number, without any necessary connection to its clock. From 913 the receiver's viewpoint, the timestamp is acting as a logical 914 extension of the high-order bits of the sequence number. 916 The receiver algorithm does place some requirements on the frequency 917 of the timestamp clock. 919 (a) The timestamp clock must not be "too slow". 921 It MUST tick at least once for each 2^31 bytes sent. In fact, 922 in order to be useful to the sender for round trip timing, the 923 clock SHOULD tick at least once per window's worth of data, and 924 even with the window extension defined in Section 2.2, 2^31 925 bytes must be at least two windows. 927 To make this more quantitative, any clock faster than 1 tick/sec 928 will reject old duplicate segments for link speeds of ~8 Gbps. 929 A 1 ms timestamp clock will work at link speeds up to 8 Tbps 930 (8*10^12) bps! 932 (b) The timestamp clock must not be "too fast". 934 The recycling time of the timestamp clock MUST be greater than 935 MSL seconds. Since the clock (timestamp) is 32 bits and the 936 worst-case MSL is 255 seconds, the maximum acceptable clock 937 frequency is one tick every 59 ns. 939 However, it is desirable to establish a much longer recycle 940 period, in order to handle outdated timestamps on idle 941 connections (see Section 4.5), and to relax the MSL requirement 942 for preventing sequence number wrap-around. With a 1 ms 943 timestamp clock, the 32-bit timestamp will wrap its sign bit in 944 24.8 days. Thus, it will reject old duplicates on the same 945 connection if MSL is 24.8 days or less. This appears to be a 946 very safe figure; an MSL of 24.8 days or longer can probably be 947 assumed in the internet without requiring precise MSL 948 enforcement. 950 Based upon these considerations, we choose a timestamp clock 951 frequency in the range 1 ms to 1 sec per tick. This range also 952 matches the requirements of the RTTM mechanism, which does not need 953 much more resolution than the granularity of the retransmit timer, 954 e.g., tens or hundreds of milliseconds. 956 The PAWS mechanism also puts a strong monotonicity requirement on the 957 sender's timestamp clock. The method of implementation of the 958 timestamp clock to meet this requirement depends upon the system 959 hardware and software. 961 o Some hosts have a hardware clock that is guaranteed to be 962 monotonic between hardware resets. 964 o A clock interrupt may be used to simply increment a binary integer 965 by 1 periodically. 967 o The timestamp clock may be derived from a system clock that is 968 subject to being abruptly changed, by adding a variable offset 969 value. This offset is initialized to zero. When a new timestamp 970 clock value is needed, the offset can be adjusted as necessary to 971 make the new value equal to or larger than the previous value 972 (which was saved for this purpose). 974 4.5. Outdated Timestamps 976 If a connection remains idle long enough for the timestamp clock of 977 the other TCP to wrap its sign bit, then the value saved in TS.Recent 978 will become too old; as a result, the PAWS mechanism will cause all 979 subsequent segments to be rejected, freezing the connection (until 980 the timestamp clock wraps its sign bit again). 982 With the chosen range of timestamp clock frequencies (1 sec to 1 ms), 983 the time to wrap the sign bit will be between 24.8 days and 24800 984 days. A TCP connection that is idle for more than 24 days and then 985 comes to life is exceedingly unusual. However, it is undesirable in 986 principle to place any limitation on TCP connection lifetimes. 988 We therefore require that an implementation of PAWS include a 989 mechanism to "invalidate" the TS.Recent value when a connection is 990 idle for more than 24 days. (An alternative solution to the problem 991 of outdated timestamps would be to send keep-alive segments at a very 992 low rate, but still more often than the wrap-around time for 993 timestamps, e.g., once a day. This would impose negligible overhead. 994 However, the TCP specification has never included keep-alives, so the 995 solution based upon invalidation was chosen.) 997 Note that a TCP does not know the frequency, and therefore, the 998 wraparound time, of the other TCP, so it must assume the worst. The 999 validity of TS.Recent needs to be checked only if the basic PAWS 1000 timestamp check fails, i.e., only if SEG.TSval < TS.Recent. If 1001 TS.Recent is found to be invalid, then the segment is accepted, 1002 regardless of the failure of the timestamp check, and rule R3 updates 1003 TS.Recent with the TSval from the new segment. 1005 To detect how long the connection has been idle, the TCP MAY update a 1006 clock or timestamp value associated with the connection whenever 1007 TS.Recent is updated, for example. The details will be 1008 implementation-dependent. 1010 4.6. Header Prediction 1012 "Header prediction" [Jacobson90a] is a high-performance transport 1013 protocol implementation technique that is most important for high- 1014 speed links. This technique optimizes the code for the most common 1015 case, receiving a segment correctly and in order. Using header 1016 prediction, the receiver asks the question, "Is this segment the next 1017 in sequence?" This question can be answered in fewer machine 1018 instructions than the question, "Is this segment within the window?" 1020 Adding header prediction to our timestamp procedure leads to the 1021 following recommended sequence for processing an arriving TCP 1022 segment: 1024 H1) Check timestamp (same as step R1 above) 1026 H2) Do header prediction: if segment is next in sequence and if 1027 there are no special conditions requiring additional processing, 1028 accept the segment, record its timestamp, and skip H3. 1030 H3) Process the segment normally, as specified in RFC 793. This 1031 includes dropping segments that are outside the window and 1032 possibly sending acknowledgments, and queuing in-window, out-of- 1033 sequence segments. 1035 Another possibility would be to interchange steps H1 and H2, i.e., to 1036 perform the header prediction step H2 *first*, and perform H1 and H3 1037 only when header prediction fails. This could be a performance 1038 improvement, since the timestamp check in step H1 is very unlikely to 1039 fail, and it requires unsigned modulo arithmetic. To perform this 1040 check on every single segment is contrary to the philosophy of header 1041 prediction. We believe that this change might produce a measurable 1042 reduction in CPU time for TCP protocol processing on high-speed 1043 networks. 1045 However, putting H2 first would create a hazard: a segment from 2^32 1046 bytes in the past might arrive at exactly the wrong time and be 1047 accepted mistakenly by the header-prediction step. The following 1048 reasoning has been introduced in [RFC1185] to show that the 1049 probability of this failure is negligible. 1051 If all segments are equally likely to show up as old duplicates, 1052 then the probability of an old duplicate exactly matching the left 1053 window edge is the maximum segment size (MSS) divided by the size 1054 of the sequence space. This ratio must be less than 2^-16, since 1055 MSS must be < 2^16; for example, it will be (2^12)/(2^32) = 2^-20 1056 for a 100 Mbit/s link. However, the older a segment is, the less 1057 likely it is to be retained in the Internet, and under any 1058 reasonable model of segment lifetime the probability of an old 1059 duplicate exactly at the left window edge must be much smaller 1060 than 2^-16. 1062 The 16 bit TCP checksum also allows a basic unreliability of one 1063 part in 2^16. A protocol mechanism whose reliability exceeds the 1064 reliability of the TCP checksum should be considered "good 1065 enough", i.e., it won't contribute significantly to the overall 1066 error rate. We therefore believe we can ignore the problem of an 1067 old duplicate being accepted by doing header prediction before 1068 checking the timestamp. 1070 However, this probabilistic argument is not universally accepted, and 1071 the consensus at present is that the performance gain does not 1072 justify the hazard in the general case. It is therefore recommended 1073 that H2 follow H1. 1075 4.7. IP Fragmentation 1077 At high data rates, the protection against old segments provided by 1078 PAWS can be circumvented by errors in IP fragment reassembly (see 1079 [RFC4963]). The only way to protect against incorrect IP fragment 1080 reassembly is to not allow the segments to be fragmented. This is 1081 done by setting the Don't Fragment (DF) bit in the IP header. 1082 Setting the DF bit implies the use of Path MTU Discovery as described 1083 in [RFC1191], [RFC1981], and [RFC4821], thus any TCP implementation 1084 that implements PAWS MUST also implement Path MTU Discovery. 1086 4.8. Duplicates from Earlier Incarnations of Connection 1088 The PAWS mechanism protects against errors due to sequence number 1089 wrap-around on high-speed connections. Segments from an earlier 1090 incarnation of the same connection are also a potential cause of old 1091 duplicate errors. In both cases, the TCP mechanisms to prevent such 1092 errors depend upon the enforcement of a maximum segment lifetime 1093 (MSL) by the Internet (IP) layer (see Appendix of RFC 1185 for a 1094 detailed discussion). Unlike the case of sequence space wrap-around, 1095 the MSL required to prevent old duplicate errors from earlier 1096 incarnations does not depend upon the transfer rate. If the IP layer 1097 enforces the recommended 2 minute MSL of TCP, and if the TCP rules 1098 are followed, TCP connections will be safe from earlier incarnations, 1099 no matter how high the network speed. Thus, the PAWS mechanism is 1100 not required for this case. 1102 We may still ask whether the PAWS mechanism can provide additional 1103 security against old duplicates from earlier connections, allowing us 1104 to relax the enforcement of MSL by the IP layer. Appendix B explores 1105 this question, showing that further assumptions and/or mechanisms are 1106 required, beyond those of PAWS. This is not part of the current 1107 extension. 1109 5. Conclusions and Acknowledgements 1111 This memo presented a set of extensions to TCP to provide efficient 1112 operation over large bandwidth * delay product paths and reliable 1113 operation over very high-speed paths. These extensions are designed 1114 to provide compatible interworking with TCP stacks that do not 1115 implement the extensions. 1117 These mechanisms are implemented using TCP options for scaled windows 1118 and timestamps. The timestamps are used for two distinct mechanisms: 1119 RTTM (Round Trip Time Measurement) and PAWS (Protection Against 1120 Wrapped Sequences). 1122 The Window Scale option was originally suggested by Mike St. Johns of 1123 USAF/DCA. The present form of the option was suggested by Mike 1124 Karels of UC Berkeley in response to a more cumbersome scheme defined 1125 by Van Jacobson. Lixia Zhang helped formulate the PAWS mechanism 1126 description in [RFC1185]. 1128 Finally, much of this work originated as the result of discussions 1129 within the End-to-End Task Force on the theoretical limitations of 1130 transport protocols in general and TCP in particular. Task force 1131 members and other on the end2end-interest list have made valuable 1132 contributions by pointing out flaws in the algorithms and the 1133 documentation. Continued discussion and development since the 1134 publication of [RFC1323] originally occurred in the IETF TCP Large 1135 Windows Working Group, later on in the End-to-End Task Force, and 1136 most recently in the IETF TCP Maintenance Working Group. The authors 1137 are grateful for all these contributions. 1139 6. Security Considerations 1141 The TCP sequence space is a fixed size, and as the window becomes 1142 larger it becomes easier for an attacker to generate forged packets 1143 that can fall within the TCP window, and be accepted as valid 1144 segments. While use of timestamps and PAWS can help to mitigate 1145 this, when using PAWS, if an attacker is able to forge a packet that 1146 is acceptable to the TCP connection, a timestamp that is in the 1147 future would cause valid segments to be dropped due to PAWS checks. 1148 Hence, implementers should take care to not open the TCP window 1149 drastically beyond the requirements of the connection. 1151 A naive implementation that derives the timestamp clock value 1152 directly from a system uptime clock may unintentionally leak this 1153 information to an attacker. This does not directly compromise any of 1154 the mechanisms described in this document. However, this may be 1155 valuable information to a potential attacker. An implementer should 1156 evaluate the potential impact and mitigate this accordingly (i.e. by 1157 using a random offset for the timestamp clock on each connection, or 1158 using an external, real-time derived timestamp clock source). 1160 Expanding the TCP window beyond 64 KiB for IPv6 allows Jumbograms 1161 [RFC2675] to be used when the local network supports packets larger 1162 than 64 KiB. When larger TCP segments are used, the TCP checksum 1163 becomes weaker. 1165 Mechanisms to protect the TCP header from modification should also 1166 protect the TCP options. 1168 Middleboxes and TCP options: 1170 Some middleboxes have been known to remove the TCP options 1171 described in this document from TCP segments [Honda11]. 1172 Middleboxes that remove TCP options described in this document 1173 from the segment interfere with the selection of parameters 1174 appropriate for the session. Removing any of these options in a 1175 segment will leave the end hosts in a state that 1176 destroys the proper operation of the protocol. 1178 * If a Window Scale option is removed from a segment, 1179 the end hosts will not negotiate the window scaling factor 1180 correctly. Middleboxes must not remove or modify the Window 1181 Scale option from segments. 1183 * If a stateful firewall uses the window field to detect whether 1184 a received segment is inside the current window, and does not 1185 support the Window Scale option, it will not be able to 1186 correctly determine whether or not a packet is in the window. 1187 These middle boxes must also support the Window Scale option 1188 and apply the scale factor when processing segments. If the 1189 window scale factor cannot be determined, it must not do window 1190 based processing. 1192 * If the Timestamps option is removed from the or 1193 segment, high speed connections that need PAWS would not have 1194 that protection. Successful negotiation of Timestamps option 1195 enforces a stricter verification of incoming segments at the 1196 receiver. If the Timestamps option was removed from a 1197 subsequent data segment after a successful negotiation (e.g. as 1198 part of re-segmentation), the segment is discarded by the 1199 receiver without further processing. Middleboxes should not 1200 remove the Timestamps option. 1202 * It must be noted that [RFC1323] doesn't address the case of the 1203 Timestamps option being dropped or selectively omitted after 1204 being negotiated, and that the update in this document may 1205 cause some broken middlebox behavior to be detected 1206 (potentially unresponsive TCP sessions). 1208 Implementations that depend on PAWS could provide a mechanism for the 1209 application to determine whether or not PAWS is in use on the 1210 connection, and chose to terminate the connection if that protection 1211 doesn't exist. This is not just to protect the connection against 1212 middleboxes that might remove the Timestamps option, but also against 1213 remote hosts that do not have Timestamp support. 1215 7. IANA Considerations 1217 This document has no actions for IANA. 1219 8. References 1221 8.1. Normative References 1223 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 1224 RFC 793, September 1981. 1226 [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 1227 November 1990. 1229 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1230 Requirement Levels", BCP 14, RFC 2119, March 1997. 1232 8.2. Informative References 1234 [Allman99] 1235 Allman, M. and V. Paxson, "On Estimating End-to-End 1236 Network Path Properties", Proc. ACM SIGCOMM Technical 1237 Symposium, Cambridge, MA, September 1999, 1238 . 1240 [Ekstroem04] 1241 Ekstroem, H. and R. Ludwig, "The Peak-Hopper: A New End- 1242 to-End Retransmission Timer for Reliable Unicast 1243 Transport", INFOCOM 2004 IEEE, March 2004, . 1247 [Floyd05] Floyd, S., "[tcpm] How the RTO should be estimated with 1248 timestamps", Message from 26.Jan.2007 to the tcpm mailing 1249 list, August 2005, . 1252 [Garlick77] 1253 Garlick, L., Rom, R., and J. Postel, "Issues in Reliable 1254 Host-to-Host Protocols", Proc. Second Berkeley Workshop on 1255 Distributed Data Management and Computer Networks, 1256 May 1977, . 1258 [Hamming77] 1259 Hamming, R., "Digital Filters", Prentice Hall, Englewood 1260 Cliffs, N.J. ISBN 0-13-212571-4, 1977. 1262 [Honda11] Honda, M., Nishida, Y., Raiciu, C., Greenhalgh, A., 1263 Handley, M., and H. Tokuda, "Is it still possible to 1264 extend TCP?", Proc. of ACM Internet Measurement 1265 Conference (IMC) '11, November 2011. 1267 [Jacobson88a] 1268 Jacobson, V., "Congestion Avoidance and Control", SIGCOMM 1269 '88, Stanford, CA., August 1988, 1270 . 1272 [Jacobson90a] 1273 Jacobson, V., "4BSD Header Prediction", ACM Computer 1274 Communication Review, April 1990. 1276 [Jacobson90c] 1277 Jacobson, V., "Modified TCP congestion avoidance 1278 algorithm", Message to the end2end-interest mailing list, 1279 April 1990, 1280 . 1282 [Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet 1283 Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and 1284 Comm., Scottsdale, Arizona, March 1986, 1285 . 1287 [Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times in 1288 Reliable Transport Protocols", Proc. SIGCOMM '87, 1289 August 1987. 1291 [Ludwig00] 1292 Ludwig, R. and K. Sklower, "The Eifel Retransmission 1293 Timer", ACM SIGCOMM Computer Communication Review Volume 1294 30 Issue 3, July 2000, . 1297 [Martin03] 1298 Martin, D., "[Tsvwg] RFC 1323.bis", Message to the tsvwg 1299 mailing list, September 2003, . 1302 [Mathis08] 1303 Mathis, M., "[tcpm] Example of 1323 window retraction 1304 problem", Message to the tcpm mailing list, March 2008, . 1308 [Medina04] 1309 Medina, A., Allman, M., and S. Floyd, "Measuring 1310 Interactions Between Transport Protocols and Middleboxes", 1311 Proc. ACM SIGCOMM/USENIX Internet Measurement Conference. 1312 October 2004, August 2004, 1313 . 1315 [Medina05] 1316 Medina, A., Allman, M., and S. Floyd, "Measuring the 1317 Evolution of Transport Protocols in the Internet", ACM 1318 Computer Communication Review 35(2), April 2005, 1319 . 1321 [RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks", 1322 RFC 896, January 1984. 1324 [RFC1072] Jacobson, V. and R. Braden, "TCP extensions for long-delay 1325 paths", RFC 1072, October 1988. 1327 [RFC1110] McKenzie, A., "Problem with the TCP big window option", 1328 RFC 1110, August 1989. 1330 [RFC1122] Braden, R., "Requirements for Internet Hosts - 1331 Communication Layers", STD 3, RFC 1122, October 1989. 1333 [RFC1185] Jacobson, V., Braden, B., and L. Zhang, "TCP Extension for 1334 High-Speed Paths", RFC 1185, October 1990. 1336 [RFC1323] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions 1337 for High Performance", RFC 1323, May 1992. 1339 [RFC1981] McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery 1340 for IP version 6", RFC 1981, August 1996. 1342 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 1343 Selective Acknowledgment Options", RFC 2018, October 1996. 1345 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 1346 Control", RFC 2581, April 1999. 1348 [RFC2675] Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms", 1349 RFC 2675, August 1999. 1351 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 1352 Extension to the Selective Acknowledgement (SACK) Option 1353 for TCP", RFC 2883, July 2000. 1355 [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm 1356 for TCP", RFC 3522, April 2003. 1358 [RFC4015] Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm 1359 for TCP", RFC 4015, February 2005. 1361 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 1362 Discovery", RFC 4821, March 2007. 1364 [RFC4963] Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly 1365 Errors at High Data Rates", RFC 4963, July 2007. 1367 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1368 Control", RFC 5681, September 2009. 1370 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 1371 "Computing TCP's Retransmission Timer", RFC 6298, 1372 June 2011. 1374 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 1375 and Y. Nishida, "A Conservative Loss Recovery Algorithm 1376 Based on Selective Acknowledgment (SACK) for TCP", 1377 RFC 6675, August 2012. 1379 [RFC6691] Borman, D., "TCP Options and Maximum Segment Size (MSS)", 1380 RFC 6691, July 2012. 1382 [Watson81] 1383 Watson, R., "Timer-based Mechanisms in Reliable Transport 1384 Protocol Connection Management", Computer Networks, Vol. 1385 5, 1981. 1387 [Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. SIGCOMM 1388 '86, Stowe, VT, August 1986. 1390 Appendix A. Implementation Suggestions 1392 TCP Option Layout 1394 The following layout is recommended for sending options on non- 1395 segments, to achieve maximum feasible alignment of 32-bit 1396 and 64-bit machines. 1398 +--------+--------+--------+--------+ 1399 | NOP | NOP | TSopt | 10 | 1400 +--------+--------+--------+--------+ 1401 | TSval timestamp | 1402 +--------+--------+--------+--------+ 1403 | TSecr timestamp | 1404 +--------+--------+--------+--------+ 1406 Interaction with the TCP Urgent Pointer 1408 The TCP Urgent pointer, like the TCP window, is a 16 bit value. 1409 Some of the original discussion for the TCP Window Scale option 1410 included proposals to increase the Urgent pointer to 32 bits. As 1411 it turns out, this is unnecessary. There are two observations 1412 that should be made: 1414 (1) With IP Version 4, the largest amount of TCP data that can be 1415 sent in a single packet is 65495 bytes (64 KiB - 1 -- size of 1416 fixed IP and TCP headers). 1418 (2) Updates to the urgent pointer while the user is in "urgent 1419 mode" are invisible to the user. 1421 This means that if the Urgent Pointer points beyond the end of the 1422 TCP data in the current segment, then the user will remain in 1423 urgent mode until the next TCP segment arrives. That segment will 1424 update the urgent pointer to a new offset, and the user will never 1425 have left urgent mode. 1427 Thus, to properly implement the Urgent Pointer, the sending TCP 1428 only has to check for overflow of the 16 bit Urgent Pointer field 1429 before filling it in. If it does overflow, than a value of 65535 1430 should be inserted into the Urgent Pointer. 1432 The same technique applies to IP Version 6, except in the case of 1433 IPv6 Jumbograms. When IPv6 Jumbograms are supported, [RFC2675] 1434 requires additional steps for dealing with the Urgent Pointer, 1435 these are described in section 5.2 of [RFC2675]. 1437 Appendix B. Duplicates from Earlier Connection Incarnations 1439 There are two cases to be considered: (1) a system crashing (and 1440 losing connection state) and restarting, and (2) the same connection 1441 being closed and reopened without a loss of host state. These will 1442 be described in the following two sections. 1444 B.1. System Crash with Loss of State 1446 TCP's quiet time of one MSL upon system startup handles the loss of 1447 connection state in a system crash/restart. For an explanation, see 1448 for example "When to Keep Quiet" in the TCP protocol specification 1449 [RFC0793]. The MSL that is required here does not depend upon the 1450 transfer speed. The current TCP MSL of 2 minutes seemed acceptable 1451 as an operational compromise, when many host systems used to take 1452 this long to boot after a crash. Current host systems can boot 1453 considerably faster. 1455 The Timestamps option may be used to ease the MSL requirements (or to 1456 provide additional security against data corruption). If timestamps 1457 are being used and if the timestamp clock can be guaranteed to be 1458 monotonic over a system crash/restart, i.e., if the first value of 1459 the sender's timestamp clock after a crash/restart can be guaranteed 1460 to be greater than the last value before the restart, then a quiet 1461 time is unnecessary. 1463 To dispense totally with the quiet time would require that the host 1464 clock be synchronized to a time source that is stable over the crash/ 1465 restart period, with an accuracy of one timestamp clock tick or 1466 better. We can back off from this strict requirement to take 1467 advantage of approximate clock synchronization. Suppose that the 1468 clock is always re-synchronized to within N timestamp clock ticks and 1469 that booting (extended with a quiet time, if necessary) takes more 1470 than N ticks. This will guarantee monotonicity of the timestamps, 1471 which can then be used to reject old duplicates even without an 1472 enforced MSL. 1474 B.2. Closing and Reopening a Connection 1476 When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT state 1477 ties up the socket pair for 4 minutes (see Section 3.5 of [RFC0793]. 1478 Applications built upon TCP that close one connection and open a new 1479 one (e.g., an FTP data transfer connection using Stream mode) must 1480 choose a new socket pair each time. The TIME-WAIT delay serves two 1481 different purposes: 1483 (a) Implement the full-duplex reliable close handshake of TCP. 1485 The proper time to delay the final close step is not really 1486 related to the MSL; it depends instead upon the RTO for the FIN 1487 segments and therefore upon the RTT of the path. (It could be 1488 argued that the side that is sending a FIN knows what degree of 1489 reliability it needs, and therefore it should be able to 1490 determine the length of the TIME-WAIT delay for the FIN's 1491 recipient. This could be accomplished with an appropriate TCP 1492 option in FIN segments.) 1494 Although there is no formal upper-bound on RTT, common network 1495 engineering practice makes an RTT greater than 1 minute very 1496 unlikely. Thus, the 4 minute delay in TIME-WAIT state works 1497 satisfactorily to provide a reliable full-duplex TCP close. 1498 Note again that this is independent of MSL enforcement and 1499 network speed. 1501 The TIME-WAIT state could cause an indirect performance problem 1502 if an application needed to repeatedly close one connection and 1503 open another at a very high frequency, since the number of 1504 available TCP ports on a host is less than 2^16. However, high 1505 network speeds are not the major contributor to this problem; 1506 the RTT is the limiting factor in how quickly connections can be 1507 opened and closed. Therefore, this problem will be no worse at 1508 high transfer speeds. 1510 (b) Allow old duplicate segments to expire. 1512 To replace this function of TIME-WAIT state, a mechanism would 1513 have to operate across connections. PAWS is defined strictly 1514 within a single connection; the last timestamp (TS.Recent) is 1515 kept in the connection control block, and discarded when a 1516 connection is closed. 1518 An additional mechanism could be added to the TCP, a per-host 1519 cache of the last timestamp received from any connection. This 1520 value could then be used in the PAWS mechanism to reject old 1521 duplicate segments from earlier incarnations of the connection, 1522 if the timestamp clock can be guaranteed to have ticked at least 1523 once since the old connection was open. This would require that 1524 the TIME-WAIT delay plus the RTT together must be at least one 1525 tick of the sender's timestamp clock. Such an extension is not 1526 part of the proposal of this RFC. 1528 Note that this is a variant on the mechanism proposed by 1529 Garlick, Rom, and Postel [Garlick77], which required each host 1530 to maintain connection records containing the highest sequence 1531 numbers on every connection. Using timestamps instead, it is 1532 only necessary to keep one quantity per remote host, regardless 1533 of the number of simultaneous connections to that host. 1535 Appendix C. Summary of Notation 1537 The following notation has been used in this document. 1539 Options 1541 WSopt: TCP Window Scale Option 1542 TSopt: TCP Timestamps option 1544 Option Fields 1546 shift.cnt: Window scale byte in WSopt 1547 TSval: 32-bit Timestamp Value field in TSopt 1548 TSecr: 32-bit Timestamp Reply field in TSopt 1550 Option Fields in Current Segment 1551 SEG.TSval: TSval field from TSopt in current segment 1552 SEG.TSecr: TSecr field from TSopt in current segment 1553 SEG.WSopt: 8-bit value in WSopt 1555 Clock Values 1557 my.TSclock: System wide source of 32-bit timestamp values 1558 my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec) 1559 Snd.TSoffset: A offset for randomizing Snd.TSclock 1560 Snd.TSclock: my.TSclock + Snd.TSoffset 1562 Per-Connection State Variables 1564 TS.Recent: Latest received Timestamp 1565 Last.ACK.sent: Last ACK field sent 1566 Snd.TS.OK: 1-bit flag 1567 Snd.WS.OK: 1-bit flag 1568 Rcv.Wind.Shift: Receive window scale exponent 1569 Snd.Wind.Shift: Send window scale exponent 1570 Start.Time: Snd.TSclock value when segment being timed was 1571 sent (used by pre-1323 code). 1573 Procedure 1575 Update_SRTT(m) Procedure to update the smoothed RTT and RTT 1576 variance estimates, using the rules of 1577 [Jacobson88a], given m, a new RTT measurement 1579 Appendix D. Event Processing Summary 1581 OPEN Call 1583 ... 1585 An initial send sequence number (ISS) is selected. Send a 1586 segment of the form: 1588 1590 ... 1592 SEND Call 1594 CLOSED STATE (i.e., TCB does not exist) 1595 ... 1597 LISTEN STATE 1599 If the foreign socket is specified, then change the connection 1600 from passive to active, select an ISS. Send a segment 1601 containing the options: and 1602 . Set SND.UNA to ISS, SND.NXT to ISS+1. 1603 Enter SYN-SENT state. ... 1605 SYN-SENT STATE 1606 SYN-RECEIVED STATE 1608 ... 1610 ESTABLISHED STATE 1611 CLOSE-WAIT STATE 1613 Segmentize the buffer and send it with a piggybacked 1614 acknowledgment (acknowledgment value = RCV.NXT). ... 1616 If the urgent flag is set ... 1618 If the Snd.TS.OK flag is set, then include the TCP Timestamps 1619 option in each data 1620 segment. 1622 Scale the receive window for transmission in the segment 1623 header: 1625 SEG.WND = (RCV.WND >> Rcv.Wind.Shift). 1627 SEGMENT ARRIVES 1629 ... 1631 If the state is LISTEN then 1633 first check for an RST 1635 ... 1637 second check for an ACK 1639 ... 1641 third check for a SYN 1643 if the SYN bit is set, check the security. If the ... 1645 ... 1647 if the SEG.PRC is less than the TCB.PRC then continue. 1649 Check for a Window Scale option (WSopt); if one is found, 1650 save SEG.WSopt in Snd.Wind.Shift and set Snd.WS.OK flag on. 1651 Otherwise, set both Snd.Wind.Shift and Rcv.Wind.Shift to 1652 zero and clear Snd.WS.OK flag. 1654 Check for a TSopt option; if one is found, save SEG.TSval in 1655 the variable TS.Recent and turn on the Snd.TS.OK bit. 1657 Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any 1658 other control or text should be queued for processing later. 1659 ISS should be selected and a segment sent of the form: 1661 1663 If the Snd.WS.OK bit is on, include a WSopt option 1664 in this segment. If the Snd.TS.OK 1665 bit is on, include a TSopt in this segment. Last.ACK.sent is set to 1667 RCV.NXT. 1669 SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection 1670 state should be changed to SYN-RECEIVED. Note that any 1671 other incoming control or data (combined with SYN) will be 1672 processed in the SYN-RECEIVED state, but processing of SYN 1673 and ACK should not be repeated. If the listen was not fully 1674 specified (i.e., the foreign socket was not fully 1675 specified), then the unspecified fields should be filled in 1676 now. 1678 fourth other text or control 1680 ... 1682 If the state is SYN-SENT then 1684 first check the ACK bit 1686 ... 1688 ... 1690 fourth check the SYN bit 1692 ... 1694 If the SYN bit is on and the security/compartment and 1695 precedence are acceptable then, RCV.NXT is set to SEG.SEQ+1, 1696 IRS is set to SEG.SEQ, and any acknowledgements on the 1697 retransmission queue which are thereby acknowledged should 1698 be removed. 1700 Check for a Window Scale option (WSopt); if it is found, 1701 save SEG.WSopt in Snd.Wind.Shift; otherwise, set both 1702 Snd.Wind.Shift and Rcv.Wind.Shift to zero. 1704 Check for a TSopt option; if one is found, save SEG.TSval in 1705 variable TS.Recent and turn on the Snd.TS.OK bit in the 1706 connection control block. If the ACK bit is set, use 1707 Snd.TSclock - SEG.TSecr as the initial RTT estimate. 1709 If SND.UNA > ISS (our has been ACKed), change the 1710 connection state to ESTABLISHED, form an segment: 1712 1714 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1715 option in this 1716 segment. Last.ACK.sent is set to RCV.NXT. 1718 Data or controls which were queued for transmission may be 1719 included. If there are other controls or text in the 1720 segment then continue processing at the sixth step below 1721 where the URG bit is checked, otherwise return. 1723 Otherwise enter SYN-RECEIVED, form a segment: 1725 1727 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1728 option in this segment. 1729 If the Snd.WS.OK bit is on, include a WSopt option 1730 in this segment. Last.ACK.sent is 1731 set to RCV.NXT. 1733 If there are other controls or text in the segment, queue 1734 them for processing after the ESTABLISHED state has been 1735 reached, return. 1737 fifth, if neither of the SYN or RST bits is set then drop the 1738 segment and return. 1740 Otherwise, 1742 First, check sequence number 1744 SYN-RECEIVED STATE 1745 ESTABLISHED STATE 1746 FIN-WAIT-1 STATE 1747 FIN-WAIT-2 STATE 1748 CLOSE-WAIT STATE 1749 CLOSING STATE 1750 LAST-ACK STATE 1751 TIME-WAIT STATE 1753 Segments are processed in sequence. Initial tests on 1754 arrival are used to discard old duplicates, but further 1755 processing is done in SEG.SEQ order. If a segment's 1756 contents straddle the boundary between old and new, only the 1757 new parts should be processed. 1759 Rescale the received window field: 1761 TrueWindow = SEG.WND << Snd.Wind.Shift, 1763 and use "TrueWindow" in place of SEG.WND in the following 1764 steps. 1766 Check whether the segment contains a Timestamp Option and 1767 bit Snd.TS.OK is on. If so: 1769 If SEG.TSval < TS.Recent and the RST bit is off, then 1770 test whether connection has been idle less than 24 days; 1771 if all are true, then the segment is not acceptable; 1772 follow steps below for an unacceptable segment. 1774 If SEG.SEQ is less than or equal to Last.ACK.sent, then 1775 save SEG.TSval in variable TS.Recent. 1777 There are four cases for the acceptability test for an 1778 incoming segment: 1780 ... 1782 If an incoming segment is not acceptable, an acknowledgment 1783 should be sent in reply (unless the RST bit is set, if so 1784 drop the segment and return): 1786 1788 Last.ACK.sent is set to SEG.ACK of the acknowledgment. If 1789 the Snd.Echo.OK bit is on, include the Timestamps option 1790 in this segment. 1791 Set Last.ACK.sent to SEG.ACK and send the segment. 1792 After sending the acknowledgment, drop the unacceptable 1793 segment and return. 1795 ... 1797 fifth check the ACK field. 1799 if the ACK bit is off drop the segment and return. 1801 if the ACK bit is on 1803 ... 1805 ESTABLISHED STATE 1807 If SND.UNA < SEG.ACK <= SND.NXT then, set SND.UNA <- 1808 SEG.ACK. Also compute a new estimate of round-trip time. 1809 If Snd.TS.OK bit is on, use Snd.TSclock - SEG.TSecr; 1810 otherwise use the elapsed time since the first segment in 1811 the retransmission queue was sent. Any segments on the 1812 retransmission queue which are thereby entirely 1813 acknowledged... 1815 ... 1817 Seventh, process the segment text. 1819 ESTABLISHED STATE 1820 FIN-WAIT-1 STATE 1821 FIN-WAIT-2 STATE 1823 ... 1825 Send an acknowledgment of the form: 1827 1829 If the Snd.TS.OK bit is on, include Timestamp Option 1830 in this segment. 1831 Set Last.ACK.sent to SEG.ACK of the acknowledgment, and send 1832 it. This acknowledgment should be piggy-backed on a segment 1833 being transmitted if possible without incurring undue delay. 1835 ... 1837 Appendix E. Timestamps Edge Cases 1839 While the rules laid out for when to calculate RTTM produce the 1840 correct results most of the time, there are some edge cases where an 1841 incorrect RTTM can be calculated. All of these situations involve 1842 the loss of segments. It is felt that these scenarios are rare, and 1843 that if they should happen, they will cause a single RTTM measurement 1844 to be inflated, which mitigates its effects on RTO calculations. 1846 [Martin03] cites two similar cases when the returning is lost, 1847 and before the retransmission timer fires, another returning 1848 segment arrives, which aknowledges the data. In this case, the RTTM 1849 calculated will be inflated: 1851 clock 1852 tc=1 -------------------> 1854 tc=2 (lost) <---- 1855 (RTTM would have been 1) 1857 (receive window opens, window update is sent) 1858 tc=5 <---- 1859 (RTTM is calculated at 4) 1861 One thing to note about this situation is that it is somewhat bounded 1862 by RTO + RTT, limiting how far off the RTTM calculation will be. 1863 While more complex scenarios can be constructed that produce larger 1864 inflations (e.g., retransmissions are lost), those scenarios involve 1865 multiple segment losses, and the connection will have other more 1866 serious operational problems than using an inflated RTTM in the RTO 1867 calculation. 1869 Appendix F. Window Retraction Example 1871 Consider an established TCP connection using a scale factor of 128, 1872 Snd.Wind.Shift=7 and Rcv.Wind.Shift=7, that is running with a very 1873 small window because the receiver is bottlenecked and both ends are 1874 doing small reads and writes. 1876 Consider the ACKs coming back: 1878 SEG.ACK SEG.WIN computed SND.WIN receiver's actual window 1879 1000 2 1256 1300 1880 The sender writes 40 bytes and receiver ACKs: 1882 1040 2 1296 1300 1884 The sender writes 5 additional bytes and the receiver has a problem. 1885 Two choices: 1887 1045 2 1301 1300 - BEYOND BUFFER 1889 1045 1 1173 1300 - RETRACTED WINDOW 1891 This is a general problem and can happen any time the sender does a 1892 write which is smaller than the window scale factor. 1894 In most stacks it is at least partially obscured when the window size 1895 is larger than some small number of segments because the stacks 1896 prefer to announce windows that are an integral number of segments, 1897 rounded up to the next scale factor. This plus silly window 1898 suppression tends to cause less frequent, larger window updates. If 1899 the window was rounded down to a segment size there is more 1900 opportunity to advance the window, the BEYOND BUFFER case above, 1901 rather than retracting it. 1903 Appendix G. RTO calculation modification 1905 Taking multiple RTT samples per window would shorten the history 1906 calculated by the RTO mechanism in [RFC6298], and the below algorithm 1907 aims to maintain a similar history as originally intended by 1908 [RFC6298]. 1910 It is roughly known how many samples a congestion window worth of 1911 data will yield, not accounting for ACK compression, and ACK losses. 1912 Such events will result in more history of the path being reflected 1913 in the final value for RTO, and are uncritical. This modification 1914 will ensure that a similar amount of time is taken into account for 1915 the RTO estimation, regardless of how many samples are taken per 1916 window: 1918 ExpectedSamples = ceiling(FlightSize / (SMSS * 2)) 1920 alpha' = alpha / ExpectedSamples 1922 beta' = beta / ExpectedSamples 1924 Note that the factor 2 in ExpectedSamples is due to "Delayed ACKs". 1926 Instead of using alpha and beta in the algorithm of [RFC6298], use 1927 alpha' and beta' instead: 1929 RTTVAR <- (1 - beta') * RTTVAR + beta' * |SRTT - R'| 1931 SRTT <- (1 - alpha') * SRTT + alpha' * R' 1933 (for each sample R') 1935 Appendix H. Changes from RFC 1323 1937 Several important updates and clarifications to the specification in 1938 RFC 1323 are made in these document. The technical changes are 1939 summarized below: 1941 (a) A wrong reference to SND.WND was corrected to SEG.WND in 1942 Section 2.3 1944 (b) Section 2.4 was added describing the unavoidable window 1945 retraction issue, and explicitly describing the mitigation steps 1946 necessary. 1948 (c) In Section 3.2 the wording how the Timestamps option negotiation 1949 is to be performed was updated with RFC2119 wording. Further, a 1950 number of paragraphs were added to clarify the expected behavior 1951 with a compliant implementation using TSopt, as RFC1323 left 1952 room for interpretation - e.g. potential late enablement of 1953 TSopt. 1955 (d) The description of which TSecr values can be used to update the 1956 measured RTT has been clarified. Specifically, with timestamps, 1957 the Karn algorithm [Karn87] is disabled. The Karn algorithm 1958 disables all RTT measurements during retransmission, since it is 1959 ambiguous whether the is for the original segment, or the 1960 retransmitted segment. With timestamps, that ambiguity is 1961 removed since the TSecr in the will contain the TSval from 1962 whichever data segment made it to the destination. 1964 (e) RTTM update processing explicitly excludes segments not updating 1965 SND.UNA. The original text could be interpreted to allow taking 1966 RTT samples when SACK acknowledges some new, non-continuous 1967 data. 1969 (f) In RFC1323, section 3.4, step (2) of the algorithm to control 1970 which timestamp is echoed was incorrect in two regards: 1972 (1) It failed to update TS.recent for a retransmitted segment 1973 that resulted from a lost . 1975 (2) It failed if SEG.LEN = 0. 1977 In the new algorithm, the case of SEG.TSval >= TS.recent is 1978 included for consistency with the PAWS test. 1980 (g) It is now recommended that the Timestamps option is included in 1981 segments if the incoming segment contained a Timestamps 1982 option. 1984 (h) segments are explicitly excluded from PAWS processing. 1986 (i) Added text to clarify the precedence between regular TCP 1987 [RFC0793] and this document Timestamps option / PAWS processing. 1988 Discussion about combined acceptability checks are ongoing. 1990 (j) Snd.TSoffset and Snd.TSclock variables have been added. 1991 Snd.TSclock is the sum of my.TSclock and Snd.TSoffset. This 1992 allows the starting points for timestamp values to be randomized 1993 on a per-connection basis. Setting Snd.TSoffset to zero yields 1994 the same results as [RFC1323]. 1996 (k) Appendix A has been expanded with information about the TCP 1997 Urgent Pointer. An earlier revision contained text around the 1998 TCP MSS option, which was split off into [RFC6691]. 2000 (l) One correction was made to the Event Processing Summary in 2001 Appendix D. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to 2002 fill in the SEG.WND value, not SND.WND. 2004 (m) Appendix G was added to exemplify how an RTO calculation might 2005 be updated to properly take the much higher RTT sampling 2006 frequency enabled by the Timestamps option into account. 2008 Editorial changes of the document, that don't impact the 2009 implementation or function of the mechanisms described in this 2010 document include: 2012 (a) Removed much of the discussion in Section 1 to streamline the 2013 document. However, detailed examples and discussions in 2014 Section 2, Section 3 and Section 4 are kept as guideline for 2015 implementers. 2017 (b) Removed references to "new" options, as the options were 2018 introduced in [RFC1323] already. Changed the text in 2019 Section 1.3 to specifically address TS and WS options. 2021 (c) Section 1.4 was added for [RFC2119] wording. Normative text was 2022 updated with the appropriate phrases. 2024 (d) Added < > brackets to mark specific types of segments, and 2025 replaced most occurances of "packet" with "segment", where TCP 2026 segments are referred to. 2028 (e) Updated the text in Section 3 to take into account what has been 2029 learned since [RFC1323]. 2031 (f) Removed the list of changes between [RFC1323] and prior 2032 versions. These changes are mentioned in Appendix C of 2033 [RFC1323]. 2035 (g) Moved Appendix Changes from RFC 1323 to the end of the 2036 appendices for easier lookup. In addition, the entries were 2037 split into a technical and an editorial part, and sorted to 2038 roughly correspond with the sections in the text where they 2039 apply. 2041 Authors' Addresses 2043 David Borman 2044 Quantum Corporation 2045 Mendota Heights MN 55120 2046 USA 2048 Email: david.borman@quantum.com 2050 Bob Braden 2051 University of Southern California 2052 4676 Admiralty Way 2053 Marina del Rey CA 90292 2054 USA 2056 Email: braden@isi.edu 2057 Van Jacobson 2058 Google, Inc. 2059 1600 Amphitheatre Parkway 2060 Mountain View CA 94043 2061 USA 2063 Email: vanj@google.com 2065 Richard Scheffenegger (editor) 2066 NetApp, Inc. 2067 Am Euro Platz 2 2068 Vienna, 1120 2069 Austria 2071 Email: rs@netapp.com