idnits 2.17.1 draft-ietf-tcpm-1323bis-16.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The abstract seems to indicate that this document obsoletes RFC1323, but the header doesn't have an 'Obsoletes:' line to match this. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 12, 2013) is 3815 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'Ekstroem04' is defined on line 1294, but no explicit reference was found in the text == Unused Reference: 'Hamming77' is defined on line 1312, but no explicit reference was found in the text == Unused Reference: 'Jain86' is defined on line 1336, but no explicit reference was found in the text == Unused Reference: 'Mathis08' is defined on line 1366, but no explicit reference was found in the text == Unused Reference: 'RFC0896' is defined on line 1391, but no explicit reference was found in the text == Unused Reference: 'RFC1110' is defined on line 1397, but no explicit reference was found in the text == Unused Reference: 'RFC2581' is defined on line 1415, but no explicit reference was found in the text == Unused Reference: 'Watson81' is defined on line 1459, but no explicit reference was found in the text == Unused Reference: 'Zhang86' is defined on line 1464, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 1072 (Obsoleted by RFC 1323, RFC 2018, RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1110 (Obsoleted by RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1185 (Obsoleted by RFC 1323) -- Obsolete informational reference (is this intentional?): RFC 1323 (Obsoleted by RFC 7323) -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 6528 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 6691 (Obsoleted by RFC 9293) Summary: 1 error (**), 0 flaws (~~), 10 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance (TCPM) D. Borman 3 Internet-Draft Quantum Corporation 4 Intended status: Standards Track B. Braden 5 Expires: May 16, 2014 University of Southern 6 California 7 V. Jacobson 8 Google, Inc. 9 R. Scheffenegger, Ed. 10 NetApp, Inc. 11 November 12, 2013 13 TCP Extensions for High Performance 14 draft-ietf-tcpm-1323bis-16 16 Abstract 18 This document specifies a set of TCP extensions to improve 19 performance over paths with a large bandwidth * delay product and to 20 provide reliable operation over very high-speed paths. It defines 21 TCP options for scaled windows and timestamps. The timestamps can be 22 used for two distinct mechanisms, PAWS (Protection Against Wrapped 23 Sequences) and RTTM (Round Trip Time Measurement). 25 This document obsoletes RFC 1323 and describes changes from it. 27 Status of this Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on May 16, 2014. 44 Copyright Notice 46 Copyright (c) 2013 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 62 1.1. TCP Performance . . . . . . . . . . . . . . . . . . . . . 4 63 1.2. TCP Reliability . . . . . . . . . . . . . . . . . . . . . 5 64 1.3. Using TCP options . . . . . . . . . . . . . . . . . . . . 6 65 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7 66 2. TCP Window Scale Option . . . . . . . . . . . . . . . . . . . 8 67 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 8 68 2.2. Window Scale Option . . . . . . . . . . . . . . . . . . . 8 69 2.3. Using the Window Scale Option . . . . . . . . . . . . . . 9 70 2.4. Addressing Window Retraction . . . . . . . . . . . . . . . 10 71 3. TCP Timestamps option . . . . . . . . . . . . . . . . . . . . 12 72 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 12 73 3.2. Timestamps option . . . . . . . . . . . . . . . . . . . . 12 74 4. The RTTM Mechanism . . . . . . . . . . . . . . . . . . . . . . 15 75 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 15 76 4.2. Updating the RTO value . . . . . . . . . . . . . . . . . . 16 77 4.3. Which Timestamp to Echo . . . . . . . . . . . . . . . . . 16 78 5. PAWS - Protection Against Wrapped Sequence Numbers . . . . . . 20 79 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 20 80 5.2. The PAWS Mechanism . . . . . . . . . . . . . . . . . . . . 20 81 5.3. Basic PAWS Algorithm . . . . . . . . . . . . . . . . . . . 21 82 5.4. Timestamp Clock . . . . . . . . . . . . . . . . . . . . . 23 83 5.5. Outdated Timestamps . . . . . . . . . . . . . . . . . . . 25 84 5.6. Header Prediction . . . . . . . . . . . . . . . . . . . . 25 85 5.7. IP Fragmentation . . . . . . . . . . . . . . . . . . . . . 27 86 5.8. Duplicates from Earlier Incarnations of Connection . . . . 27 87 6. Conclusions and Acknowledgements . . . . . . . . . . . . . . . 28 88 7. Security Considerations . . . . . . . . . . . . . . . . . . . 28 89 7.1. Privacy Considerations . . . . . . . . . . . . . . . . . . 30 90 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30 91 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 30 92 9.1. Normative References . . . . . . . . . . . . . . . . . . . 30 93 9.2. Informative References . . . . . . . . . . . . . . . . . . 31 94 Appendix A. Implementation Suggestions . . . . . . . . . . . . . 34 95 Appendix B. Duplicates from Earlier Connection Incarnations . . . 35 96 B.1. System Crash with Loss of State . . . . . . . . . . . . . 36 97 B.2. Closing and Reopening a Connection . . . . . . . . . . . . 36 98 Appendix C. Summary of Notation . . . . . . . . . . . . . . . . . 37 99 Appendix D. Event Processing Summary . . . . . . . . . . . . . . 38 100 Appendix E. Timestamps Edge Cases . . . . . . . . . . . . . . . . 44 101 Appendix F. Window Retraction Example . . . . . . . . . . . . . . 45 102 Appendix G. RTO calculation modification . . . . . . . . . . . . 45 103 Appendix H. Changes from RFC 1323 . . . . . . . . . . . . . . . . 46 104 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 48 106 1. Introduction 108 The TCP protocol [RFC0793] was designed to operate reliably over 109 almost any transmission medium regardless of transmission rate, 110 delay, corruption, duplication, or reordering of segments. Over the 111 years, advances in networking technology have resulted in ever-higher 112 transmission speeds, and the fastest paths are well beyond the domain 113 for which TCP was originally engineered. 115 This document defines a set of modest extensions to TCP to extend the 116 domain of its application to match the increasing network capability. 117 It is an update to and obsoletes [RFC1323], which in turn is based 118 upon and obsoletes [RFC1072] and [RFC1185]. 120 Changes between [RFC1323] and this document are detailed in 121 Appendix H. These changes are partly due to errata in [RFC1323], and 122 partly due to the improved understanding of how the involved 123 components interact. 125 For brevity, the full discussions of the merits and history behind 126 the TCP options defined within this document have been omitted. 127 [RFC1323] should be consulted for reference. It is recommended that 128 a modern TCP stack implements and make use of the extensions 129 described in this document. 131 1.1. TCP Performance 133 TCP performance problems arise when the bandwidth * delay product is 134 large. A network having such paths is referred to as "long, fat 135 network" (LFN). 137 There are two fundamental performance problems with basic TCP over 138 LFN paths: 140 (1) Window Size Limit 142 The TCP header uses a 16 bit field to report the receive window 143 size to the sender. Therefore, the largest window that can be 144 used is 2^16 = 64 KiB. For LFN paths where the bandwidth * 145 delay product exceeds 64 KiB, the receive window limits the 146 maximum throughput of the TCP connection over the path, i.e., 147 the amount of unacknowledged data that TCP can send in order to 148 keep the pipeline full. 150 To circumvent this problem, Section 2 of this memo defines a TCP 151 option, "Window Scale", to allow windows larger than 2^16. This 152 option defines an implicit scale factor, which is used to 153 multiply the window size value found in a TCP header to obtain 154 the true window size. 156 It must be noted, that the use of large receive windows 157 increases the chance of too quickly wrapping sequence numbers, 158 as described below in Section 1.2, (1). 160 (2) Recovery from Losses 162 Packet losses in an LFN can have a catastrophic effect on 163 throughput. 165 To generalize the Fast Retransmit / Fast Recovery mechanism to 166 handle multiple packets dropped per window, Selective 167 Acknowledgments are required. Unlike the normal cumulative 168 acknowledgments of TCP, Selective Acknowledgments give the 169 sender a complete picture of which segments are queued at the 170 receiver and which have not yet arrived. 172 Selective acknowledgements and their use are specified in 173 separate documents, "TCP Selective Acknowledgment Options" 174 [RFC2018], "An Extension to the Selective Acknowledgement (SACK) 175 Option for TCP" [RFC2883], and "A Conservative Selective 176 Acknowledgment (SACK)-based Loss Recovery Algorithm for TCP" 177 [RFC6675], and not further discussed in this document. 179 1.2. TCP Reliability 181 An especially serious kind of error may result from an accidental 182 reuse of TCP sequence numbers in data segments. TCP reliability 183 depends upon the existence of a bound on the lifetime of a segment: 184 the "Maximum Segment Lifetime" or MSL. 186 Duplication of sequence numbers might happen in either of two ways: 188 (1) Sequence number wrap-around on the current connection 190 A TCP sequence number contains 32 bits. At a high enough 191 transfer rate of large volumes of data (at least 4 GiB in the 192 same session), the 32-bit sequence space may be "wrapped" 193 (cycled) within the time that a segment is delayed in queues. 195 (2) Earlier incarnation of the connection 197 Suppose that a connection terminates, either by a proper close 198 sequence or due to a host crash, and the same connection (i.e., 199 using the same pair of port numbers) is immediately reopened. A 200 delayed segment from the terminated connection could fall within 201 the current window for the new incarnation and be accepted as 202 valid. 204 Duplicates from earlier incarnations, case (2), are avoided by 205 enforcing the current fixed MSL of the TCP specification, as 206 explained in Section 5.8 and Appendix B. In addition, the randmizing 207 of ephemeral ports can also help to probabilistically reduce the 208 chances of duplicates from earlier connections. However, case (1), 209 avoiding the reuse of sequence numbers within the same connection, 210 requires an upper bound on MSL that depends upon the transfer rate, 211 and at high enough rates, a dedicated mechanism is required. 213 A possible fix for the problem of cycling the sequence space would be 214 to increase the size of the TCP sequence number field. For example, 215 the sequence number field (and also the acknowledgment field) could 216 be expanded to 64 bits. This could be done either by changing the 217 TCP header or by means of an additional option. 219 Section 5 presents a different mechanism, which we call PAWS 220 (Protection Against Wrapped Sequence numbers), to extend TCP 221 reliability to transfer rates well beyond the foreseeable upper limit 222 of network bandwidths. PAWS uses the TCP Timestamps option defined 223 in Section 3.2 to protect against old duplicates from the same 224 connection. 226 1.3. Using TCP options 228 The extensions defined in this document all use TCP options. 230 When [RFC1323] was published, there was concern that some buggy TCP 231 implementation might crash on the first appearance of an option on a 232 non- segment. However, bugs like that can lead to DOS attacks 233 against a TCP. Research has shown that most TCP implementations will 234 properly handle unknown options on non- segments ([Medina04], 235 [Medina05]). But it is still prudent to be conservative in what you 236 send, and avoiding buggy TCP implementation is not the only reason 237 for negotiating TCP options on segments. 239 The window scale option negotiates fundamental parameters of the TCP 240 session. Therefore, it is only sent during the initial handshake. 241 Furthermore, the window scale option will be sent in a 242 segment only if the corresponding option was received in the initial 243 segment. 245 The Timestamps option may appear in any data or segment, adding 246 10 bytes (up to 12 bytes including padding) to the 20-byte TCP 247 header. It is required that this TCP option will be sent on all non- 248 segments after an exchange of options on the segments has 249 indicated that both sides understand this extension. 251 Research has shown that the use of the Timestamps option to take 252 additional RTT samples within each RTT has little effect on the 253 ultimate retransmission timeout value [Allman99]. However, there are 254 other uses of the Timestamps option, such as the Eifel mechanism 255 [RFC3522], [RFC4015], and PAWS (see Section 5) which improve overall 256 TCP security and performance. The extra header bandwidth used by 257 this option should be evaluated for the gains in performance and 258 security in an actual deployment. 260 Appendix A contains a recommended layout of the options in TCP 261 headers to achieve reasonable data field alignment. 263 Finally, we observe that most of the mechanisms defined in this 264 document are important for LFN's and/or very high-speed networks. 265 For low-speed networks, it might be a performance optimization to NOT 266 use these mechanisms. A TCP vendor concerned about optimal 267 performance over low-speed paths might consider turning these 268 extensions off for low- speed paths, or allow a user or installation 269 manager to disable them. 271 1.4. Terminology 273 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 274 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 275 document are to be interpreted as described in [RFC2119]. 277 In this document, these words will appear with that interpretation 278 only when in UPPER CASE. Lower case uses of these words are not to 279 be interpreted as carrying [RFC2119] significance. 281 2. TCP Window Scale Option 283 2.1. Introduction 285 The window scale extension expands the definition of the TCP window 286 to 30 bits and then uses an implicit scale factor to carry this 30- 287 bit value in the 16-bit Window field of the TCP header (SEG.WND in 288 [RFC0793]). The exponent of the scale factor is carried in a TCP 289 option, Window Scale. This option is sent only in a segment (a 290 segment with the SYN bit on), hence the window scale is fixed in each 291 direction when a connection is opened. 293 The maximum receive window, and therefore the scale factor, is 294 determined by the maximum receive buffer space. In a typical modern 295 implementation, this maximum buffer space is set by default but can 296 be overridden by a user program before a TCP connection is opened. 297 This determines the scale factor, and therefore no new user interface 298 is needed for window scaling. 300 2.2. Window Scale Option 302 The three-byte Window Scale option MAY be sent in a segment by 303 a TCP. It has two purposes: (1) indicate that the TCP is prepared to 304 both send and receive window scaling, and (2) communicate the 305 exponent of a scale factor to be applied to its receive window. 306 Thus, a TCP that is prepared to scale windows SHOULD send the option, 307 even if its own scale factor is 1 and the exponent 0. The scale 308 factor is limited to a power of two and encoded logarithmically, so 309 it may be implemented by binary shift operations. The maximum scale 310 exponent is limited to 14 for a maximum permissible receive window 311 size of 1 GiB (2^(14+16)). 313 TCP Window Scale Option (WSopt): 315 Kind: 3 317 Length: 3 bytes 319 +---------+---------+---------+ 320 | Kind=3 |Length=3 |shift.cnt| 321 +---------+---------+---------+ 322 1 1 1 324 This option is an offer, not a promise; both sides MUST send Window 325 Scale options in their segments to enable window scaling in 326 either direction. If window scaling is enabled, then the TCP that 327 sent this option will right-shift its true receive-window values by 328 'shift.cnt' bits for transmission in SEG.WND. The value 'shift.cnt' 329 MAY be zero (offering to scale, while applying a scale factor of 1 to 330 the receive window). 332 This option MAY be sent in an initial segment (i.e., a segment 333 with the SYN bit on and the ACK bit off). It MAY also be sent in a 334 segment, but only if a Window Scale option was received in 335 the initial segment. A Window Scale option in a segment 336 without a SYN bit MUST be ignored. 338 The window field in a segment where the SYN bit is set (i.e., a 339 or ) MUST NOT be scaled. 341 2.3. Using the Window Scale Option 343 A model implementation of window scaling is as follows, using the 344 notation of [RFC0793]: 346 o The connection state MUST be augmented by two window shift 347 counters, Snd.Wind.Shift and Rcv.Wind.Shift, to be applied to the 348 incoming and outgoing window fields, respectively. 350 o If a TCP receives a segment containing a Window Scale 351 option, it SHOULD send its own Window Scale option in the 352 segment. 354 o The Window Scale option MUST be sent with shift.cnt = R, where R 355 is the value that the TCP would like to use for its receive 356 window. 358 o Upon receiving a segment with a Window Scale option 359 containing shift.cnt = S, a TCP MUST set Snd.Wind.Shift to S and 360 MUST set Rcv.Wind.Shift to R; otherwise, it MUST set both 361 Snd.Wind.Shift and Rcv.Wind.Shift to zero. 363 o The window field (SEG.WND) in the header of every incoming 364 segment, with the exception of segments, MUST be left- 365 shifted by Snd.Wind.Shift bits before updating SND.WND: 367 SND.WND = SEG.WND << Snd.Wind.Shift 369 (assuming the other conditions of [RFC0793] are met, and using the 370 "C" notation "<<" for left-shift). 372 o The window field (SEG.WND) of every outgoing segment, with the 373 exception of segments, MUST be right-shifted by 374 Rcv.Wind.Shift bits: 376 SEG.WND = RCV.WND >> Rcv.Wind.Shift 378 TCP determines if a data segment is "old" or "new" by testing whether 379 its sequence number is within 2^31 bytes of the left edge of the 380 window, and if it is not, discarding the data as "old". To insure 381 that new data is never mistakenly considered old and vice versa, the 382 left edge of the sender's window has to be at most 2^31 away from the 383 right edge of the receiver's window. Similarly with the sender's 384 right edge and receiver's left edge. Since the right and left edges 385 of either the sender's or receiver's window differ by the window 386 size, and since the sender and receiver windows can be out of phase 387 by at most the window size, the above constraints imply that two 388 times the maximum window size must be less than 2^31, or 390 max window < 2^30 392 Since the max window is 2^S (where S is the scaling shift count) 393 times at most 2^16 - 1 (the maximum unscaled window), the maximum 394 window is guaranteed to be < 2^30 if S <= 14. Thus, the shift count 395 MUST be limited to 14 (which allows windows of 2^30 = 1 GiB). If a 396 Window Scale option is received with a shift.cnt value larger than 397 14, the TCP SHOULD log the error but MUST use 14 instead of the 398 specified value. This is safe as a sender can always choose to only 399 partially use any signaled receive window. If the receiver is 400 scaling by a factor larger than 14 and the sender is only scaling by 401 14 then the receive window used by the sender will appear smaller 402 than it is in reality. 404 The scale factor applies only to the Window field as transmitted in 405 the TCP header; each TCP using extended windows will maintain the 406 window values locally as 32-bit numbers. For example, the 407 "congestion window" computed by Slow Start and Congestion Avoidance 408 (see [RFC5681]) is not affected by the scale factor, so window 409 scaling will not introduce quantization into the congestion window. 411 2.4. Addressing Window Retraction 413 When a non-zero scale factor is in use, there are instances when a 414 retracted window can be offered - see Appendix F for a detailed 415 example. The end of the window will be on a boundary based on the 416 granularity of the scale factor being used. If the sequence number 417 is then updated by a number of bytes smaller than that granularity, 418 the TCP will have to either advertise a new window that is beyond 419 what it previously advertised (and perhaps beyond the buffer), or 420 will have to advertise a smaller window, which will cause the TCP 421 window to shrink. Implementations MUST ensure that they handle a 422 shrinking window, as specified in section 4.2.2.16 of [RFC1122]. 424 For the receiver, this implies that: 426 1) The receiver MUST honor, as in-window, any segment that would 427 have been in-window for any sent by the receiver. 429 2) When window scaling is in effect, the receiver SHOULD track the 430 actual maximum window sequence number (which is likely to be 431 greater than the window announced by the most recent , if 432 more than one segment has arrived since the application consumed 433 any data in the receive buffer). 435 On the sender side: 437 3) The initial transmission MUST be within the window announced by 438 the most recent . 440 4) On first retransmission, or if the sequence number is out-of- 441 window by less than 2^Rcv.Wind.Shift then do normal 442 retransmission(s) without regard to receiver window as long as 443 the original segment was in window when it was sent. 445 5) Subsequent retransmissions MAY only be sent, if they are within 446 the window announced by the most recent . 448 3. TCP Timestamps option 450 3.1. Introduction 452 The Timestamps option is introduced to address some of the issues 453 mentioned in Section 1.1 and Section 1.2. The Timestamps option is 454 specified in a symmetrical manner, so that TSval timestamps are 455 carried in both data and segments and are echoed in TSecr 456 fields carried in returning or data segments. Originally used 457 primarily for timestamping individual segments, the properties of the 458 Timestamps option allow not only the use for taking time measurements 459 (Section 4), but additional uses as well (xref target="sec4"/>). 461 It is necessary to remember that there is a distinction between the 462 Timestamps option conveying timestamp information, and the use of 463 that information. In particular, the Round Trip Time Measurement 464 (RTTM) mechanism must be viewed independently from updating the 465 Retransmission Timeout (RTO) (see Section 4.2). In this case, the 466 sample granularity also needs to be taken into account. Other 467 mechanisms, such as PAWS, or Eifel, are not built upon the timestamp 468 information itself, but are based on the intrinsic property of 469 monotonically increasing values. 471 The Timestamps option is important when large receive windows are 472 used, to allow the use of the PAWS mechanism (see Section 5). 473 Furthermore, the option may be useful for all TCP's, since it 474 simplifies the sender and allows the use of additional optimizations 475 such as Eifel ([RFC3522], [RFC4015]) and others ([RFC6817], 476 [Kuzmanovic03], [Kuehlewind10]. 478 3.2. Timestamps option 480 TCP is a symmetric protocol, allowing data to be sent at any time in 481 either direction, and therefore timestamp echoing may occur in either 482 direction. For simplicity and symmetry, we specify that timestamps 483 always be sent and echoed in both directions. For efficiency, we 484 combine the timestamp and timestamp reply fields into a single TCP 485 Timestamps option. 487 TCP Timestamps option (TSopt): 489 Kind: 8 491 Length: 10 bytes 493 +-------+-------+---------------------+---------------------+ 494 |Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)| 495 +-------+-------+---------------------+---------------------+ 496 1 1 4 4 498 The Timestamps option carries two four-byte timestamp fields. The 499 Timestamp Value field (TSval) contains the current value of the 500 timestamp clock of the TCP sending the option. 502 The Timestamp Echo Reply field (TSecr) is valid if the ACK bit is set 503 in the TCP header; if it is valid, it echoes a timestamp value that 504 was sent by the remote TCP in the TSval field of a Timestamps option. 505 When TSecr is not valid, its value MUST be zero. However, a value of 506 zero does not imply TSecr being invalid. The TSecr value will 507 generally be from the most recent Timestamps option that was 508 received; however, there are exceptions that are explained below. 510 A TCP MAY send the Timestamps option (TSopt) in an initial 511 segment (i.e., segment containing a SYN bit and no ACK bit), and MAY 512 send a TSopt in only if it received a TSopt in the initial 513 segment for the connection. 515 Once TSopt has been successfully negotiated, that is both , and 516 contain TSopt, the TSopt MUST be sent in every non- 517 segment for the duration of the connection, and SHOULD be sent in an 518 segment (see Section 5.2 for details). The TCP SHOULD remember 519 this state by setting a flag, referred to as Snd.TS.OK, to one. If a 520 non- segment is received without a TSopt, a TCP SHOULD silently 521 drop the segment. A TCP MUST NOT abort a TCP connection because any 522 segment lacks an expected TSopt. 524 Implementations are strongly encouraged to follow the above rules for 525 handling a missing Timestamps option, and the order of precedence 526 mentioned in Section 5.3 when deciding on the acceptance of a 527 segment. 529 If a receiver chooses to accept a segment without an expected 530 Timestamps option, it must be clear that undetectable data corruption 531 may occur. 533 Such a TCP receiver may experience undetectable wrapped- sequence 534 effects, such as data (payload) corruption or session stalls. In 535 order to maintain the integrity of the payload data, in particular on 536 high speed networks, it is paramount to follow the described 537 processing rules. 539 However, it has been mentioned that under some circumstances, the 540 above guidelines are too strict, and some paths sporadically suppress 541 the Timestamps option, while maintaining payload integrity. A path 542 behaving in this manner should be deemed unacceptable, but it has 543 been noted that some implementations relax the acceptance rules as a 544 workaround, and allow TCP to run across such paths [Oppermann13] 546 If a TSopt is received on a connection where TSopt was not negotiated 547 in the initial three-way handshake, the TSopt MUST be ignored and the 548 packet processed normally. 550 In the case of crossing segments where one contains a 551 TSopt and the other doesn't, both sides MAY send a TSopt in the 552 segment. 554 TSopt is required for the two mechanisms described in sections 4 and 555 5. There are also other mechanisms that rely on the presence of the 556 TSopt, e.g. [RFC3522]. If a TCP stopped sending TSopt at any time 557 during an established session, it interferes with these mechanisms. 558 This update to [RFC1323] describes explicitly the previous assumption 559 (see Section 5.2), that each TCP segment must have TSopt, once 560 negotiated. 562 4. The RTTM Mechanism 564 4.1. Introduction 566 One use of the Timestamps option is to measure the round trip time of 567 virtually every packet acknowledged. The Round Trip Time Measurement 568 (RTTM) mechansim requires a Timestamps option in every measured 569 segment, with a TSval that is obtained from a (virtual) "timestamp 570 clock". Values of this clock MUST be at least approximately 571 proportional to real time, in order to measure actual RTT. 573 TCP measures the round trip time (RTT), primarily for the purpose of 574 arriving at a reasonable value for the Retransmission Timeout (RTO) 575 timer interval. Accurate and current RTT estimates are necessary to 576 adapt to changing traffic conditions, while a conservative estimate 577 of the RTO interval is necessary to minimize spurious RTOs. 579 These TSval values are echoed in TSecr values in the reverse 580 direction. The difference between a received TSecr value and the 581 current timestamp clock value provides an RTT measurement. 583 When timestamps are used, every segment that is received will contain 584 a TSecr value. However, these values cannot all be used to update 585 the measured RTT. The following example illustrates why. It shows a 586 one-way data flow with segments arriving in sequence without loss. 587 Here A, B, C... represent data blocks occupying successive blocks of 588 sequence numbers, and ACK(A),... represent the corresponding 589 cumulative acknowledgments. The two timestamp fields of the 590 Timestamps option are shown symbolically as . Each 591 TSecr field contains the value most recently received in a TSval 592 field. 594 TCP A TCP B 596 -----> 598 <---- 600 -----> 602 <---- 604 . . . . . . . . . . . . . . . . . . . . . . 606 ----> 608 <---- 609 (etc.) 611 The dotted line marks a pause (60 time units long) in which A had 612 nothing to send. Note that this pause inflates the RTT which B could 613 infer from receiving TSecr=131 in data segment C. Thus, in one-way 614 data flows, RTTM in the reverse direction measures a value that is 615 inflated by gaps in sending data. However, the following rule 616 prevents a resulting inflation of the measured RTT: 618 RTTM Rule: A TSecr value received in a segment MAY be used to update 619 the averaged RTT measurement only if the segment advances 620 the left edge of the send window, i.e. SND.UNA is 621 increased. 623 Since TCP B is not sending data, the data segment C does not 624 acknowledge any new data when it arrives at B. Thus, the inflated 625 RTTM measurement is not used to update B's RTTM measurement. 627 4.2. Updating the RTO value 629 When [RFC1323] was originally written, it was perceived that taking 630 RTT measurements for each segment, and also during retransmissions, 631 would contribute to reduce spurious RTOs, while maintaining the 632 timeliness of necessary RTOs. At the time, RTO was also the only 633 mechanism to make use of the measured RTT. It has been shown, that 634 taking more RTT samples has only a very limited effect to optimize 635 RTOs [Allman99]. 637 Implementers should note that with timestamps multiple RTTMs can be 638 taken per RTT. The [RFC6298] RTO estimator has weighting factors, 639 alpha and beta, based on an implicit assumption that at most one RTTM 640 will be sampled per RTT. When multiple RTTMs per RTT are available 641 to update the RTO estimator, an implementation SHOULD try to adhere 642 to the spirit of the history specified in [RFC6298]. An 643 implementation suggestion is detailed in Appendix G. 645 [Ludwig00] and [Floyd05] have highlighted the problem that an 646 unmodified RTO calculation, which is updated with per-packet RTT 647 samples, will truncate the path history too soon. This can lead to 648 an increase in spurious retransmissions, when the path properties 649 vary in the order of a few RTTs, but a high number of RTT samples are 650 taken on a much shorter timescale. 652 4.3. Which Timestamp to Echo 654 If more than one Timestamps option is received before a reply segment 655 is sent, the TCP must choose only one of the TSvals to echo, ignoring 656 the others. To minimize the state kept in the receiver (i.e., the 657 number of unprocessed TSvals), the receiver should be required to 658 retain at most one timestamp in the connection control block. 660 There are three situations to consider: 662 (A) Delayed ACKs. 664 Many TCP's acknowledge only every second segment out of a group 665 of segments arriving within a short time interval; this policy 666 is known generally as "delayed ACKs". The data-sender TCP must 667 measure the effective RTT, including the additional time due to 668 delayed ACKs, or else it will retransmit unnecessarily. Thus, 669 when delayed ACKs are in use, the receiver SHOULD reply with the 670 TSval field from the earliest unacknowledged segment. 672 (B) A hole in the sequence space (segment(s) have been lost). 674 The sender will continue sending until the window is filled, and 675 the receiver may be generating s as these out-of-order 676 segments arrive (e.g., to aid "fast retransmit"). 678 The lost segment is probably a sign of congestion, and in that 679 situation the sender should be conservative about 680 retransmission. Furthermore, it is better to overestimate than 681 underestimate the RTT. An for an out-of-order segment 682 SHOULD therefore contain the timestamp from the most recent 683 segment that advanced RCV.NXT. 685 The same situation occurs if segments are re-ordered by the 686 network. 688 (C) A filled hole in the sequence space. 690 The segment that fills the hole and advances the window 691 represents the most recent measurement of the network 692 characteristics. An RTT computed from an earlier segment would 693 probably include the sender's retransmit time-out, badly biasing 694 the sender's average RTT estimate. Thus, the timestamp from the 695 latest segment (which filled the hole) MUST be echoed. 697 An algorithm that covers all three cases is described in the 698 following rules for Timestamps option processing on a synchronized 699 connection: 701 (1) The connection state is augmented with two 32-bit slots: 703 TS.Recent holds a timestamp to be echoed in TSecr whenever a 704 segment is sent, and Last.ACK.sent holds the ACK field from the 705 last segment sent. Last.ACK.sent will equal RCV.NXT except when 706 s have been delayed. 708 (2) If: 710 SEG.TSval >= TS.recent and SEG.SEQ <= Last.ACK.sent 712 then SEG.TSval is copied to TS.Recent; otherwise, it is ignored. 714 (3) When a TSopt is sent, its TSecr field is set to the current 715 TS.Recent value. 717 The following examples illustrate these rules. Here A, B, C... 718 represent data segments occupying successive blocks of sequence 719 numbers, and ACK(A),... represent the corresponding acknowledgment 720 segments. Note that ACK(A) has the same sequence number as B. We 721 show only one direction of timestamp echoing, for clarity. 723 o Segments arrive in sequence, and some of the s are delayed. 725 By case (A), the timestamp from the oldest unacknowledged segment 726 is echoed. 728 TS.Recent 729 -------------------> 730 1 731 -------------------> 732 1 733 -------------------> 734 1 735 <---- 736 (etc) 738 o Segments arrive out of order, and every segment is acknowledged. 740 By case (B), the timestamp from the last segment that advanced the 741 left window edge is echoed, until the missing segment arrives; it 742 is echoed according to Case (C). The same sequence would occur if 743 segments B and D were lost and retransmitted. 745 TS.Recent 746 -------------------> 747 1 748 <---- 749 1 750 -------------------> 751 1 752 <---- 753 1 754 -------------------> 755 2 756 <---- 757 2 758 -------------------> 759 2 760 <---- 761 2 762 -------------------> 763 4 764 <---- 765 (etc) 767 5. PAWS - Protection Against Wrapped Sequence Numbers 769 5.1. Introduction 771 Another use for the Timestamps options is the mechanism to Protect 772 Against Wrapped Sequence numbers (PAWS). Section 5.2 describes a 773 simple mechanism to reject old duplicate segments that might corrupt 774 an open TCP connection. PAWS operates within a single TCP 775 connection, using state that is saved in the connection control 776 block. Section 5.8 and Appendix H discuss the implications of the 777 PAWS mechanism for avoiding old duplicates from previous incarnations 778 of the same connection. 780 5.2. The PAWS Mechanism 782 PAWS uses the TCP Timestamps option described earlier, and assumes 783 that every received TCP segment (including data and segments) 784 contains a timestamp SEG.TSval whose values are monotonically non- 785 decreasing in time. The basic idea is that a segment can be 786 discarded as an old duplicate if it is received with a timestamp 787 SEG.TSval less than some timestamp recently received on this 788 connection. 790 In the PAWS mechanism, the "timestamps" are 32-bit unsigned integers 791 in a modular 32-bit space. Thus, "less than" is defined the same way 792 it is for TCP sequence numbers, and the same implementation 793 techniques apply. If s and t are timestamp values, 795 s < t if 0 < (t - s) < 2^31, 797 computed in unsigned 32-bit arithmetic. 799 The choice of incoming timestamps to be saved for this comparison 800 MUST guarantee a value that is monotonically non-decreasing. For 801 example, an implementation might save the timestamp from the segment 802 that last advanced the left edge of the receive window, i.e., the 803 most recent in-sequence segment. For simplicity, the value TS.Recent 804 introduced in Section 4.3 is used instead, as using a common value 805 for both PAWS and RTTM simplifies the implementation. As Section 4.3 806 explained, TS.Recent differs from the timestamp from the last in- 807 sequence segment only in the case of delayed s, and therefore by 808 less than one window. Either choice will therefore protect against 809 sequence number wrap-around. 811 PAWS submits all incoming segments to the same test, and therefore 812 protects against duplicate segments as well as data segments. 813 (An alternative non-symmetric algorithm would protect against old 814 duplicate s: the sender of data would reject incoming 815 segments whose TSecr values were less than the TSecr saved from the 816 last segment whose ACK field advanced the left edge of the send 817 window. This algorithm was deemed to lack economy of mechanism and 818 symmetry.) 820 TSval timestamps sent on and segments are used to 821 initialize PAWS. PAWS protects against old duplicate non- 822 segments, and duplicate segments received while there is a 823 synchronized connection. Duplicate and segments 824 received when there is no connection will be discarded by the normal 825 3-way handshake and sequence number checks of TCP. 827 [RFC1323] recommended that segments NOT carry timestamps, and 828 that they be acceptable regardless of their timestamp. At that time, 829 the thinking was that old duplicate segments should be 830 exceedingly unlikely, and their cleanup function should take 831 precedence over timestamps. More recently, discussions about various 832 blind attacks on TCP connections have raised the suggestion that if 833 the Timestamps option is present, SEG.TSecr could be used to provide 834 stricter acceptance tests for segments. 836 While still under discussion, to enable research into this area it is 837 now RECOMMENDED that when generating an , that if the segment 838 causing the to be generated contained a Timestamps option, that 839 the also contain a Timestamps option. In the segment, 840 SEG.TSecr SHOULD be set to SEG.TSval from the incoming segment and 841 SEG.TSval SHOULD be set to zero. If an is being generated 842 because of a user abort, and Snd.TS.OK is set, then a Timestamps 843 option SHOULD be included in the . When an segment is 844 received, it MUST NOT be subjected to the PAWS check by verifying an 845 acceptable value in SEG.TSval, and information from the Timestamps 846 option MUST NOT be used to update connection state information. 847 SEG.TSecr MAY be used to provide stricter acceptance checks. 849 5.3. Basic PAWS Algorithm 851 If the PAWS algorithm is used, the following processing MUST be 852 performed on all incoming segments for a synchronized connection. 853 Also, PAWS processing MUST take precedence over the regular TCP 854 acceptablitiy check (Section 3.3 in [RFC0793]), which is performed 855 after verification of the received Timestamps option: 857 R1) If there is a Timestamps option in the arriving segment, 858 SEG.TSval < TS.Recent, TS.Recent is valid (see later discussion) 859 and the RST bit is not set, then treat the arriving segment as 860 not acceptable: 862 Send an acknowledgement in reply as specified in [RFC0793] 863 page 69 and drop the segment. 865 Note: it is necessary to send an segment in order to 866 retain TCP's mechanisms for detecting and recovering from 867 half- open connections. For example, see Figure 10 of 868 [RFC0793]. 870 R2) If the segment is outside the window, reject it (normal TCP 871 processing) 873 R3) If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent (see 874 Section 4.3), then record its timestamp in TS.Recent. 876 R4) If an arriving segment is in-sequence (i.e., at the left window 877 edge), then accept it normally. 879 R5) Otherwise, treat the segment as a normal in-window, out-of- 880 sequence TCP segment (e.g., queue it for later delivery to the 881 user). 883 Steps R2, R4, and R5 are the normal TCP processing steps specified by 884 [RFC0793]. 886 It is important to note that the timestamp MUST be checked only when 887 a segment first arrives at the receiver, regardless of whether it is 888 in- sequence or it must be queued for later delivery. 890 Consider the following example. 892 Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has been 893 sent, where the letter indicates the sequence number and the digit 894 represents the timestamp. Suppose also that segment B.1 has been 895 lost. The timestamp in TS.Recent is 1 (from A.1), so C.1, ..., 896 Z.1 are considered acceptable and are queued. When B is 897 retransmitted as segment B.2 (using the latest timestamp), it 898 fills the hole and causes all the segments through Z to be 899 acknowledged and passed to the user. The timestamps of the queued 900 segments are *not* inspected again at this time, since they have 901 already been accepted. When B.2 is accepted, TS.Recent is set to 902 2. 904 This rule allows reasonable performance under loss. A full window of 905 data is in transit at all times, and after a loss a full window less 906 one segment will show up out-of-sequence to be queued at the receiver 907 (e.g., up to ~2^30 bytes of data); the Timestamps option must not 908 result in discarding this data. 910 In certain unlikely circumstances, the algorithm of rules R1-R5 could 911 lead to discarding some segments unnecessarily, as shown in the 912 following example: 914 Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have been 915 sent in sequence and that segment B.1 has been lost. Furthermore, 916 suppose delivery of some of C.1, ... Z.1 is delayed until *after* 917 the retransmission B.2 arrives at the receiver. These delayed 918 segments will be discarded unnecessarily when they do arrive, 919 since their timestamps are now out of date. 921 This case is very unlikely to occur. If the retransmission was 922 triggered by a timeout, some of the segments C.1, ... Z.1 must have 923 been delayed longer than the RTO time. This is presumably an 924 unlikely event, or there would be many spurious timeouts and 925 retransmissions. If B's retransmission was triggered by the "fast 926 retransmit" algorithm, i.e., by duplicate s, then the queued 927 segments that caused these s must have been received already. 929 Even if a segment were delayed past the RTO, the Fast Retransmit 930 mechanism [Jacobson90c] will cause the delayed segments to be 931 retransmitted at the same time as B.2, avoiding an extra RTT and 932 therefore causing a very small performance penalty. 934 We know of no case with a significant probability of occurrence in 935 which timestamps will cause performance degradation by unnecessarily 936 discarding segments. 938 5.4. Timestamp Clock 940 It is important to understand that the PAWS algorithm does not 941 require clock synchronization between sender and receiver. The 942 sender's timestamp clock is used as a source of monotonic non- 943 decreasing values to stamp the segments. The receiver treats the 944 timestamp value as simply a monotonically non-decreasing serial 945 number, without any connection to time. From the receiver's 946 viewpoint, the timestamp is acting as a logical extension of the 947 high-order bits of the sequence number. 949 The receiver algorithm does place some requirements on the frequency 950 of the timestamp clock. 952 (a) The timestamp clock must not be "too slow". 954 It MUST tick at least once for each 2^31 bytes sent. In fact, 955 in order to be useful to the sender for round trip timing, the 956 clock SHOULD tick at least once per window's worth of data, and 957 even with the window extension defined in Section 2.2, 2^31 958 bytes must be at least two windows. 960 To make this more quantitative, any clock faster than 1 tick/sec 961 will reject old duplicate segments for link speeds of ~8 Gbps. 962 A 1 ms timestamp clock will work at link speeds up to 8 Tbps 963 (8*10^12) bps! 965 (b) The timestamp clock must not be "too fast". 967 The recycling time of the timestamp clock MUST be greater than 968 MSL seconds. Since the clock (timestamp) is 32 bits and the 969 worst-case MSL is 255 seconds, the maximum acceptable clock 970 frequency is one tick every 59 ns. 972 However, it is desirable to establish a much longer recycle 973 period, in order to handle outdated timestamps on idle 974 connections (see Section 5.5), and to relax the MSL requirement 975 for preventing sequence number wrap-around. With a 1 ms 976 timestamp clock, the 32-bit timestamp will wrap its sign bit in 977 24.8 days. Thus, it will reject old duplicates on the same 978 connection if MSL is 24.8 days or less. This appears to be a 979 very safe figure; an MSL of 24.8 days or longer can probably be 980 assumed in the Internet without requiring precise MSL 981 enforcement. 983 Based upon these considerations, we choose a timestamp clock 984 frequency in the range 1 ms to 1 sec per tick. This range also 985 matches the requirements of the RTTM mechanism, which does not need 986 much more resolution than the granularity of the retransmit timer, 987 e.g., tens or hundreds of milliseconds. 989 The PAWS mechanism also puts a strong monotonicity requirement on the 990 sender's timestamp clock. The method of implementation of the 991 timestamp clock to meet this requirement depends upon the system 992 hardware and software. 994 o Some hosts have a hardware clock that is guaranteed to be 995 monotonic between hardware resets. 997 o A clock interrupt may be used to simply increment a binary integer 998 by 1 periodically. 1000 o The timestamp clock may be derived from a system clock that is 1001 subject to being abruptly changed, by adding a variable offset 1002 value. This offset is initialized to zero. When a new timestamp 1003 clock value is needed, the offset can be adjusted as necessary to 1004 make the new value equal to or larger than the previous value 1005 (which was saved for this purpose). 1007 o A random offset may be added to the timestamp clock on a per 1008 connection basis. See [RFC6528], section 3, on randomizing the 1009 initial sequence number (ISN). The same function with a different 1010 secret key can be use to generate the per connection timestamp 1011 offset. 1013 5.5. Outdated Timestamps 1015 If a connection remains idle long enough for the timestamp clock of 1016 the other TCP to wrap its sign bit, then the value saved in TS.Recent 1017 will become too old; as a result, the PAWS mechanism will cause all 1018 subsequent segments to be rejected, freezing the connection (until 1019 the timestamp clock wraps its sign bit again). 1021 With the chosen range of timestamp clock frequencies (1 sec to 1 ms), 1022 the time to wrap the sign bit will be between 24.8 days and 24800 1023 days. A TCP connection that is idle for more than 24 days and then 1024 comes to life is exceedingly unusual. However, it is undesirable in 1025 principle to place any limitation on TCP connection lifetimes. 1027 We therefore require that an implementation of PAWS include a 1028 mechanism to "invalidate" the TS.Recent value when a connection is 1029 idle for more than 24 days. (An alternative solution to the problem 1030 of outdated timestamps would be to send keep-alive segments at a very 1031 low rate, but still more often than the wrap-around time for 1032 timestamps, e.g., once a day. This would impose negligible overhead. 1033 However, the TCP specification has never included keep-alives, so the 1034 solution based upon invalidation was chosen.) 1036 Note that a TCP does not know the frequency, and therefore, the 1037 wraparound time, of the other TCP, so it must assume the worst. The 1038 validity of TS.Recent needs to be checked only if the basic PAWS 1039 timestamp check fails, i.e., only if SEG.TSval < TS.Recent. If 1040 TS.Recent is found to be invalid, then the segment is accepted, 1041 regardless of the failure of the timestamp check, and rule R3 updates 1042 TS.Recent with the TSval from the new segment. 1044 To detect how long the connection has been idle, the TCP MAY update a 1045 clock or timestamp value associated with the connection whenever 1046 TS.Recent is updated, for example. The details will be 1047 implementation-dependent. 1049 5.6. Header Prediction 1051 "Header prediction" [Jacobson90a] is a high-performance transport 1052 protocol implementation technique that is most important for high- 1053 speed links. This technique optimizes the code for the most common 1054 case, receiving a segment correctly and in order. Using header 1055 prediction, the receiver asks the question, "Is this segment the next 1056 in sequence?" This question can be answered in fewer machine 1057 instructions than the question, "Is this segment within the window?" 1059 Adding header prediction to our timestamp procedure leads to the 1060 following recommended sequence for processing an arriving TCP 1061 segment: 1063 H1) Check timestamp (same as step R1 above) 1065 H2) Do header prediction: if segment is next in sequence and if 1066 there are no special conditions requiring additional processing, 1067 accept the segment, record its timestamp, and skip H3. 1069 H3) Process the segment normally, as specified in RFC 793. This 1070 includes dropping segments that are outside the window and 1071 possibly sending acknowledgments, and queuing in-window, out-of- 1072 sequence segments. 1074 Another possibility would be to interchange steps H1 and H2, i.e., to 1075 perform the header prediction step H2 *first*, and perform H1 and H3 1076 only when header prediction fails. This could be a performance 1077 improvement, since the timestamp check in step H1 is very unlikely to 1078 fail, and it requires unsigned modulo arithmetic. To perform this 1079 check on every single segment is contrary to the philosophy of header 1080 prediction. We believe that this change might produce a measurable 1081 reduction in CPU time for TCP protocol processing on high-speed 1082 networks. 1084 However, putting H2 first would create a hazard: a segment from 2^32 1085 bytes in the past might arrive at exactly the wrong time and be 1086 accepted mistakenly by the header-prediction step. The following 1087 reasoning has been introduced in [RFC1185] to show that the 1088 probability of this failure is negligible. 1090 If all segments are equally likely to show up as old duplicates, 1091 then the probability of an old duplicate exactly matching the left 1092 window edge is the maximum segment size (MSS) divided by the size 1093 of the sequence space. This ratio must be less than 2^-16, since 1094 MSS must be < 2^16; for example, it will be (2^12)/(2^32) = 2^-20 1095 for a 100 Mbit/s link. However, the older a segment is, the less 1096 likely it is to be retained in the Internet, and under any 1097 reasonable model of segment lifetime the probability of an old 1098 duplicate exactly at the left window edge must be much smaller 1099 than 2^-16. 1101 The 16 bit TCP checksum also allows a basic unreliability of one 1102 part in 2^16. A protocol mechanism whose reliability exceeds the 1103 reliability of the TCP checksum should be considered "good 1104 enough", i.e., it won't contribute significantly to the overall 1105 error rate. We therefore believe we can ignore the problem of an 1106 old duplicate being accepted by doing header prediction before 1107 checking the timestamp. 1109 However, this probabilistic argument is not universally accepted, and 1110 the consensus at present is that the performance gain does not 1111 justify the hazard in the general case. It is therefore recommended 1112 that H2 follow H1. 1114 5.7. IP Fragmentation 1116 At high data rates, the protection against old segments provided by 1117 PAWS can be circumvented by errors in IP fragment reassembly (see 1118 [RFC4963]). The only way to protect against incorrect IP fragment 1119 reassembly is to not allow the segments to be fragmented. This is 1120 done by setting the Don't Fragment (DF) bit in the IP header. 1121 Setting the DF bit implies the use of Path MTU Discovery as described 1122 in [RFC1191], [RFC1981], and [RFC4821], thus any TCP implementation 1123 that implements PAWS MUST also implement Path MTU Discovery. 1125 5.8. Duplicates from Earlier Incarnations of Connection 1127 The PAWS mechanism protects against errors due to sequence number 1128 wrap-around on high-speed connections. Segments from an earlier 1129 incarnation of the same connection are also a potential cause of old 1130 duplicate errors. In both cases, the TCP mechanisms to prevent such 1131 errors depend upon the enforcement of a maximum segment lifetime 1132 (MSL) by the Internet (IP) layer (see Appendix of RFC 1185 for a 1133 detailed discussion). Unlike the case of sequence space wrap-around, 1134 the MSL required to prevent old duplicate errors from earlier 1135 incarnations does not depend upon the transfer rate. If the IP layer 1136 enforces the recommended 2 minute MSL of TCP, and if the TCP rules 1137 are followed, TCP connections will be safe from earlier incarnations, 1138 no matter how high the network speed. Thus, the PAWS mechanism is 1139 not required for this case. 1141 We may still ask whether the PAWS mechanism can provide additional 1142 security against old duplicates from earlier connections, allowing us 1143 to relax the enforcement of MSL by the IP layer. Appendix B explores 1144 this question, showing that further assumptions and/or mechanisms are 1145 required, beyond those of PAWS. This is not part of the current 1146 extension. 1148 6. Conclusions and Acknowledgements 1150 This memo presented a set of extensions to TCP to provide efficient 1151 operation over large bandwidth * delay product paths and reliable 1152 operation over very high-speed paths. These extensions are designed 1153 to provide compatible interworking with TCP stacks that do not 1154 implement the extensions. 1156 These mechanisms are implemented using TCP options for scaled windows 1157 and timestamps. The timestamps are used for two distinct mechanisms: 1158 RTTM (Round Trip Time Measurement) and PAWS (Protection Against 1159 Wrapped Sequences). 1161 The Window Scale option was originally suggested by Mike St. Johns of 1162 USAF/DCA. The present form of the option was suggested by Mike 1163 Karels of UC Berkeley in response to a more cumbersome scheme defined 1164 by Van Jacobson. Lixia Zhang helped formulate the PAWS mechanism 1165 description in [RFC1185]. 1167 Finally, much of this work originated as the result of discussions 1168 within the End-to-End Task Force on the theoretical limitations of 1169 transport protocols in general and TCP in particular. Task force 1170 members and other on the end2end-interest list have made valuable 1171 contributions by pointing out flaws in the algorithms and the 1172 documentation. Continued discussion and development since the 1173 publication of [RFC1323] originally occurred in the IETF TCP Large 1174 Windows Working Group, later on in the End-to-End Task Force, and 1175 most recently in the IETF TCP Maintenance Working Group. The authors 1176 are grateful for all these contributions. 1178 7. Security Considerations 1180 The TCP sequence space is a fixed size, and as the window becomes 1181 larger it becomes easier for an attacker to generate forged packets 1182 that can fall within the TCP window, and be accepted as valid 1183 segments. While use of timestamps and PAWS can help to mitigate 1184 this, when using PAWS, if an attacker is able to forge a packet that 1185 is acceptable to the TCP connection, a timestamp that is in the 1186 future would cause valid segments to be dropped due to PAWS checks. 1187 Hence, implementers should take care to not open the TCP window 1188 drastically beyond the requirements of the connection. 1190 A naive implementation that derives the timestamp clock value 1191 directly from a system uptime clock may unintentionally leak this 1192 information to an attacker. This does not directly compromise any of 1193 the mechanisms described in this document. However, this may be 1194 valuable information to a potential attacker. An implementer should 1195 evaluate the potential impact and mitigate this accordingly (i.e. by 1196 using a random offset for the timestamp clock on each connection, or 1197 using an external, real-time derived timestamp clock source). 1199 Expanding the TCP window beyond 64 KiB for IPv6 allows Jumbograms 1200 [RFC2675] to be used when the local network supports packets larger 1201 than 64 KiB. When larger TCP segments are used, the TCP checksum 1202 becomes weaker. 1204 Mechanisms to protect the TCP header from modification should also 1205 protect the TCP options. 1207 Middleboxes and TCP options: 1209 Some middleboxes have been known to remove the TCP options 1210 described in this document from TCP segments [Honda11]. 1211 Middleboxes that remove TCP options described in this document 1212 from the segment interfere with the selection of parameters 1213 appropriate for the session. Removing any of these options in a 1214 segment will leave the end hosts in a state that 1215 destroys the proper operation of the protocol. 1217 * If a Window Scale option is removed from a segment, 1218 the end hosts will not negotiate the window scaling factor 1219 correctly. Middleboxes must not remove or modify the Window 1220 Scale option from segments. 1222 * If a stateful firewall uses the window field to detect whether 1223 a received segment is inside the current window, and does not 1224 support the Window Scale option, it will not be able to 1225 correctly determine whether or not a packet is in the window. 1226 These middle boxes must also support the Window Scale option 1227 and apply the scale factor when processing segments. If the 1228 window scale factor cannot be determined, it must not do window 1229 based processing. 1231 * If the Timestamps option is removed from the or 1232 segment, high speed connections that need PAWS would not have 1233 that protection. Successful negotiation of Timestamps option 1234 enforces a stricter verification of incoming segments at the 1235 receiver. If the Timestamps option was removed from a 1236 subsequent data segment after a successful negotiation (e.g. as 1237 part of re-segmentation), the segment is discarded by the 1238 receiver without further processing. Middleboxes should not 1239 remove the Timestamps option. 1241 * It must be noted that [RFC1323] doesn't address the case of the 1242 Timestamps option being dropped or selectively omitted after 1243 being negotiated, and that the update in this document may 1244 cause some broken middlebox behavior to be detected 1245 (potentially unresponsive TCP sessions). 1247 Implementations that depend on PAWS could provide a mechanism for the 1248 application to determine whether or not PAWS is in use on the 1249 connection, and chose to terminate the connection if that protection 1250 doesn't exist. This is not just to protect the connection against 1251 middleboxes that might remove the Timestamps option, but also against 1252 remote hosts that do not have Timestamp support. 1254 7.1. Privacy Considerations 1256 The TCP options described in this document do not expose individual 1257 users data. However, a naive implementation simply using the system 1258 clock as source for the Timestamps option will reveal characteristics 1259 of the TCP potentially allowing more targeted attacks. It is 1260 therefore RECOMMENDED to generate a random, per-connection offset to 1261 be used with the clock source when generating the Timestamps option 1262 value (see Section 5.4). 1264 Furthermore, the combination, relative ordering and padding of the 1265 TCP options described in Section 2.2 and Section 3.2 will reveal 1266 additional clues to allow the fingerprinting of the system. 1268 8. IANA Considerations 1270 This document has no actions for IANA. The described TCP options are 1271 well known from the superceded [RFC1323]. 1273 9. References 1275 9.1. Normative References 1277 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 1278 RFC 793, September 1981. 1280 [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 1281 November 1990. 1283 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1284 Requirement Levels", BCP 14, RFC 2119, March 1997. 1286 9.2. Informative References 1288 [Allman99] 1289 Allman, M. and V. Paxson, "On Estimating End-to-End 1290 Network Path Properties", Proc. ACM SIGCOMM Technical 1291 Symposium, Cambridge, MA, September 1999, 1292 . 1294 [Ekstroem04] 1295 Ekstroem, H. and R. Ludwig, "The Peak-Hopper: A New End- 1296 to-End Retransmission Timer for Reliable Unicast 1297 Transport", INFOCOM 2004 IEEE, March 2004, . 1301 [Floyd05] Floyd, S., "[tcpm] How the RTO should be estimated with 1302 timestamps", Message from 26.Jan.2007 to the tcpm mailing 1303 list, August 2005, . 1306 [Garlick77] 1307 Garlick, L., Rom, R., and J. Postel, "Issues in Reliable 1308 Host-to-Host Protocols", Proc. Second Berkeley Workshop on 1309 Distributed Data Management and Computer Networks, 1310 May 1977, . 1312 [Hamming77] 1313 Hamming, R., "Digital Filters", Prentice Hall, Englewood 1314 Cliffs, N.J. ISBN 0-13-212571-4, 1977. 1316 [Honda11] Honda, M., Nishida, Y., Raiciu, C., Greenhalgh, A., 1317 Handley, M., and H. Tokuda, "Is it still possible to 1318 extend TCP?", Proc. of ACM Internet Measurement 1319 Conference (IMC) '11, November 2011. 1321 [Jacobson88a] 1322 Jacobson, V., "Congestion Avoidance and Control", SIGCOMM 1323 '88, Stanford, CA., August 1988, 1324 . 1326 [Jacobson90a] 1327 Jacobson, V., "4BSD Header Prediction", ACM Computer 1328 Communication Review, April 1990. 1330 [Jacobson90c] 1331 Jacobson, V., "Modified TCP congestion avoidance 1332 algorithm", Message to the end2end-interest mailing list, 1333 April 1990, 1334 . 1336 [Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet 1337 Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and 1338 Comm., Scottsdale, Arizona, March 1986, 1339 . 1341 [Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times in 1342 Reliable Transport Protocols", Proc. SIGCOMM '87, 1343 August 1987. 1345 [Kuehlewind10] 1346 Kuehlewind, M. and B. Briscoe, "Chirping for Congestion 1347 Control - Implementation Feasibility", November 2010, 1348 . 1350 [Kuzmanovic03] 1351 Kuzmanovic, A. and E. Knightly, "TCP-LP: Low-Priority 1352 Service via End-Point Congestion Control", 2003, 1353 . 1355 [Ludwig00] 1356 Ludwig, R. and K. Sklower, "The Eifel Retransmission 1357 Timer", ACM SIGCOMM Computer Communication Review Volume 1358 30 Issue 3, July 2000, . 1361 [Martin03] 1362 Martin, D., "[Tsvwg] RFC 1323.bis", Message to the tsvwg 1363 mailing list, September 2003, . 1366 [Mathis08] 1367 Mathis, M., "[tcpm] Example of 1323 window retraction 1368 problem", Message to the tcpm mailing list, March 2008, . 1372 [Medina04] 1373 Medina, A., Allman, M., and S. Floyd, "Measuring 1374 Interactions Between Transport Protocols and Middleboxes", 1375 Proc. ACM SIGCOMM/USENIX Internet Measurement Conference. 1376 October 2004, August 2004, 1377 . 1379 [Medina05] 1380 Medina, A., Allman, M., and S. Floyd, "Measuring the 1381 Evolution of Transport Protocols in the Internet", ACM 1382 Computer Communication Review 35(2), April 2005, 1383 . 1385 [Oppermann13] 1386 Oppermann, A., "[tcpm] Explanation to the relaxation of 1387 TSopt acceptance rules", Message to the tcpm mailing list, 1388 Jun 2013, . 1391 [RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks", 1392 RFC 896, January 1984. 1394 [RFC1072] Jacobson, V. and R. Braden, "TCP extensions for long-delay 1395 paths", RFC 1072, October 1988. 1397 [RFC1110] McKenzie, A., "Problem with the TCP big window option", 1398 RFC 1110, August 1989. 1400 [RFC1122] Braden, R., "Requirements for Internet Hosts - 1401 Communication Layers", STD 3, RFC 1122, October 1989. 1403 [RFC1185] Jacobson, V., Braden, B., and L. Zhang, "TCP Extension for 1404 High-Speed Paths", RFC 1185, October 1990. 1406 [RFC1323] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions 1407 for High Performance", RFC 1323, May 1992. 1409 [RFC1981] McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery 1410 for IP version 6", RFC 1981, August 1996. 1412 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 1413 Selective Acknowledgment Options", RFC 2018, October 1996. 1415 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 1416 Control", RFC 2581, April 1999. 1418 [RFC2675] Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms", 1419 RFC 2675, August 1999. 1421 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 1422 Extension to the Selective Acknowledgement (SACK) Option 1423 for TCP", RFC 2883, July 2000. 1425 [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm 1426 for TCP", RFC 3522, April 2003. 1428 [RFC4015] Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm 1429 for TCP", RFC 4015, February 2005. 1431 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 1432 Discovery", RFC 4821, March 2007. 1434 [RFC4963] Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly 1435 Errors at High Data Rates", RFC 4963, July 2007. 1437 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1438 Control", RFC 5681, September 2009. 1440 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 1441 "Computing TCP's Retransmission Timer", RFC 6298, 1442 June 2011. 1444 [RFC6528] Gont, F. and S. Bellovin, "Defending against Sequence 1445 Number Attacks", RFC 6528, February 2012. 1447 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 1448 and Y. Nishida, "A Conservative Loss Recovery Algorithm 1449 Based on Selective Acknowledgment (SACK) for TCP", 1450 RFC 6675, August 2012. 1452 [RFC6691] Borman, D., "TCP Options and Maximum Segment Size (MSS)", 1453 RFC 6691, July 2012. 1455 [RFC6817] Shalunov, S., Hazel, G., Iyengar, J., and M. Kuehlewind, 1456 "Low Extra Delay Background Transport (LEDBAT)", RFC 6817, 1457 December 2012. 1459 [Watson81] 1460 Watson, R., "Timer-based Mechanisms in Reliable Transport 1461 Protocol Connection Management", Computer Networks, Vol. 1462 5, 1981. 1464 [Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. SIGCOMM 1465 '86, Stowe, VT, August 1986. 1467 Appendix A. Implementation Suggestions 1469 TCP Option Layout 1471 The following layout is recommended for sending options on non- 1472 segments, to achieve maximum feasible alignment of 32-bit 1473 and 64-bit machines. 1475 +--------+--------+--------+--------+ 1476 | NOP | NOP | TSopt | 10 | 1477 +--------+--------+--------+--------+ 1478 | TSval timestamp | 1479 +--------+--------+--------+--------+ 1480 | TSecr timestamp | 1481 +--------+--------+--------+--------+ 1483 Interaction with the TCP Urgent Pointer 1485 The TCP Urgent pointer, like the TCP window, is a 16 bit value. 1486 Some of the original discussion for the TCP Window Scale option 1487 included proposals to increase the Urgent pointer to 32 bits. As 1488 it turns out, this is unnecessary. There are two observations 1489 that should be made: 1491 (1) With IP Version 4, the largest amount of TCP data that can be 1492 sent in a single packet is 65495 bytes (64 KiB - 1 -- size of 1493 fixed IP and TCP headers). 1495 (2) Updates to the urgent pointer while the user is in "urgent 1496 mode" are invisible to the user. 1498 This means that if the Urgent Pointer points beyond the end of the 1499 TCP data in the current segment, then the user will remain in 1500 urgent mode until the next TCP segment arrives. That segment will 1501 update the urgent pointer to a new offset, and the user will never 1502 have left urgent mode. 1504 Thus, to properly implement the Urgent Pointer, the sending TCP 1505 only has to check for overflow of the 16 bit Urgent Pointer field 1506 before filling it in. If it does overflow, than a value of 65535 1507 should be inserted into the Urgent Pointer. 1509 The same technique applies to IP Version 6, except in the case of 1510 IPv6 Jumbograms. When IPv6 Jumbograms are supported, [RFC2675] 1511 requires additional steps for dealing with the Urgent Pointer, 1512 these are described in section 5.2 of [RFC2675]. 1514 Appendix B. Duplicates from Earlier Connection Incarnations 1516 There are two cases to be considered: (1) a system crashing (and 1517 losing connection state) and restarting, and (2) the same connection 1518 being closed and reopened without a loss of host state. These will 1519 be described in the following two sections. 1521 B.1. System Crash with Loss of State 1523 TCP's quiet time of one MSL upon system startup handles the loss of 1524 connection state in a system crash/restart. For an explanation, see 1525 for example "When to Keep Quiet" in the TCP protocol specification 1526 [RFC0793]. The MSL that is required here does not depend upon the 1527 transfer speed. The current TCP MSL of 2 minutes seemed acceptable 1528 as an operational compromise, when many host systems used to take 1529 this long to boot after a crash. Current host systems can boot 1530 considerably faster. 1532 The Timestamps option may be used to ease the MSL requirements (or to 1533 provide additional security against data corruption). If timestamps 1534 are being used and if the timestamp clock can be guaranteed to be 1535 monotonic over a system crash/restart, i.e., if the first value of 1536 the sender's timestamp clock after a crash/restart can be guaranteed 1537 to be greater than the last value before the restart, then a quiet 1538 time is unnecessary. 1540 To dispense totally with the quiet time would require that the host 1541 clock be synchronized to a time source that is stable over the crash/ 1542 restart period, with an accuracy of one timestamp clock tick or 1543 better. We can back off from this strict requirement to take 1544 advantage of approximate clock synchronization. Suppose that the 1545 clock is always re-synchronized to within N timestamp clock ticks and 1546 that booting (extended with a quiet time, if necessary) takes more 1547 than N ticks. This will guarantee monotonicity of the timestamps, 1548 which can then be used to reject old duplicates even without an 1549 enforced MSL. 1551 B.2. Closing and Reopening a Connection 1553 When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT state 1554 ties up the socket pair for 4 minutes (see Section 3.5 of [RFC0793]. 1555 Applications built upon TCP that close one connection and open a new 1556 one (e.g., an FTP data transfer connection using Stream mode) must 1557 choose a new socket pair each time. The TIME-WAIT delay serves two 1558 different purposes: 1560 (a) Implement the full-duplex reliable close handshake of TCP. 1562 The proper time to delay the final close step is not really 1563 related to the MSL; it depends instead upon the RTO for the FIN 1564 segments and therefore upon the RTT of the path. (It could be 1565 argued that the side that is sending a FIN knows what degree of 1566 reliability it needs, and therefore it should be able to 1567 determine the length of the TIME-WAIT delay for the FIN's 1568 recipient. This could be accomplished with an appropriate TCP 1569 option in FIN segments.) 1571 Although there is no formal upper-bound on RTT, common network 1572 engineering practice makes an RTT greater than 1 minute very 1573 unlikely. Thus, the 4 minute delay in TIME-WAIT state works 1574 satisfactorily to provide a reliable full-duplex TCP close. 1575 Note again that this is independent of MSL enforcement and 1576 network speed. 1578 The TIME-WAIT state could cause an indirect performance problem 1579 if an application needed to repeatedly close one connection and 1580 open another at a very high frequency, since the number of 1581 available TCP ports on a host is less than 2^16. However, high 1582 network speeds are not the major contributor to this problem; 1583 the RTT is the limiting factor in how quickly connections can be 1584 opened and closed. Therefore, this problem will be no worse at 1585 high transfer speeds. 1587 (b) Allow old duplicate segments to expire. 1589 To replace this function of TIME-WAIT state, a mechanism would 1590 have to operate across connections. PAWS is defined strictly 1591 within a single connection; the last timestamp (TS.Recent) is 1592 kept in the connection control block, and discarded when a 1593 connection is closed. 1595 An additional mechanism could be added to the TCP, a per-host 1596 cache of the last timestamp received from any connection. This 1597 value could then be used in the PAWS mechanism to reject old 1598 duplicate segments from earlier incarnations of the connection, 1599 if the timestamp clock can be guaranteed to have ticked at least 1600 once since the old connection was open. This would require that 1601 the TIME-WAIT delay plus the RTT together must be at least one 1602 tick of the sender's timestamp clock. Such an extension is not 1603 part of the proposal of this RFC. 1605 Note that this is a variant on the mechanism proposed by 1606 Garlick, Rom, and Postel [Garlick77], which required each host 1607 to maintain connection records containing the highest sequence 1608 numbers on every connection. Using timestamps instead, it is 1609 only necessary to keep one quantity per remote host, regardless 1610 of the number of simultaneous connections to that host. 1612 Appendix C. Summary of Notation 1614 The following notation has been used in this document. 1616 Options 1618 WSopt: TCP Window Scale Option 1619 TSopt: TCP Timestamps option 1621 Option Fields 1623 shift.cnt: Window scale byte in WSopt 1624 TSval: 32-bit Timestamp Value field in TSopt 1625 TSecr: 32-bit Timestamp Reply field in TSopt 1627 Option Fields in Current Segment 1629 SEG.TSval: TSval field from TSopt in current segment 1630 SEG.TSecr: TSecr field from TSopt in current segment 1631 SEG.WSopt: 8-bit value in WSopt 1633 Clock Values 1635 my.TSclock: System wide source of 32-bit timestamp values 1636 my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec) 1637 Snd.TSoffset: A offset for randomizing Snd.TSclock 1638 Snd.TSclock: my.TSclock + Snd.TSoffset 1640 Per-Connection State Variables 1642 TS.Recent: Latest received Timestamp 1643 Last.ACK.sent: Last ACK field sent 1644 Snd.TS.OK: 1-bit flag 1645 Snd.WS.OK: 1-bit flag 1646 Rcv.Wind.Shift: Receive window scale exponent 1647 Snd.Wind.Shift: Send window scale exponent 1648 Start.Time: Snd.TSclock value when segment being timed was 1649 sent (used by pre-1323 code). 1651 Procedure 1653 Update_SRTT(m) Procedure to update the smoothed RTT and RTT 1654 variance estimates, using the rules of 1655 [Jacobson88a], given m, a new RTT measurement 1657 Appendix D. Event Processing Summary 1659 OPEN Call 1661 ... 1663 An initial send sequence number (ISS) is selected. Send a 1664 segment of the form: 1666 1668 ... 1670 SEND Call 1672 CLOSED STATE (i.e., TCB does not exist) 1674 ... 1676 LISTEN STATE 1678 If the foreign socket is specified, then change the connection 1679 from passive to active, select an ISS. Send a segment 1680 containing the options: and 1681 . Set SND.UNA to ISS, SND.NXT to ISS+1. 1682 Enter SYN-SENT state. ... 1684 SYN-SENT STATE 1685 SYN-RECEIVED STATE 1687 ... 1689 ESTABLISHED STATE 1690 CLOSE-WAIT STATE 1692 Segmentize the buffer and send it with a piggybacked 1693 acknowledgment (acknowledgment value = RCV.NXT). ... 1695 If the urgent flag is set ... 1697 If the Snd.TS.OK flag is set, then include the TCP Timestamps 1698 option in each data 1699 segment. 1701 Scale the receive window for transmission in the segment 1702 header: 1704 SEG.WND = (RCV.WND >> Rcv.Wind.Shift). 1706 SEGMENT ARRIVES 1708 ... 1710 If the state is LISTEN then 1712 first check for an RST 1714 ... 1716 second check for an ACK 1718 ... 1720 third check for a SYN 1722 if the SYN bit is set, check the security. If the ... 1724 ... 1726 if the SEG.PRC is less than the TCB.PRC then continue. 1728 Check for a Window Scale option (WSopt); if one is found, 1729 save SEG.WSopt in Snd.Wind.Shift and set Snd.WS.OK flag on. 1730 Otherwise, set both Snd.Wind.Shift and Rcv.Wind.Shift to 1731 zero and clear Snd.WS.OK flag. 1733 Check for a TSopt option; if one is found, save SEG.TSval in 1734 the variable TS.Recent and turn on the Snd.TS.OK bit. 1736 Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any 1737 other control or text should be queued for processing later. 1738 ISS should be selected and a segment sent of the form: 1740 1742 If the Snd.WS.OK bit is on, include a WSopt option 1743 in this segment. If the Snd.TS.OK 1744 bit is on, include a TSopt in this segment. Last.ACK.sent is set to 1746 RCV.NXT. 1748 SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection 1749 state should be changed to SYN-RECEIVED. Note that any 1750 other incoming control or data (combined with SYN) will be 1751 processed in the SYN-RECEIVED state, but processing of SYN 1752 and ACK should not be repeated. If the listen was not fully 1753 specified (i.e., the foreign socket was not fully 1754 specified), then the unspecified fields should be filled in 1755 now. 1757 fourth other text or control 1759 ... 1761 If the state is SYN-SENT then 1763 first check the ACK bit 1765 ... 1767 ... 1769 fourth check the SYN bit 1771 ... 1773 If the SYN bit is on and the security/compartment and 1774 precedence are acceptable then, RCV.NXT is set to SEG.SEQ+1, 1775 IRS is set to SEG.SEQ, and any acknowledgements on the 1776 retransmission queue which are thereby acknowledged should 1777 be removed. 1779 Check for a Window Scale option (WSopt); if it is found, 1780 save SEG.WSopt in Snd.Wind.Shift; otherwise, set both 1781 Snd.Wind.Shift and Rcv.Wind.Shift to zero. 1783 Check for a TSopt option; if one is found, save SEG.TSval in 1784 variable TS.Recent and turn on the Snd.TS.OK bit in the 1785 connection control block. If the ACK bit is set, use 1786 Snd.TSclock - SEG.TSecr as the initial RTT estimate. 1788 If SND.UNA > ISS (our has been ACKed), change the 1789 connection state to ESTABLISHED, form an segment: 1791 1793 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1794 option in this 1795 segment. Last.ACK.sent is set to RCV.NXT. 1797 Data or controls which were queued for transmission may be 1798 included. If there are other controls or text in the 1799 segment then continue processing at the sixth step below 1800 where the URG bit is checked, otherwise return. 1802 Otherwise enter SYN-RECEIVED, form a segment: 1804 1806 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1807 option in this segment. 1808 If the Snd.WS.OK bit is on, include a WSopt option 1809 in this segment. Last.ACK.sent is 1810 set to RCV.NXT. 1812 If there are other controls or text in the segment, queue 1813 them for processing after the ESTABLISHED state has been 1814 reached, return. 1816 fifth, if neither of the SYN or RST bits is set then drop the 1817 segment and return. 1819 Otherwise, 1821 First, check sequence number 1823 SYN-RECEIVED STATE 1824 ESTABLISHED STATE 1825 FIN-WAIT-1 STATE 1826 FIN-WAIT-2 STATE 1827 CLOSE-WAIT STATE 1828 CLOSING STATE 1829 LAST-ACK STATE 1830 TIME-WAIT STATE 1832 Segments are processed in sequence. Initial tests on 1833 arrival are used to discard old duplicates, but further 1834 processing is done in SEG.SEQ order. If a segment's 1835 contents straddle the boundary between old and new, only the 1836 new parts should be processed. 1838 Rescale the received window field: 1840 TrueWindow = SEG.WND << Snd.Wind.Shift, 1842 and use "TrueWindow" in place of SEG.WND in the following 1843 steps. 1845 Check whether the segment contains a Timestamp Option and 1846 bit Snd.TS.OK is on. If so: 1848 If SEG.TSval < TS.Recent and the RST bit is off, then 1849 test whether connection has been idle less than 24 days; 1850 if all are true, then the segment is not acceptable; 1851 follow steps below for an unacceptable segment. 1853 If SEG.SEQ is less than or equal to Last.ACK.sent, then 1854 save SEG.TSval in variable TS.Recent. 1856 There are four cases for the acceptability test for an 1857 incoming segment: 1859 ... 1861 If an incoming segment is not acceptable, an acknowledgment 1862 should be sent in reply (unless the RST bit is set, if so 1863 drop the segment and return): 1865 1867 Last.ACK.sent is set to SEG.ACK of the acknowledgment. If 1868 the Snd.Echo.OK bit is on, include the Timestamps option 1869 in this segment. 1870 Set Last.ACK.sent to SEG.ACK and send the segment. 1871 After sending the acknowledgment, drop the unacceptable 1872 segment and return. 1874 ... 1876 fifth check the ACK field. 1878 if the ACK bit is off drop the segment and return. 1880 if the ACK bit is on 1882 ... 1884 ESTABLISHED STATE 1886 If SND.UNA < SEG.ACK <= SND.NXT then, set SND.UNA <- 1887 SEG.ACK. Also compute a new estimate of round-trip time. 1888 If Snd.TS.OK bit is on, use Snd.TSclock - SEG.TSecr; 1889 otherwise use the elapsed time since the first segment in 1890 the retransmission queue was sent. Any segments on the 1891 retransmission queue which are thereby entirely 1892 acknowledged... 1894 ... 1896 Seventh, process the segment text. 1898 ESTABLISHED STATE 1899 FIN-WAIT-1 STATE 1900 FIN-WAIT-2 STATE 1901 ... 1903 Send an acknowledgment of the form: 1905 1907 If the Snd.TS.OK bit is on, include Timestamp Option 1908 in this segment. 1909 Set Last.ACK.sent to SEG.ACK of the acknowledgment, and send 1910 it. This acknowledgment should be piggy-backed on a segment 1911 being transmitted if possible without incurring undue delay. 1913 ... 1915 Appendix E. Timestamps Edge Cases 1917 While the rules laid out for when to calculate RTTM produce the 1918 correct results most of the time, there are some edge cases where an 1919 incorrect RTTM can be calculated. All of these situations involve 1920 the loss of segments. It is felt that these scenarios are rare, and 1921 that if they should happen, they will cause a single RTTM measurement 1922 to be inflated, which mitigates its effects on RTO calculations. 1924 [Martin03] cites two similar cases when the returning is lost, 1925 and before the retransmission timer fires, another returning 1926 segment arrives, which aknowledges the data. In this case, the RTTM 1927 calculated will be inflated: 1929 clock 1930 tc=1 -------------------> 1932 tc=2 (lost) <---- 1933 (RTTM would have been 1) 1935 (receive window opens, window update is sent) 1936 tc=5 <---- 1937 (RTTM is calculated at 4) 1939 One thing to note about this situation is that it is somewhat bounded 1940 by RTO + RTT, limiting how far off the RTTM calculation will be. 1941 While more complex scenarios can be constructed that produce larger 1942 inflations (e.g., retransmissions are lost), those scenarios involve 1943 multiple segment losses, and the connection will have other more 1944 serious operational problems than using an inflated RTTM in the RTO 1945 calculation. 1947 Appendix F. Window Retraction Example 1949 Consider an established TCP connection using a scale factor of 128, 1950 Snd.Wind.Shift=7 and Rcv.Wind.Shift=7, that is running with a very 1951 small window because the receiver is bottlenecked and both ends are 1952 doing small reads and writes. 1954 Consider the ACKs coming back: 1956 SEG.ACK SEG.WIN computed SND.WIN receiver's actual window 1957 1000 2 1256 1300 1959 The sender writes 40 bytes and receiver ACKs: 1961 1040 2 1296 1300 1963 The sender writes 5 additional bytes and the receiver has a problem. 1964 Two choices: 1966 1045 2 1301 1300 - BEYOND BUFFER 1968 1045 1 1173 1300 - RETRACTED WINDOW 1970 This is a general problem and can happen any time the sender does a 1971 write which is smaller than the window scale factor. 1973 In most stacks it is at least partially obscured when the window size 1974 is larger than some small number of segments because the stacks 1975 prefer to announce windows that are an integral number of segments, 1976 rounded up to the next scale factor. This plus silly window 1977 suppression tends to cause less frequent, larger window updates. If 1978 the window was rounded down to a segment size there is more 1979 opportunity to advance the window, the BEYOND BUFFER case above, 1980 rather than retracting it. 1982 Appendix G. RTO calculation modification 1984 Taking multiple RTT samples per window would shorten the history 1985 calculated by the RTO mechanism in [RFC6298], and the below algorithm 1986 aims to maintain a similar history as originally intended by 1987 [RFC6298]. 1989 It is roughly known how many samples a congestion window worth of 1990 data will yield, not accounting for ACK compression, and ACK losses. 1991 Such events will result in more history of the path being reflected 1992 in the final value for RTO, and are uncritical. This modification 1993 will ensure that a similar amount of time is taken into account for 1994 the RTO estimation, regardless of how many samples are taken per 1995 window: 1997 ExpectedSamples = ceiling(FlightSize / (SMSS * 2)) 1999 alpha' = alpha / ExpectedSamples 2001 beta' = beta / ExpectedSamples 2003 Note that the factor 2 in ExpectedSamples is due to "Delayed ACKs". 2005 Instead of using alpha and beta in the algorithm of [RFC6298], use 2006 alpha' and beta' instead: 2008 RTTVAR <- (1 - beta') * RTTVAR + beta' * |SRTT - R'| 2010 SRTT <- (1 - alpha') * SRTT + alpha' * R' 2012 (for each sample R') 2014 Appendix H. Changes from RFC 1323 2016 Several important updates and clarifications to the specification in 2017 RFC 1323 are made in these document. The technical changes are 2018 summarized below: 2020 (a) A wrong reference to SND.WND was corrected to SEG.WND in 2021 Section 2.3 2023 (b) Section 2.4 was added describing the unavoidable window 2024 retraction issue, and explicitly describing the mitigation steps 2025 necessary. 2027 (c) In Section 3.2 the wording how the Timestamps option negotiation 2028 is to be performed was updated with RFC2119 wording. Further, a 2029 number of paragraphs were added to clarify the expected behavior 2030 with a compliant implementation using TSopt, as RFC1323 left 2031 room for interpretation - e.g. potential late enablement of 2032 TSopt. 2034 (d) The description of which TSecr values can be used to update the 2035 measured RTT has been clarified. Specifically, with timestamps, 2036 the Karn algorithm [Karn87] is disabled. The Karn algorithm 2037 disables all RTT measurements during retransmission, since it is 2038 ambiguous whether the is for the original segment, or the 2039 retransmitted segment. With timestamps, that ambiguity is 2040 removed since the TSecr in the will contain the TSval from 2041 whichever data segment made it to the destination. 2043 (e) RTTM update processing explicitly excludes segments not updating 2044 SND.UNA. The original text could be interpreted to allow taking 2045 RTT samples when SACK acknowledges some new, non-continuous 2046 data. 2048 (f) In RFC1323, section 3.4, step (2) of the algorithm to control 2049 which timestamp is echoed was incorrect in two regards: 2051 (1) It failed to update TS.recent for a retransmitted segment 2052 that resulted from a lost . 2054 (2) It failed if SEG.LEN = 0. 2056 In the new algorithm, the case of SEG.TSval >= TS.recent is 2057 included for consistency with the PAWS test. 2059 (g) It is now recommended that the Timestamps option is included in 2060 segments if the incoming segment contained a Timestamps 2061 option. 2063 (h) segments are explicitly excluded from PAWS processing. 2065 (i) Added text to clarify the precedence between regular TCP 2066 [RFC0793] and this document Timestamps option / PAWS processing. 2067 Discussion about combined acceptability checks are ongoing. 2069 (j) Snd.TSoffset and Snd.TSclock variables have been added. 2070 Snd.TSclock is the sum of my.TSclock and Snd.TSoffset. This 2071 allows the starting points for timestamp values to be randomized 2072 on a per-connection basis. Setting Snd.TSoffset to zero yields 2073 the same results as [RFC1323]. Text was added to guide 2074 implementors to the proper selection of these offsets, as 2075 entirly random offsets for each new connection will conflict 2076 with PAWS. 2078 (k) Appendix A has been expanded with information about the TCP 2079 Urgent Pointer. An earlier revision contained text around the 2080 TCP MSS option, which was split off into [RFC6691]. 2082 (l) One correction was made to the Event Processing Summary in 2083 Appendix D. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to 2084 fill in the SEG.WND value, not SND.WND. 2086 (m) Appendix G was added to exemplify how an RTO calculation might 2087 be updated to properly take the much higher RTT sampling 2088 frequency enabled by the Timestamps option into account. 2090 Editorial changes of the document, that don't impact the 2091 implementation or function of the mechanisms described in this 2092 document include: 2094 (a) Removed much of the discussion in Section 1 to streamline the 2095 document. However, detailed examples and discussions in 2096 Section 2, Section 3 and Section 5 are kept as guideline for 2097 implementers. 2099 (b) Added short text that the use of WS increases the chances of 2100 sequence number wrap, thus the PAWS mechanism is required in 2101 certain environments. 2103 (c) Removed references to "new" options, as the options were 2104 introduced in [RFC1323] already. Changed the text in 2105 Section 1.3 to specifically address TS and WS options. 2107 (d) Section 1.4 was added for [RFC2119] wording. Normative text was 2108 updated with the appropriate phrases. 2110 (e) Added < > brackets to mark specific types of segments, and 2111 replaced most occurances of "packet" with "segment", where TCP 2112 segments are referred to. 2114 (f) Updated the text in Section 3 to take into account what has been 2115 learned since [RFC1323]. 2117 (g) Removed the list of changes between [RFC1323] and prior 2118 versions. These changes are mentioned in Appendix C of 2119 [RFC1323]. 2121 (h) Moved Appendix Changes from RFC 1323 to the end of the 2122 appendices for easier lookup. In addition, the entries were 2123 split into a technical and an editorial part, and sorted to 2124 roughly correspond with the sections in the text where they 2125 apply. 2127 Authors' Addresses 2129 David Borman 2130 Quantum Corporation 2131 Mendota Heights MN 55120 2132 USA 2134 Email: david.borman@quantum.com 2136 Bob Braden 2137 University of Southern California 2138 4676 Admiralty Way 2139 Marina del Rey CA 90292 2140 USA 2142 Email: braden@isi.edu 2144 Van Jacobson 2145 Google, Inc. 2146 1600 Amphitheatre Parkway 2147 Mountain View CA 94043 2148 USA 2150 Email: vanj@google.com 2152 Richard Scheffenegger (editor) 2153 NetApp, Inc. 2154 Am Euro Platz 2 2155 Vienna, 1120 2156 Austria 2158 Email: rs@netapp.com