idnits 2.17.1 draft-ietf-tcpm-1323bis-17.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The abstract seems to indicate that this document obsoletes RFC1323, but the header doesn't have an 'Obsoletes:' line to match this. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 15, 2013) is 3814 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'Ekstroem04' is defined on line 1298, but no explicit reference was found in the text == Unused Reference: 'Hamming77' is defined on line 1316, but no explicit reference was found in the text == Unused Reference: 'Jain86' is defined on line 1340, but no explicit reference was found in the text == Unused Reference: 'Mathis08' is defined on line 1370, but no explicit reference was found in the text == Unused Reference: 'RFC0896' is defined on line 1395, but no explicit reference was found in the text == Unused Reference: 'RFC1110' is defined on line 1401, but no explicit reference was found in the text == Unused Reference: 'RFC2581' is defined on line 1419, but no explicit reference was found in the text == Unused Reference: 'Watson81' is defined on line 1463, but no explicit reference was found in the text == Unused Reference: 'Zhang86' is defined on line 1468, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 1072 (Obsoleted by RFC 1323, RFC 2018, RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1110 (Obsoleted by RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1185 (Obsoleted by RFC 1323) -- Obsolete informational reference (is this intentional?): RFC 1323 (Obsoleted by RFC 7323) -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 6528 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 6691 (Obsoleted by RFC 9293) Summary: 1 error (**), 0 flaws (~~), 10 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance (TCPM) D. Borman 3 Internet-Draft Quantum Corporation 4 Intended status: Standards Track B. Braden 5 Expires: May 19, 2014 University of Southern 6 California 7 V. Jacobson 8 Google, Inc. 9 R. Scheffenegger, Ed. 10 NetApp, Inc. 11 November 15, 2013 13 TCP Extensions for High Performance 14 draft-ietf-tcpm-1323bis-17 16 Abstract 18 This document specifies a set of TCP extensions to improve 19 performance over paths with a large bandwidth * delay product and to 20 provide reliable operation over very high-speed paths. It defines 21 the TCP Window Scale (WS) option and the TCP Timestamps (TS) option 22 and their semantics. The Window Scale option is used to support 23 larger receive windows, while the Timestamps option can be used for 24 at least two distinct mechanisms, PAWS (Protection Against Wrapped 25 Sequences) and RTTM (Round Trip Time Measurement), that are also 26 described herein. 28 This document obsoletes RFC1323 and describes changes from it. 30 Status of this Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF). Note that other groups may also distribute 37 working documents as Internet-Drafts. The list of current Internet- 38 Drafts is at http://datatracker.ietf.org/drafts/current/. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 This Internet-Draft will expire on May 19, 2014. 47 Copyright Notice 48 Copyright (c) 2013 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (http://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 64 1.1. TCP Performance . . . . . . . . . . . . . . . . . . . . . 4 65 1.2. TCP Reliability . . . . . . . . . . . . . . . . . . . . . 5 66 1.3. Using TCP options . . . . . . . . . . . . . . . . . . . . 6 67 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7 68 2. TCP Window Scale option . . . . . . . . . . . . . . . . . . . 8 69 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 8 70 2.2. Window Scale option . . . . . . . . . . . . . . . . . . . 8 71 2.3. Using the Window Scale option . . . . . . . . . . . . . . 9 72 2.4. Addressing Window Retraction . . . . . . . . . . . . . . . 10 73 3. TCP Timestamps option . . . . . . . . . . . . . . . . . . . . 12 74 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 12 75 3.2. Timestamps option . . . . . . . . . . . . . . . . . . . . 12 76 4. The RTTM Mechanism . . . . . . . . . . . . . . . . . . . . . . 15 77 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 15 78 4.2. Updating the RTO value . . . . . . . . . . . . . . . . . . 16 79 4.3. Which Timestamp to Echo . . . . . . . . . . . . . . . . . 16 80 5. PAWS - Protection Against Wrapped Sequence Numbers . . . . . . 20 81 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 20 82 5.2. The PAWS Mechanism . . . . . . . . . . . . . . . . . . . . 20 83 5.3. Basic PAWS Algorithm . . . . . . . . . . . . . . . . . . . 21 84 5.4. Timestamp Clock . . . . . . . . . . . . . . . . . . . . . 23 85 5.5. Outdated Timestamps . . . . . . . . . . . . . . . . . . . 25 86 5.6. Header Prediction . . . . . . . . . . . . . . . . . . . . 25 87 5.7. IP Fragmentation . . . . . . . . . . . . . . . . . . . . . 27 88 5.8. Duplicates from Earlier Incarnations of Connection . . . . 27 89 6. Conclusions and Acknowledgements . . . . . . . . . . . . . . . 28 90 7. Security Considerations . . . . . . . . . . . . . . . . . . . 28 91 7.1. Privacy Considerations . . . . . . . . . . . . . . . . . . 30 92 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30 93 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 30 94 9.1. Normative References . . . . . . . . . . . . . . . . . . . 30 95 9.2. Informative References . . . . . . . . . . . . . . . . . . 31 96 Appendix A. Implementation Suggestions . . . . . . . . . . . . . 34 97 Appendix B. Duplicates from Earlier Connection Incarnations . . . 35 98 B.1. System Crash with Loss of State . . . . . . . . . . . . . 36 99 B.2. Closing and Reopening a Connection . . . . . . . . . . . . 36 100 Appendix C. Summary of Notation . . . . . . . . . . . . . . . . . 37 101 Appendix D. Event Processing Summary . . . . . . . . . . . . . . 38 102 Appendix E. Timestamps Edge Cases . . . . . . . . . . . . . . . . 44 103 Appendix F. Window Retraction Example . . . . . . . . . . . . . . 45 104 Appendix G. RTO calculation modification . . . . . . . . . . . . 45 105 Appendix H. Changes from RFC 1323 . . . . . . . . . . . . . . . . 46 106 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 48 108 1. Introduction 110 The TCP protocol [RFC0793] was designed to operate reliably over 111 almost any transmission medium regardless of transmission rate, 112 delay, corruption, duplication, or reordering of segments. Over the 113 years, advances in networking technology have resulted in ever-higher 114 transmission speeds, and the fastest paths are well beyond the domain 115 for which TCP was originally engineered. 117 This document defines a set of modest extensions to TCP to extend the 118 domain of its application to match the increasing network capability. 119 It is an update to and obsoletes [RFC1323], which in turn is based 120 upon and obsoletes [RFC1072] and [RFC1185]. 122 Changes between [RFC1323] and this document are detailed in 123 Appendix H. These changes are partly due to errata in [RFC1323], and 124 partly due to the improved understanding of how the involved 125 components interact. 127 For brevity, the full discussions of the merits and history behind 128 the TCP options defined within this document have been omitted. 129 [RFC1323] should be consulted for reference. It is recommended that 130 a modern TCP stack implements and make use of the extensions 131 described in this document. 133 1.1. TCP Performance 135 TCP performance problems arise when the bandwidth * delay product is 136 large. A network having such paths is referred to as "long, fat 137 network" (LFN). 139 There are two fundamental performance problems with basic TCP over 140 LFN paths: 142 (1) Window Size Limit 144 The TCP header uses a 16 bit field to report the receive window 145 size to the sender. Therefore, the largest window that can be 146 used is 2^16 = 64 KiB. For LFN paths where the bandwidth * 147 delay product exceeds 64 KiB, the receive window limits the 148 maximum throughput of the TCP connection over the path, i.e., 149 the amount of unacknowledged data that TCP can send in order to 150 keep the pipeline full. 152 To circumvent this problem, Section 2 of this memo defines a TCP 153 option, "Window Scale", to allow windows larger than 2^16. This 154 option defines an implicit scale factor, which is used to 155 multiply the window size value found in a TCP header to obtain 156 the true window size. 158 It must be noted, that the use of large receive windows 159 increases the chance of too quickly wrapping sequence numbers, 160 as described below in Section 1.2, (1). 162 (2) Recovery from Losses 164 Packet losses in an LFN can have a catastrophic effect on 165 throughput. 167 To generalize the Fast Retransmit / Fast Recovery mechanism to 168 handle multiple packets dropped per window, Selective 169 Acknowledgments are required. Unlike the normal cumulative 170 acknowledgments of TCP, Selective Acknowledgments give the 171 sender a complete picture of which segments are queued at the 172 receiver and which have not yet arrived. 174 Selective acknowledgements and their use are specified in 175 separate documents, "TCP Selective Acknowledgment options" 176 [RFC2018], "An Extension to the Selective Acknowledgement (SACK) 177 option for TCP" [RFC2883], and "A Conservative Selective 178 Acknowledgment (SACK)-based Loss Recovery Algorithm for TCP" 179 [RFC6675], and not further discussed in this document. 181 1.2. TCP Reliability 183 An especially serious kind of error may result from an accidental 184 reuse of TCP sequence numbers in data segments. TCP reliability 185 depends upon the existence of a bound on the lifetime of a segment: 186 the "Maximum Segment Lifetime" or MSL. 188 Duplication of sequence numbers might happen in either of two ways: 190 (1) Sequence number wrap-around on the current connection 192 A TCP sequence number contains 32 bits. At a high enough 193 transfer rate of large volumes of data (at least 4 GiB in the 194 same session), the 32-bit sequence space may be "wrapped" 195 (cycled) within the time that a segment is delayed in queues. 197 (2) Earlier incarnation of the connection 199 Suppose that a connection terminates, either by a proper close 200 sequence or due to a host crash, and the same connection (i.e., 201 using the same pair of port numbers) is immediately reopened. A 202 delayed segment from the terminated connection could fall within 203 the current window for the new incarnation and be accepted as 204 valid. 206 Duplicates from earlier incarnations, case (2), are avoided by 207 enforcing the current fixed MSL of the TCP specification, as 208 explained in Section 5.8 and Appendix B. In addition, the randmizing 209 of ephemeral ports can also help to probabilistically reduce the 210 chances of duplicates from earlier connections. However, case (1), 211 avoiding the reuse of sequence numbers within the same connection, 212 requires an upper bound on MSL that depends upon the transfer rate, 213 and at high enough rates, a dedicated mechanism is required. 215 A possible fix for the problem of cycling the sequence space would be 216 to increase the size of the TCP sequence number field. For example, 217 the sequence number field (and also the acknowledgment field) could 218 be expanded to 64 bits. This could be done either by changing the 219 TCP header or by means of an additional option. 221 Section 5 presents a different mechanism, which we call PAWS 222 (Protection Against Wrapped Sequence numbers), to extend TCP 223 reliability to transfer rates well beyond the foreseeable upper limit 224 of network bandwidths. PAWS uses the TCP Timestamps option defined 225 in Section 3.2 to protect against old duplicates from the same 226 connection. 228 1.3. Using TCP options 230 The extensions defined in this document all use TCP options. 232 When [RFC1323] was published, there was concern that some buggy TCP 233 implementation might crash on the first appearance of an option on a 234 non- segment. However, bugs like that can lead to DOS attacks 235 against a TCP. Research has shown that most TCP implementations will 236 properly handle unknown options on non- segments ([Medina04], 237 [Medina05]). But it is still prudent to be conservative in what you 238 send, and avoiding buggy TCP implementation is not the only reason 239 for negotiating TCP options on segments. 241 The window scale option negotiates fundamental parameters of the TCP 242 session. Therefore, it is only sent during the initial handshake. 243 Furthermore, the window scale option will be sent in a 244 segment only if the corresponding option was received in the initial 245 segment. 247 The Timestamps option may appear in any data or segment, adding 248 10 bytes (up to 12 bytes including padding) to the 20-byte TCP 249 header. It is required that this TCP option will be sent on all non- 250 segments after an exchange of options on the segments has 251 indicated that both sides understand this extension. 253 Research has shown that the use of the Timestamps option to take 254 additional RTT samples within each RTT has little effect on the 255 ultimate retransmission timeout value [Allman99]. However, there are 256 other uses of the Timestamps option, such as the Eifel mechanism 257 [RFC3522], [RFC4015], and PAWS (see Section 5) which improve overall 258 TCP security and performance. The extra header bandwidth used by 259 this option should be evaluated for the gains in performance and 260 security in an actual deployment. 262 Appendix A contains a recommended layout of the options in TCP 263 headers to achieve reasonable data field alignment. 265 Finally, we observe that most of the mechanisms defined in this 266 document are important for LFN's and/or very high-speed networks. 267 For low-speed networks, it might be a performance optimization to NOT 268 use these mechanisms. A TCP vendor concerned about optimal 269 performance over low-speed paths might consider turning these 270 extensions off for low- speed paths, or allow a user or installation 271 manager to disable them. 273 1.4. Terminology 275 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 276 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 277 document are to be interpreted as described in [RFC2119]. 279 In this document, these words will appear with that interpretation 280 only when in UPPER CASE. Lower case uses of these words are not to 281 be interpreted as carrying [RFC2119] significance. 283 2. TCP Window Scale option 285 2.1. Introduction 287 The window scale extension expands the definition of the TCP window 288 to 30 bits and then uses an implicit scale factor to carry this 30- 289 bit value in the 16-bit Window field of the TCP header (SEG.WND in 290 [RFC0793]). The exponent of the scale factor is carried in a TCP 291 option, Window Scale. This option is sent only in a segment (a 292 segment with the SYN bit on), hence the window scale is fixed in each 293 direction when a connection is opened. 295 The maximum receive window, and therefore the scale factor, is 296 determined by the maximum receive buffer space. In a typical modern 297 implementation, this maximum buffer space is set by default but can 298 be overridden by a user program before a TCP connection is opened. 299 This determines the scale factor, and therefore no new user interface 300 is needed for window scaling. 302 2.2. Window Scale option 304 The three-byte Window Scale option MAY be sent in a segment by 305 a TCP. It has two purposes: (1) indicate that the TCP is prepared to 306 both send and receive window scaling, and (2) communicate the 307 exponent of a scale factor to be applied to its receive window. 308 Thus, a TCP that is prepared to scale windows SHOULD send the option, 309 even if its own scale factor is 1 and the exponent 0. The scale 310 factor is limited to a power of two and encoded logarithmically, so 311 it may be implemented by binary shift operations. The maximum scale 312 exponent is limited to 14 for a maximum permissible receive window 313 size of 1 GiB (2^(14+16)). 315 TCP Window Scale option (WSopt): 317 Kind: 3 319 Length: 3 bytes 321 +---------+---------+---------+ 322 | Kind=3 |Length=3 |shift.cnt| 323 +---------+---------+---------+ 324 1 1 1 326 This option is an offer, not a promise; both sides MUST send Window 327 Scale options in their segments to enable window scaling in 328 either direction. If window scaling is enabled, then the TCP that 329 sent this option will right-shift its true receive-window values by 330 'shift.cnt' bits for transmission in SEG.WND. The value 'shift.cnt' 331 MAY be zero (offering to scale, while applying a scale factor of 1 to 332 the receive window). 334 This option MAY be sent in an initial segment (i.e., a segment 335 with the SYN bit on and the ACK bit off). It MAY also be sent in a 336 segment, but only if a Window Scale option was received in 337 the initial segment. A Window Scale option in a segment 338 without a SYN bit MUST be ignored. 340 The window field in a segment where the SYN bit is set (i.e., a 341 or ) MUST NOT be scaled. 343 2.3. Using the Window Scale option 345 A model implementation of window scaling is as follows, using the 346 notation of [RFC0793]: 348 o The connection state is augmented by two window shift counters, 349 Snd.Wind.Shift and Rcv.Wind.Shift, to be applied to the incoming 350 and outgoing window fields, respectively. 352 o If a TCP receives a segment containing a Window Scale 353 option, it SHOULD send its own Window Scale option in the 354 segment. 356 o The Window Scale option MUST be sent with shift.cnt = R, where R 357 is the value that the TCP would like to use for its receive 358 window. 360 o Upon receiving a segment with a Window Scale option 361 containing shift.cnt = S, a TCP MUST set Snd.Wind.Shift to S and 362 MUST set Rcv.Wind.Shift to R; otherwise, it MUST set both 363 Snd.Wind.Shift and Rcv.Wind.Shift to zero. 365 o The window field (SEG.WND) in the header of every incoming 366 segment, with the exception of segments, MUST be left- 367 shifted by Snd.Wind.Shift bits before updating SND.WND: 369 SND.WND = SEG.WND << Snd.Wind.Shift 371 (assuming the other conditions of [RFC0793] are met, and using the 372 "C" notation "<<" for left-shift). 374 o The window field (SEG.WND) of every outgoing segment, with the 375 exception of segments, MUST be right-shifted by 376 Rcv.Wind.Shift bits: 378 SEG.WND = RCV.WND >> Rcv.Wind.Shift 380 TCP determines if a data segment is "old" or "new" by testing whether 381 its sequence number is within 2^31 bytes of the left edge of the 382 window, and if it is not, discarding the data as "old". To insure 383 that new data is never mistakenly considered old and vice versa, the 384 left edge of the sender's window has to be at most 2^31 away from the 385 right edge of the receiver's window. Similarly with the sender's 386 right edge and receiver's left edge. Since the right and left edges 387 of either the sender's or receiver's window differ by the window 388 size, and since the sender and receiver windows can be out of phase 389 by at most the window size, the above constraints imply that two 390 times the maximum window size must be less than 2^31, or 392 max window < 2^30 394 Since the max window is 2^S (where S is the scaling shift count) 395 times at most 2^16 - 1 (the maximum unscaled window), the maximum 396 window is guaranteed to be < 2^30 if S <= 14. Thus, the shift count 397 MUST be limited to 14 (which allows windows of 2^30 = 1 GiB). If a 398 Window Scale option is received with a shift.cnt value larger than 399 14, the TCP SHOULD log the error but MUST use 14 instead of the 400 specified value. This is safe as a sender can always choose to only 401 partially use any signaled receive window. If the receiver is 402 scaling by a factor larger than 14 and the sender is only scaling by 403 14 then the receive window used by the sender will appear smaller 404 than it is in reality. 406 The scale factor applies only to the Window field as transmitted in 407 the TCP header; each TCP using extended windows will maintain the 408 window values locally as 32-bit numbers. For example, the 409 "congestion window" computed by Slow Start and Congestion Avoidance 410 (see [RFC5681]) is not affected by the scale factor, so window 411 scaling will not introduce quantization into the congestion window. 413 2.4. Addressing Window Retraction 415 When a non-zero scale factor is in use, there are instances when a 416 retracted window can be offered - see Appendix F for a detailed 417 example. The end of the window will be on a boundary based on the 418 granularity of the scale factor being used. If the sequence number 419 is then updated by a number of bytes smaller than that granularity, 420 the TCP will have to either advertise a new window that is beyond 421 what it previously advertised (and perhaps beyond the buffer), or 422 will have to advertise a smaller window, which will cause the TCP 423 window to shrink. Implementations MUST ensure that they handle a 424 shrinking window, as specified in section 4.2.2.16 of [RFC1122]. 426 For the receiver, this implies that: 428 1) The receiver MUST honor, as in-window, any segment that would 429 have been in-window for any sent by the receiver. 431 2) When window scaling is in effect, the receiver SHOULD track the 432 actual maximum window sequence number (which is likely to be 433 greater than the window announced by the most recent , if 434 more than one segment has arrived since the application consumed 435 any data in the receive buffer). 437 On the sender side: 439 3) The initial transmission MUST be within the window announced by 440 the most recent . 442 4) On first retransmission, or if the sequence number is out-of- 443 window by less than 2^Rcv.Wind.Shift then do normal 444 retransmission(s) without regard to receiver window as long as 445 the original segment was in window when it was sent. 447 5) Subsequent retransmissions MAY only be sent, if they are within 448 the window announced by the most recent . 450 3. TCP Timestamps option 452 3.1. Introduction 454 The Timestamps option is introduced to address some of the issues 455 mentioned in Section 1.1 and Section 1.2. The Timestamps option is 456 specified in a symmetrical manner, so that TSval timestamps are 457 carried in both data and segments and are echoed in TSecr 458 fields carried in returning or data segments. Originally used 459 primarily for timestamping individual segments, the properties of the 460 Timestamps option allow not only the use for taking time measurements 461 (Section 4), but additional uses as well (Section 5). 463 It is necessary to remember that there is a distinction between the 464 Timestamps option conveying timestamp information, and the use of 465 that information. In particular, the Round Trip Time Measurement 466 (RTTM) mechanism must be viewed independently from updating the 467 Retransmission Timeout (RTO) (see Section 4.2). In this case, the 468 sample granularity also needs to be taken into account. Other 469 mechanisms, such as PAWS, or Eifel, are not built upon the timestamp 470 information itself, but are based on the intrinsic property of 471 monotonically non-decreasing values. 473 The Timestamps option is important when large receive windows are 474 used, to allow the use of the PAWS mechanism (see Section 5). 475 Furthermore, the option may be useful for all TCP's, since it 476 simplifies the sender and allows the use of additional optimizations 477 such as Eifel ([RFC3522], [RFC4015]) and others ([RFC6817], 478 [Kuzmanovic03], [Kuehlewind10]. 480 3.2. Timestamps option 482 TCP is a symmetric protocol, allowing data to be sent at any time in 483 either direction, and therefore timestamp echoing may occur in either 484 direction. For simplicity and symmetry, we specify that timestamps 485 always be sent and echoed in both directions. For efficiency, we 486 combine the timestamp and timestamp reply fields into a single TCP 487 Timestamps option. 489 TCP Timestamps option (TSopt): 491 Kind: 8 493 Length: 10 bytes 495 +-------+-------+---------------------+---------------------+ 496 |Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)| 497 +-------+-------+---------------------+---------------------+ 498 1 1 4 4 500 The Timestamps option carries two four-byte timestamp fields. The 501 Timestamp Value field (TSval) contains the current value of the 502 timestamp clock of the TCP sending the option. 504 The Timestamp Echo Reply (TSecr) field is valid if the ACK bit is set 505 in the TCP header. If the ACK bit is not set in the outgoing TCP 506 header, the sender of that segment SHOULD set the TSecr field to 507 zero. When the ACK bit is set in an outgoing segment, the sender 508 MUST echo a recently received Timestamp Value (TSval) sent by the 509 remote TCP in the TSval field of a Timestamps option. The exact 510 rules on which TSval MUST be echoed are given in Section 4.3. When 511 the ACK bit is not set, the receiver MUST ignore the value of the 512 TSecr field. 514 A TCP MAY send the Timestamps option (TSopt) in an initial 515 segment (i.e., segment containing a SYN bit and no ACK bit), and MAY 516 send a TSopt in only if it received a TSopt in the initial 517 segment for the connection. 519 Once TSopt has been successfully negotiated, that is both , and 520 contain TSopt, the TSopt MUST be sent in every non- 521 segment for the duration of the connection, and SHOULD be sent in an 522 segment (see Section 5.2 for details). The TCP SHOULD remember 523 this state by setting a flag, referred to as Snd.TS.OK, to one. If a 524 non- segment is received without a TSopt, a TCP SHOULD silently 525 drop the segment. A TCP MUST NOT abort a TCP connection because any 526 segment lacks an expected TSopt. 528 Implementations are strongly encouraged to follow the above rules for 529 handling a missing Timestamps option, and the order of precedence 530 mentioned in Section 5.3 when deciding on the acceptance of a 531 segment. 533 If a receiver chooses to accept a segment without an expected 534 Timestamps option, it must be clear that undetectable data corruption 535 may occur. 537 Such a TCP receiver may experience undetectable wrapped- sequence 538 effects, such as data (payload) corruption or session stalls. In 539 order to maintain the integrity of the payload data, in particular on 540 high speed networks, it is paramount to follow the described 541 processing rules. 543 However, it has been mentioned that under some circumstances, the 544 above guidelines are too strict, and some paths sporadically suppress 545 the Timestamps option, while maintaining payload integrity. A path 546 behaving in this manner should be deemed unacceptable, but it has 547 been noted that some implementations relax the acceptance rules as a 548 workaround, and allow TCP to run across such paths [Oppermann13] 550 If a TSopt is received on a connection where TSopt was not negotiated 551 in the initial three-way handshake, the TSopt MUST be ignored and the 552 packet processed normally. 554 In the case of crossing segments where one contains a 555 TSopt and the other doesn't, both sides MAY send a TSopt in the 556 segment. 558 TSopt is required for the two mechanisms described in sections 4 and 559 5. There are also other mechanisms that rely on the presence of the 560 TSopt, e.g. [RFC3522]. If a TCP stopped sending TSopt at any time 561 during an established session, it interferes with these mechanisms. 562 This update to [RFC1323] describes explicitly the previous assumption 563 (see Section 5.2), that each TCP segment must have TSopt, once 564 negotiated. 566 4. The RTTM Mechanism 568 4.1. Introduction 570 One use of the Timestamps option is to measure the round trip time of 571 virtually every packet acknowledged. The Round Trip Time Measurement 572 (RTTM) mechansim requires a Timestamps option in every measured 573 segment, with a TSval that is obtained from a (virtual) "timestamp 574 clock". Values of this clock MUST be at least approximately 575 proportional to real time, in order to measure actual RTT. 577 TCP measures the round trip time (RTT), primarily for the purpose of 578 arriving at a reasonable value for the Retransmission Timeout (RTO) 579 timer interval. Accurate and current RTT estimates are necessary to 580 adapt to changing traffic conditions, while a conservative estimate 581 of the RTO interval is necessary to minimize spurious RTOs. 583 These TSval values are echoed in TSecr values in the reverse 584 direction. The difference between a received TSecr value and the 585 current timestamp clock value provides an RTT measurement. 587 When timestamps are used, every segment that is received will contain 588 a TSecr value. However, these values cannot all be used to update 589 the measured RTT. The following example illustrates why. It shows a 590 one-way data flow with segments arriving in sequence without loss. 591 Here A, B, C... represent data blocks occupying successive blocks of 592 sequence numbers, and ACK(A),... represent the corresponding 593 cumulative acknowledgments. The two timestamp fields of the 594 Timestamps option are shown symbolically as . Each 595 TSecr field contains the value most recently received in a TSval 596 field. 598 TCP A TCP B 600 -----> 602 <---- 604 -----> 606 <---- 608 . . . . . . . . . . . . . . . . . . . . . . 610 ----> 612 <---- 613 (etc.) 615 The dotted line marks a pause (60 time units long) in which A had 616 nothing to send. Note that this pause inflates the RTT which B could 617 infer from receiving TSecr=131 in data segment C. Thus, in one-way 618 data flows, RTTM in the reverse direction measures a value that is 619 inflated by gaps in sending data. However, the following rule 620 prevents a resulting inflation of the measured RTT: 622 RTTM Rule: A TSecr value received in a segment MAY be used to update 623 the averaged RTT measurement only if the segment advances 624 the left edge of the send window, i.e. SND.UNA is 625 increased. 627 Since TCP B is not sending data, the data segment C does not 628 acknowledge any new data when it arrives at B. Thus, the inflated 629 RTTM measurement is not used to update B's RTTM measurement. 631 4.2. Updating the RTO value 633 When [RFC1323] was originally written, it was perceived that taking 634 RTT measurements for each segment, and also during retransmissions, 635 would contribute to reduce spurious RTOs, while maintaining the 636 timeliness of necessary RTOs. At the time, RTO was also the only 637 mechanism to make use of the measured RTT. It has been shown, that 638 taking more RTT samples has only a very limited effect to optimize 639 RTOs [Allman99]. 641 Implementers should note that with timestamps multiple RTTMs can be 642 taken per RTT. The [RFC6298] RTO estimator has weighting factors, 643 alpha and beta, based on an implicit assumption that at most one RTTM 644 will be sampled per RTT. When multiple RTTMs per RTT are available 645 to update the RTO estimator, an implementation SHOULD try to adhere 646 to the spirit of the history specified in [RFC6298]. An 647 implementation suggestion is detailed in Appendix G. 649 [Ludwig00] and [Floyd05] have highlighted the problem that an 650 unmodified RTO calculation, which is updated with per-packet RTT 651 samples, will truncate the path history too soon. This can lead to 652 an increase in spurious retransmissions, when the path properties 653 vary in the order of a few RTTs, but a high number of RTT samples are 654 taken on a much shorter timescale. 656 4.3. Which Timestamp to Echo 658 If more than one Timestamps option is received before a reply segment 659 is sent, the TCP must choose only one of the TSvals to echo, ignoring 660 the others. To minimize the state kept in the receiver (i.e., the 661 number of unprocessed TSvals), the receiver should be required to 662 retain at most one timestamp in the connection control block. 664 There are three situations to consider: 666 (A) Delayed ACKs. 668 Many TCP's acknowledge only every second segment out of a group 669 of segments arriving within a short time interval; this policy 670 is known generally as "delayed ACKs". The data-sender TCP must 671 measure the effective RTT, including the additional time due to 672 delayed ACKs, or else it will retransmit unnecessarily. Thus, 673 when delayed ACKs are in use, the receiver SHOULD reply with the 674 TSval field from the earliest unacknowledged segment. 676 (B) A hole in the sequence space (segment(s) have been lost). 678 The sender will continue sending until the window is filled, and 679 the receiver may be generating s as these out-of-order 680 segments arrive (e.g., to aid "fast retransmit"). 682 The lost segment is probably a sign of congestion, and in that 683 situation the sender should be conservative about 684 retransmission. Furthermore, it is better to overestimate than 685 underestimate the RTT. An for an out-of-order segment 686 SHOULD therefore contain the timestamp from the most recent 687 segment that advanced RCV.NXT. 689 The same situation occurs if segments are re-ordered by the 690 network. 692 (C) A filled hole in the sequence space. 694 The segment that fills the hole and advances the window 695 represents the most recent measurement of the network 696 characteristics. An RTT computed from an earlier segment would 697 probably include the sender's retransmit time-out, badly biasing 698 the sender's average RTT estimate. Thus, the timestamp from the 699 latest segment (which filled the hole) MUST be echoed. 701 An algorithm that covers all three cases is described in the 702 following rules for Timestamps option processing on a synchronized 703 connection: 705 (1) The connection state is augmented with two 32-bit slots: 707 TS.Recent holds a timestamp to be echoed in TSecr whenever a 708 segment is sent, and Last.ACK.sent holds the ACK field from the 709 last segment sent. Last.ACK.sent will equal RCV.NXT except when 710 s have been delayed. 712 (2) If: 714 SEG.TSval >= TS.recent and SEG.SEQ <= Last.ACK.sent 716 then SEG.TSval is copied to TS.Recent; otherwise, it is ignored. 718 (3) When a TSopt is sent, its TSecr field is set to the current 719 TS.Recent value. 721 The following examples illustrate these rules. Here A, B, C... 722 represent data segments occupying successive blocks of sequence 723 numbers, and ACK(A),... represent the corresponding acknowledgment 724 segments. Note that ACK(A) has the same sequence number as B. We 725 show only one direction of timestamp echoing, for clarity. 727 o Segments arrive in sequence, and some of the s are delayed. 729 By case (A), the timestamp from the oldest unacknowledged segment 730 is echoed. 732 TS.Recent 733 -------------------> 734 1 735 -------------------> 736 1 737 -------------------> 738 1 739 <---- 740 (etc) 742 o Segments arrive out of order, and every segment is acknowledged. 744 By case (B), the timestamp from the last segment that advanced the 745 left window edge is echoed, until the missing segment arrives; it 746 is echoed according to Case (C). The same sequence would occur if 747 segments B and D were lost and retransmitted. 749 TS.Recent 750 -------------------> 751 1 752 <---- 753 1 754 -------------------> 755 1 756 <---- 757 1 758 -------------------> 759 2 760 <---- 761 2 762 -------------------> 763 2 764 <---- 765 2 766 -------------------> 767 4 768 <---- 769 (etc) 771 5. PAWS - Protection Against Wrapped Sequence Numbers 773 5.1. Introduction 775 Another use for the Timestamps options is the mechanism to Protect 776 Against Wrapped Sequence numbers (PAWS). Section 5.2 describes a 777 simple mechanism to reject old duplicate segments that might corrupt 778 an open TCP connection. PAWS operates within a single TCP 779 connection, using state that is saved in the connection control 780 block. Section 5.8 and Appendix H discuss the implications of the 781 PAWS mechanism for avoiding old duplicates from previous incarnations 782 of the same connection. 784 5.2. The PAWS Mechanism 786 PAWS uses the TCP Timestamps option described earlier, and assumes 787 that every received TCP segment (including data and segments) 788 contains a timestamp SEG.TSval whose values are monotonically non- 789 decreasing in time. The basic idea is that a segment can be 790 discarded as an old duplicate if it is received with a timestamp 791 SEG.TSval less than some timestamp recently received on this 792 connection. 794 In the PAWS mechanism, the "timestamps" are 32-bit unsigned integers 795 in a modular 32-bit space. Thus, "less than" is defined the same way 796 it is for TCP sequence numbers, and the same implementation 797 techniques apply. If s and t are timestamp values, 799 s < t if 0 < (t - s) < 2^31, 801 computed in unsigned 32-bit arithmetic. 803 The choice of incoming timestamps to be saved for this comparison 804 MUST guarantee a value that is monotonically non-decreasing. For 805 example, an implementation might save the timestamp from the segment 806 that last advanced the left edge of the receive window, i.e., the 807 most recent in-sequence segment. For simplicity, the value TS.Recent 808 introduced in Section 4.3 is used instead, as using a common value 809 for both PAWS and RTTM simplifies the implementation. As Section 4.3 810 explained, TS.Recent differs from the timestamp from the last in- 811 sequence segment only in the case of delayed s, and therefore by 812 less than one window. Either choice will therefore protect against 813 sequence number wrap-around. 815 PAWS submits all incoming segments to the same test, and therefore 816 protects against duplicate segments as well as data segments. 817 (An alternative non-symmetric algorithm would protect against old 818 duplicate s: the sender of data would reject incoming 819 segments whose TSecr values were less than the TSecr saved from the 820 last segment whose ACK field advanced the left edge of the send 821 window. This algorithm was deemed to lack economy of mechanism and 822 symmetry.) 824 TSval timestamps sent on and segments are used to 825 initialize PAWS. PAWS protects against old duplicate non- 826 segments, and duplicate segments received while there is a 827 synchronized connection. Duplicate and segments 828 received when there is no connection will be discarded by the normal 829 3-way handshake and sequence number checks of TCP. 831 [RFC1323] recommended that segments NOT carry timestamps, and 832 that they be acceptable regardless of their timestamp. At that time, 833 the thinking was that old duplicate segments should be 834 exceedingly unlikely, and their cleanup function should take 835 precedence over timestamps. More recently, discussions about various 836 blind attacks on TCP connections have raised the suggestion that if 837 the Timestamps option is present, SEG.TSecr could be used to provide 838 stricter acceptance tests for segments. 840 While still under discussion, to enable research into this area it is 841 now RECOMMENDED that when generating an , that if the segment 842 causing the to be generated contained a Timestamps option, that 843 the also contain a Timestamps option. In the segment, 844 SEG.TSecr SHOULD be set to SEG.TSval from the incoming segment and 845 SEG.TSval SHOULD be set to zero. If an is being generated 846 because of a user abort, and Snd.TS.OK is set, then a Timestamps 847 option SHOULD be included in the . When an segment is 848 received, it MUST NOT be subjected to the PAWS check by verifying an 849 acceptable value in SEG.TSval, and information from the Timestamps 850 option MUST NOT be used to update connection state information. 851 SEG.TSecr MAY be used to provide stricter acceptance checks. 853 5.3. Basic PAWS Algorithm 855 If the PAWS algorithm is used, the following processing MUST be 856 performed on all incoming segments for a synchronized connection. 857 Also, PAWS processing MUST take precedence over the regular TCP 858 acceptablitiy check (Section 3.3 in [RFC0793]), which is performed 859 after verification of the received Timestamps option: 861 R1) If there is a Timestamps option in the arriving segment, 862 SEG.TSval < TS.Recent, TS.Recent is valid (see later discussion) 863 and the RST bit is not set, then treat the arriving segment as 864 not acceptable: 866 Send an acknowledgement in reply as specified in [RFC0793] 867 page 69 and drop the segment. 869 Note: it is necessary to send an segment in order to 870 retain TCP's mechanisms for detecting and recovering from 871 half- open connections. For example, see Figure 10 of 872 [RFC0793]. 874 R2) If the segment is outside the window, reject it (normal TCP 875 processing) 877 R3) If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent (see 878 Section 4.3), then record its timestamp in TS.Recent. 880 R4) If an arriving segment is in-sequence (i.e., at the left window 881 edge), then accept it normally. 883 R5) Otherwise, treat the segment as a normal in-window, out-of- 884 sequence TCP segment (e.g., queue it for later delivery to the 885 user). 887 Steps R2, R4, and R5 are the normal TCP processing steps specified by 888 [RFC0793]. 890 It is important to note that the timestamp MUST be checked only when 891 a segment first arrives at the receiver, regardless of whether it is 892 in- sequence or it must be queued for later delivery. 894 Consider the following example. 896 Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has been 897 sent, where the letter indicates the sequence number and the digit 898 represents the timestamp. Suppose also that segment B.1 has been 899 lost. The timestamp in TS.Recent is 1 (from A.1), so C.1, ..., 900 Z.1 are considered acceptable and are queued. When B is 901 retransmitted as segment B.2 (using the latest timestamp), it 902 fills the hole and causes all the segments through Z to be 903 acknowledged and passed to the user. The timestamps of the queued 904 segments are *not* inspected again at this time, since they have 905 already been accepted. When B.2 is accepted, TS.Recent is set to 906 2. 908 This rule allows reasonable performance under loss. A full window of 909 data is in transit at all times, and after a loss a full window less 910 one segment will show up out-of-sequence to be queued at the receiver 911 (e.g., up to ~2^30 bytes of data); the Timestamps option must not 912 result in discarding this data. 914 In certain unlikely circumstances, the algorithm of rules R1-R5 could 915 lead to discarding some segments unnecessarily, as shown in the 916 following example: 918 Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have been 919 sent in sequence and that segment B.1 has been lost. Furthermore, 920 suppose delivery of some of C.1, ... Z.1 is delayed until *after* 921 the retransmission B.2 arrives at the receiver. These delayed 922 segments will be discarded unnecessarily when they do arrive, 923 since their timestamps are now out of date. 925 This case is very unlikely to occur. If the retransmission was 926 triggered by a timeout, some of the segments C.1, ... Z.1 must have 927 been delayed longer than the RTO time. This is presumably an 928 unlikely event, or there would be many spurious timeouts and 929 retransmissions. If B's retransmission was triggered by the "fast 930 retransmit" algorithm, i.e., by duplicate s, then the queued 931 segments that caused these s must have been received already. 933 Even if a segment were delayed past the RTO, the Fast Retransmit 934 mechanism [Jacobson90c] will cause the delayed segments to be 935 retransmitted at the same time as B.2, avoiding an extra RTT and 936 therefore causing a very small performance penalty. 938 We know of no case with a significant probability of occurrence in 939 which timestamps will cause performance degradation by unnecessarily 940 discarding segments. 942 5.4. Timestamp Clock 944 It is important to understand that the PAWS algorithm does not 945 require clock synchronization between sender and receiver. The 946 sender's timestamp clock is used as a source of monotonic non- 947 decreasing values to stamp the segments. The receiver treats the 948 timestamp value as simply a monotonically non-decreasing serial 949 number, without any connection to time. From the receiver's 950 viewpoint, the timestamp is acting as a logical extension of the 951 high-order bits of the sequence number. 953 The receiver algorithm does place some requirements on the frequency 954 of the timestamp clock. 956 (a) The timestamp clock must not be "too slow". 958 It MUST tick at least once for each 2^31 bytes sent. In fact, 959 in order to be useful to the sender for round trip timing, the 960 clock SHOULD tick at least once per window's worth of data, and 961 even with the window extension defined in Section 2.2, 2^31 962 bytes must be at least two windows. 964 To make this more quantitative, any clock faster than 1 tick/sec 965 will reject old duplicate segments for link speeds of ~8 Gbps. 966 A 1 ms timestamp clock will work at link speeds up to 8 Tbps 967 (8*10^12) bps! 969 (b) The timestamp clock must not be "too fast". 971 The recycling time of the timestamp clock MUST be greater than 972 MSL seconds. Since the clock (timestamp) is 32 bits and the 973 worst-case MSL is 255 seconds, the maximum acceptable clock 974 frequency is one tick every 59 ns. 976 However, it is desirable to establish a much longer recycle 977 period, in order to handle outdated timestamps on idle 978 connections (see Section 5.5), and to relax the MSL requirement 979 for preventing sequence number wrap-around. With a 1 ms 980 timestamp clock, the 32-bit timestamp will wrap its sign bit in 981 24.8 days. Thus, it will reject old duplicates on the same 982 connection if MSL is 24.8 days or less. This appears to be a 983 very safe figure; an MSL of 24.8 days or longer can probably be 984 assumed in the Internet without requiring precise MSL 985 enforcement. 987 Based upon these considerations, we choose a timestamp clock 988 frequency in the range 1 ms to 1 sec per tick. This range also 989 matches the requirements of the RTTM mechanism, which does not need 990 much more resolution than the granularity of the retransmit timer, 991 e.g., tens or hundreds of milliseconds. 993 The PAWS mechanism also puts a strong monotonicity requirement on the 994 sender's timestamp clock. The method of implementation of the 995 timestamp clock to meet this requirement depends upon the system 996 hardware and software. 998 o Some hosts have a hardware clock that is guaranteed to be 999 monotonic between hardware resets. 1001 o A clock interrupt may be used to simply increment a binary integer 1002 by 1 periodically. 1004 o The timestamp clock may be derived from a system clock that is 1005 subject to being abruptly changed, by adding a variable offset 1006 value. This offset is initialized to zero. When a new timestamp 1007 clock value is needed, the offset can be adjusted as necessary to 1008 make the new value equal to or larger than the previous value 1009 (which was saved for this purpose). 1011 o A random offset may be added to the timestamp clock on a per 1012 connection basis. See [RFC6528], section 3, on randomizing the 1013 initial sequence number (ISN). The same function with a different 1014 secret key can be use to generate the per connection timestamp 1015 offset. 1017 5.5. Outdated Timestamps 1019 If a connection remains idle long enough for the timestamp clock of 1020 the other TCP to wrap its sign bit, then the value saved in TS.Recent 1021 will become too old; as a result, the PAWS mechanism will cause all 1022 subsequent segments to be rejected, freezing the connection (until 1023 the timestamp clock wraps its sign bit again). 1025 With the chosen range of timestamp clock frequencies (1 sec to 1 ms), 1026 the time to wrap the sign bit will be between 24.8 days and 24800 1027 days. A TCP connection that is idle for more than 24 days and then 1028 comes to life is exceedingly unusual. However, it is undesirable in 1029 principle to place any limitation on TCP connection lifetimes. 1031 We therefore require that an implementation of PAWS include a 1032 mechanism to "invalidate" the TS.Recent value when a connection is 1033 idle for more than 24 days. (An alternative solution to the problem 1034 of outdated timestamps would be to send keep-alive segments at a very 1035 low rate, but still more often than the wrap-around time for 1036 timestamps, e.g., once a day. This would impose negligible overhead. 1037 However, the TCP specification has never included keep-alives, so the 1038 solution based upon invalidation was chosen.) 1040 Note that a TCP does not know the frequency, and therefore, the 1041 wraparound time, of the other TCP, so it must assume the worst. The 1042 validity of TS.Recent needs to be checked only if the basic PAWS 1043 timestamp check fails, i.e., only if SEG.TSval < TS.Recent. If 1044 TS.Recent is found to be invalid, then the segment is accepted, 1045 regardless of the failure of the timestamp check, and rule R3 updates 1046 TS.Recent with the TSval from the new segment. 1048 To detect how long the connection has been idle, the TCP MAY update a 1049 clock or timestamp value associated with the connection whenever 1050 TS.Recent is updated, for example. The details will be 1051 implementation-dependent. 1053 5.6. Header Prediction 1055 "Header prediction" [Jacobson90a] is a high-performance transport 1056 protocol implementation technique that is most important for high- 1057 speed links. This technique optimizes the code for the most common 1058 case, receiving a segment correctly and in order. Using header 1059 prediction, the receiver asks the question, "Is this segment the next 1060 in sequence?" This question can be answered in fewer machine 1061 instructions than the question, "Is this segment within the window?" 1063 Adding header prediction to our timestamp procedure leads to the 1064 following recommended sequence for processing an arriving TCP 1065 segment: 1067 H1) Check timestamp (same as step R1 above) 1069 H2) Do header prediction: if segment is next in sequence and if 1070 there are no special conditions requiring additional processing, 1071 accept the segment, record its timestamp, and skip H3. 1073 H3) Process the segment normally, as specified in RFC 793. This 1074 includes dropping segments that are outside the window and 1075 possibly sending acknowledgments, and queuing in-window, out-of- 1076 sequence segments. 1078 Another possibility would be to interchange steps H1 and H2, i.e., to 1079 perform the header prediction step H2 *first*, and perform H1 and H3 1080 only when header prediction fails. This could be a performance 1081 improvement, since the timestamp check in step H1 is very unlikely to 1082 fail, and it requires unsigned modulo arithmetic. To perform this 1083 check on every single segment is contrary to the philosophy of header 1084 prediction. We believe that this change might produce a measurable 1085 reduction in CPU time for TCP protocol processing on high-speed 1086 networks. 1088 However, putting H2 first would create a hazard: a segment from 2^32 1089 bytes in the past might arrive at exactly the wrong time and be 1090 accepted mistakenly by the header-prediction step. The following 1091 reasoning has been introduced in [RFC1185] to show that the 1092 probability of this failure is negligible. 1094 If all segments are equally likely to show up as old duplicates, 1095 then the probability of an old duplicate exactly matching the left 1096 window edge is the maximum segment size (MSS) divided by the size 1097 of the sequence space. This ratio must be less than 2^-16, since 1098 MSS must be < 2^16; for example, it will be (2^12)/(2^32) = 2^-20 1099 for a 100 Mbit/s link. However, the older a segment is, the less 1100 likely it is to be retained in the Internet, and under any 1101 reasonable model of segment lifetime the probability of an old 1102 duplicate exactly at the left window edge must be much smaller 1103 than 2^-16. 1105 The 16 bit TCP checksum also allows a basic unreliability of one 1106 part in 2^16. A protocol mechanism whose reliability exceeds the 1107 reliability of the TCP checksum should be considered "good 1108 enough", i.e., it won't contribute significantly to the overall 1109 error rate. We therefore believe we can ignore the problem of an 1110 old duplicate being accepted by doing header prediction before 1111 checking the timestamp. 1113 However, this probabilistic argument is not universally accepted, and 1114 the consensus at present is that the performance gain does not 1115 justify the hazard in the general case. It is therefore recommended 1116 that H2 follow H1. 1118 5.7. IP Fragmentation 1120 At high data rates, the protection against old segments provided by 1121 PAWS can be circumvented by errors in IP fragment reassembly (see 1122 [RFC4963]). The only way to protect against incorrect IP fragment 1123 reassembly is to not allow the segments to be fragmented. This is 1124 done by setting the Don't Fragment (DF) bit in the IP header. 1125 Setting the DF bit implies the use of Path MTU Discovery as described 1126 in [RFC1191], [RFC1981], and [RFC4821], thus any TCP implementation 1127 that implements PAWS MUST also implement Path MTU Discovery. 1129 5.8. Duplicates from Earlier Incarnations of Connection 1131 The PAWS mechanism protects against errors due to sequence number 1132 wrap-around on high-speed connections. Segments from an earlier 1133 incarnation of the same connection are also a potential cause of old 1134 duplicate errors. In both cases, the TCP mechanisms to prevent such 1135 errors depend upon the enforcement of a maximum segment lifetime 1136 (MSL) by the Internet (IP) layer (see Appendix of RFC 1185 for a 1137 detailed discussion). Unlike the case of sequence space wrap-around, 1138 the MSL required to prevent old duplicate errors from earlier 1139 incarnations does not depend upon the transfer rate. If the IP layer 1140 enforces the recommended 2 minute MSL of TCP, and if the TCP rules 1141 are followed, TCP connections will be safe from earlier incarnations, 1142 no matter how high the network speed. Thus, the PAWS mechanism is 1143 not required for this case. 1145 We may still ask whether the PAWS mechanism can provide additional 1146 security against old duplicates from earlier connections, allowing us 1147 to relax the enforcement of MSL by the IP layer. Appendix B explores 1148 this question, showing that further assumptions and/or mechanisms are 1149 required, beyond those of PAWS. This is not part of the current 1150 extension. 1152 6. Conclusions and Acknowledgements 1154 This memo presented a set of extensions to TCP to provide efficient 1155 operation over large bandwidth * delay product paths and reliable 1156 operation over very high-speed paths. These extensions are designed 1157 to provide compatible interworking with TCP stacks that do not 1158 implement the extensions. 1160 These mechanisms are implemented using TCP options for scaled windows 1161 and timestamps. The timestamps are used for two distinct mechanisms: 1162 RTTM (Round Trip Time Measurement) and PAWS (Protection Against 1163 Wrapped Sequences). 1165 The Window Scale option was originally suggested by Mike St. Johns of 1166 USAF/DCA. The present form of the option was suggested by Mike 1167 Karels of UC Berkeley in response to a more cumbersome scheme defined 1168 by Van Jacobson. Lixia Zhang helped formulate the PAWS mechanism 1169 description in [RFC1185]. 1171 Finally, much of this work originated as the result of discussions 1172 within the End-to-End Task Force on the theoretical limitations of 1173 transport protocols in general and TCP in particular. Task force 1174 members and other on the end2end-interest list have made valuable 1175 contributions by pointing out flaws in the algorithms and the 1176 documentation. Continued discussion and development since the 1177 publication of [RFC1323] originally occurred in the IETF TCP Large 1178 Windows Working Group, later on in the End-to-End Task Force, and 1179 most recently in the IETF TCP Maintenance Working Group. The authors 1180 are grateful for all these contributions. 1182 7. Security Considerations 1184 The TCP sequence space is a fixed size, and as the window becomes 1185 larger it becomes easier for an attacker to generate forged packets 1186 that can fall within the TCP window, and be accepted as valid 1187 segments. While use of timestamps and PAWS can help to mitigate 1188 this, when using PAWS, if an attacker is able to forge a packet that 1189 is acceptable to the TCP connection, a timestamp that is in the 1190 future would cause valid segments to be dropped due to PAWS checks. 1191 Hence, implementers should take care to not open the TCP window 1192 drastically beyond the requirements of the connection. 1194 A naive implementation that derives the timestamp clock value 1195 directly from a system uptime clock may unintentionally leak this 1196 information to an attacker. This does not directly compromise any of 1197 the mechanisms described in this document. However, this may be 1198 valuable information to a potential attacker. An implementer should 1199 evaluate the potential impact and mitigate this accordingly (i.e. by 1200 using a random offset for the timestamp clock on each connection, or 1201 using an external, real-time derived timestamp clock source). 1203 Expanding the TCP window beyond 64 KiB for IPv6 allows Jumbograms 1204 [RFC2675] to be used when the local network supports packets larger 1205 than 64 KiB. When larger TCP segments are used, the TCP checksum 1206 becomes weaker. 1208 Mechanisms to protect the TCP header from modification should also 1209 protect the TCP options. 1211 Middleboxes and TCP options: 1213 Some middleboxes have been known to remove the TCP options 1214 described in this document from TCP segments [Honda11]. 1215 Middleboxes that remove TCP options described in this document 1216 from the segment interfere with the selection of parameters 1217 appropriate for the session. Removing any of these options in a 1218 segment will leave the end hosts in a state that 1219 destroys the proper operation of the protocol. 1221 * If a Window Scale option is removed from a segment, 1222 the end hosts will not negotiate the window scaling factor 1223 correctly. Middleboxes must not remove or modify the Window 1224 Scale option from segments. 1226 * If a stateful firewall uses the window field to detect whether 1227 a received segment is inside the current window, and does not 1228 support the Window Scale option, it will not be able to 1229 correctly determine whether or not a packet is in the window. 1230 These middle boxes must also support the Window Scale option 1231 and apply the scale factor when processing segments. If the 1232 window scale factor cannot be determined, it must not do window 1233 based processing. 1235 * If the Timestamps option is removed from the or 1236 segment, high speed connections that need PAWS would not have 1237 that protection. Successful negotiation of Timestamps option 1238 enforces a stricter verification of incoming segments at the 1239 receiver. If the Timestamps option was removed from a 1240 subsequent data segment after a successful negotiation (e.g. as 1241 part of re-segmentation), the segment is discarded by the 1242 receiver without further processing. Middleboxes should not 1243 remove the Timestamps option. 1245 * It must be noted that [RFC1323] doesn't address the case of the 1246 Timestamps option being dropped or selectively omitted after 1247 being negotiated, and that the update in this document may 1248 cause some broken middlebox behavior to be detected 1249 (potentially unresponsive TCP sessions). 1251 Implementations that depend on PAWS could provide a mechanism for the 1252 application to determine whether or not PAWS is in use on the 1253 connection, and chose to terminate the connection if that protection 1254 doesn't exist. This is not just to protect the connection against 1255 middleboxes that might remove the Timestamps option, but also against 1256 remote hosts that do not have Timestamp support. 1258 7.1. Privacy Considerations 1260 The TCP options described in this document do not expose individual 1261 users data. However, a naive implementation simply using the system 1262 clock as source for the Timestamps option will reveal characteristics 1263 of the TCP potentially allowing more targeted attacks. It is 1264 therefore RECOMMENDED to generate a random, per-connection offset to 1265 be used with the clock source when generating the Timestamps option 1266 value (see Section 5.4). 1268 Furthermore, the combination, relative ordering and padding of the 1269 TCP options described in Section 2.2 and Section 3.2 will reveal 1270 additional clues to allow the fingerprinting of the system. 1272 8. IANA Considerations 1274 This document has no actions for IANA. The described TCP options are 1275 well known from the superceded [RFC1323]. 1277 9. References 1279 9.1. Normative References 1281 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 1282 RFC 793, September 1981. 1284 [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 1285 November 1990. 1287 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1288 Requirement Levels", BCP 14, RFC 2119, March 1997. 1290 9.2. Informative References 1292 [Allman99] 1293 Allman, M. and V. Paxson, "On Estimating End-to-End 1294 Network Path Properties", Proc. ACM SIGCOMM Technical 1295 Symposium, Cambridge, MA, September 1999, 1296 . 1298 [Ekstroem04] 1299 Ekstroem, H. and R. Ludwig, "The Peak-Hopper: A New End- 1300 to-End Retransmission Timer for Reliable Unicast 1301 Transport", INFOCOM 2004 IEEE, March 2004, . 1305 [Floyd05] Floyd, S., "[tcpm] How the RTO should be estimated with 1306 timestamps", Message from 26.Jan.2007 to the tcpm mailing 1307 list, August 2005, . 1310 [Garlick77] 1311 Garlick, L., Rom, R., and J. Postel, "Issues in Reliable 1312 Host-to-Host Protocols", Proc. Second Berkeley Workshop on 1313 Distributed Data Management and Computer Networks, 1314 May 1977, . 1316 [Hamming77] 1317 Hamming, R., "Digital Filters", Prentice Hall, Englewood 1318 Cliffs, N.J. ISBN 0-13-212571-4, 1977. 1320 [Honda11] Honda, M., Nishida, Y., Raiciu, C., Greenhalgh, A., 1321 Handley, M., and H. Tokuda, "Is it still possible to 1322 extend TCP?", Proc. of ACM Internet Measurement 1323 Conference (IMC) '11, November 2011. 1325 [Jacobson88a] 1326 Jacobson, V., "Congestion Avoidance and Control", SIGCOMM 1327 '88, Stanford, CA., August 1988, 1328 . 1330 [Jacobson90a] 1331 Jacobson, V., "4BSD Header Prediction", ACM Computer 1332 Communication Review, April 1990. 1334 [Jacobson90c] 1335 Jacobson, V., "Modified TCP congestion avoidance 1336 algorithm", Message to the end2end-interest mailing list, 1337 April 1990, 1338 . 1340 [Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet 1341 Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and 1342 Comm., Scottsdale, Arizona, March 1986, 1343 . 1345 [Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times in 1346 Reliable Transport Protocols", Proc. SIGCOMM '87, 1347 August 1987. 1349 [Kuehlewind10] 1350 Kuehlewind, M. and B. Briscoe, "Chirping for Congestion 1351 Control - Implementation Feasibility", November 2010, 1352 . 1354 [Kuzmanovic03] 1355 Kuzmanovic, A. and E. Knightly, "TCP-LP: Low-Priority 1356 Service via End-Point Congestion Control", 2003, 1357 . 1359 [Ludwig00] 1360 Ludwig, R. and K. Sklower, "The Eifel Retransmission 1361 Timer", ACM SIGCOMM Computer Communication Review Volume 1362 30 Issue 3, July 2000, . 1365 [Martin03] 1366 Martin, D., "[Tsvwg] RFC 1323.bis", Message to the tsvwg 1367 mailing list, September 2003, . 1370 [Mathis08] 1371 Mathis, M., "[tcpm] Example of 1323 window retraction 1372 problem", Message to the tcpm mailing list, March 2008, . 1376 [Medina04] 1377 Medina, A., Allman, M., and S. Floyd, "Measuring 1378 Interactions Between Transport Protocols and Middleboxes", 1379 Proc. ACM SIGCOMM/USENIX Internet Measurement Conference. 1380 October 2004, August 2004, 1381 . 1383 [Medina05] 1384 Medina, A., Allman, M., and S. Floyd, "Measuring the 1385 Evolution of Transport Protocols in the Internet", ACM 1386 Computer Communication Review 35(2), April 2005, 1387 . 1389 [Oppermann13] 1390 Oppermann, A., "[tcpm] Explanation to the relaxation of 1391 TSopt acceptance rules", Message to the tcpm mailing list, 1392 Jun 2013, . 1395 [RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks", 1396 RFC 896, January 1984. 1398 [RFC1072] Jacobson, V. and R. Braden, "TCP extensions for long-delay 1399 paths", RFC 1072, October 1988. 1401 [RFC1110] McKenzie, A., "Problem with the TCP big window option", 1402 RFC 1110, August 1989. 1404 [RFC1122] Braden, R., "Requirements for Internet Hosts - 1405 Communication Layers", STD 3, RFC 1122, October 1989. 1407 [RFC1185] Jacobson, V., Braden, B., and L. Zhang, "TCP Extension for 1408 High-Speed Paths", RFC 1185, October 1990. 1410 [RFC1323] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions 1411 for High Performance", RFC 1323, May 1992. 1413 [RFC1981] McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery 1414 for IP version 6", RFC 1981, August 1996. 1416 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 1417 Selective Acknowledgment Options", RFC 2018, October 1996. 1419 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 1420 Control", RFC 2581, April 1999. 1422 [RFC2675] Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms", 1423 RFC 2675, August 1999. 1425 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 1426 Extension to the Selective Acknowledgement (SACK) Option 1427 for TCP", RFC 2883, July 2000. 1429 [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm 1430 for TCP", RFC 3522, April 2003. 1432 [RFC4015] Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm 1433 for TCP", RFC 4015, February 2005. 1435 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 1436 Discovery", RFC 4821, March 2007. 1438 [RFC4963] Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly 1439 Errors at High Data Rates", RFC 4963, July 2007. 1441 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1442 Control", RFC 5681, September 2009. 1444 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 1445 "Computing TCP's Retransmission Timer", RFC 6298, 1446 June 2011. 1448 [RFC6528] Gont, F. and S. Bellovin, "Defending against Sequence 1449 Number Attacks", RFC 6528, February 2012. 1451 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 1452 and Y. Nishida, "A Conservative Loss Recovery Algorithm 1453 Based on Selective Acknowledgment (SACK) for TCP", 1454 RFC 6675, August 2012. 1456 [RFC6691] Borman, D., "TCP Options and Maximum Segment Size (MSS)", 1457 RFC 6691, July 2012. 1459 [RFC6817] Shalunov, S., Hazel, G., Iyengar, J., and M. Kuehlewind, 1460 "Low Extra Delay Background Transport (LEDBAT)", RFC 6817, 1461 December 2012. 1463 [Watson81] 1464 Watson, R., "Timer-based Mechanisms in Reliable Transport 1465 Protocol Connection Management", Computer Networks, Vol. 1466 5, 1981. 1468 [Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. SIGCOMM 1469 '86, Stowe, VT, August 1986. 1471 Appendix A. Implementation Suggestions 1473 TCP Option Layout 1475 The following layout is recommended for sending options on non- 1476 segments, to achieve maximum feasible alignment of 32-bit 1477 and 64-bit machines. 1479 +--------+--------+--------+--------+ 1480 | NOP | NOP | TSopt | 10 | 1481 +--------+--------+--------+--------+ 1482 | TSval timestamp | 1483 +--------+--------+--------+--------+ 1484 | TSecr timestamp | 1485 +--------+--------+--------+--------+ 1487 Interaction with the TCP Urgent Pointer 1489 The TCP Urgent pointer, like the TCP window, is a 16 bit value. 1490 Some of the original discussion for the TCP Window Scale option 1491 included proposals to increase the Urgent pointer to 32 bits. As 1492 it turns out, this is unnecessary. There are two observations 1493 that should be made: 1495 (1) With IP Version 4, the largest amount of TCP data that can be 1496 sent in a single packet is 65495 bytes (64 KiB - 1 -- size of 1497 fixed IP and TCP headers). 1499 (2) Updates to the urgent pointer while the user is in "urgent 1500 mode" are invisible to the user. 1502 This means that if the Urgent Pointer points beyond the end of the 1503 TCP data in the current segment, then the user will remain in 1504 urgent mode until the next TCP segment arrives. That segment will 1505 update the urgent pointer to a new offset, and the user will never 1506 have left urgent mode. 1508 Thus, to properly implement the Urgent Pointer, the sending TCP 1509 only has to check for overflow of the 16 bit Urgent Pointer field 1510 before filling it in. If it does overflow, than a value of 65535 1511 should be inserted into the Urgent Pointer. 1513 The same technique applies to IP Version 6, except in the case of 1514 IPv6 Jumbograms. When IPv6 Jumbograms are supported, [RFC2675] 1515 requires additional steps for dealing with the Urgent Pointer, 1516 these are described in section 5.2 of [RFC2675]. 1518 Appendix B. Duplicates from Earlier Connection Incarnations 1520 There are two cases to be considered: (1) a system crashing (and 1521 losing connection state) and restarting, and (2) the same connection 1522 being closed and reopened without a loss of host state. These will 1523 be described in the following two sections. 1525 B.1. System Crash with Loss of State 1527 TCP's quiet time of one MSL upon system startup handles the loss of 1528 connection state in a system crash/restart. For an explanation, see 1529 for example "When to Keep Quiet" in the TCP protocol specification 1530 [RFC0793]. The MSL that is required here does not depend upon the 1531 transfer speed. The current TCP MSL of 2 minutes seemed acceptable 1532 as an operational compromise, when many host systems used to take 1533 this long to boot after a crash. Current host systems can boot 1534 considerably faster. 1536 The Timestamps option may be used to ease the MSL requirements (or to 1537 provide additional security against data corruption). If timestamps 1538 are being used and if the timestamp clock can be guaranteed to be 1539 monotonic over a system crash/restart, i.e., if the first value of 1540 the sender's timestamp clock after a crash/restart can be guaranteed 1541 to be greater than the last value before the restart, then a quiet 1542 time is unnecessary. 1544 To dispense totally with the quiet time would require that the host 1545 clock be synchronized to a time source that is stable over the crash/ 1546 restart period, with an accuracy of one timestamp clock tick or 1547 better. We can back off from this strict requirement to take 1548 advantage of approximate clock synchronization. Suppose that the 1549 clock is always re-synchronized to within N timestamp clock ticks and 1550 that booting (extended with a quiet time, if necessary) takes more 1551 than N ticks. This will guarantee monotonicity of the timestamps, 1552 which can then be used to reject old duplicates even without an 1553 enforced MSL. 1555 B.2. Closing and Reopening a Connection 1557 When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT state 1558 ties up the socket pair for 4 minutes (see Section 3.5 of [RFC0793]. 1559 Applications built upon TCP that close one connection and open a new 1560 one (e.g., an FTP data transfer connection using Stream mode) must 1561 choose a new socket pair each time. The TIME-WAIT delay serves two 1562 different purposes: 1564 (a) Implement the full-duplex reliable close handshake of TCP. 1566 The proper time to delay the final close step is not really 1567 related to the MSL; it depends instead upon the RTO for the FIN 1568 segments and therefore upon the RTT of the path. (It could be 1569 argued that the side that is sending a FIN knows what degree of 1570 reliability it needs, and therefore it should be able to 1571 determine the length of the TIME-WAIT delay for the FIN's 1572 recipient. This could be accomplished with an appropriate TCP 1573 option in FIN segments.) 1575 Although there is no formal upper-bound on RTT, common network 1576 engineering practice makes an RTT greater than 1 minute very 1577 unlikely. Thus, the 4 minute delay in TIME-WAIT state works 1578 satisfactorily to provide a reliable full-duplex TCP close. 1579 Note again that this is independent of MSL enforcement and 1580 network speed. 1582 The TIME-WAIT state could cause an indirect performance problem 1583 if an application needed to repeatedly close one connection and 1584 open another at a very high frequency, since the number of 1585 available TCP ports on a host is less than 2^16. However, high 1586 network speeds are not the major contributor to this problem; 1587 the RTT is the limiting factor in how quickly connections can be 1588 opened and closed. Therefore, this problem will be no worse at 1589 high transfer speeds. 1591 (b) Allow old duplicate segments to expire. 1593 To replace this function of TIME-WAIT state, a mechanism would 1594 have to operate across connections. PAWS is defined strictly 1595 within a single connection; the last timestamp (TS.Recent) is 1596 kept in the connection control block, and discarded when a 1597 connection is closed. 1599 An additional mechanism could be added to the TCP, a per-host 1600 cache of the last timestamp received from any connection. This 1601 value could then be used in the PAWS mechanism to reject old 1602 duplicate segments from earlier incarnations of the connection, 1603 if the timestamp clock can be guaranteed to have ticked at least 1604 once since the old connection was open. This would require that 1605 the TIME-WAIT delay plus the RTT together must be at least one 1606 tick of the sender's timestamp clock. Such an extension is not 1607 part of the proposal of this RFC. 1609 Note that this is a variant on the mechanism proposed by 1610 Garlick, Rom, and Postel [Garlick77], which required each host 1611 to maintain connection records containing the highest sequence 1612 numbers on every connection. Using timestamps instead, it is 1613 only necessary to keep one quantity per remote host, regardless 1614 of the number of simultaneous connections to that host. 1616 Appendix C. Summary of Notation 1618 The following notation has been used in this document. 1620 Options 1622 WSopt: TCP Window Scale option 1623 TSopt: TCP Timestamps option 1625 Option Fields 1627 shift.cnt: Window scale byte in WSopt 1628 TSval: 32-bit Timestamp Value field in TSopt 1629 TSecr: 32-bit Timestamp Reply field in TSopt 1631 Option Fields in Current Segment 1633 SEG.TSval: TSval field from TSopt in current segment 1634 SEG.TSecr: TSecr field from TSopt in current segment 1635 SEG.WSopt: 8-bit value in WSopt 1637 Clock Values 1639 my.TSclock: System wide source of 32-bit timestamp values 1640 my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec) 1641 Snd.TSoffset: A offset for randomizing Snd.TSclock 1642 Snd.TSclock: my.TSclock + Snd.TSoffset 1644 Per-Connection State Variables 1646 TS.Recent: Latest received Timestamp 1647 Last.ACK.sent: Last ACK field sent 1648 Snd.TS.OK: 1-bit flag 1649 Snd.WS.OK: 1-bit flag 1650 Rcv.Wind.Shift: Receive window scale exponent 1651 Snd.Wind.Shift: Send window scale exponent 1652 Start.Time: Snd.TSclock value when segment being timed was 1653 sent (used by pre-1323 code). 1655 Procedure 1657 Update_SRTT(m) Procedure to update the smoothed RTT and RTT 1658 variance estimates, using the rules of 1659 [Jacobson88a], given m, a new RTT measurement 1661 Appendix D. Event Processing Summary 1663 OPEN Call 1665 ... 1667 An initial send sequence number (ISS) is selected. Send a 1668 segment of the form: 1670 1672 ... 1674 SEND Call 1676 CLOSED STATE (i.e., TCB does not exist) 1678 ... 1680 LISTEN STATE 1682 If the foreign socket is specified, then change the connection 1683 from passive to active, select an ISS. Send a segment 1684 containing the options: and 1685 . Set SND.UNA to ISS, SND.NXT to ISS+1. 1686 Enter SYN-SENT state. ... 1688 SYN-SENT STATE 1689 SYN-RECEIVED STATE 1691 ... 1693 ESTABLISHED STATE 1694 CLOSE-WAIT STATE 1696 Segmentize the buffer and send it with a piggybacked 1697 acknowledgment (acknowledgment value = RCV.NXT). ... 1699 If the urgent flag is set ... 1701 If the Snd.TS.OK flag is set, then include the TCP Timestamps 1702 option in each data 1703 segment. 1705 Scale the receive window for transmission in the segment 1706 header: 1708 SEG.WND = (RCV.WND >> Rcv.Wind.Shift). 1710 SEGMENT ARRIVES 1712 ... 1714 If the state is LISTEN then 1716 first check for an RST 1718 ... 1720 second check for an ACK 1722 ... 1724 third check for a SYN 1726 if the SYN bit is set, check the security. If the ... 1728 ... 1730 if the SEG.PRC is less than the TCB.PRC then continue. 1732 Check for a Window Scale option (WSopt); if one is found, 1733 save SEG.WSopt in Snd.Wind.Shift and set Snd.WS.OK flag on. 1734 Otherwise, set both Snd.Wind.Shift and Rcv.Wind.Shift to 1735 zero and clear Snd.WS.OK flag. 1737 Check for a TSopt option; if one is found, save SEG.TSval in 1738 the variable TS.Recent and turn on the Snd.TS.OK bit. 1740 Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any 1741 other control or text should be queued for processing later. 1742 ISS should be selected and a segment sent of the form: 1744 1746 If the Snd.WS.OK bit is on, include a WSopt option 1747 in this segment. If the Snd.TS.OK 1748 bit is on, include a TSopt in this segment. Last.ACK.sent is set to 1750 RCV.NXT. 1752 SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection 1753 state should be changed to SYN-RECEIVED. Note that any 1754 other incoming control or data (combined with SYN) will be 1755 processed in the SYN-RECEIVED state, but processing of SYN 1756 and ACK should not be repeated. If the listen was not fully 1757 specified (i.e., the foreign socket was not fully 1758 specified), then the unspecified fields should be filled in 1759 now. 1761 fourth other text or control 1763 ... 1765 If the state is SYN-SENT then 1767 first check the ACK bit 1769 ... 1771 ... 1773 fourth check the SYN bit 1775 ... 1777 If the SYN bit is on and the security/compartment and 1778 precedence are acceptable then, RCV.NXT is set to SEG.SEQ+1, 1779 IRS is set to SEG.SEQ, and any acknowledgements on the 1780 retransmission queue which are thereby acknowledged should 1781 be removed. 1783 Check for a Window Scale option (WSopt); if it is found, 1784 save SEG.WSopt in Snd.Wind.Shift; otherwise, set both 1785 Snd.Wind.Shift and Rcv.Wind.Shift to zero. 1787 Check for a TSopt option; if one is found, save SEG.TSval in 1788 variable TS.Recent and turn on the Snd.TS.OK bit in the 1789 connection control block. If the ACK bit is set, use 1790 Snd.TSclock - SEG.TSecr as the initial RTT estimate. 1792 If SND.UNA > ISS (our has been ACKed), change the 1793 connection state to ESTABLISHED, form an segment: 1795 1797 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1798 option in this 1799 segment. Last.ACK.sent is set to RCV.NXT. 1801 Data or controls which were queued for transmission may be 1802 included. If there are other controls or text in the 1803 segment then continue processing at the sixth step below 1804 where the URG bit is checked, otherwise return. 1806 Otherwise enter SYN-RECEIVED, form a segment: 1808 1810 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1811 option in this segment. 1812 If the Snd.WS.OK bit is on, include a WSopt option 1813 in this segment. Last.ACK.sent is 1814 set to RCV.NXT. 1816 If there are other controls or text in the segment, queue 1817 them for processing after the ESTABLISHED state has been 1818 reached, return. 1820 fifth, if neither of the SYN or RST bits is set then drop the 1821 segment and return. 1823 Otherwise, 1825 First, check sequence number 1827 SYN-RECEIVED STATE 1828 ESTABLISHED STATE 1829 FIN-WAIT-1 STATE 1830 FIN-WAIT-2 STATE 1831 CLOSE-WAIT STATE 1832 CLOSING STATE 1833 LAST-ACK STATE 1834 TIME-WAIT STATE 1836 Segments are processed in sequence. Initial tests on 1837 arrival are used to discard old duplicates, but further 1838 processing is done in SEG.SEQ order. If a segment's 1839 contents straddle the boundary between old and new, only the 1840 new parts should be processed. 1842 Rescale the received window field: 1844 TrueWindow = SEG.WND << Snd.Wind.Shift, 1846 and use "TrueWindow" in place of SEG.WND in the following 1847 steps. 1849 Check whether the segment contains a Timestamps option and 1850 bit Snd.TS.OK is on. If so: 1852 If SEG.TSval < TS.Recent and the RST bit is off, then 1853 test whether connection has been idle less than 24 days; 1854 if all are true, then the segment is not acceptable; 1855 follow steps below for an unacceptable segment. 1857 If SEG.SEQ is less than or equal to Last.ACK.sent, then 1858 save SEG.TSval in variable TS.Recent. 1860 There are four cases for the acceptability test for an 1861 incoming segment: 1863 ... 1865 If an incoming segment is not acceptable, an acknowledgment 1866 should be sent in reply (unless the RST bit is set, if so 1867 drop the segment and return): 1869 1871 Last.ACK.sent is set to SEG.ACK of the acknowledgment. If 1872 the Snd.Echo.OK bit is on, include the Timestamps option 1873 in this segment. 1874 Set Last.ACK.sent to SEG.ACK and send the segment. 1875 After sending the acknowledgment, drop the unacceptable 1876 segment and return. 1878 ... 1880 fifth check the ACK field. 1882 if the ACK bit is off drop the segment and return. 1884 if the ACK bit is on 1886 ... 1888 ESTABLISHED STATE 1890 If SND.UNA < SEG.ACK <= SND.NXT then, set SND.UNA <- 1891 SEG.ACK. Also compute a new estimate of round-trip time. 1892 If Snd.TS.OK bit is on, use Snd.TSclock - SEG.TSecr; 1893 otherwise use the elapsed time since the first segment in 1894 the retransmission queue was sent. Any segments on the 1895 retransmission queue which are thereby entirely 1896 acknowledged... 1898 ... 1900 Seventh, process the segment text. 1902 ESTABLISHED STATE 1903 FIN-WAIT-1 STATE 1904 FIN-WAIT-2 STATE 1905 ... 1907 Send an acknowledgment of the form: 1909 1911 If the Snd.TS.OK bit is on, include Timestamps option 1912 in this segment. 1913 Set Last.ACK.sent to SEG.ACK of the acknowledgment, and send 1914 it. This acknowledgment should be piggy-backed on a segment 1915 being transmitted if possible without incurring undue delay. 1917 ... 1919 Appendix E. Timestamps Edge Cases 1921 While the rules laid out for when to calculate RTTM produce the 1922 correct results most of the time, there are some edge cases where an 1923 incorrect RTTM can be calculated. All of these situations involve 1924 the loss of segments. It is felt that these scenarios are rare, and 1925 that if they should happen, they will cause a single RTTM measurement 1926 to be inflated, which mitigates its effects on RTO calculations. 1928 [Martin03] cites two similar cases when the returning is lost, 1929 and before the retransmission timer fires, another returning 1930 segment arrives, which aknowledges the data. In this case, the RTTM 1931 calculated will be inflated: 1933 clock 1934 tc=1 -------------------> 1936 tc=2 (lost) <---- 1937 (RTTM would have been 1) 1939 (receive window opens, window update is sent) 1940 tc=5 <---- 1941 (RTTM is calculated at 4) 1943 One thing to note about this situation is that it is somewhat bounded 1944 by RTO + RTT, limiting how far off the RTTM calculation will be. 1945 While more complex scenarios can be constructed that produce larger 1946 inflations (e.g., retransmissions are lost), those scenarios involve 1947 multiple segment losses, and the connection will have other more 1948 serious operational problems than using an inflated RTTM in the RTO 1949 calculation. 1951 Appendix F. Window Retraction Example 1953 Consider an established TCP connection using a scale factor of 128, 1954 Snd.Wind.Shift=7 and Rcv.Wind.Shift=7, that is running with a very 1955 small window because the receiver is bottlenecked and both ends are 1956 doing small reads and writes. 1958 Consider the ACKs coming back: 1960 SEG.ACK SEG.WIN computed SND.WIN receiver's actual window 1961 1000 2 1256 1300 1963 The sender writes 40 bytes and receiver ACKs: 1965 1040 2 1296 1300 1967 The sender writes 5 additional bytes and the receiver has a problem. 1968 Two choices: 1970 1045 2 1301 1300 - BEYOND BUFFER 1972 1045 1 1173 1300 - RETRACTED WINDOW 1974 This is a general problem and can happen any time the sender does a 1975 write which is smaller than the window scale factor. 1977 In most stacks it is at least partially obscured when the window size 1978 is larger than some small number of segments because the stacks 1979 prefer to announce windows that are an integral number of segments, 1980 rounded up to the next scale factor. This plus silly window 1981 suppression tends to cause less frequent, larger window updates. If 1982 the window was rounded down to a segment size there is more 1983 opportunity to advance the window, the BEYOND BUFFER case above, 1984 rather than retracting it. 1986 Appendix G. RTO calculation modification 1988 Taking multiple RTT samples per window would shorten the history 1989 calculated by the RTO mechanism in [RFC6298], and the below algorithm 1990 aims to maintain a similar history as originally intended by 1991 [RFC6298]. 1993 It is roughly known how many samples a congestion window worth of 1994 data will yield, not accounting for ACK compression, and ACK losses. 1995 Such events will result in more history of the path being reflected 1996 in the final value for RTO, and are uncritical. This modification 1997 will ensure that a similar amount of time is taken into account for 1998 the RTO estimation, regardless of how many samples are taken per 1999 window: 2001 ExpectedSamples = ceiling(FlightSize / (SMSS * 2)) 2003 alpha' = alpha / ExpectedSamples 2005 beta' = beta / ExpectedSamples 2007 Note that the factor 2 in ExpectedSamples is due to "Delayed ACKs". 2009 Instead of using alpha and beta in the algorithm of [RFC6298], use 2010 alpha' and beta' instead: 2012 RTTVAR <- (1 - beta') * RTTVAR + beta' * |SRTT - R'| 2014 SRTT <- (1 - alpha') * SRTT + alpha' * R' 2016 (for each sample R') 2018 Appendix H. Changes from RFC 1323 2020 Several important updates and clarifications to the specification in 2021 RFC 1323 are made in these document. The technical changes are 2022 summarized below: 2024 (a) A wrong reference to SND.WND was corrected to SEG.WND in 2025 Section 2.3 2027 (b) Section 2.4 was added describing the unavoidable window 2028 retraction issue, and explicitly describing the mitigation steps 2029 necessary. 2031 (c) In Section 3.2 the wording how the Timestamps option negotiation 2032 is to be performed was updated with RFC2119 wording. Further, a 2033 number of paragraphs were added to clarify the expected behavior 2034 with a compliant implementation using TSopt, as RFC1323 left 2035 room for interpretation - e.g. potential late enablement of 2036 TSopt. 2038 (d) The description of which TSecr values can be used to update the 2039 measured RTT has been clarified. Specifically, with timestamps, 2040 the Karn algorithm [Karn87] is disabled. The Karn algorithm 2041 disables all RTT measurements during retransmission, since it is 2042 ambiguous whether the is for the original segment, or the 2043 retransmitted segment. With timestamps, that ambiguity is 2044 removed since the TSecr in the will contain the TSval from 2045 whichever data segment made it to the destination. 2047 (e) RTTM update processing explicitly excludes segments not updating 2048 SND.UNA. The original text could be interpreted to allow taking 2049 RTT samples when SACK acknowledges some new, non-continuous 2050 data. 2052 (f) In RFC1323, section 3.4, step (2) of the algorithm to control 2053 which timestamp is echoed was incorrect in two regards: 2055 (1) It failed to update TS.recent for a retransmitted segment 2056 that resulted from a lost . 2058 (2) It failed if SEG.LEN = 0. 2060 In the new algorithm, the case of SEG.TSval >= TS.recent is 2061 included for consistency with the PAWS test. 2063 (g) It is now recommended that the Timestamps option is included in 2064 segments if the incoming segment contained a Timestamps 2065 option. 2067 (h) segments are explicitly excluded from PAWS processing. 2069 (i) Added text to clarify the precedence between regular TCP 2070 [RFC0793] and this document Timestamps option / PAWS processing. 2071 Discussion about combined acceptability checks are ongoing. 2073 (j) Snd.TSoffset and Snd.TSclock variables have been added. 2074 Snd.TSclock is the sum of my.TSclock and Snd.TSoffset. This 2075 allows the starting points for timestamp values to be randomized 2076 on a per-connection basis. Setting Snd.TSoffset to zero yields 2077 the same results as [RFC1323]. Text was added to guide 2078 implementors to the proper selection of these offsets, as 2079 entirly random offsets for each new connection will conflict 2080 with PAWS. 2082 (k) Appendix A has been expanded with information about the TCP 2083 Urgent Pointer. An earlier revision contained text around the 2084 TCP MSS option, which was split off into [RFC6691]. 2086 (l) One correction was made to the Event Processing Summary in 2087 Appendix D. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to 2088 fill in the SEG.WND value, not SND.WND. 2090 (m) Appendix G was added to exemplify how an RTO calculation might 2091 be updated to properly take the much higher RTT sampling 2092 frequency enabled by the Timestamps option into account. 2094 Editorial changes of the document, that don't impact the 2095 implementation or function of the mechanisms described in this 2096 document include: 2098 (a) Removed much of the discussion in Section 1 to streamline the 2099 document. However, detailed examples and discussions in 2100 Section 2, Section 3 and Section 5 are kept as guideline for 2101 implementers. 2103 (b) Added short text that the use of WS increases the chances of 2104 sequence number wrap, thus the PAWS mechanism is required in 2105 certain environments. 2107 (c) Removed references to "new" options, as the options were 2108 introduced in [RFC1323] already. Changed the text in 2109 Section 1.3 to specifically address TS and WS options. 2111 (d) Section 1.4 was added for [RFC2119] wording. Normative text was 2112 updated with the appropriate phrases. 2114 (e) Added < > brackets to mark specific types of segments, and 2115 replaced most occurances of "packet" with "segment", where TCP 2116 segments are referred to. 2118 (f) Updated the text in Section 3 to take into account what has been 2119 learned since [RFC1323]. 2121 (g) Removed the list of changes between [RFC1323] and prior 2122 versions. These changes are mentioned in Appendix C of 2123 [RFC1323]. 2125 (h) Moved Appendix Changes from RFC 1323 to the end of the 2126 appendices for easier lookup. In addition, the entries were 2127 split into a technical and an editorial part, and sorted to 2128 roughly correspond with the sections in the text where they 2129 apply. 2131 Authors' Addresses 2133 David Borman 2134 Quantum Corporation 2135 Mendota Heights MN 55120 2136 USA 2138 Email: david.borman@quantum.com 2140 Bob Braden 2141 University of Southern California 2142 4676 Admiralty Way 2143 Marina del Rey CA 90292 2144 USA 2146 Email: braden@isi.edu 2148 Van Jacobson 2149 Google, Inc. 2150 1600 Amphitheatre Parkway 2151 Mountain View CA 94043 2152 USA 2154 Email: vanj@google.com 2156 Richard Scheffenegger (editor) 2157 NetApp, Inc. 2158 Am Euro Platz 2 2159 Vienna, 1120 2160 Austria 2162 Email: rs@netapp.com