idnits 2.17.1 draft-ietf-tcpm-1323bis-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The abstract seems to indicate that this document obsoletes RFC1323, but the header doesn't have an 'Obsoletes:' line to match this. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 21, 2013) is 4053 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC1110' is defined on line 1226, but no explicit reference was found in the text == Unused Reference: 'RFC2018' is defined on line 1241, but no explicit reference was found in the text == Unused Reference: 'RFC2581' is defined on line 1244, but no explicit reference was found in the text == Unused Reference: 'RFC2883' is defined on line 1250, but no explicit reference was found in the text == Unused Reference: 'RFC5681' is defined on line 1260, but no explicit reference was found in the text == Unused Reference: 'Watson81' is defined on line 1271, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 1072 (Obsoleted by RFC 1323, RFC 2018, RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1110 (Obsoleted by RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1185 (Obsoleted by RFC 1323) -- Obsolete informational reference (is this intentional?): RFC 1323 (Obsoleted by RFC 7323) -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 6691 (Obsoleted by RFC 9293) Summary: 1 error (**), 0 flaws (~~), 7 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance (TCPM) D. Borman 3 Internet-Draft Quantum Corporation 4 Intended status: Standards Track B. Braden 5 Expires: September 22, 2013 University of Southern 6 California 7 V. Jacobson 8 Packet Design 9 R. Scheffenegger, Ed. 10 NetApp, Inc. 11 March 21, 2013 13 TCP Extensions for High Performance 14 draft-ietf-tcpm-1323bis-07 16 Abstract 18 This document specifies a set of TCP extensions to improve 19 performance over paths with a large bandwidth * delay product and to 20 provide reliable operation over very high-speed paths. It defines 21 TCP options for scaled windows and timestamps. The timestamps are 22 used for two distinct mechanisms, RTTM (Round Trip Time Measurement) 23 and PAWS (Protection Against Wrapped Sequences). 25 This document updates and obsoletes RFC 1323. 27 Status of this Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on September 22, 2013. 44 Copyright Notice 46 Copyright (c) 2013 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 62 1.1. TCP Performance . . . . . . . . . . . . . . . . . . . . . 4 63 1.2. TCP Reliability . . . . . . . . . . . . . . . . . . . . . 5 64 1.3. Using TCP options . . . . . . . . . . . . . . . . . . . . 6 65 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7 66 2. TCP Window Scale Option . . . . . . . . . . . . . . . . . . . 8 67 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 8 68 2.2. Window Scale Option . . . . . . . . . . . . . . . . . . . 8 69 2.3. Using the Window Scale Option . . . . . . . . . . . . . . 9 70 2.4. Addressing Window Retraction . . . . . . . . . . . . . . . 10 71 3. RTTM -- Round-Trip Time Measurement . . . . . . . . . . . . . 12 72 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 12 73 3.2. TCP Timestamps Option . . . . . . . . . . . . . . . . . . 13 74 3.3. The RTTM Mechanism . . . . . . . . . . . . . . . . . . . . 14 75 3.4. Which Timestamp to Echo . . . . . . . . . . . . . . . . . 15 76 4. PAWS -- Protection Against Wrapped Sequence Numbers . . . . . 18 77 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 18 78 4.2. The PAWS Mechanism . . . . . . . . . . . . . . . . . . . . 18 79 4.3. Basic PAWS Algorithm . . . . . . . . . . . . . . . . . . . 19 80 4.4. Timestamp Clock . . . . . . . . . . . . . . . . . . . . . 21 81 4.5. Outdated Timestamps . . . . . . . . . . . . . . . . . . . 23 82 4.6. Header Prediction . . . . . . . . . . . . . . . . . . . . 23 83 4.7. IP Fragmentation . . . . . . . . . . . . . . . . . . . . . 25 84 4.8. Duplicates from Earlier Incarnations of Connection . . . . 25 85 5. Conclusions and Acknowledgements . . . . . . . . . . . . . . . 25 86 6. Security Considerations . . . . . . . . . . . . . . . . . . . 26 87 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 27 88 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 27 89 8.1. Normative References . . . . . . . . . . . . . . . . . . . 27 90 8.2. Informative References . . . . . . . . . . . . . . . . . . 27 91 Appendix A. Implementation Suggestions . . . . . . . . . . . . . 29 92 Appendix B. Duplicates from Earlier Connection Incarnations . . . 30 93 B.1. System Crash with Loss of State . . . . . . . . . . . . . 31 94 B.2. Closing and Reopening a Connection . . . . . . . . . . . . 31 95 Appendix C. Summary of Notation . . . . . . . . . . . . . . . . . 32 96 Appendix D. Event Processing Summary . . . . . . . . . . . . . . 33 97 Appendix E. Timestamps Edge Cases . . . . . . . . . . . . . . . . 39 98 Appendix F. Changes from RFC 1323 . . . . . . . . . . . . . . . . 40 99 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 41 101 1. Introduction 103 The TCP protocol [RFC0793] was designed to operate reliably over 104 almost any transmission medium regardless of transmission rate, 105 delay, corruption, duplication, or reordering of segments. Over the 106 years, advances in networking technology has resulted in ever-higher 107 transmission speeds, and the fastest paths are well beyond the domain 108 for which TCP was originally engineered. 110 This document defines a set of modest extensions to TCP to extend the 111 domain of its application to match the increasing network capability. 112 It is an update to and obsoletes [RFC1323], which in turn is based 113 upon and obsoletes [RFC1072] and [RFC1185]. 115 For brevity, the full discussions of the merits and history behind 116 the TCP options defined within this document have been omitted. 117 [RFC1323] should be consulted for reference. It is recommended that 118 a modern TCP stack implements and make use of the extensions 119 described in this document. 121 1.1. TCP Performance 123 TCP performance problems arise when the bandwidth * delay product is 124 large. A network having such paths is referred to as "long, fat 125 network" (LFN). 127 There are three fundamental performance problems with basic TCP over 128 LFN paths: 130 (1) Window Size Limit 132 The TCP header uses a 16 bit field to report the receive window 133 size to the sender. Therefore, the largest window that can be 134 used is 2^16 = 65K bytes. 136 To circumvent this problem, Section 2 of this memo defines a TCP 137 option, "Window Scale", to allow windows larger than 2^16. This 138 option defines an implicit scale factor, which is used to 139 multiply the window size value found in a TCP header to obtain 140 the true window size. 142 (2) Recovery from Losses 144 Packet losses in an LFN can have a catastrophic effect on 145 throughput. 147 To generalize the Fast Retransmit/Fast Recovery mechanism to 148 handle multiple packets dropped per window, selective 149 acknowledgments are required. Unlike the normal cumulative 150 acknowledgments of TCP, selective acknowledgments give the 151 sender a complete picture of which segments are queued at the 152 receiver and which have not yet arrived. 154 Selective acknowledgements are specified in a separate document, 155 "A Conservative Selective Acknowledgment (SACK)-based Loss 156 Recovery Algorithm for TCP" [RFC6675], and not further discussed 157 in this document. 159 (3) Round-Trip Measurement 161 TCP implements reliable data delivery by retransmitting segments 162 that are not acknowledged within some retransmission timeout 163 (RTO) interval. Accurate dynamic determination of an 164 appropriate RTO is essential to TCP performance. RTO is 165 determined by estimating the mean and variance of the measured 166 round-trip time (RTT), i.e., the time interval between sending a 167 segment and receiving an acknowledgment for it [Jacobson88a]. 169 Section 3.2 defines a TCP option, "Timestamps", and then 170 specifies a mechanism using this option that allows nearly every 171 segment, including retransmissions, to be timed at negligible 172 computational cost. We use the mnemonic RTTM (Round Trip Time 173 Measurement) for this mechanism, to distinguish it from other 174 uses of the Timestamps option. 176 1.2. TCP Reliability 178 An especially serious kind of error may result from an accidental 179 reuse of TCP sequence numbers in data segments. TCP reliability 180 depends upon the existence of a bound on the lifetime of a segment: 181 the "Maximum Segment Lifetime" or MSL. 183 Duplication of sequence numbers might happen in either of two ways: 185 (1) Sequence number wrap-around on the current connection 187 A TCP sequence number contains 32 bits. At a high enough 188 transfer rate, the 32-bit sequence space may be "wrapped" 189 (cycled) within the time that a segment is delayed in queues. 191 (2) Earlier incarnation of the connection 193 Suppose that a connection terminates, either by a proper close 194 sequence or due to a host crash, and the same connection (i.e., 195 using the same pair of port numbers) is immediately reopened. A 196 delayed segment from the terminated connection could fall within 197 the current window for the new incarnation and be accepted as 198 valid. 200 Duplicates from earlier incarnations, case (2), are avoided by 201 enforcing the current fixed MSL of the TCP specification, as 202 explained in Section 4.8 and Appendix B. However, case (1), avoiding 203 the reuse of sequence numbers within the same connection, requires an 204 upper bound on MSL that depends upon the transfer rate, and at high 205 enough rates, a dedicated mechanism is required. 207 A possible fix for the problem of cycling the sequence space would be 208 to increase the size of the TCP sequence number field. For example, 209 the sequence number field (and also the acknowledgment field) could 210 be expanded to 64 bits. This could be done either by changing the 211 TCP header or by means of an additional option. 213 Section 4 presents a different mechanism, which we call PAWS 214 (Protection Against Wrapped Sequence numbers), to extend TCP 215 reliability to transfer rates well beyond the foreseeable upper limit 216 of network bandwidths. PAWS uses the TCP timestamp option defined in 217 Section 3.2 to protect against old duplicates from the same 218 connection. 220 1.3. Using TCP options 222 The extensions defined in this document all use TCP options. 224 When [RFC1323] was published, there was concern that some buggy TCP 225 implementation might be crashed by the first appearance of an option 226 on a non- segment. However, bugs like that can lead to DOS 227 attacks against a TCP, so it is now expected that most TCP 228 implementations will properly handle unknown options on non- 229 segments. But it is still prudent to be conservative in what you 230 send, and avoiding buggy TCP implementation is not the only reason 231 for negotiating TCP options on segments. 233 The window scale option negotiates fundamental parameters of the TCP 234 session. Therefore, it is only sent during the initial handshake. 235 Furthermore, the window scale option will be sent in a 236 segment only if the corresponding option was received in the initial 237 segment. 239 The timestamp option may appear in any data or segment, adding 240 12 bytes to the 20-byte TCP header. We recognize there is a trade- 241 off between the bandwidth saved by reducing unnecessary 242 retransmission timeouts, and the extra header bandwidth used by this 243 option. It is required that this TCP option will be sent on non- 244 segments only after an exchange of options on the 245 segments has indicated that both sides understand this extension. 247 Appendix A contains a recommended layout of the options in TCP 248 headers to achieve reasonable data field alignment. 250 Finally, we observe that most of the mechanisms defined in this memo 251 are important for LFN's and/or very high-speed networks. For low- 252 speed networks, it might be a performance optimization to NOT use 253 these mechanisms. A TCP vendor concerned about optimal performance 254 over low-speed paths might consider turning these extensions off for 255 low-speed paths, or allow a user or installation manager to disable 256 them. 258 1.4. Terminology 260 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 261 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 262 document are to be interpreted as described in [RFC2119]. 264 In this document, these words will appear with that interpretation 265 only when in UPPER CASE. Lower case uses of these words are not to 266 be interpreted as carrying [RFC2119] significance. 268 2. TCP Window Scale Option 270 2.1. Introduction 272 The window scale extension expands the definition of the TCP window 273 to 32 bits and then uses a scale factor to carry this 32-bit value in 274 the 16-bit Window field of the TCP header (SEG.WND in RFC 793). The 275 scale factor is carried in a TCP option, Window Scale. This option 276 is sent only in a segment (a segment with the SYN bit on), 277 hence the window scale is fixed in each direction when a connection 278 is opened. 280 The maximum receive window, and therefore the scale factor, is 281 determined by the maximum receive buffer space. In a typical modern 282 implementation, this maximum buffer space is set by default but can 283 be overridden by a user program before a TCP connection is opened. 284 This determines the scale factor, and therefore no new user interface 285 is needed for window scaling. 287 2.2. Window Scale Option 289 The three-byte Window Scale option MAY be sent in a segment by 290 a TCP. It has two purposes: (1) indicate that the TCP is prepared to 291 do both send and receive window scaling, and (2) communicate a scale 292 factor to be applied to its receive window. Thus, a TCP that is 293 prepared to scale windows SHOULD send the option, even if its own 294 scale factor is 1. The scale factor is limited to a power of two and 295 encoded logarithmically, so it may be implemented by binary shift 296 operations. 298 TCP Window Scale Option (WSopt): 300 Kind: 3 302 Length: 3 bytes 304 +---------+---------+---------+ 305 | Kind=3 |Length=3 |shift.cnt| 306 +---------+---------+---------+ 307 1 1 1 309 This option is an offer, not a promise; both sides MUST send Window 310 Scale options in their segments to enable window scaling in 311 either direction. If window scaling is enabled, then the TCP that 312 sent this option will right-shift its true receive-window values by 313 'shift.cnt' bits for transmission in SEG.WND. The value 'shift.cnt' 314 MAY be zero (offering to scale, while applying a scale factor of 1 to 315 the receive window). 317 This option MAY be sent in an initial segment (i.e., a segment 318 with the SYN bit on and the ACK bit off). It MAY also be sent in a 319 segment, but only if a Window Scale option was received in 320 the initial segment. A Window Scale option in a segment 321 without a SYN bit SHOULD be ignored. 323 The window field in a segment where the SYN bit is set (i.e., a 324 or ) is never scaled. 326 2.3. Using the Window Scale Option 328 A model implementation of window scaling is as follows, using the 329 notation of [RFC0793]: 331 o All windows are treated as 32-bit quantities for storage in the 332 connection control block and for local calculations. This 333 includes the send-window (SND.WND) and the receive-window 334 (RCV.WND) values, as well as the congestion window. 336 o The connection state is augmented by two window shift counts, 337 Snd.Wind.Scale and Rcv.Wind.Scale, to be applied to the incoming 338 and outgoing window fields, respectively. 340 o If a TCP receives a segment containing a Window Scale 341 option, it sends its own Window Scale option in the 342 segment. 344 o The Window Scale option is sent with shift.cnt = R, where R is the 345 value that the TCP would like to use for its receive window. 347 o Upon receiving a segment with a Window Scale option 348 containing shift.cnt = S, a TCP sets Snd.Wind.Scale to S and sets 349 Rcv.Wind.Scale to R; otherwise, it sets both Snd.Wind.Scale and 350 Rcv.Wind.Scale to zero. 352 o The window field (SEG.WND) in the header of every incoming 353 segment, with the exception of segments, is left-shifted by 354 Snd.Wind.Scale bits before updating SND.WND: 356 SND.WND = SEG.WND << Snd.Wind.Scale 358 (assuming the other conditions of [RFC0793] are met, and using the 359 "C" notation "<<" for left-shift). 361 o The window field (SEG.WND) of every outgoing segment, with the 362 exception of segments, is right-shifted by Rcv.Wind.Scale 363 bits: 365 SND.WND = RCV.WND >> Rcv.Wind.Scale 367 TCP determines if a data segment is "old" or "new" by testing whether 368 its sequence number is within 2^31 bytes of the left edge of the 369 window, and if it is not, discarding the data as "old". To insure 370 that new data is never mistakenly considered old and vice versa, the 371 left edge of the sender's window has to be at most 2^31 away from the 372 right edge of the receiver's window. Similarly with the sender's 373 right edge and receiver's left edge. Since the right and left edges 374 of either the sender's or receiver's window differ by the window 375 size, and since the sender and receiver windows can be out of phase 376 by at most the window size, the above constraints imply that two 377 times the max window size must be less than 2^31, or 379 max window < 2^30 381 Since the max window is 2^S (where S is the scaling shift count) 382 times at most 2^16 - 1 (the maximum unscaled window), the maximum 383 window is guaranteed to be < 2^30 if S <= 14. Thus, the shift count 384 MUST be limited to 14 (which allows windows of 2^30 = 1 Gbyte). If a 385 Window Scale option is received with a shift.cnt value exceeding 14, 386 the TCP SHOULD log the error but use 14 instead of the specified 387 value. 389 The scale factor applies only to the Window field as transmitted in 390 the TCP header; each TCP using extended windows will maintain the 391 window values locally as 32-bit numbers. For example, the 392 "congestion window" computed by Slow Start and Congestion Avoidance 393 is not affected by the scale factor, so window scaling will not 394 introduce quantization into the congestion window. 396 2.4. Addressing Window Retraction 398 When a non-zero scale factor is in use, there are instances when a 399 retracted window can be offered [Mathis08]. The end of the window 400 will be on a boundary based on the granularity of the scale factor 401 being used. If the sequence number is then updated by a number of 402 bytes smaller than that granularity, the TCP will have to either 403 advertise a new window that is beyond what it previously advertised 404 (and perhaps beyond the buffer), or will have to advertise a smaller 405 window, which will cause the TCP window to shrink. Implementations 406 MUST ensure that they handle a shrinking window, as specified in 407 section 4.2.2.16 of [RFC1122]. 409 For the receiver, this implies that: 411 1) The receiver MUST honor, as in-window, any segment that would 412 have been in-window for any sent by the receiver. 414 2) When window scaling is in effect, the receiver SHOULD track the 415 actual maximum window sequence number (which is likely to be 416 greater than the window announced by the most recent , if 417 more than one segment has arrived since the application consumed 418 any data in the receive buffer). 420 On the sender side: 422 3) The initial transmission MUST honor window on most recent . 424 4) On first retransmission, or if the sequence number is out-of- 425 window by less than (2^Rcv.Wind.Scale) then do normal 426 retransmission(s) without regard to receiver window as long as 427 the original segment was in window when it was sent. 429 5) On subsequent retransmissions, treat such s as zero window 430 probes. 432 3. RTTM -- Round-Trip Time Measurement 434 3.1. Introduction 436 Accurate and current RTT estimates are necessary to adapt to changing 437 traffic conditions and to avoid an instability known as "congestion 438 collapse" [RFC0896] in a busy network. However, accurate measurement 439 of RTT may be difficult both in theory and in implementation. 441 Many TCP implementations base their RTT measurements upon a sample of 442 one segment per window or less. While this yields an adequate 443 approximation to the RTT for small windows, it results in an 444 unacceptably poor RTT estimate for a LFN. If we look at RTT 445 estimation as a signal processing problem (which it is), a data 446 signal at some frequency, the packet rate, is being sampled at a 447 lower frequency, the window rate. This lower sampling frequency 448 violates Nyquist's criteria and may therefore introduce "aliasing" 449 artifacts into the estimated RTT [Hamming77]. 451 A good RTT estimator with a conservative retransmission timeout 452 calculation can tolerate aliasing when the sampling frequency is 453 "close" to the data frequency. For example, with a window of 8 454 segments, the sample rate is 1/8 the data frequency -- less than an 455 order of magnitude different. However, when the window is tens or 456 hundreds of segments, the RTT estimator may be seriously in error, 457 resulting in spurious retransmissions. 459 If there are dropped segments, the problem becomes worse. Zhang 460 [Zhang86], Jain [Jain86] and Karn [Karn87] have shown that it is not 461 possible to accumulate reliable RTT estimates if retransmitted 462 segments are included in the estimate. Since a full window of data 463 will have been transmitted prior to a retransmission, all of the 464 segments in that window will have to be ACKed before the next RTT 465 sample can be taken. This means at least an additional window's 466 worth of time between RTT measurements and, as the error rate 467 approaches one per window of data (e.g., 10^-6 errors per bit for the 468 Wideband satellite network), it becomes effectively impossible to 469 obtain a valid RTT measurement. 471 A solution to these problems, which actually simplifies the sender 472 substantially, is as follows: using TCP options, the sender places a 473 timestamp in each data segment, and the receiver reflects these 474 timestamps back in segments. Then a single subtract gives the 475 sender an accurate RTT measurement for every segment (which 476 will correspond to every other data segment, with a sensible 477 receiver). We call this the RTTM (Round-Trip Time Measurement) 478 mechanism. 480 It is vitally important to use the RTTM mechanism with big windows; 481 otherwise, the door is opened to some dangerous instabilities due to 482 aliasing. Furthermore, the option is probably useful for all TCP's, 483 since it simplifies the sender. 485 3.2. TCP Timestamps Option 487 TCP is a symmetric protocol, allowing data to be sent at any time in 488 either direction, and therefore timestamp echoing may occur in either 489 direction. For simplicity and symmetry, we specify that timestamps 490 always be sent and echoed in both directions. For efficiency, we 491 combine the timestamp and timestamp reply fields into a single TCP 492 Timestamps Option. 494 TCP Timestamps Option (TSopt): 496 Kind: 8 498 Length: 10 bytes 500 +-------+-------+---------------------+---------------------+ 501 |Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)| 502 +-------+-------+---------------------+---------------------+ 503 1 1 4 4 505 The Timestamps option carries two four-byte timestamp fields. The 506 Timestamp Value field (TSval) contains the current value of the 507 timestamp clock of the TCP sending the option. 509 The Timestamp Echo Reply field (TSecr) is valid if the ACK bit is set 510 in the TCP header; if it is valid, it echoes a timestamp value that 511 was sent by the remote TCP in the TSval field of a Timestamp option. 512 When TSecr is not valid, its value MUST be zero. However, a value of 513 zero does not imply TSecr being invalid. The TSecr value will 514 generally be from the most recent Timestamps Option that was 515 received; however, there are exceptions that are explained below. 517 A TCP MAY send the Timestamps option (TSopt) in an initial 518 segment. When used, the Timestamp option SHOULD be negotiated during 519 the initial and unless another mechanism allows to 520 enable it during an established session. However, such a mechanism 521 is outside the scope of this document. When TSopt has been sent or 522 received in a non- segment, it MUST be sent in all segments. 523 Once a TSopt has been received in a non- segment, then any 524 successive segment that is received without the RST bit and without a 525 TSopt MAY be dropped without further processing, and an of the 526 current SND.UNA generated. 528 In the case of crossing segments where one contains a 529 TSopt and the other doesn't, both sides SHOULD put a TSopt in the 530 segment. 532 3.3. The RTTM Mechanism 534 RTTM places a Timestamps option in every segment, with a TSval that 535 is obtained from a (virtual) "timestamp clock". Values of this clock 536 MUST be at least approximately proportional to real time, in order to 537 measure actual RTT. 539 These TSval values are echoed in TSecr values in the reverse 540 direction. The difference between a received TSecr value and the 541 current timestamp clock value provides a RTT measurement. 543 When timestamps are used, every segment that is received will contain 544 a TSecr value. However, these values cannot all be used to update 545 the measured RTT. The following example illustrates why. It shows a 546 one-way data flow with segments arriving in sequence without loss. 547 Here A, B, C... represent data blocks occupying successive blocks of 548 sequence numbers, and ACK(A),... represent the corresponding 549 cumulative acknowledgments. The two timestamp fields of the 550 Timestamps option are shown symbolically as . Each 551 TSecr field contains the value most recently received in a TSval 552 field. 554 TCP A TCP B 556 -----> 558 <---- 560 -----> 562 <---- 564 . . . . . . . . . . . . . . . . . . . . . . 566 ----> 568 <---- 570 (etc.) 572 The dotted line marks a pause (60 time units long) in which A had 573 nothing to send. Note that this pause inflates the RTT which B could 574 infer from receiving TSecr=131 in data segment C. Thus, in one-way 575 data flows, RTTM in the reverse direction measures a value that is 576 inflated by gaps in sending data. However, the following rule 577 prevents a resulting inflation of the measured RTT: 579 RTTM Rule: A TSecr value received in a segment is used to update 580 the averaged RTT measurement only if 582 a) the segment acknowledges some new data, i.e., only if it 583 advances the left edge of the send window, and 585 b) the segment does not indicate any loss or reordering, i.e. 586 contains SACK options 588 Since TCP B is not sending data, the data segment C does not 589 acknowledge any new data when it arrives at B. Thus, the inflated 590 RTTM measurement is not used to update B's RTTM measurement. 592 Implementers should note that with Timestamps multiple RTTMs can be 593 taken per RTT. Many RTO estimators have a weighting factor based on 594 an implicit assumption that at most one RTTM will be sampled per RTT. 595 When using multiple RTTMs per RTT to update the RTO estimator, the 596 weighting factor needs to be decreased to take into account the more 597 frequent RTTMs. For example, an implementation could choose to just 598 use one sample per RTT to update the RTO estimator, or vary the gain 599 based on the congestion window, or take an average of all the RTT 600 measurements received over one RTT, and then use that value to update 601 the RTO estimator. This document does not prescribe any particular 602 method for modifying the RTO estimator. 604 3.4. Which Timestamp to Echo 606 If more than one Timestamps option is received before a reply segment 607 is sent, the TCP must choose only one of the TSvals to echo, ignoring 608 the others. To minimize the state kept in the receiver (i.e., the 609 number of unprocessed TSvals), the receiver should be required to 610 retain at most one timestamp in the connection control block. 612 There are three situations to consider: 614 (A) Delayed ACKs. 616 Many TCP's acknowledge only every Kth segment out of a group of 617 segments arriving within a short time interval; this policy is 618 known generally as "delayed ACKs". The data-sender TCP must 619 measure the effective RTT, including the additional time due to 620 delayed ACKs, or else it will retransmit unnecessarily. Thus, 621 when delayed ACKs are in use, the receiver SHOULD reply with the 622 TSval field from the earliest unacknowledged segment. 624 (B) A hole in the sequence space (segment(s) have been lost). 626 The sender will continue sending until the window is filled, and 627 the receiver may be generating s as these out-of-order 628 segments arrive (e.g., to aid "fast retransmit"). 630 The lost segment is probably a sign of congestion, and in that 631 situation the sender should be conservative about 632 retransmission. Furthermore, it is better to overestimate than 633 underestimate the RTT. An for an out-of-order segment 634 SHOULD therefore contain the timestamp from the most recent 635 segment that advanced the window. 637 The same situation occurs if segments are re-ordered by the 638 network. 640 (C) A filled hole in the sequence space. 642 The segment that fills the hole represents the most recent 643 measurement of the network characteristics. A RTT computed from 644 an earlier segment would probably include the sender's 645 retransmit time-out, badly biasing the sender's average RTT 646 estimate. Thus, the timestamp from the latest segment (which 647 filled the hole) MUST be echoed. 649 An algorithm that covers all three cases is described in the 650 following rules for Timestamps option processing on a synchronized 651 connection: 653 (1) The connection state is augmented with two 32-bit slots: 655 TS.Recent holds a timestamp to be echoed in TSecr whenever a 656 segment is sent, and Last.ACK.sent holds the ACK field from the 657 last segment sent. Last.ACK.sent will equal RCV.NXT except when 658 s have been delayed. 660 (2) If: 662 SEG.TSval >= TS.recent and SEG.SEQ <= Last.ACK.sent 664 then SEG.TSval is copied to TS.Recent; otherwise, it is ignored. 666 (3) When a TSopt is sent, its TSecr field is set to the current 667 TS.Recent value. 669 The following examples illustrate these rules. Here A, B, C... 670 represent data segments occupying successive blocks of sequence 671 numbers, and ACK(A),... represent the corresponding acknowledgment 672 segments. Note that ACK(A) has the same sequence number as B. We 673 show only one direction of timestamp echoing, for clarity. 675 o Segments arrive in sequence, and some of the s are delayed. 677 By case (A), the timestamp from the oldest unacknowledged segment 678 is echoed. 680 TS.Recent 681 -------------------> 682 1 683 -------------------> 684 1 685 -------------------> 686 1 687 <---- 688 (etc) 690 o Segments arrive out of order, and every segment is acknowledged. 692 By case (B), the timestamp from the last segment that advanced the 693 left window edge is echoed, until the missing segment arrives; it 694 is echoed according to Case (C). The same sequence would occur if 695 segments B and D were lost and retransmitted. 697 TS.Recent 698 -------------------> 699 1 700 <---- 701 1 702 -------------------> 703 1 704 <---- 705 1 706 -------------------> 707 2 708 <---- 709 2 710 -------------------> 711 2 712 <---- 713 2 714 -------------------> 715 4 716 <---- 717 (etc) 719 4. PAWS -- Protection Against Wrapped Sequence Numbers 721 4.1. Introduction 723 Section 4.2 describes a simple mechanism to reject old duplicate 724 segments that might corrupt an open TCP connection; we call this 725 mechanism PAWS (Protection Against Wrapped Sequence numbers). PAWS 726 operates within a single TCP connection, using state that is saved in 727 the connection control block. Section 4.8 and Appendix F discuss the 728 implications of the PAWS mechanism for avoiding old duplicates from 729 previous incarnations of the same connection. 731 4.2. The PAWS Mechanism 733 PAWS uses the same TCP Timestamps option as the RTTM mechanism 734 described earlier, and assumes that every received TCP segment 735 (including data and segments) contains a timestamp SEG.TSval 736 whose values are monotonically non-decreasing in time. The basic 737 idea is that a segment can be discarded as an old duplicate if it is 738 received with a timestamp SEG.TSval less than some timestamp recently 739 received on this connection. 741 In both the PAWS and the RTTM mechanism, the "timestamps" are 32-bit 742 unsigned integers in a modular 32-bit space. Thus, "less than" is 743 defined the same way it is for TCP sequence numbers, and the same 744 implementation techniques apply. If s and t are timestamp values, 746 s < t if 0 < (t - s) < 2^31, 748 computed in unsigned 32-bit arithmetic. 750 The choice of incoming timestamps to be saved for this comparison 751 MUST guarantee a value that is monotonically increasing. For 752 example, we might save the timestamp from the segment that last 753 advanced the left edge of the receive window, i.e., the most recent 754 in-sequence segment. Instead, we choose the value TS.Recent 755 introduced in Section 3.4 for the RTTM mechanism, since using a 756 common value for both PAWS and RTTM simplifies the implementation of 757 both. As Section 3.4 explained, TS.Recent differs from the timestamp 758 from the last in-sequence segment only in the case of delayed s, 759 and therefore by less than one window. Either choice will therefore 760 protect against sequence number wrap-around. 762 RTTM was specified in a symmetrical manner, so that TSval timestamps 763 are carried in both data and segments and are echoed in TSecr 764 fields carried in returning or data segments. PAWS submits all 765 incoming segments to the same test, and therefore protects against 766 duplicate segments as well as data segments. (An alternative 767 non-symmetric algorithm would protect against old duplicate s: 768 the sender of data would reject incoming segments whose TSecr 769 values were less than the TSecr saved from the last segment whose ACK 770 field advanced the left edge of the send window. This algorithm was 771 deemed to lack economy of mechanism and symmetry.) 773 TSval timestamps sent on and segments are used to 774 initialize PAWS. PAWS protects against old duplicate non- 775 segments, and duplicate segments received while there is a 776 synchronized connection. Duplicate and segments 777 received when there is no connection will be discarded by the normal 778 3-way handshake and sequence number checks of TCP. 780 [RFC1323] recommended that segments NOT carry timestamps, and 781 that they be acceptable regardless of their timestamp. At that time, 782 the thinking was that old duplicate segments should be 783 exceedingly unlikely, and their cleanup function should take 784 precedence over timestamps. More recently, discussions about various 785 blind attacks on TCP connections have raised the suggestion that if 786 the timestamp option is present, SEG.TSecr could be used to provide 787 stricter acceptance tests for segments. While still under 788 discussion, to enable research into this area it is now RECOMMENDED 789 that when generating a , that if the segment causing the 790 to be generated contained a timestamp option, that the also 791 contain a timestamp option. In the segment, SEG.TSecr SHOULD 792 be set to SEG.TSval from the incoming segment and SEG.TSval SHOULD be 793 set to zero. If a is being generated because of a user abort, 794 and Snd.TS.OK is set, then a timestamp option SHOULD be included in 795 the . When a segment is received, it MUST NOT be 796 subjected to PAWS checks, and information from the timestamp option 797 MUST NOT be used to update connection state information. SEG.TSecr 798 MAY be used to provide stricter acceptance checks. 800 4.3. Basic PAWS Algorithm 802 The PAWS algorithm requires the following processing to be performed 803 on all incoming segments for a synchronized connection: 805 R1) If there is a Timestamps option in the arriving segment, 806 SEG.TSval < TS.Recent, TS.Recent is valid (see later discussion) 807 and the RST bit is not set, then treat the arriving segment as 808 not acceptable: 810 Send an acknowledgement in reply as specified in [RFC0793] 811 page 69 and drop the segment. 813 Note: it is necessary to send an segment in order to 814 retain TCP's mechanisms for detecting and recovering from 815 half-open connections. For example, see Figure 10 of 816 [RFC0793]. 818 R2) If the segment is outside the window, reject it (normal TCP 819 processing) 821 R3) If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent (see 822 Section 3.4), then record its timestamp in TS.Recent. 824 R4) If an arriving segment is in-sequence (i.e., at the left window 825 edge), then accept it normally. 827 R5) Otherwise, treat the segment as a normal in-window, out-of- 828 sequence TCP segment (e.g., queue it for later delivery to the 829 user). 831 Steps R2, R4, and R5 are the normal TCP processing steps specified by 832 [RFC0793]. 834 It is important to note that the timestamp is checked only when a 835 segment first arrives at the receiver, regardless of whether it is 836 in-sequence or it must be queued for later delivery. 838 Consider the following example. 840 Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has been 841 sent, where the letter indicates the sequence number and the digit 842 represents the timestamp. Suppose also that segment B.1 has been 843 lost. The timestamp in TS.Recent is 1 (from A.1), so C.1, ..., 844 Z.1 are considered acceptable and are queued. When B is 845 retransmitted as segment B.2 (using the latest timestamp), it 846 fills the hole and causes all the segments through Z to be 847 acknowledged and passed to the user. The timestamps of the queued 848 segments are *not* inspected again at this time, since they have 849 already been accepted. When B.2 is accepted, TS.Recent is set to 850 2. 852 This rule allows reasonable performance under loss. A full window of 853 data is in transit at all times, and after a loss a full window less 854 one segment will show up out-of-sequence to be queued at the receiver 855 (e.g., up to ~2^30 bytes of data); the timestamp option must not 856 result in discarding this data. 858 In certain unlikely circumstances, the algorithm of rules R1-R5 could 859 lead to discarding some segments unnecessarily, as shown in the 860 following example: 862 Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have been 863 sent in sequence and that segment B.1 has been lost. Furthermore, 864 suppose delivery of some of C.1, ... Z.1 is delayed until AFTER 865 the retransmission B.2 arrives at the receiver. These delayed 866 segments will be discarded unnecessarily when they do arrive, 867 since their timestamps are now out of date. 869 This case is very unlikely to occur. If the retransmission was 870 triggered by a timeout, some of the segments C.1, ... Z.1 must have 871 been delayed longer than the RTO time. This is presumably an 872 unlikely event, or there would be many spurious timeouts and 873 retransmissions. If B's retransmission was triggered by the "fast 874 retransmit" algorithm, i.e., by duplicate s, then the queued 875 segments that caused these s must have been received already. 877 Even if a segment were delayed past the RTO, the Fast Retransmit 878 mechanism [Jacobson90c] will cause the delayed segments to be 879 retransmitted at the same time as B.2, avoiding an extra RTT and 880 therefore causing a very small performance penalty. 882 We know of no case with a significant probability of occurrence in 883 which timestamps will cause performance degradation by unnecessarily 884 discarding segments. 886 4.4. Timestamp Clock 888 It is important to understand that the PAWS algorithm does not 889 require clock synchronization between sender and receiver. The 890 sender's timestamp clock is used to stamp the segments, and the 891 sender uses the echoed timestamp to measure RTTs. However, the 892 receiver treats the timestamp as simply a monotonically increasing 893 serial number, without any necessary connection to its clock. From 894 the receiver's viewpoint, the timestamp is acting as a logical 895 extension of the high-order bits of the sequence number. 897 The receiver algorithm does place some requirements on the frequency 898 of the timestamp clock. 900 (a) The timestamp clock must not be "too slow". 902 It MUST tick at least once for each 2^31 bytes sent. In fact, 903 in order to be useful to the sender for round trip timing, the 904 clock SHOULD tick at least once per window's worth of data, and 905 even with the window extension defined in Section 2.2, 2^31 906 bytes must be at least two windows. 908 To make this more quantitative, any clock faster than 1 tick/sec 909 will reject old duplicate segments for link speeds of ~8 Gbps. 911 A 1 ms timestamp clock will work at link speeds up to 8 Tbps 912 (8*10^12) bps! 914 (b) The timestamp clock must not be "too fast". 916 The recycling time of the timestamp clock MUST be greater than 917 MSL seconds. Since the clock (timestamp) is 32 bits and the 918 worst-case MSL is 255 seconds, the maximum acceptable clock 919 frequency is one tick every 59 ns. 921 However, it is desirable to establish a much longer recycle 922 period, in order to handle outdated timestamps on idle 923 connections (see Section 4.5), and to relax the MSL requirement 924 for preventing sequence number wrap-around. With a 1 ms 925 timestamp clock, the 32-bit timestamp will wrap its sign bit in 926 24.8 days. Thus, it will reject old duplicates on the same 927 connection if MSL is 24.8 days or less. This appears to be a 928 very safe figure; an MSL of 24.8 days or longer can probably be 929 assumed in the internet without requiring precise MSL 930 enforcement. 932 Based upon these considerations, we choose a timestamp clock 933 frequency in the range 1 ms to 1 sec per tick. This range also 934 matches the requirements of the RTTM mechanism, which does not need 935 much more resolution than the granularity of the retransmit timer, 936 e.g., tens or hundreds of milliseconds. 938 The PAWS mechanism also puts a strong monotonicity requirement on the 939 sender's timestamp clock. The method of implementation of the 940 timestamp clock to meet this requirement depends upon the system 941 hardware and software. 943 o Some hosts have a hardware clock that is guaranteed to be 944 monotonic between hardware resets. 946 o A clock interrupt may be used to simply increment a binary integer 947 by 1 periodically. 949 o The timestamp clock may be derived from a system clock that is 950 subject to being abruptly changed, by adding a variable offset 951 value. This offset is initialized to zero. When a new timestamp 952 clock value is needed, the offset can be adjusted as necessary to 953 make the new value equal to or larger than the previous value 954 (which was saved for this purpose). 956 4.5. Outdated Timestamps 958 If a connection remains idle long enough for the timestamp clock of 959 the other TCP to wrap its sign bit, then the value saved in TS.Recent 960 will become too old; as a result, the PAWS mechanism will cause all 961 subsequent segments to be rejected, freezing the connection (until 962 the timestamp clock wraps its sign bit again). 964 With the chosen range of timestamp clock frequencies (1 sec to 1 ms), 965 the time to wrap the sign bit will be between 24.8 days and 24800 966 days. A TCP connection that is idle for more than 24 days and then 967 comes to life is exceedingly unusual. However, it is undesirable in 968 principle to place any limitation on TCP connection lifetimes. 970 We therefore require that an implementation of PAWS include a 971 mechanism to "invalidate" the TS.Recent value when a connection is 972 idle for more than 24 days. (An alternative solution to the problem 973 of outdated timestamps would be to send keep-alive segments at a very 974 low rate, but still more often than the wrap-around time for 975 timestamps, e.g., once a day. This would impose negligible overhead. 976 However, the TCP specification has never included keep-alives, so the 977 solution based upon invalidation was chosen.) 979 Note that a TCP does not know the frequency, and therefore, the 980 wraparound time, of the other TCP, so it must assume the worst. The 981 validity of TS.Recent needs to be checked only if the basic PAWS 982 timestamp check fails, i.e., only if SEG.TSval < TS.Recent. If 983 TS.Recent is found to be invalid, then the segment is accepted, 984 regardless of the failure of the timestamp check, and rule R3 updates 985 TS.Recent with the TSval from the new segment. 987 To detect how long the connection has been idle, the TCP MAY update a 988 clock or timestamp value associated with the connection whenever 989 TS.Recent is updated, for example. The details will be 990 implementation-dependent. 992 4.6. Header Prediction 994 "Header prediction" [Jacobson90a] is a high-performance transport 995 protocol implementation technique that is most important for high- 996 speed links. This technique optimizes the code for the most common 997 case, receiving a segment correctly and in order. Using header 998 prediction, the receiver asks the question, "Is this segment the next 999 in sequence?" This question can be answered in fewer machine 1000 instructions than the question, "Is this segment within the window?" 1002 Adding header prediction to our timestamp procedure leads to the 1003 following recommended sequence for processing an arriving TCP 1004 segment: 1006 H1) Check timestamp (same as step R1 above) 1008 H2) Do header prediction: if segment is next in sequence and if 1009 there are no special conditions requiring additional processing, 1010 accept the segment, record its timestamp, and skip H3. 1012 H3) Process the segment normally, as specified in RFC 793. This 1013 includes dropping segments that are outside the window and 1014 possibly sending acknowledgments, and queuing in-window, out-of- 1015 sequence segments. 1017 Another possibility would be to interchange steps H1 and H2, i.e., to 1018 perform the header prediction step H2 FIRST, and perform H1 and H3 1019 only when header prediction fails. This could be a performance 1020 improvement, since the timestamp check in step H1 is very unlikely to 1021 fail, and it requires unsigned modulo arithmetic. To perform this 1022 check on every single segment is contrary to the philosophy of header 1023 prediction. We believe that this change might produce a measurable 1024 reduction in CPU time for TCP protocol processing on high-speed 1025 networks. 1027 However, putting H2 first would create a hazard: a segment from 2^32 1028 bytes in the past might arrive at exactly the wrong time and be 1029 accepted mistakenly by the header-prediction step. The following 1030 reasoning has been introduced in [RFC1185] to show that the 1031 probability of this failure is negligible. 1033 If all segments are equally likely to show up as old duplicates, 1034 then the probability of an old duplicate exactly matching the left 1035 window edge is the maximum segment size (MSS) divided by the size 1036 of the sequence space. This ratio must be less than 2^-16, since 1037 MSS must be < 2^16; for example, it will be (2^12)/(2^32) = 2^-20 1038 for a FDDI link. However, the older a segment is, the less likely 1039 it is to be retained in the Internet, and under any reasonable 1040 model of segment lifetime the probability of an old duplicate 1041 exactly at the left window edge must be much smaller than 2^-16. 1043 The 16 bit TCP checksum also allows a basic unreliability of one 1044 part in 2^16. A protocol mechanism whose reliability exceeds the 1045 reliability of the TCP checksum should be considered "good 1046 enough", i.e., it won't contribute significantly to the overall 1047 error rate. We therefore believe we can ignore the problem of an 1048 old duplicate being accepted by doing header prediction before 1049 checking the timestamp. 1051 However, this probabilistic argument is not universally accepted, and 1052 the consensus at present is that the performance gain does not 1053 justify the hazard in the general case. It is therefore recommended 1054 that H2 follow H1. 1056 4.7. IP Fragmentation 1058 At high data rates, the protection against old segments provided by 1059 PAWS can be circumvented by errors in IP fragment reassembly (see 1060 [RFC4963]). The only way to protect against incorrect IP fragment 1061 reassembly is to not allow the segments to be fragmented. This is 1062 done by setting the Don't Fragment (DF) bit in the IP header. 1063 Setting the DF bit implies the use of Path MTU Discovery as described 1064 in [RFC1191], [RFC1981], and [RFC4821], thus any TCP implementation 1065 that implements PAWS MUST also implement Path MTU Discovery. 1067 4.8. Duplicates from Earlier Incarnations of Connection 1069 The PAWS mechanism protects against errors due to sequence number 1070 wrap-around on high-speed connections. Segments from an earlier 1071 incarnation of the same connection are also a potential cause of old 1072 duplicate errors. In both cases, the TCP mechanisms to prevent such 1073 errors depend upon the enforcement of a maximum segment lifetime 1074 (MSL) by the Internet (IP) layer (see Appendix of RFC 1185 for a 1075 detailed discussion). Unlike the case of sequence space wrap-around, 1076 the MSL required to prevent old duplicate errors from earlier 1077 incarnations does not depend upon the transfer rate. If the IP layer 1078 enforces the recommended 2 minute MSL of TCP, and if the TCP rules 1079 are followed, TCP connections will be safe from earlier incarnations, 1080 no matter how high the network speed. Thus, the PAWS mechanism is 1081 not required for this case. 1083 We may still ask whether the PAWS mechanism can provide additional 1084 security against old duplicates from earlier connections, allowing us 1085 to relax the enforcement of MSL by the IP layer. Appendix B explores 1086 this question, showing that further assumptions and/or mechanisms are 1087 required, beyond those of PAWS. This is not part of the current 1088 extension. 1090 5. Conclusions and Acknowledgements 1092 This memo presented a set of extensions to TCP to provide efficient 1093 operation over large-bandwidth*delay-product paths and reliable 1094 operation over very high-speed paths. These extensions are designed 1095 to provide compatible interworking with TCP's that do not implement 1096 the extensions. 1098 These mechanisms are implemented using TCP options for scaled windows 1099 and timestamps. The timestamps are used for two distinct mechanisms: 1100 RTTM (Round Trip Time Measurement) and PAWS (Protection Against 1101 Wrapped Sequences). 1103 The Window Scale option was originally suggested by Mike St. Johns of 1104 USAF/DCA. The present form of the option was suggested by Mike 1105 Karels of UC Berkeley in response to a more cumbersome scheme defined 1106 by Van Jacobson. Lixia Zhang helped formulate the PAWS mechanism 1107 description in [RFC1185]. 1109 Finally, much of this work originated as the result of discussions 1110 within the End-to-End Task Force on the theoretical limitations of 1111 transport protocols in general and TCP in particular. Task force 1112 members and other on the end2end-interest list have made valuable 1113 contributions by pointing out flaws in the algorithms and the 1114 documentation. Continued discussion and development since the 1115 publication of [RFC1323] originally occurred in the IETF TCP Large 1116 Windows Working Group, later on in the End-to-End Task Force, and 1117 most recently in the IETF TCP Maintenance Working Group. The authors 1118 are grateful for all these contributions. 1120 6. Security Considerations 1122 The TCP sequence space is a fixed size, and as the window becomes 1123 larger it becomes easier for an attacker to generate forged packets 1124 that can fall within the TCP window, and be accepted as valid 1125 segments. While use of Timestamps and PAWS can help to mitigate 1126 this, when using PAWS, if an attacker is able to forge a packet that 1127 is acceptable to the TCP connection, a timestamp that is in the 1128 future would cause valid segments to be dropped due to PAWS checks. 1129 Hence, implementers should take care to not open the TCP window 1130 drastically beyond the requirements of the connection. 1132 Middle boxes and options: If a middle box removes TCP options from 1133 the segment, such as TSopt, a high speed connection that needs 1134 PAWS would not have that protection. In this situation, an 1135 implementer could provide a mechanism for the application to 1136 determine whether or not PAWS is in use on the connection, and chose 1137 to terminate the connection if that protection doesn't exist. 1139 Mechanisms to protect the TCP header from modification should also 1140 protect the TCP options. 1142 A naive implementation that derives the timestamp clock value 1143 directly from a system uptime clock may unintentionally leak this 1144 information to an attacker. This does not directly compromise any of 1145 the mechanisms described in this document. However, this may be 1146 valuable information to a potential attacker. An implementer should 1147 evaluate the potential impact and mitigate this accordingly (i.e. by 1148 using a random offset for the timestamp clock on each connection, or 1149 using an external, real-time derived timestamp clock source). 1151 Expanding the TCP window beyond 64K for IPv6 allows Jumbograms 1152 [RFC2675] to be used when the local network supports packets larger 1153 than 64K. When larger TCP segments are used, the TCP checksum becomes 1154 weaker. 1156 7. IANA Considerations 1158 This document has no actions for IANA. 1160 8. References 1162 8.1. Normative References 1164 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 1165 RFC 793, September 1981. 1167 [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 1168 November 1990. 1170 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1171 Requirement Levels", BCP 14, RFC 2119, March 1997. 1173 8.2. Informative References 1175 [Garlick77] 1176 Garlick, L., Rom, R., and J. Postel, "Issues in Reliable 1177 Host-to-Host Protocols", Proc. Second Berkeley Workshop on 1178 Distributed Data Management and Computer Networks, 1179 May 1977, . 1181 [Hamming77] 1182 Hamming, R., "Digital Filters", Prentice Hall, Englewood 1183 Cliffs, N.J. ISBN 0-13-212571-4, 1977. 1185 [Jacobson88a] 1186 Jacobson, V., "Congestion Avoidance and Control", SIGCOMM 1187 '88, Stanford, CA., August 1988, 1188 . 1190 [Jacobson90a] 1191 Jacobson, V., "4BSD Header Prediction", ACM Computer 1192 Communication Review, April 1990. 1194 [Jacobson90c] 1195 Jacobson, V., "Modified TCP congestion avoidance 1196 algorithm", Message to the end2end-interest mailing list, 1197 April 1990, 1198 . 1200 [Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet 1201 Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and 1202 Comm., Scottsdale, Arizona, March 1986, 1203 . 1205 [Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times in 1206 Reliable Transport Protocols", Proc. SIGCOMM '87, 1207 August 1987. 1209 [Martin03] 1210 Martin, D., "[Tsvwg] RFC 1323.bis", Message to the tsvwg 1211 mailing list, September 2003, . 1214 [Mathis08] 1215 Mathis, M., "[tcpm] Example of 1323 window retraction 1216 problem", Message to the tcpm mailing list, March 2008, 1217 . 1220 [RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks", 1221 RFC 896, January 1984. 1223 [RFC1072] Jacobson, V. and R. Braden, "TCP extensions for long-delay 1224 paths", RFC 1072, October 1988. 1226 [RFC1110] McKenzie, A., "Problem with the TCP big window option", 1227 RFC 1110, August 1989. 1229 [RFC1122] Braden, R., "Requirements for Internet Hosts - 1230 Communication Layers", STD 3, RFC 1122, October 1989. 1232 [RFC1185] Jacobson, V., Braden, B., and L. Zhang, "TCP Extension for 1233 High-Speed Paths", RFC 1185, October 1990. 1235 [RFC1323] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions 1236 for High Performance", RFC 1323, May 1992. 1238 [RFC1981] McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery 1239 for IP version 6", RFC 1981, August 1996. 1241 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 1242 Selective Acknowledgment Options", RFC 2018, October 1996. 1244 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 1245 Control", RFC 2581, April 1999. 1247 [RFC2675] Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms", 1248 RFC 2675, August 1999. 1250 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 1251 Extension to the Selective Acknowledgement (SACK) Option 1252 for TCP", RFC 2883, July 2000. 1254 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 1255 Discovery", RFC 4821, March 2007. 1257 [RFC4963] Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly 1258 Errors at High Data Rates", RFC 4963, July 2007. 1260 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1261 Control", RFC 5681, September 2009. 1263 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 1264 and Y. Nishida, "A Conservative Loss Recovery Algorithm 1265 Based on Selective Acknowledgment (SACK) for TCP", 1266 RFC 6675, August 2012. 1268 [RFC6691] Borman, D., "TCP Options and Maximum Segment Size (MSS)", 1269 RFC 6691, July 2012. 1271 [Watson81] 1272 Watson, R., "Timer-based Mechanisms in Reliable Transport 1273 Protocol Connection Management", Computer Networks, Vol. 1274 5, 1981. 1276 [Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. SIGCOMM 1277 '86, Stowe, VT, August 1986. 1279 Appendix A. Implementation Suggestions 1281 TCP Option Layout 1283 The following layouts are recommended for sending options on non- 1284 segments, to achieve maximum feasible alignment of 32-bit 1285 and 64-bit machines. 1287 +--------+--------+--------+--------+ 1288 | NOP | NOP | TSopt | 10 | 1289 +--------+--------+--------+--------+ 1290 | TSval timestamp | 1291 +--------+--------+--------+--------+ 1292 | TSecr timestamp | 1293 +--------+--------+--------+--------+ 1295 Interaction with the TCP Urgent Pointer 1297 The TCP Urgent pointer, like the TCP window, is a 16 bit value. 1298 Some of the original discussion for the TCP Window Scale option 1299 included proposals to increase the Urgent pointer to 32 bits. As 1300 it turns out, this is unnecessary. There are two observations 1301 that should be made: 1303 (1) With IP Version 4, the largest amount of TCP data that can be 1304 sent in a single packet is 65495 bytes (64K - 1 -- size of 1305 fixed IP and TCP headers). 1307 (2) Updates to the urgent pointer while the user is in "urgent 1308 mode" are invisible to the user. 1310 This means that if the Urgent Pointer points beyond the end of the 1311 TCP data in the current segment, then the user will remain in 1312 urgent mode until the next TCP segment arrives. That segment will 1313 update the urgent pointer to a new offset, and the user will never 1314 have left urgent mode. 1316 Thus, to properly implement the Urgent Pointer, the sending TCP 1317 only has to check for overflow of the 16 bit Urgent Pointer field 1318 before filling it in. If it does overflow, than a value of 65535 1319 should be inserted into the Urgent Pointer. 1321 The same technique applies to IP Version 6, except in the case of 1322 IPv6 Jumbograms. When IPv6 Jumbograms are supported, [RFC2675] 1323 requires additional steps for dealing with the Urgent Pointer, 1324 these are described in section 5.2 of [RFC2675]. 1326 Appendix B. Duplicates from Earlier Connection Incarnations 1328 There are two cases to be considered: (1) a system crashing (and 1329 losing connection state) and restarting, and (2) the same connection 1330 being closed and reopened without a loss of host state. These will 1331 be described in the following two sections. 1333 B.1. System Crash with Loss of State 1335 TCP's quiet time of one MSL upon system startup handles the loss of 1336 connection state in a system crash/restart. For an explanation, see 1337 for example "When to Keep Quiet" in the TCP protocol specification 1338 [RFC0793]. The MSL that is required here does not depend upon the 1339 transfer speed. The current TCP MSL of 2 minutes seemed acceptable 1340 as an operational compromise, when many host systems used to take 1341 this long to boot after a crash. Current host systems can boot 1342 considerably faster. 1344 The timestamp option may be used to ease the MSL requirements (or to 1345 provide additional security against data corruption). If timestamps 1346 are being used and if the timestamp clock can be guaranteed to be 1347 monotonic over a system crash/restart, i.e., if the first value of 1348 the sender's timestamp clock after a crash/restart can be guaranteed 1349 to be greater than the last value before the restart, then a quiet 1350 time is unnecessary. 1352 To dispense totally with the quiet time would require that the host 1353 clock be synchronized to a time source that is stable over the crash/ 1354 restart period, with an accuracy of one timestamp clock tick or 1355 better. We can back off from this strict requirement to take 1356 advantage of approximate clock synchronization. Suppose that the 1357 clock is always re-synchronized to within N timestamp clock ticks and 1358 that booting (extended with a quiet time, if necessary) takes more 1359 than N ticks. This will guarantee monotonicity of the timestamps, 1360 which can then be used to reject old duplicates even without an 1361 enforced MSL. 1363 B.2. Closing and Reopening a Connection 1365 When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT state 1366 ties up the socket pair for 4 minutes (see Section 3.5 of [RFC0793]. 1367 Applications built upon TCP that close one connection and open a new 1368 one (e.g., an FTP data transfer connection using Stream mode) must 1369 choose a new socket pair each time. The TIME-WAIT delay serves two 1370 different purposes: 1372 (a) Implement the full-duplex reliable close handshake of TCP. 1374 The proper time to delay the final close step is not really 1375 related to the MSL; it depends instead upon the RTO for the FIN 1376 segments and therefore upon the RTT of the path. (It could be 1377 argued that the side that is sending a FIN knows what degree of 1378 reliability it needs, and therefore it should be able to 1379 determine the length of the TIME-WAIT delay for the FIN's 1380 recipient. This could be accomplished with an appropriate TCP 1381 option in FIN segments.) 1383 Although there is no formal upper-bound on RTT, common network 1384 engineering practice makes an RTT greater than 1 minute very 1385 unlikely. Thus, the 4 minute delay in TIME-WAIT state works 1386 satisfactorily to provide a reliable full-duplex TCP close. 1387 Note again that this is independent of MSL enforcement and 1388 network speed. 1390 The TIME-WAIT state could cause an indirect performance problem 1391 if an application needed to repeatedly close one connection and 1392 open another at a very high frequency, since the number of 1393 available TCP ports on a host is less than 2^16. However, high 1394 network speeds are not the major contributor to this problem; 1395 the RTT is the limiting factor in how quickly connections can be 1396 opened and closed. Therefore, this problem will be no worse at 1397 high transfer speeds. 1399 (b) Allow old duplicate segments to expire. 1401 To replace this function of TIME-WAIT state, a mechanism would 1402 have to operate across connections. PAWS is defined strictly 1403 within a single connection; the last timestamp (TS.Recent) is 1404 kept in the connection control block, and discarded when a 1405 connection is closed. 1407 An additional mechanism could be added to the TCP, a per-host 1408 cache of the last timestamp received from any connection. This 1409 value could then be used in the PAWS mechanism to reject old 1410 duplicate segments from earlier incarnations of the connection, 1411 if the timestamp clock can be guaranteed to have ticked at least 1412 once since the old connection was open. This would require that 1413 the TIME-WAIT delay plus the RTT together must be at least one 1414 tick of the sender's timestamp clock. Such an extension is not 1415 part of the proposal of this RFC. 1417 Note that this is a variant on the mechanism proposed by 1418 Garlick, Rom, and Postel [Garlick77], which required each host 1419 to maintain connection records containing the highest sequence 1420 numbers on every connection. Using timestamps instead, it is 1421 only necessary to keep one quantity per remote host, regardless 1422 of the number of simultaneous connections to that host. 1424 Appendix C. Summary of Notation 1426 The following notation has been used in this document. 1428 Options 1430 WSopt: TCP Window Scale Option 1431 TSopt: TCP Timestamps Option 1433 Option Fields 1435 shift.cnt: Window scale byte in WSopt 1436 TSval: 32-bit Timestamp Value field in TSopt 1437 TSecr: 32-bit Timestamp Reply field in TSopt 1439 Option Fields in Current Segment 1441 SEG.TSval: TSval field from TSopt in current segment 1442 SEG.TSecr: TSecr field from TSopt in current segment 1443 SEG.WSopt: 8-bit value in WSopt 1445 Clock Values 1447 my.TSclock: System wide source of 32-bit timestamp values 1448 my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec) 1449 Snd.TSoffset: A offset for randomizing Snd.TSclock 1450 Snd.TSclock: my.TSclock + Snd.TSoffset 1452 Per-Connection State Variables 1454 TS.Recent: Latest received Timestamp 1455 Last.ACK.sent: Last ACK field sent 1456 Snd.TS.OK: 1-bit flag 1457 Snd.WS.OK: 1-bit flag 1458 Rcv.Wind.Scale: Receive window scale power 1459 Snd.Wind.Scale: Send window scale power 1460 Start.Time: Snd.TSclock value when segment being timed was 1461 sent (used by pre-1323 code). 1463 Procedure 1465 Update_SRTT(m) Procedure to update the smoothed RTT and RTT 1466 variance estimates, using the rules of 1467 [Jacobson88a], given m, a new RTT measurement 1469 Appendix D. Event Processing Summary 1471 OPEN Call 1473 ... 1475 An initial send sequence number (ISS) is selected. Send a 1476 segment of the form: 1478 1480 ... 1482 SEND Call 1484 CLOSED STATE (i.e., TCB does not exist) 1486 ... 1488 LISTEN STATE 1490 If the foreign socket is specified, then change the connection 1491 from passive to active, select an ISS. Send a segment 1492 containing the options: and 1493 . Set SND.UNA to ISS, SND.NXT to ISS+1. 1494 Enter SYN-SENT state. ... 1496 SYN-SENT STATE 1497 SYN-RECEIVED STATE 1499 ... 1501 ESTABLISHED STATE 1502 CLOSE-WAIT STATE 1504 Segmentize the buffer and send it with a piggybacked 1505 acknowledgment (acknowledgment value = RCV.NXT). ... 1507 If the urgent flag is set ... 1509 If the Snd.TS.OK flag is set, then include the TCP Timestamps 1510 option in each data 1511 segment. 1513 Scale the receive window for transmission in the segment 1514 header: 1516 SEG.WND = (RCV.WND >> Rcv.Wind.Scale). 1518 SEGMENT ARRIVES 1520 ... 1522 If the state is LISTEN then 1524 first check for an RST 1526 ... 1528 second check for an ACK 1530 ... 1532 third check for a SYN 1534 if the SYN bit is set, check the security. If the ... 1536 ... 1538 if the SEG.PRC is less than the TCB.PRC then continue. 1540 Check for a Window Scale option (WSopt); if one is found, 1541 save SEG.WSopt in Snd.Wind.Scale and set Snd.WS.OK flag on. 1542 Otherwise, set both Snd.Wind.Scale and Rcv.Wind.Scale to 1543 zero and clear Snd.WS.OK flag. 1545 Check for a TSopt option; if one is found, save SEG.TSval in 1546 the variable TS.Recent and turn on the Snd.TS.OK bit. 1548 Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any 1549 other control or text should be queued for processing later. 1550 ISS should be selected and a segment sent of the form: 1552 1554 If the Snd.WS.OK bit is on, include a WSopt option 1555 in this segment. If the Snd.TS.OK 1556 bit is on, include a TSopt 1557 in this segment. 1558 Last.ACK.sent is set to RCV.NXT. 1560 SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection 1561 state should be changed to SYN-RECEIVED. Note that any 1562 other incoming control or data (combined with SYN) will be 1563 processed in the SYN-RECEIVED state, but processing of SYN 1564 and ACK should not be repeated. If the listen was not fully 1565 specified (i.e., the foreign socket was not fully 1566 specified), then the unspecified fields should be filled in 1567 now. 1569 fourth other text or control 1571 ... 1573 If the state is SYN-SENT then 1575 first check the ACK bit 1577 ... 1579 ... 1581 fourth check the SYN bit 1583 ... 1585 If the SYN bit is on and the security/compartment and 1586 precedence are acceptable then, RCV.NXT is set to SEG.SEQ+1, 1587 IRS is set to SEG.SEQ, and any acknowledgements on the 1588 retransmission queue which are thereby acknowledged should 1589 be removed. 1591 Check for a Window Scale option (WSopt); if it is found, 1592 save SEG.WSopt in Snd.Wind.Scale; otherwise, set both 1593 Snd.Wind.Scale and Rcv.Wind.Scale to zero. 1595 Check for a TSopt option; if one is found, save SEG.TSval in 1596 variable TS.Recent and turn on the Snd.TS.OK bit in the 1597 connection control block. If the ACK bit is set, use 1598 Snd.TSclock - SEG.TSecr as the initial RTT estimate. 1600 If SND.UNA > ISS (our has been ACKed), change the 1601 connection state to ESTABLISHED, form an segment: 1603 1605 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1606 option in this 1607 segment. Last.ACK.sent is set to RCV.NXT. 1609 Data or controls which were queued for transmission may be 1610 included. If there are other controls or text in the 1611 segment then continue processing at the sixth step below 1612 where the URG bit is checked, otherwise return. 1614 Otherwise enter SYN-RECEIVED, form a segment: 1616 1618 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1619 option in this segment. 1620 If the Snd.WS.OK bit is on, include a WSopt option 1621 in this segment. Last.ACK.sent is 1622 set to RCV.NXT. 1624 If there are other controls or text in the segment, queue 1625 them for processing after the ESTABLISHED state has been 1626 reached, return. 1628 fifth, if neither of the SYN or RST bits is set then drop the 1629 segment and return. 1631 Otherwise, 1633 First, check sequence number 1635 SYN-RECEIVED STATE 1636 ESTABLISHED STATE 1637 FIN-WAIT-1 STATE 1638 FIN-WAIT-2 STATE 1639 CLOSE-WAIT STATE 1640 CLOSING STATE 1641 LAST-ACK STATE 1642 TIME-WAIT STATE 1644 Segments are processed in sequence. Initial tests on 1645 arrival are used to discard old duplicates, but further 1646 processing is done in SEG.SEQ order. If a segment's 1647 contents straddle the boundary between old and new, only the 1648 new parts should be processed. 1650 Rescale the received window field: 1652 TrueWindow = SEG.WND << Snd.Wind.Scale, 1654 and use "TrueWindow" in place of SEG.WND in the following 1655 steps. 1657 Check whether the segment contains a Timestamps option and 1658 bit Snd.TS.OK is on. If so: 1660 If SEG.TSval < TS.Recent and the RST bit is off, then 1661 test whether connection has been idle less than 24 days; 1662 if all are true, then the segment is not acceptable; 1663 follow steps below for an unacceptable segment. 1665 If SEG.SEQ is less than or equal to Last.ACK.sent, then 1666 save SEG.TSval in variable TS.Recent. 1668 There are four cases for the acceptability test for an 1669 incoming segment: 1671 ... 1673 If an incoming segment is not acceptable, an acknowledgment 1674 should be sent in reply (unless the RST bit is set, if so 1675 drop the segment and return): 1677 1679 Last.ACK.sent is set to SEG.ACK of the acknowledgment. If 1680 the Snd.Echo.OK bit is on, include the Timestamps option 1681 in this segment. 1682 Set Last.ACK.sent to SEG.ACK and send the segment. 1683 After sending the acknowledgment, drop the unacceptable 1684 segment and return. 1686 ... 1688 fifth check the ACK field. 1690 if the ACK bit is off drop the segment and return. 1692 if the ACK bit is on 1694 ... 1696 ESTABLISHED STATE 1698 If SND.UNA < SEG.ACK <= SND.NXT then, set SND.UNA <- 1699 SEG.ACK. Also compute a new estimate of round-trip time. 1700 If Snd.TS.OK bit is on, use Snd.TSclock - SEG.TSecr; 1701 otherwise use the elapsed time since the first segment in 1702 the retransmission queue was sent. Any segments on the 1703 retransmission queue which are thereby entirely 1704 acknowledged... 1706 ... 1708 Seventh, process the segment text. 1710 ESTABLISHED STATE 1711 FIN-WAIT-1 STATE 1712 FIN-WAIT-2 STATE 1713 ... 1715 Send an acknowledgment of the form: 1717 1719 If the Snd.TS.OK bit is on, include Timestamps option 1720 in this segment. 1721 Set Last.ACK.sent to SEG.ACK of the acknowledgment, and send 1722 it. This acknowledgment should be piggy-backed on a segment 1723 being transmitted if possible without incurring undue delay. 1725 ... 1727 Appendix E. Timestamps Edge Cases 1729 While the rules laid out for when to calculate RTTM produce the 1730 correct results most of the time, there are some edge cases where an 1731 incorrect RTTM can be calculated. All of these situations involve 1732 the loss of segments. It is felt that these scenarios are rare, and 1733 that if they should happen, they will cause a single RTTM measurement 1734 to be inflated, which mitigates its effects on RTO calculations. 1736 [Martin03] cites two similar cases when the returning is lost, 1737 and before the retransmission timer fires, another returning 1738 segment arrives, which aknowledges the data. In this case, the RTTM 1739 calculated will be inflated: 1741 clock 1742 tc=1 -------------------> 1744 tc=2 (lost) <---- 1745 (RTTM would have been 1) 1747 (receive window opens, window update is sent) 1748 tc=5 <---- 1749 (RTTM is calculated at 4) 1751 One thing to note about this situation is that it is somewhat bounded 1752 by RTO + RTT, limiting how far off the RTTM calculation will be. 1753 While more complex scenarios can be constructed that produce larger 1754 inflations (e.g., retransmissions are lost), those scenarios involve 1755 multiple segment losses, and the connection will have other more 1756 serious operational problems than using an inflated RTTM in the RTO 1757 calculation. 1759 Appendix F. Changes from RFC 1323 1761 Several important updates and clarifications to the specification in 1762 RFC 1323 are made in these document. The technical changes are 1763 summarized below: 1765 (a) The description of which TSecr values can be used to update the 1766 measured RTT has been clarified. Specifically, with timestamps, 1767 the Karn algorithm [Karn87] is disabled. The Karn algorithm 1768 disables all RTT measurements during retransmission, since it is 1769 ambiguous whether the is for the original segment, or the 1770 retransmitted segment. With timestamps, that ambiguity is 1771 removed since the TSecr in the will contain the TSval from 1772 whichever data segment made it to the destination. 1774 (b) In RFC1323, section 3.4, step (2) of the algorithm to control 1775 which timestamp is echoed was incorrect in two regards: 1777 (1) It failed to update TS.recent for a retransmitted segment 1778 that resulted from a lost . 1780 (2) It failed if SEG.LEN = 0. 1782 In the new algorithm, the case of SEG.TSval >= TS.recent is 1783 included for consistency with the PAWS test. 1785 (c) One correction was made to the Event Processing Summary in 1786 Appendix D. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to 1787 fill in the SEG.WND value, not SND.WND. 1789 (d) Appendix A has been expanded with information about the TCP 1790 Urgent Pointer. An earlier revision contained text around the 1791 TCP MSS option, which was split off into [RFC6691]. 1793 (e) It is now recommended that Timestamps options be included in 1794 segments if the incoming segment contained a timestamp 1795 option. 1797 (f) segments are explicitly excluded from PAWS processing. 1799 (g) Snd.TSoffset and Snd.TSclock variables have been added. 1800 Snd.TSclock is the sum of my.TSclock and Snd.TSoffset. This 1801 allows the starting points for timestamp values to be randomized 1802 on a per-connection basis. Setting Snd.TSoffset to zero yields 1803 the same results as [RFC1323]. 1805 (h) RTTM update processing explicitly excludes segments containing 1806 SACK options. This addresses inflation of the RTT during 1807 episodes of segment loss in both directions. 1809 (i) In Section 3.2 the wording how timestamp option negotiation is 1810 to be performed was updated with RFC2119 wording. Text was also 1811 added to subsequently allow late timestamp negotiation. 1813 (j) Section 2.4 was added describing the unavoidable window 1814 retraction issue, and explicitly describing the mitigation steps 1815 necessary. 1817 Editorial changes of the document, that don't impact the 1818 implementation or function of the mechanisms described in this 1819 document include: 1821 (a) Section 1.4 was added for RFC2119 wording. Normative text was 1822 updated with the appropriate phrases. 1824 (b) Removed much of the discussion in Section 1 to streamline the 1825 document. However, detailed examples and discussions in 1826 Section 2, Section 3 and Section 4 are kept as guideline for 1827 implementers. 1829 (c) Moved Appendix "Changes" at the end of the appendices for easier 1830 lookup. 1832 (d) Removed references to "new" options, as they were introduced in 1833 [RFC1323] already. Changed the text in Section 1.3 to 1834 specifically address TS and WS options. 1836 (e) Removed the list of changes between RFC 1323 and prior versions. 1837 These changes are mentioned in appendix C of RFC 1323. 1839 (f) Added < > brackets to mark specific types of segments, and 1840 replaced most occurances of "packet" with "segment", where TCP 1841 segments are referred. 1843 Authors' Addresses 1845 David Borman 1846 Quantum Corporation 1847 Mendota Heights MN 55120 1848 USA 1850 Email: david.borman@quantum.com 1851 Bob Braden 1852 University of Southern California 1853 4676 Admiralty Way 1854 Marina del Rey CA 90292 1855 USA 1857 Email: braden@isi.edu 1859 Van Jacobson 1860 Packet Design 1861 2465 Latham Street 1862 Mountain View CA 94040 1863 USA 1865 Email: van@packetdesign.com 1867 Richard Scheffenegger (editor) 1868 NetApp, Inc. 1869 Am Euro Platz 2 1870 Vienna, 1120 1871 Austria 1873 Email: rs@netapp.com