idnits 2.17.1 draft-ietf-tcpm-1323bis-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The abstract seems to indicate that this document obsoletes RFC1323, but the header doesn't have an 'Obsoletes:' line to match this. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 25, 2013) is 4071 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC1110' is defined on line 1216, but no explicit reference was found in the text == Unused Reference: 'RFC2581' is defined on line 1234, but no explicit reference was found in the text == Unused Reference: 'RFC2883' is defined on line 1240, but no explicit reference was found in the text == Unused Reference: 'RFC5681' is defined on line 1250, but no explicit reference was found in the text == Unused Reference: 'Watson81' is defined on line 1261, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 1072 (Obsoleted by RFC 1323, RFC 2018, RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1110 (Obsoleted by RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1185 (Obsoleted by RFC 1323) -- Obsolete informational reference (is this intentional?): RFC 1323 (Obsoleted by RFC 7323) -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 6691 (Obsoleted by RFC 9293) Summary: 1 error (**), 0 flaws (~~), 6 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance (TCPM) D. Borman 3 Internet-Draft Quantum Corporation 4 Intended status: Standards Track B. Braden 5 Expires: August 29, 2013 University of Southern 6 California 7 V. Jacobson 8 Packet Design 9 R. Scheffenegger, Ed. 10 NetApp, Inc. 11 February 25, 2013 13 TCP Extensions for High Performance 14 draft-ietf-tcpm-1323bis-06 16 Abstract 18 This document specifies a set of TCP extensions to improve 19 performance over paths with a large bandwidth*delay product and to 20 provide reliable operation over very high-speed paths. It defines 21 TCP options for scaled windows and timestamps. The timestamps are 22 used for two distinct mechanisms, RTTM (Round Trip Time Measurement) 23 and PAWS (Protection Against Wrapped Sequences). 25 This document updates and obsoletes RFC 1323. 27 Status of this Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on August 29, 2013. 44 Copyright Notice 46 Copyright (c) 2013 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 62 1.1. TCP Performance . . . . . . . . . . . . . . . . . . . . . 4 63 1.2. TCP Reliability . . . . . . . . . . . . . . . . . . . . . 5 64 1.3. Using TCP options . . . . . . . . . . . . . . . . . . . . 6 65 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 7 66 3. TCP Window Scale Option . . . . . . . . . . . . . . . . . . . 7 67 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 7 68 3.2. Window Scale Option . . . . . . . . . . . . . . . . . . . 7 69 3.3. Using the Window Scale Option . . . . . . . . . . . . . . 8 70 3.4. Addressing Window Retraction . . . . . . . . . . . . . . . 10 71 4. RTTM -- Round-Trip Time Measurement . . . . . . . . . . . . . 11 72 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 11 73 4.2. TCP Timestamps Option . . . . . . . . . . . . . . . . . . 12 74 4.3. The RTTM Mechanism . . . . . . . . . . . . . . . . . . . . 13 75 4.4. Which Timestamp to Echo . . . . . . . . . . . . . . . . . 14 76 5. PAWS -- Protection Against Wrapped Sequence Numbers . . . . . 17 77 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 17 78 5.2. The PAWS Mechanism . . . . . . . . . . . . . . . . . . . . 17 79 5.2.1. Basic PAWS Algorithm . . . . . . . . . . . . . . . . . 18 80 5.2.2. Timestamp Clock . . . . . . . . . . . . . . . . . . . 20 81 5.2.3. Outdated Timestamps . . . . . . . . . . . . . . . . . 22 82 5.2.4. Header Prediction . . . . . . . . . . . . . . . . . . 22 83 5.2.5. IP Fragmentation . . . . . . . . . . . . . . . . . . . 24 84 5.3. Duplicates from Earlier Incarnations of Connection . . . . 24 85 6. Conclusions and Acknowledgements . . . . . . . . . . . . . . . 24 86 7. Security Considerations . . . . . . . . . . . . . . . . . . . 25 87 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 26 88 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 26 89 9.1. Normative References . . . . . . . . . . . . . . . . . . . 26 90 9.2. Informative References . . . . . . . . . . . . . . . . . . 26 91 Appendix A. Implementation Suggestions . . . . . . . . . . . . . 28 92 Appendix B. Duplicates from Earlier Connection Incarnations . . . 29 93 B.1. System Crash with Loss of State . . . . . . . . . . . . . 30 94 B.2. Closing and Reopening a Connection . . . . . . . . . . . . 30 95 Appendix C. Summary of Notation . . . . . . . . . . . . . . . . . 31 96 Appendix D. Pseudo-code Summary . . . . . . . . . . . . . . . . . 32 97 Appendix E. Event Processing Summary . . . . . . . . . . . . . . 34 98 Appendix F. Timestamps Edge Cases . . . . . . . . . . . . . . . . 40 99 Appendix G. Changes from RFC 1072, RFC 1185, and RFC 1323 . . . . 40 100 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 43 102 1. Introduction 104 The TCP protocol [RFC0793] was designed to operate reliably over 105 almost any transmission medium regardless of transmission rate, 106 delay, corruption, duplication, or reordering of segments. Over the 107 years, advances in networking technology has resulted in ever-higher 108 transmission speeds, and the fastest paths are well beyond the domain 109 for which TCP was originally engineered. 111 This document defines a set of modest extensions to TCP to extend the 112 domain of its application to match the increasing network capability. 113 It is an update to and obsoletes [RFC1323], which in turn is based 114 upon and obsoletes [RFC1072] and [RFC1185]. 116 For brevity, the full discussions of the merits and history behind 117 the TCP options defined within this document have been omitted. 118 [RFC1323] should be consulted for reference. A modern TCP 119 implementation SHOULD implement and make use of the extensions 120 described in this document. 122 1.1. TCP Performance 124 TCP performance problems arise when the bandwidth*delay product is 125 large. A network having such paths is referred to as "long, fat 126 network" (LFN). 128 There are three fundamental performance problems with the current TCP 129 over LFN paths: 131 (1) Window Size Limit 133 The TCP header uses a 16 bit field to report the receive window 134 size to the sender. Therefore, the largest window that can be 135 used is 2^16 = 65K bytes. 137 To circumvent this problem, Section 2 of this memo defines a new 138 TCP option, "Window Scale", to allow windows larger than 2^16. 139 This option defines an implicit scale factor, which is used to 140 multiply the window size value found in a TCP header to obtain 141 the true window size. 143 (2) Recovery from Losses 145 Packet losses in an LFN can have a catastrophic effect on 146 throughput. 148 To generalize the Fast Retransmit/Fast Recovery mechanism to 149 handle multiple packets dropped per window, selective 150 acknowledgments are required. Unlike the normal cumulative 151 acknowledgments of TCP, selective acknowledgments give the 152 sender a complete picture of which segments are queued at the 153 receiver and which have not yet arrived. 155 Selective acknowledgements are specified in a separate document, 156 "A Conservative Selective Acknowledgment (SACK)-based Loss 157 Recovery Algorithm for TCP" [RFC6675], and not further discussed 158 in this document. 160 (3) Round-Trip Measurement 162 TCP implements reliable data delivery by retransmitting segments 163 that are not acknowledged within some retransmission timeout 164 (RTO) interval. Accurate dynamic determination of an 165 appropriate RTO is essential to TCP performance. RTO is 166 determined by estimating the mean and variance of the measured 167 round-trip time (RTT), i.e., the time interval between sending a 168 segment and receiving an acknowledgment for it [Jacobson88a]. 170 Section 4.2 introduces a new TCP option, "Timestamps", and then 171 defines a mechanism using this option that allows nearly every 172 segment, including retransmissions, to be timed at negligible 173 computational cost. We use the mnemonic RTTM (Round Trip Time 174 Measurement) for this mechanism, to distinguish it from other 175 uses of the Timestamps option. 177 1.2. TCP Reliability 179 An especially serious kind of error may result from an accidental 180 reuse of TCP sequence numbers in data segments. TCP reliability 181 depends upon the existence of a bound on the lifetime of a segment: 182 the "Maximum Segment Lifetime" or MSL. 184 Duplication of sequence numbers might happen in either of two ways: 186 (1) Sequence number wrap-around on the current connection 188 A TCP sequence number contains 32 bits. At a high enough 189 transfer rate, the 32-bit sequence space may be "wrapped" 190 (cycled) within the time that a segment is delayed in queues. 192 (2) Earlier incarnation of the connection 194 Suppose that a connection terminates, either by a proper close 195 sequence or due to a host crash, and the same connection (i.e., 196 using the same pair of port numbers) is immediately reopened. A 197 delayed segment from the terminated connection could fall within 198 the current window for the new incarnation and be accepted as 199 valid. 201 Duplicates from earlier incarnations, Case (2), are avoided by 202 enforcing the current fixed MSL of the TCP spec, as explained in 203 Section 5.3 and Appendix B. However, case (1), avoiding the reuse of 204 sequence numbers within the same connection, requires an MSL bound 205 that depends upon the transfer rate, and at high enough rates, a new 206 mechanism is required. 208 A possible fix for the problem of cycling the sequence space would be 209 to increase the size of the TCP sequence number field. For example, 210 the sequence number field (and also the acknowledgment field) could 211 be expanded to 64 bits. This could be done either by changing the 212 TCP header or by means of an additional option. 214 Section 5 presents a different mechanism, which we call PAWS 215 (Protection Against Wrapped Sequence numbers), to extend TCP 216 reliability to transfer rates well beyond the foreseeable upper limit 217 of network bandwidths. PAWS uses the TCP Timestamps option defined 218 in Section 4.2 to protect against old duplicates from the same 219 connection. 221 1.3. Using TCP options 223 The extensions defined in this document all use new TCP options. 225 When RFC 1323 was published, there was concern that some buggy TCP 226 implementation might be crashed by the first appearance of an option 227 on a non-SYN segment. However, bugs like that can lead to DOS 228 attacks against a TCP, so it is now expected that most TCP 229 implementations will properly handle unknown options on non-SYN 230 segments. But it is still prudent to be conservative in what you 231 send, and avoiding buggy TCP implementation is not the only reason 232 for negotiating TCP options on SYN segments. Therefore, for each of 233 the extensions defined below, it is recommended that TCP options will 234 be sent on non-SYN segments only after an exchange of options on the 235 SYN segments has indicated that both sides understand the extension. 236 Furthermore, an extension option will be sent in a segment 237 only if the corresponding option was received in the initial 238 segment. 240 The timestamps option may appear in any data or ACK segment, adding 241 12 bytes to the 20-byte TCP header. We believe that the bandwidth 242 saved by reducing unnecessary retransmission timeouts will more than 243 pay for the extra header bandwidth. 245 Appendix A contains a recommended layout of the options in TCP 246 headers to achieve reasonable data field alignment. 248 Finally, we observe that most of the mechanisms defined in this memo 249 are important for LFN's and/or very high-speed networks. For low- 250 speed networks, it might be a performance optimization to NOT use 251 these mechanisms. A TCP vendor concerned about optimal performance 252 over low-speed paths might consider turning these extensions off for 253 low-speed paths, or allow a user or installation manager to disable 254 them. 256 2. Terminology 258 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 259 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 260 document are to be interpreted as described in [RFC2119]. 262 3. TCP Window Scale Option 264 3.1. Introduction 266 The window scale extension expands the definition of the TCP window 267 to 32 bits and then uses a scale factor to carry this 32-bit value in 268 the 16-bit Window field of the TCP header (SEG.WND in RFC 793). The 269 scale factor is carried in a new TCP option, Window Scale. This 270 option is sent only in a SYN segment (a segment with the SYN bit on), 271 hence the window scale is fixed in each direction when a connection 272 is opened. 274 The maximum receive window, and therefore the scale factor, is 275 determined by the maximum receive buffer space. In a typical modern 276 implementation, this maximum buffer space is set by default but can 277 be overridden by a user program before a TCP connection is opened. 278 This determines the scale factor, and therefore no new user interface 279 is needed for window scaling. 281 3.2. Window Scale Option 283 The three-byte Window Scale option MAY be sent in a SYN segment by a 284 TCP. It has two purposes: (1) indicate that the TCP is prepared to 285 do both send and receive window scaling, and (2) communicate a scale 286 factor to be applied to its receive window. Thus, a TCP that is 287 prepared to scale windows SHOULD send the option, even if its own 288 scale factor is 1. The scale factor is limited to a power of two and 289 encoded logarithmically, so it may be implemented by binary shift 290 operations. 292 TCP Window Scale Option (WSopt): 294 Kind: 3 296 Length: 3 bytes 298 +---------+---------+---------+ 299 | Kind=3 |Length=3 |shift.cnt| 300 +---------+---------+---------+ 301 1 1 1 303 This option is an offer, not a promise; both sides MUST send Window 304 Scale options in their SYN segments to enable window scaling in 305 either direction. If window scaling is enabled, then the TCP that 306 sent this option will right-shift its true receive-window values by 307 'shift.cnt' bits for transmission in SEG.WND. The value 'shift.cnt' 308 MAY be zero (offering to scale, while applying a scale factor of 1 to 309 the receive window). 311 This option MAY be sent in an initial segment (i.e., a segment 312 with the SYN bit on and the ACK bit off). It MAY also be sent in a 313 segment, but only if a Window Scale option was received in 314 the initial segment. A Window Scale option in a segment 315 without a SYN bit SHOULD be ignored. 317 The Window field in a SYN (i.e., a or ) segment itself 318 is never scaled. 320 3.3. Using the Window Scale Option 322 A model implementation of window scaling is as follows, using the 323 notation of [RFC0793]: 325 o All windows are treated as 32-bit quantities for storage in the 326 connection control block and for local calculations. This 327 includes the send-window (SND.WND) and the receive-window 328 (RCV.WND) values, as well as the congestion window. 330 o The connection state is augmented by two window shift counts, 331 Snd.Wind.Scale and Rcv.Wind.Scale, to be applied to the incoming 332 and outgoing window fields, respectively. 334 o If a TCP receives a segment containing a Window Scale 335 option, it sends its own Window Scale option in the 336 segment. 338 o The Window Scale option is sent with shift.cnt = R, where R is the 339 value that the TCP would like to use for its receive window. 341 o Upon receiving a SYN segment with a Window Scale option containing 342 shift.cnt = S, a TCP sets Snd.Wind.Scale to S and sets 343 Rcv.Wind.Scale to R; otherwise, it sets both Snd.Wind.Scale and 344 Rcv.Wind.Scale to zero. 346 o The window field (SEG.WND) in the header of every incoming 347 segment, with the exception of SYN segments, is left-shifted by 348 Snd.Wind.Scale bits before updating SND.WND: 350 SND.WND = SEG.WND << Snd.Wind.Scale 352 (assuming the other conditions of [RFC0793] are met, and using the 353 "C" notation "<<" for left-shift). 355 o The window field (SEG.WND) of every outgoing segment, with the 356 exception of SYN segments, is right-shifted by Rcv.Wind.Scale 357 bits: 359 SND.WND = RCV.WND >> Rcv.Wind.Scale 361 TCP determines if a data segment is "old" or "new" by testing whether 362 its sequence number is within 2^31 bytes of the left edge of the 363 window, and if it is not, discarding the data as "old". To insure 364 that new data is never mistakenly considered old and vice versa, the 365 left edge of the sender's window has to be at most 2^31 away from the 366 right edge of the receiver's window. Similarly with the sender's 367 right edge and receiver's left edge. Since the right and left edges 368 of either the sender's or receiver's window differ by the window 369 size, and since the sender and receiver windows can be out of phase 370 by at most the window size, the above constraints imply that two 371 times the max window size must be less than 2^31, or 373 max window < 2^30 375 Since the max window is 2^S (where S is the scaling shift count) 376 times at most 2^16 - 1 (the maximum unscaled window), the maximum 377 window is guaranteed to be < 2^30 if S <= 14. Thus, the shift count 378 MUST be limited to 14 (which allows windows of 2^30 = 1 Gbyte). If a 379 Window Scale option is received with a shift.cnt value exceeding 14, 380 the TCP SHOULD log the error but use 14 instead of the specified 381 value. 383 The scale factor applies only to the Window field as transmitted in 384 the TCP header; each TCP using extended windows will maintain the 385 window values locally as 32-bit numbers. For example, the 386 "congestion window" computed by Slow Start and Congestion Avoidance 387 is not affected by the scale factor, so window scaling will not 388 introduce quantization into the congestion window. 390 3.4. Addressing Window Retraction 392 When a non-zero scale factor is in use, there are instances when a 393 retracted window can be offered [Mathis08]. The end of the window 394 will be on a boundary based on the granularity of the scale factor 395 being used. If the sequence number is then updated by a number of 396 bytes smaller than that granularity, the TCP will have to either 397 advertise a new window that is beyond what it previously advertised 398 (and perhaps beyond the buffer), or will have to advertise a smaller 399 window, which will cause the TCP window to shrink. Implementations 400 MUST ensure that they handle a shrinking window, as specified in 401 section 4.2.2.16 of [RFC1122]. 403 For the receiver, this implies that: 405 1) The receiver MUST honor, as in-window, any segment that would 406 have been in-window for any ACK sent by the receiver. 408 2) When window scaling is in effect, the receiver SHOULD track the 409 actual maximum window sequence number (which is likely to be 410 greater than the window announced by the most recent ACK, if more 411 than one segment has arrived since the application consumed any 412 data in the receive buffer). 414 On the sender side: 416 3) The initial transmission MUST honor window on most recent ACK. 418 4) On first retransmission, or if the sequence number is out-of- 419 window by less than (2^Rcv.Wind.Scale) then do normal 420 retransmission(s) without regard to receiver window as long as 421 the original segment was in window when it was sent. 423 5) On subsequent retransmissions, treat such ACKs as zero window 424 probes. 426 4. RTTM -- Round-Trip Time Measurement 428 4.1. Introduction 430 Accurate and current RTT estimates are necessary to adapt to changing 431 traffic conditions and to avoid an instability known as "congestion 432 collapse" [RFC0896] in a busy network. However, accurate measurement 433 of RTT may be difficult both in theory and in implementation. 435 Many TCP implementations base their RTT measurements upon a sample of 436 one packet per window or less. While this yields an adequate 437 approximation to the RTT for small windows, it results in an 438 unacceptably poor RTT estimate for a LFN. If we look at RTT 439 estimation as a signal processing problem (which it is), a data 440 signal at some frequency, the packet rate, is being sampled at a 441 lower frequency, the window rate. This lower sampling frequency 442 violates Nyquist's criteria and may therefore introduce "aliasing" 443 artifacts into the estimated RTT [Hamming77]. 445 A good RTT estimator with a conservative retransmission timeout 446 calculation can tolerate aliasing when the sampling frequency is 447 "close" to the data frequency. For example, with a window of 8 448 packets, the sample rate is 1/8 the data frequency -- less than an 449 order of magnitude different. However, when the window is tens or 450 hundreds of packets, the RTT estimator may be seriously in error, 451 resulting in spurious retransmissions. 453 If there are dropped packets, the problem becomes worse. Zhang 454 [Zhang86], Jain [Jain86] and Karn [Karn87] have shown that it is not 455 possible to accumulate reliable RTT estimates if retransmitted 456 segments are included in the estimate. Since a full window of data 457 will have been transmitted prior to a retransmission, all of the 458 segments in that window will have to be ACKed before the next RTT 459 sample can be taken. This means at least an additional window's 460 worth of time between RTT measurements and, as the error rate 461 approaches one per window of data (e.g., 10^-6 errors per bit for the 462 Wideband satellite network), it becomes effectively impossible to 463 obtain a valid RTT measurement. 465 A solution to these problems, which actually simplifies the sender 466 substantially, is as follows: using TCP options, the sender places a 467 timestamp in each data segment, and the receiver reflects these 468 timestamps back in ACK segments. Then a single subtract gives the 469 sender an accurate RTT measurement for every ACK segment (which will 470 correspond to every other data segment, with a sensible receiver). 471 We call this the RTTM (Round-Trip Time Measurement) mechanism. 473 It is vitally important to use the RTTM mechanism with big windows; 474 otherwise, the door is opened to some dangerous instabilities due to 475 aliasing. Furthermore, the option is probably useful for all TCP's, 476 since it simplifies the sender. 478 4.2. TCP Timestamps Option 480 TCP is a symmetric protocol, allowing data to be sent at any time in 481 either direction, and therefore timestamp echoing may occur in either 482 direction. For simplicity and symmetry, we specify that timestamps 483 always be sent and echoed in both directions. For efficiency, we 484 combine the timestamp and timestamp reply fields into a single TCP 485 Timestamps Option. 487 TCP Timestamps Option (TSopt): 489 Kind: 8 491 Length: 10 bytes 493 +-------+-------+---------------------+---------------------+ 494 |Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)| 495 +-------+-------+---------------------+---------------------+ 496 1 1 4 4 498 The Timestamps option carries two four-byte timestamp fields. The 499 Timestamp Value field (TSval) contains the current value of the 500 timestamp clock of the TCP sending the option. 502 The Timestamp Echo Reply field (TSecr) is valid if the ACK bit is set 503 in the TCP header; if it is valid, it echoes a timestamp value that 504 was sent by the remote TCP in the TSval field of a Timestamp option. 505 When TSecr is not valid, its value MUST be zero. However, a value of 506 zero does not imply TSecr being invalid. The TSecr value will 507 generally be from the most recent Timestamps Option that was 508 received; however, there are exceptions that are explained below. 510 A TCP MAY send the Timestamps option (TSopt) in an initial 511 segment (i.e., a segment containing a SYN bit and no ACK bit). Once 512 a TSopt has been sent or received in a non segment, it MUST be 513 sent in all segments. Once a TSopt has been received in a non 514 segment, then any successive segment that is received without the RST 515 bit and without a TSopt MAY be dropped without further processing, 516 and an ACK of the current SND.UNA generated. 518 In the case of crossing SYN packets where one SYN contains a TSopt 519 and the other doesn't, both sides SHOULD put a TSopt in the 520 segment. 522 4.3. The RTTM Mechanism 524 RTTM places a Timestamps option in every segment, with a TSval that 525 is obtained from a (virtual) "timestamp clock". Values of this clock 526 MUST be at least approximately proportional to real time, in order to 527 measure actual RTT. 529 These TSval values are echoed in TSecr values in the reverse 530 direction. The difference between a received TSecr value and the 531 current timestamp clock value provides a RTT measurement. 533 When timestamps are used, every segment that is received will contain 534 a TSecr value. However, these values cannot all be used to update 535 the measured RTT. The following example illustrates why. It shows a 536 one-way data flow with segments arriving in sequence without loss. 537 Here A, B, C... represent data blocks occupying successive blocks of 538 sequence numbers, and ACK(A),... represent the corresponding 539 cumulative acknowledgments. The two timestamp fields of the 540 Timestamps option are shown symbolically as . Each 541 TSecr field contains the value most recently received in a TSval 542 field. 544 TCP A TCP B 546 -----> 548 <---- 550 -----> 552 <---- 554 . . . . . . . . . . . . . . . . . . . . . . 556 ----> 558 <---- 560 (etc.) 562 The dotted line marks a pause (60 time units long) in which A had 563 nothing to send. Note that this pause inflates the RTT which B could 564 infer from receiving TSecr=131 in data segment C. Thus, in one-way 565 data flows, RTTM in the reverse direction measures a value that is 566 inflated by gaps in sending data. However, the following rule 567 prevents a resulting inflation of the measured RTT: 569 RTTM Rule: A TSecr value received in a segment is used to update 570 the averaged RTT measurement only if 572 a) the segment acknowledges some new data, i.e., only if it 573 advances the left edge of the send window, and 575 b) the segment does not indicate any loss or reordering, i.e. 576 contains SACK options 578 Since TCP B is not sending data, the data segment C does not 579 acknowledge any new data when it arrives at B. Thus, the inflated 580 RTTM measurement is not used to update B's RTTM measurement. 582 Implementers should note that with Timestamps multiple RTTMs can be 583 taken per RTT. Many RTO estimators have a weighting factor based on 584 an implicit assumption that at most one RTTM will be sampled per RTT. 585 When using multiple RTTMs per RTT to update the RTO estimator, the 586 weighting factor needs to be decreased to take into account the more 587 frequent RTTMs. For example, an implementation could choose to just 588 use one sample per RTT to update the RTO estimator, or vary the gain 589 based on the congestion window, or take an average of all the RTTM 590 measurements received over one RTT, and then use that value to update 591 the RTO estimator. This document does not prescribe any particular 592 method for modifying the RTO estimator. 594 4.4. Which Timestamp to Echo 596 If more than one Timestamps option is received before a reply segment 597 is sent, the TCP must choose only one of the TSvals to echo, ignoring 598 the others. To minimize the state kept in the receiver (i.e., the 599 number of unprocessed TSvals), the receiver should be required to 600 retain at most one timestamp in the connection control block. 602 There are three situations to consider: 604 (A) Delayed ACKs. 606 Many TCP's acknowledge only every Kth segment out of a group of 607 segments arriving within a short time interval; this policy is 608 known generally as "delayed ACKs". The data-sender TCP must 609 measure the effective RTT, including the additional time due to 610 delayed ACKs, or else it will retransmit unnecessarily. Thus, 611 when delayed ACKs are in use, the receiver SHOULD reply with the 612 TSval field from the earliest unacknowledged segment. 614 (B) A hole in the sequence space (segment(s) have been lost). 616 The sender will continue sending until the window is filled, and 617 the receiver may be generating ACKs as these out-of-order 618 segments arrive (e.g., to aid "fast retransmit"). 620 The lost segment is probably a sign of congestion, and in that 621 situation the sender should be conservative about 622 retransmission. Furthermore, it is better to overestimate than 623 underestimate the RTT. An ACK for an out-of-order segment 624 SHOULD therefore contain the timestamp from the most recent 625 segment that advanced the window. 627 The same situation occurs if segments are re-ordered by the 628 network. 630 (C) A filled hole in the sequence space. 632 The segment that fills the hole represents the most recent 633 measurement of the network characteristics. A RTT computed from 634 an earlier segment would probably include the sender's 635 retransmit time-out, badly biasing the sender's average RTT 636 estimate. Thus, the timestamp from the latest segment (which 637 filled the hole) MUST be echoed. 639 An algorithm that covers all three cases is described in the 640 following rules for Timestamps option processing on a synchronized 641 connection: 643 (1) The connection state is augmented with two 32-bit slots: 645 TS.Recent holds a timestamp to be echoed in TSecr whenever a 646 segment is sent, and Last.ACK.sent holds the ACK field from the 647 last segment sent. Last.ACK.sent will equal RCV.NXT except when 648 ACKs have been delayed. 650 (2) If: 652 SEG.TSval >= TS.recent and SEG.SEQ <= Last.ACK.sent 654 then SEG.TSval is copied to TS.Recent; otherwise, it is ignored. 656 (3) When a TSopt is sent, its TSecr field is set to the current 657 TS.Recent value. 659 The following examples illustrate these rules. Here A, B, C... 660 represent data segments occupying successive blocks of sequence 661 numbers, and ACK(A),... represent the corresponding acknowledgment 662 segments. Note that ACK(A) has the same sequence number as B. We 663 show only one direction of timestamp echoing, for clarity. 665 o Packets arrive in sequence, and some of the ACKs are delayed. 667 By case (A), the timestamp from the oldest unacknowledged segment 668 is echoed. 670 TS.Recent 671 -------------------> 672 1 673 -------------------> 674 1 675 -------------------> 676 1 677 <---- 678 (etc) 680 o Packets arrive out of order, and every packet is acknowledged. 682 By case (B), the timestamp from the last segment that advanced the 683 left window edge is echoed, until the missing segment arrives; it 684 is echoed according to Case (C). The same sequence would occur if 685 segments B and D were lost and retransmitted. 687 TS.Recent 688 -------------------> 689 1 690 <---- 691 1 692 -------------------> 693 1 694 <---- 695 1 696 -------------------> 697 2 698 <---- 699 2 700 -------------------> 701 2 702 <---- 703 2 704 -------------------> 705 4 706 <---- 707 (etc) 709 5. PAWS -- Protection Against Wrapped Sequence Numbers 711 5.1. Introduction 713 Section 5.2 describes a simple mechanism to reject old duplicate 714 segments that might corrupt an open TCP connection; we call this 715 mechanism PAWS (Protection Against Wrapped Sequence numbers). PAWS 716 operates within a single TCP connection, using state that is saved in 717 the connection control block. Section 5.3 and Appendix G discuss the 718 implications of the PAWS mechanism for avoiding old duplicates from 719 previous incarnations of the same connection. 721 5.2. The PAWS Mechanism 723 PAWS uses the same TCP Timestamps option as the RTTM mechanism 724 described earlier, and assumes that every received TCP segment 725 (including data and ACK segments) contains a timestamp SEG.TSval 726 whose values are monotonically non-decreasing in time. The basic 727 idea is that a segment can be discarded as an old duplicate if it is 728 received with a timestamp SEG.TSval less than some timestamp recently 729 received on this connection. 731 In both the PAWS and the RTTM mechanism, the "timestamps" are 32-bit 732 unsigned integers in a modular 32-bit space. Thus, "less than" is 733 defined the same way it is for TCP sequence numbers, and the same 734 implementation techniques apply. If s and t are timestamp values, 736 s < t if 0 < (t - s) < 2^31, 738 computed in unsigned 32-bit arithmetic. 740 The choice of incoming timestamps to be saved for this comparison 741 MUST guarantee a value that is monotonically increasing. For 742 example, we might save the timestamp from the segment that last 743 advanced the left edge of the receive window, i.e., the most recent 744 in-sequence segment. Instead, we choose the value TS.Recent 745 introduced in Section 4.4 for the RTTM mechanism, since using a 746 common value for both PAWS and RTTM simplifies the implementation of 747 both. As Section 4.4 explained, TS.Recent differs from the timestamp 748 from the last in-sequence segment only in the case of delayed ACKs, 749 and therefore by less than one window. Either choice will therefore 750 protect against sequence number wrap-around. 752 RTTM was specified in a symmetrical manner, so that TSval timestamps 753 are carried in both data and ACK segments and are echoed in TSecr 754 fields carried in returning ACK or data segments. PAWS submits all 755 incoming segments to the same test, and therefore protects against 756 duplicate ACK segments as well as data segments. (An alternative 757 non-symmetric algorithm would protect against old duplicate ACKs: the 758 sender of data would reject incoming ACK segments whose TSecr values 759 were less than the TSecr saved from the last segment whose ACK field 760 advanced the left edge of the send window. This algorithm was deemed 761 to lack economy of mechanism and symmetry.) 763 TSval timestamps sent on and segments are used to 764 initialize PAWS. PAWS protects against old duplicate non-SYN 765 segments, and duplicate SYN segments received while there is a 766 synchronized connection. Duplicate and segments 767 received when there is no connection will be discarded by the normal 768 3-way handshake and sequence number checks of TCP. 770 [RFC1323] recommended that RST segments NOT carry timestamps, and 771 that they be acceptable regardless of their timestamp. At that time, 772 the thinking was that old duplicate RST segments should be 773 exceedingly unlikely, and their cleanup function should take 774 precedence over timestamps. More recently, discussions about various 775 blind attacks on TCP connections have raised the suggestion that if 776 the Timestamps option is present, SEG.TSecr could be used to provide 777 stricter acceptance tests for RST packets. While still under 778 discussion, to enable research into this area it is now RECOMMENDED 779 that when generating a RST, that if the packet causing the RST to be 780 generated contained a Timestamps option that the RST also contain a 781 Timestamps option. In the RST segment, SEG.TSecr SHOULD be set to 782 SEG.TSval from the incoming packet and SEG.TSval SHOULD be set to 783 zero. If a RST is being generated because of a user abort, and 784 Snd.TS.OK is set, then a Timestamps option SHOULD be included in the 785 RST. When a RST packet is received, it MUST NOT be subjected to PAWS 786 checks, and information from the Timestamps option MUST NOT be used 787 to update connection state information. SEG.TSecr MAY be used to 788 provide stricter RST acceptance checks. 790 5.2.1. Basic PAWS Algorithm 792 The PAWS algorithm requires the following processing to be performed 793 on all incoming segments for a synchronized connection: 795 R1) If there is a Timestamps option in the arriving segment, 796 SEG.TSval < TS.Recent, TS.Recent is valid (see later discussion) 797 and the RST bit is not set, then treat the arriving segment as 798 not acceptable: 800 Send an acknowledgement in reply as specified in [RFC0793] 801 page 69 and drop the segment. 803 Note: it is necessary to send an ACK segment in order to 804 retain TCP's mechanisms for detecting and recovering from 805 half-open connections. For example, see Figure 10 of 806 [RFC0793]. 808 R2) If the segment is outside the window, reject it (normal TCP 809 processing) 811 R3) If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent (see 812 Section 4.4), then record its timestamp in TS.Recent. 814 R4) If an arriving segment is in-sequence (i.e., at the left window 815 edge), then accept it normally. 817 R5) Otherwise, treat the segment as a normal in-window, out-of- 818 sequence TCP segment (e.g., queue it for later delivery to the 819 user). 821 Steps R2, R4, and R5 are the normal TCP processing steps specified by 822 [RFC0793]. 824 It is important to note that the timestamp is checked only when a 825 segment first arrives at the receiver, regardless of whether it is 826 in-sequence or it must be queued for later delivery. 828 Consider the following example. 830 Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has been 831 sent, where the letter indicates the sequence number and the digit 832 represents the timestamp. Suppose also that segment B.1 has been 833 lost. The timestamp in TS.Recent is 1 (from A.1), so C.1, ..., 834 Z.1 are considered acceptable and are queued. When B is 835 retransmitted as segment B.2 (using the latest timestamp), it 836 fills the hole and causes all the segments through Z to be 837 acknowledged and passed to the user. The timestamps of the queued 838 segments are *not* inspected again at this time, since they have 839 already been accepted. When B.2 is accepted, TS.Recent is set to 840 2. 842 This rule allows reasonable performance under loss. A full window of 843 data is in transit at all times, and after a loss a full window less 844 one packet will show up out-of-sequence to be queued at the receiver 845 (e.g., up to ~2^30 bytes of data); the timestamp option must not 846 result in discarding this data. 848 In certain unlikely circumstances, the algorithm of rules R1-R5 could 849 lead to discarding some segments unnecessarily, as shown in the 850 following example: 852 Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have been 853 sent in sequence and that segment B.1 has been lost. Furthermore, 854 suppose delivery of some of C.1, ... Z.1 is delayed until AFTER 855 the retransmission B.2 arrives at the receiver. These delayed 856 segments will be discarded unnecessarily when they do arrive, 857 since their timestamps are now out of date. 859 This case is very unlikely to occur. If the retransmission was 860 triggered by a timeout, some of the segments C.1, ... Z.1 must have 861 been delayed longer than the RTO time. This is presumably an 862 unlikely event, or there would be many spurious timeouts and 863 retransmissions. If B's retransmission was triggered by the "fast 864 retransmit" algorithm, i.e., by duplicate ACKs, then the queued 865 segments that caused these ACKs must have been received already. 867 Even if a segment were delayed past the RTO, the Fast Retransmit 868 mechanism [Jacobson90c] will cause the delayed packets to be 869 retransmitted at the same time as B.2, avoiding an extra RTT and 870 therefore causing a very small performance penalty. 872 We know of no case with a significant probability of occurrence in 873 which timestamps will cause performance degradation by unnecessarily 874 discarding segments. 876 5.2.2. Timestamp Clock 878 It is important to understand that the PAWS algorithm does not 879 require clock synchronization between sender and receiver. The 880 sender's timestamp clock is used to stamp the segments, and the 881 sender uses the echoed timestamp to measure RTTs. However, the 882 receiver treats the timestamp as simply a monotonically increasing 883 serial number, without any necessary connection to its clock. From 884 the receiver's viewpoint, the timestamp is acting as a logical 885 extension of the high-order bits of the sequence number. 887 The receiver algorithm does place some requirements on the frequency 888 of the timestamp clock. 890 (a) The timestamp clock must not be "too slow". 892 It MUST tick at least once for each 2^31 bytes sent. In fact, 893 in order to be useful to the sender for round trip timing, the 894 clock SHOULD tick at least once per window's worth of data, and 895 even with the window extension defined in Section 3.2, 2^31 896 bytes must be at least two windows. 898 To make this more quantitative, any clock faster than 1 tick/sec 899 will reject old duplicate segments for link speeds of ~8 Gbps. 901 A 1 ms timestamp clock will work at link speeds up to 8 Tbps 902 (8*10^12) bps! 904 (b) The timestamp clock must not be "too fast". 906 The recycling time of the timestamp clock MUST be greater than 907 MSL seconds. Since the clock (timestamp) is 32 bits and the 908 worst-case MSL is 255 seconds, the maximum acceptable clock 909 frequency is one tick every 59 ns. 911 However, it is desirable to establish a much longer recycle 912 period, in order to handle outdated timestamps on idle 913 connections (see Section 5.2.3), and to relax the MSL 914 requirement for preventing sequence number wrap-around. With a 915 1 ms timestamp clock, the 32-bit timestamp will wrap its sign 916 bit in 24.8 days. Thus, it will reject old duplicates on the 917 same connection if MSL is 24.8 days or less. This appears to be 918 a very safe figure; an MSL of 24.8 days or longer can probably 919 be assumed in the internet without requiring precise MSL 920 enforcement. 922 Based upon these considerations, we choose a timestamp clock 923 frequency in the range 1 ms to 1 sec per tick. This range also 924 matches the requirements of the RTTM mechanism, which does not need 925 much more resolution than the granularity of the retransmit timer, 926 e.g., tens or hundreds of milliseconds. 928 The PAWS mechanism also puts a strong monotonicity requirement on the 929 sender's timestamp clock. The method of implementation of the 930 timestamp clock to meet this requirement depends upon the system 931 hardware and software. 933 o Some hosts have a hardware clock that is guaranteed to be 934 monotonic between hardware resets. 936 o A clock interrupt may be used to simply increment a binary integer 937 by 1 periodically. 939 o The timestamp clock may be derived from a system clock that is 940 subject to being abruptly changed, by adding a variable offset 941 value. This offset is initialized to zero. When a new timestamp 942 clock value is needed, the offset can be adjusted as necessary to 943 make the new value equal to or larger than the previous value 944 (which was saved for this purpose). 946 5.2.3. Outdated Timestamps 948 If a connection remains idle long enough for the timestamp clock of 949 the other TCP to wrap its sign bit, then the value saved in TS.Recent 950 will become too old; as a result, the PAWS mechanism will cause all 951 subsequent segments to be rejected, freezing the connection (until 952 the timestamp clock wraps its sign bit again). 954 With the chosen range of timestamp clock frequencies (1 sec to 1 ms), 955 the time to wrap the sign bit will be between 24.8 days and 24800 956 days. A TCP connection that is idle for more than 24 days and then 957 comes to life is exceedingly unusual. However, it is undesirable in 958 principle to place any limitation on TCP connection lifetimes. 960 We therefore require that an implementation of PAWS include a 961 mechanism to "invalidate" the TS.Recent value when a connection is 962 idle for more than 24 days. (An alternative solution to the problem 963 of outdated timestamps would be to send keep-alive segments at a very 964 low rate, but still more often than the wrap-around time for 965 timestamps, e.g., once a day. This would impose negligible overhead. 966 However, the TCP specification has never included keep-alives, so the 967 solution based upon invalidation was chosen.) 969 Note that a TCP does not know the frequency, and therefore, the 970 wraparound time, of the other TCP, so it must assume the worst. The 971 validity of TS.Recent needs to be checked only if the basic PAWS 972 timestamp check fails, i.e., only if SEG.TSval < TS.Recent. If 973 TS.Recent is found to be invalid, then the segment is accepted, 974 regardless of the failure of the timestamp check, and rule R3 updates 975 TS.Recent with the TSval from the new segment. 977 To detect how long the connection has been idle, the TCP MAY update a 978 clock or timestamp value associated with the connection whenever 979 TS.Recent is updated, for example. The details will be 980 implementation-dependent. 982 5.2.4. Header Prediction 984 "Header prediction" [Jacobson90a] is a high-performance transport 985 protocol implementation technique that is most important for high- 986 speed links. This technique optimizes the code for the most common 987 case, receiving a segment correctly and in order. Using header 988 prediction, the receiver asks the question, "Is this segment the next 989 in sequence?" This question can be answered in fewer machine 990 instructions than the question, "Is this segment within the window?" 992 Adding header prediction to our timestamp procedure leads to the 993 following recommended sequence for processing an arriving TCP 994 segment: 996 H1) Check timestamp (same as step R1 above) 998 H2) Do header prediction: if segment is next in sequence and if 999 there are no special conditions requiring additional processing, 1000 accept the segment, record its timestamp, and skip H3. 1002 H3) Process the segment normally, as specified in RFC 793. This 1003 includes dropping segments that are outside the window and 1004 possibly sending acknowledgments, and queuing in-window, out-of- 1005 sequence segments. 1007 Another possibility would be to interchange steps H1 and H2, i.e., to 1008 perform the header prediction step H2 FIRST, and perform H1 and H3 1009 only when header prediction fails. This could be a performance 1010 improvement, since the timestamp check in step H1 is very unlikely to 1011 fail, and it requires unsigned modulo arithmetic. To perform this 1012 check on every single segment is contrary to the philosophy of header 1013 prediction. We believe that this change might produce a measurable 1014 reduction in CPU time for TCP protocol processing on high-speed 1015 networks. 1017 However, putting H2 first would create a hazard: a segment from 2^32 1018 bytes in the past might arrive at exactly the wrong time and be 1019 accepted mistakenly by the header-prediction step. The following 1020 reasoning has been introduced in [RFC1185] to show that the 1021 probability of this failure is negligible. 1023 If all segments are equally likely to show up as old duplicates, 1024 then the probability of an old duplicate exactly matching the left 1025 window edge is the maximum segment size (MSS) divided by the size 1026 of the sequence space. This ratio must be less than 2^-16, since 1027 MSS must be < 2^16; for example, it will be (2^12)/(2^32) = 2^-20 1028 for a FDDI link. However, the older a segment is, the less likely 1029 it is to be retained in the Internet, and under any reasonable 1030 model of segment lifetime the probability of an old duplicate 1031 exactly at the left window edge must be much smaller than 2^-16. 1033 The 16 bit TCP checksum also allows a basic unreliability of one 1034 part in 2^16. A protocol mechanism whose reliability exceeds the 1035 reliability of the TCP checksum should be considered "good 1036 enough", i.e., it won't contribute significantly to the overall 1037 error rate. We therefore believe we can ignore the problem of an 1038 old duplicate being accepted by doing header prediction before 1039 checking the timestamp. 1041 However, this probabilistic argument is not universally accepted, and 1042 the consensus at present is that the performance gain does not 1043 justify the hazard in the general case. It is therefore recommended 1044 that H2 follow H1. 1046 5.2.5. IP Fragmentation 1048 At high data rates, the protection against old packets provided by 1049 PAWS can be circumvented by errors in IP fragment reassembly (see 1050 [RFC4963]). The only way to protect against incorrect IP fragment 1051 reassembly is to not allow the packets to be fragmented. This is 1052 done by setting the Don't Fragment (DF) bit in the IP header. 1053 Setting the DF bit implies the use of Path MTU Discovery as described 1054 in [RFC1191], [RFC1981], and [RFC4821], thus any TCP implementation 1055 that implements PAWS MUST also implement Path MTU Discovery. 1057 5.3. Duplicates from Earlier Incarnations of Connection 1059 The PAWS mechanism protects against errors due to sequence number 1060 wrap-around on high-speed connections. Segments from an earlier 1061 incarnation of the same connection are also a potential cause of old 1062 duplicate errors. In both cases, the TCP mechanisms to prevent such 1063 errors depend upon the enforcement of a maximum segment lifetime 1064 (MSL) by the Internet (IP) layer (see Appendix of RFC 1185 for a 1065 detailed discussion). Unlike the case of sequence space wrap-around, 1066 the MSL required to prevent old duplicate errors from earlier 1067 incarnations does not depend upon the transfer rate. If the IP layer 1068 enforces the recommended 2 minute MSL of TCP, and if the TCP rules 1069 are followed, TCP connections will be safe from earlier incarnations, 1070 no matter how high the network speed. Thus, the PAWS mechanism is 1071 not required for this case. 1073 We may still ask whether the PAWS mechanism can provide additional 1074 security against old duplicates from earlier connections, allowing us 1075 to relax the enforcement of MSL by the IP layer. Appendix B explores 1076 this question, showing that further assumptions and/or mechanisms are 1077 required, beyond those of PAWS. This is not part of the current 1078 extension. 1080 6. Conclusions and Acknowledgements 1082 This memo presented a set of extensions to TCP to provide efficient 1083 operation over large-bandwidth*delay-product paths and reliable 1084 operation over very high-speed paths. These extensions are designed 1085 to provide compatible interworking with TCP's that do not implement 1086 the extensions. 1088 These mechanisms are implemented using new TCP options for scaled 1089 windows and timestamps. The timestamps are used for two distinct 1090 mechanisms: RTTM (Round Trip Time Measurement) and PAWS (Protection 1091 Against Wrapped Sequences). 1093 The Window Scale option was originally suggested by Mike St. Johns of 1094 USAF/DCA. The present form of the option was suggested by Mike 1095 Karels of UC Berkeley in response to a more cumbersome scheme defined 1096 by Van Jacobson. Lixia Zhang helped formulate the PAWS mechanism 1097 description in [RFC1185]. 1099 Finally, much of this work originated as the result of discussions 1100 within the End-to-End Task Force on the theoretical limitations of 1101 transport protocols in general and TCP in particular. Task force 1102 members and other on the end2end-interest list have made valuable 1103 contributions by pointing out flaws in the algorithms and the 1104 documentation. Continued discussion and development since the 1105 publication of [RFC1323] originally occurred in the IETF TCP Large 1106 Windows Working Group, later on in the End-to-End Task Force, and 1107 most recently in the IETF TCP Maintenance Working Group. The authors 1108 are grateful for all these contributions. 1110 7. Security Considerations 1112 The TCP sequence space is a fixed size, and as the window becomes 1113 larger it becomes easier for an attacker to generate forged packets 1114 that can fall within the TCP window, and be accepted as valid 1115 packets. While use of Timestamps and PAWS can help to mitigate this, 1116 when using PAWS, if an attacker is able to forge a packet that is 1117 acceptable to the TCP connection, a timestamp that is in the future 1118 would cause valid packets to be dropped due to PAWS checks. Hence, 1119 implementers should take care to not open the TCP window drastically 1120 beyond the requirements of the connection. 1122 Middle boxes and options: If a middle box removes TCP options from 1123 the SYN, such as TSopt, a high speed connection that needs PAWS would 1124 not have that protection. In this situation, an implementer could 1125 provide a mechanism for the application to determine whether or not 1126 PAWS is in use on the connection, and chose to terminate the 1127 connection if that protection doesn't exist. 1129 Mechanisms to protect the TCP header from modification should also 1130 protect the TCP options. 1132 A naive implementation that derives the timestamp clock value 1133 directly from a system uptime clock may unintentionally leak this 1134 information to an attacker. This does not directly compromise any of 1135 the mechanisms described in this document. However, this may be 1136 valuable information to a potential attacker. An implementer should 1137 evaluate the potential impact and mitigate this accordingly (i.e. by 1138 using a random offset for the timestamp clock on each connection, or 1139 using an external, real-time derived timestamp clock source). 1141 Expanding the TCP window beyond 64K for IPv6 allows Jumbograms 1142 [RFC2675] to be used when the local network supports packets larger 1143 than 64K. When larger TCP packets are used, the TCP checksum becomes 1144 weaker. 1146 8. IANA Considerations 1148 This document has no actions for IANA. 1150 9. References 1152 9.1. Normative References 1154 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 1155 RFC 793, September 1981. 1157 [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 1158 November 1990. 1160 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1161 Requirement Levels", BCP 14, RFC 2119, March 1997. 1163 9.2. Informative References 1165 [Garlick77] 1166 Garlick, L., Rom, R., and J. Postel, "Issues in Reliable 1167 Host-to-Host Protocols", Proc. Second Berkeley Workshop on 1168 Distributed Data Management and Computer Networks, 1169 May 1977, . 1171 [Hamming77] 1172 Hamming, R., "Digital Filters", Prentice Hall, Englewood 1173 Cliffs, N.J. ISBN 0-13-212571-4, 1977. 1175 [Jacobson88a] 1176 Jacobson, V., "Congestion Avoidance and Control", SIGCOMM 1177 '88, Stanford, CA., August 1988, 1178 . 1180 [Jacobson90a] 1181 Jacobson, V., "4BSD Header Prediction", ACM Computer 1182 Communication Review, April 1990. 1184 [Jacobson90c] 1185 Jacobson, V., "Modified TCP congestion avoidance 1186 algorithm", Message to the end2end-interest mailing list, 1187 April 1990, 1188 . 1190 [Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet 1191 Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and 1192 Comm., Scottsdale, Arizona, March 1986, 1193 . 1195 [Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times in 1196 Reliable Transport Protocols", Proc. SIGCOMM '87, 1197 August 1987. 1199 [Martin03] 1200 Martin, D., "[Tsvwg] RFC 1323.bis", Message to the tsvwg 1201 mailing list, September 2003, . 1204 [Mathis08] 1205 Mathis, M., "[tcpm] Example of 1323 window retraction 1206 problem", Message to the tcpm mailing list, March 2008, 1207 . 1210 [RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks", 1211 RFC 896, January 1984. 1213 [RFC1072] Jacobson, V. and R. Braden, "TCP extensions for long-delay 1214 paths", RFC 1072, October 1988. 1216 [RFC1110] McKenzie, A., "Problem with the TCP big window option", 1217 RFC 1110, August 1989. 1219 [RFC1122] Braden, R., "Requirements for Internet Hosts - 1220 Communication Layers", STD 3, RFC 1122, October 1989. 1222 [RFC1185] Jacobson, V., Braden, B., and L. Zhang, "TCP Extension for 1223 High-Speed Paths", RFC 1185, October 1990. 1225 [RFC1323] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions 1226 for High Performance", RFC 1323, May 1992. 1228 [RFC1981] McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery 1229 for IP version 6", RFC 1981, August 1996. 1231 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 1232 Selective Acknowledgment Options", RFC 2018, October 1996. 1234 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 1235 Control", RFC 2581, April 1999. 1237 [RFC2675] Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms", 1238 RFC 2675, August 1999. 1240 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 1241 Extension to the Selective Acknowledgement (SACK) Option 1242 for TCP", RFC 2883, July 2000. 1244 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 1245 Discovery", RFC 4821, March 2007. 1247 [RFC4963] Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly 1248 Errors at High Data Rates", RFC 4963, July 2007. 1250 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1251 Control", RFC 5681, September 2009. 1253 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 1254 and Y. Nishida, "A Conservative Loss Recovery Algorithm 1255 Based on Selective Acknowledgment (SACK) for TCP", 1256 RFC 6675, August 2012. 1258 [RFC6691] Borman, D., "TCP Options and Maximum Segment Size (MSS)", 1259 RFC 6691, July 2012. 1261 [Watson81] 1262 Watson, R., "Timer-based Mechanisms in Reliable Transport 1263 Protocol Connection Management", Computer Networks, Vol. 1264 5, 1981. 1266 [Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. SIGCOMM 1267 '86, Stowe, VT, August 1986. 1269 Appendix A. Implementation Suggestions 1271 TCP Option Layout 1273 The following layouts are recommended for sending options on non- 1274 SYN segments, to achieve maximum feasible alignment of 32-bit and 1275 64-bit machines. 1277 +--------+--------+--------+--------+ 1278 | NOP | NOP | TSopt | 10 | 1279 +--------+--------+--------+--------+ 1280 | TSval timestamp | 1281 +--------+--------+--------+--------+ 1282 | TSecr timestamp | 1283 +--------+--------+--------+--------+ 1285 Interaction with the TCP Urgent Pointer 1287 The TCP Urgent pointer, like the TCP window, is a 16 bit value. 1288 Some of the original discussion for the TCP Window Scale option 1289 included proposals to increase the Urgent pointer to 32 bits. As 1290 it turns out, this is unnecessary. There are two observations 1291 that should be made: 1293 (1) With IP Version 4, the largest amount of TCP data that can be 1294 sent in a single packet is 65495 bytes (64K - 1 -- size of 1295 fixed IP and TCP headers). 1297 (2) Updates to the urgent pointer while the user is in "urgent 1298 mode" are invisible to the user. 1300 This means that if the Urgent Pointer points beyond the end of the 1301 TCP data in the current packet, then the user will remain in 1302 urgent mode until the next TCP packet arrives. That packet will 1303 update the urgent pointer to a new offset, and the user will never 1304 have left urgent mode. 1306 Thus, to properly implement the Urgent Pointer, the sending TCP 1307 only has to check for overflow of the 16 bit Urgent Pointer field 1308 before filling it in. If it does overflow, than a value of 65535 1309 should be inserted into the Urgent Pointer. 1311 The same technique applies to IP Version 6, except in the case of 1312 IPv6 Jumbograms. When IPv6 Jumbograms are supported, [RFC2675] 1313 requires additional steps for dealing with the Urgent Pointer, 1314 these are described in section 5.2 of [RFC2675]. 1316 Appendix B. Duplicates from Earlier Connection Incarnations 1318 There are two cases to be considered: (1) a system crashing (and 1319 losing connection state) and restarting, and (2) the same connection 1320 being closed and reopened without a loss of host state. These will 1321 be described in the following two sections. 1323 B.1. System Crash with Loss of State 1325 TCP's quiet time of one MSL upon system startup handles the loss of 1326 connection state in a system crash/restart. For an explanation, see 1327 for example "When to Keep Quiet" in the TCP protocol specification 1328 [RFC0793]. The MSL that is required here does not depend upon the 1329 transfer speed. The current TCP MSL of 2 minutes seemed acceptable 1330 as an operational compromise, when many host systems used to take 1331 this long to boot after a crash. Current host systems can boot 1332 considerably faster. 1334 The timestamp option may be used to ease the MSL requirements (or to 1335 provide additional security against data corruption). If timestamps 1336 are being used and if the timestamp clock can be guaranteed to be 1337 monotonic over a system crash/restart, i.e., if the first value of 1338 the sender's timestamp clock after a crash/restart can be guaranteed 1339 to be greater than the last value before the restart, then a quiet 1340 time is unnecessary. 1342 To dispense totally with the quiet time would require that the host 1343 clock be synchronized to a time source that is stable over the crash/ 1344 restart period, with an accuracy of one timestamp clock tick or 1345 better. We can back off from this strict requirement to take 1346 advantage of approximate clock synchronization. Suppose that the 1347 clock is always re-synchronized to within N timestamp clock ticks and 1348 that booting (extended with a quiet time, if necessary) takes more 1349 than N ticks. This will guarantee monotonicity of the timestamps, 1350 which can then be used to reject old duplicates even without an 1351 enforced MSL. 1353 B.2. Closing and Reopening a Connection 1355 When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT state 1356 ties up the socket pair for 4 minutes (see Section 3.5 of [RFC0793]. 1357 Applications built upon TCP that close one connection and open a new 1358 one (e.g., an FTP data transfer connection using Stream mode) must 1359 choose a new socket pair each time. The TIME-WAIT delay serves two 1360 different purposes: 1362 (a) Implement the full-duplex reliable close handshake of TCP. 1364 The proper time to delay the final close step is not really 1365 related to the MSL; it depends instead upon the RTO for the FIN 1366 segments and therefore upon the RTT of the path. (It could be 1367 argued that the side that is sending a FIN knows what degree of 1368 reliability it needs, and therefore it should be able to 1369 determine the length of the TIME-WAIT delay for the FIN's 1370 recipient. This could be accomplished with an appropriate TCP 1371 option in FIN segments.) 1373 Although there is no formal upper-bound on RTT, common network 1374 engineering practice makes an RTT greater than 1 minute very 1375 unlikely. Thus, the 4 minute delay in TIME-WAIT state works 1376 satisfactorily to provide a reliable full-duplex TCP close. 1377 Note again that this is independent of MSL enforcement and 1378 network speed. 1380 The TIME-WAIT state could cause an indirect performance problem 1381 if an application needed to repeatedly close one connection and 1382 open another at a very high frequency, since the number of 1383 available TCP ports on a host is less than 2^16. However, high 1384 network speeds are not the major contributor to this problem; 1385 the RTT is the limiting factor in how quickly connections can be 1386 opened and closed. Therefore, this problem will be no worse at 1387 high transfer speeds. 1389 (b) Allow old duplicate segments to expire. 1391 To replace this function of TIME-WAIT state, a mechanism would 1392 have to operate across connections. PAWS is defined strictly 1393 within a single connection; the last timestamp (TS.Recent) is 1394 kept in the connection control block, and discarded when a 1395 connection is closed. 1397 An additional mechanism could be added to the TCP, a per-host 1398 cache of the last timestamp received from any connection. This 1399 value could then be used in the PAWS mechanism to reject old 1400 duplicate segments from earlier incarnations of the connection, 1401 if the timestamp clock can be guaranteed to have ticked at least 1402 once since the old connection was open. This would require that 1403 the TIME-WAIT delay plus the RTT together must be at least one 1404 tick of the sender's timestamp clock. Such an extension is not 1405 part of the proposal of this RFC. 1407 Note that this is a variant on the mechanism proposed by 1408 Garlick, Rom, and Postel [Garlick77], which required each host 1409 to maintain connection records containing the highest sequence 1410 numbers on every connection. Using timestamps instead, it is 1411 only necessary to keep one quantity per remote host, regardless 1412 of the number of simultaneous connections to that host. 1414 Appendix C. Summary of Notation 1416 The following notation has been used in this document. 1418 Options 1420 WSopt: TCP Window Scale Option 1421 TSopt: TCP Timestamps Option 1423 Option Fields 1425 shift.cnt: Window scale byte in WSopt 1426 TSval: 32-bit Timestamp Value field in TSopt 1427 TSecr: 32-bit Timestamp Reply field in TSopt 1429 Option Fields in Current Segment 1431 SEG.TSval: TSval field from TSopt in current segment 1432 SEG.TSecr: TSecr field from TSopt in current segment 1433 SEG.WSopt: 8-bit value in WSopt 1435 Clock Values 1437 my.TSclock: System wide source of 32-bit timestamp values 1438 my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec) 1439 Snd.TSoffset: A offset for randomizing Snd.TSclock 1440 Snd.TSclock: my.TSclock + Snd.TSoffset 1442 Per-Connection State Variables 1444 TS.Recent: Latest received Timestamp 1445 Last.ACK.sent: Last ACK field sent 1446 Snd.TS.OK: 1-bit flag 1447 Snd.WS.OK: 1-bit flag 1448 Rcv.Wind.Scale: Receive window scale power 1449 Snd.Wind.Scale: Send window scale power 1450 Start.Time: Snd.TSclock value when segment being timed was 1451 sent (used by pre-1323 code). 1453 Procedure 1455 Update_SRTT(m) Procedure to update the smoothed RTT and RTT 1456 variance estimates, using the rules of 1457 [Jacobson88a], given m, a new RTT measurement 1459 Appendix D. Pseudo-code Summary 1461 Create new TCB => { 1462 Rcv.wind.scale = 1463 MIN( 14, MAX(0, floor(log2(receive buffer space)) - 15) ); 1465 Snd.wind.scale = 0; 1466 Last.ACK.sent = 0; 1467 Snd.TS.OK = Snd.WS.OK = FALSE; 1468 Snd.TSoffset = random 32 bit value 1469 } 1471 Send initial segment => { 1472 SEG.WND = MIN( RCV.WND, 65535 ); 1473 Include in segment: TSopt(TSval=Snd.TSclock, TSecr=0); 1474 Include in segment: WSopt = Rcv.wind.scale; 1475 } 1477 Send segment => { 1478 SEG.ACK = Last.ACK.sent = RCV.NXT; 1479 SEG.WND = MIN( RCV.WND, 65535 ); 1480 if (Snd.TS.OK) then 1481 Include in segment: 1482 TSopt(TSval=Snd.TSclock, TSecr=TS.Recent); 1483 if (Snd.WS.OK) then 1484 Include in segment: WSopt = Rcv.wind.scale; 1485 } 1487 Receive or segment => { 1488 if (Segment contains TSopt) then { 1489 TS.Recent = SEG.TSval; 1490 Snd.TS.OK = TRUE; 1491 if (is segment) then 1492 Update_SRTT( 1493 (Snd.TSclock - SEG.TSecr)/my.TSclock.rate); 1494 } 1495 if (Segment contains WSopt) then { 1496 Snd.wind.scale = SEG.WSopt; 1497 Snd.WS.OK = TRUE; 1498 if (the ACK bit is not set, and Rcv.wind.scale has not been 1499 initialized by the user) then 1500 Rcv.wind.scale = Snd.wind.scale; 1501 } 1502 else 1503 Rcv.wind.scale = Snd.wind.scale = 0; 1504 } 1506 Send non-SYN segment => { 1507 SEG.ACK = Last.ACK.sent = RCV.NXT; 1508 SEG.WND = MIN( RCV.WND >> Rcv.wind.scale, 65535 ); 1509 if (Snd.TS.OK) then 1510 Include in segment: 1511 TSopt(TSval=Snd.TSclock, TSecr=TS.Recent); 1512 } 1513 Receive non-SYN segment in (state >= ESTABLISHED) => { 1514 Window = (SEG.WND << Snd.wind.scale); 1515 /* Use 32-bit 'Window' instead of 16-bit 'SEG.WND' 1516 * in rest of processing. 1517 */ 1518 if (Segment contains TSopt) then { 1519 if (SEG.TSval < TS.Recent && Idle less than 24 days) then { 1520 if (Send.TS.OK AND (NOT RST) ) then { 1521 /* Timestamp too old => 1522 * segment is unacceptable. 1523 */ 1524 Send ACK segment; 1525 Discard segment and return; 1526 } 1527 } 1528 else { 1529 if (SEG.SEQ <= Last.ACK.sent) then 1530 TS.Recent = SEG.TSval; 1531 } 1532 } 1533 if (SEG.ACK > SND.UNA) then { 1534 /* (At least part of) first segment in 1535 * retransmission queue has been ACKed 1536 */ 1537 if (Segment contains TSopt) then 1538 Update_SRTT( 1539 (Snd.TSclock - SEG.TSecr)/my.TSclock.rate); 1540 else 1541 Update_SRTT( /* for compatibility */ 1542 (Snd.TSclock - Start.Time)/my.TSclock.rate); 1543 } 1544 } 1546 Appendix E. Event Processing Summary 1548 OPEN Call 1550 ... 1552 An initial send sequence number (ISS) is selected. Send a SYN 1553 segment of the form: 1555 1557 ... 1559 SEND Call 1560 CLOSED STATE (i.e., TCB does not exist) 1562 ... 1564 LISTEN STATE 1566 If the foreign socket is specified, then change the connection 1567 from passive to active, select an ISS. Send a SYN segment 1568 containing the options: and 1569 . Set SND.UNA to ISS, SND.NXT to ISS+1. 1570 Enter SYN-SENT state. ... 1572 SYN-SENT STATE 1573 SYN-RECEIVED STATE 1575 ... 1577 ESTABLISHED STATE 1578 CLOSE-WAIT STATE 1580 Segmentize the buffer and send it with a piggybacked 1581 acknowledgment (acknowledgment value = RCV.NXT). ... 1583 If the urgent flag is set ... 1585 If the Snd.TS.OK flag is set, then include the TCP Timestamps 1586 option in each data 1587 segment. 1589 Scale the receive window for transmission in the segment 1590 header: 1592 SEG.WND = (RCV.WND >> Rcv.Wind.Scale). 1594 SEGMENT ARRIVES 1596 ... 1598 If the state is LISTEN then 1600 first check for an RST 1602 ... 1604 second check for an ACK 1606 ... 1608 third check for a SYN 1610 if the SYN bit is set, check the security. If the ... 1612 ... 1614 if the SEG.PRC is less than the TCB.PRC then continue. 1616 Check for a Window Scale option (WSopt); if one is found, 1617 save SEG.WSopt in Snd.Wind.Scale and set Snd.WS.OK flag on. 1618 Otherwise, set both Snd.Wind.Scale and Rcv.Wind.Scale to 1619 zero and clear Snd.WS.OK flag. 1621 Check for a TSopt option; if one is found, save SEG.TSval in 1622 the variable TS.Recent and turn on the Snd.TS.OK bit. 1624 Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any 1625 other control or text should be queued for processing later. 1626 ISS should be selected and a SYN segment sent of the form: 1628 1630 If the Snd.WS.OK bit is on, include a WSopt option 1631 in this segment. If the Snd.TS.OK 1632 bit is on, include a TSopt 1633 in this segment. 1634 Last.ACK.sent is set to RCV.NXT. 1636 SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection 1637 state should be changed to SYN-RECEIVED. Note that any 1638 other incoming control or data (combined with SYN) will be 1639 processed in the SYN-RECEIVED state, but processing of SYN 1640 and ACK should not be repeated. If the listen was not fully 1641 specified (i.e., the foreign socket was not fully 1642 specified), then the unspecified fields should be filled in 1643 now. 1645 fourth other text or control 1647 ... 1649 If the state is SYN-SENT then 1651 first check the ACK bit 1653 ... 1655 ... 1657 fourth check the SYN bit 1659 ... 1661 If the SYN bit is on and the security/compartment and 1662 precedence are acceptable then, RCV.NXT is set to SEG.SEQ+1, 1663 IRS is set to SEG.SEQ, and any acknowledgements on the 1664 retransmission queue which are thereby acknowledged should 1665 be removed. 1667 Check for a Window Scale option (WSopt); if it is found, 1668 save SEG.WSopt in Snd.Wind.Scale; otherwise, set both 1669 Snd.Wind.Scale and Rcv.Wind.Scale to zero. 1671 Check for a TSopt option; if one is found, save SEG.TSval in 1672 variable TS.Recent and turn on the Snd.TS.OK bit in the 1673 connection control block. If the ACK bit is set, use 1674 Snd.TSclock - SEG.TSecr as the initial RTT estimate. 1676 If SND.UNA > ISS (our SYN has been ACKed), change the 1677 connection state to ESTABLISHED, form an ACK segment: 1679 1681 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1682 option in this ACK 1683 segment. Last.ACK.sent is set to RCV.NXT. 1685 Data or controls which were queued for transmission may be 1686 included. If there are other controls or text in the 1687 segment then continue processing at the sixth step below 1688 where the URG bit is checked, otherwise return. 1690 Otherwise enter SYN-RECEIVED, form a SYN,ACK segment: 1692 1694 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1695 option in this segment. 1696 If the Snd.WS.OK bit is on, include a WSopt option 1697 in this segment. Last.ACK.sent is 1698 set to RCV.NXT. 1700 If there are other controls or text in the segment, queue 1701 them for processing after the ESTABLISHED state has been 1702 reached, return. 1704 fifth, if neither of the SYN or RST bits is set then drop the 1705 segment and return. 1707 Otherwise, 1709 First, check sequence number 1711 SYN-RECEIVED STATE 1712 ESTABLISHED STATE 1713 FIN-WAIT-1 STATE 1714 FIN-WAIT-2 STATE 1715 CLOSE-WAIT STATE 1716 CLOSING STATE 1717 LAST-ACK STATE 1718 TIME-WAIT STATE 1720 Segments are processed in sequence. Initial tests on 1721 arrival are used to discard old duplicates, but further 1722 processing is done in SEG.SEQ order. If a segment's 1723 contents straddle the boundary between old and new, only the 1724 new parts should be processed. 1726 Rescale the received window field: 1728 TrueWindow = SEG.WND << Snd.Wind.Scale, 1730 and use "TrueWindow" in place of SEG.WND in the following 1731 steps. 1733 Check whether the segment contains a Timestamps option and 1734 bit Snd.TS.OK is on. If so: 1736 If SEG.TSval < TS.Recent and the RST bit is off, then 1737 test whether connection has been idle less than 24 days; 1738 if all are true, then the segment is not acceptable; 1739 follow steps below for an unacceptable segment. 1741 If SEG.SEQ is less than or equal to Last.ACK.sent, then 1742 save SEG.TSval in variable TS.Recent. 1744 There are four cases for the acceptability test for an 1745 incoming segment: 1747 ... 1749 If an incoming segment is not acceptable, an acknowledgment 1750 should be sent in reply (unless the RST bit is set, if so 1751 drop the segment and return): 1753 1755 Last.ACK.sent is set to SEG.ACK of the acknowledgment. If 1756 the Snd.Echo.OK bit is on, include the Timestamps option 1757 in this ACK segment. 1758 Set Last.ACK.sent to SEG.ACK and send the ACK segment. 1759 After sending the acknowledgment, drop the unacceptable 1760 segment and return. 1762 ... 1764 fifth check the ACK field. 1766 if the ACK bit is off drop the segment and return. 1768 if the ACK bit is on 1770 ... 1772 ESTABLISHED STATE 1774 If SND.UNA < SEG.ACK <= SND.NXT then, set SND.UNA <- 1775 SEG.ACK. Also compute a new estimate of round-trip time. 1776 If Snd.TS.OK bit is on, use Snd.TSclock - SEG.TSecr; 1777 otherwise use the elapsed time since the first segment in 1778 the retransmission queue was sent. Any segments on the 1779 retransmission queue which are thereby entirely 1780 acknowledged... 1782 ... 1784 Seventh, process the segment text. 1786 ESTABLISHED STATE 1787 FIN-WAIT-1 STATE 1788 FIN-WAIT-2 STATE 1790 ... 1792 Send an acknowledgment of the form: 1794 1796 If the Snd.TS.OK bit is on, include Timestamps option 1797 in this ACK segment. 1798 Set Last.ACK.sent to SEG.ACK of the acknowledgment, and send 1799 it. This acknowledgment should be piggy-backed on a segment 1800 being transmitted if possible without incurring undue delay. 1802 ... 1804 Appendix F. Timestamps Edge Cases 1806 While the rules laid out for when to calculate RTTM produce the 1807 correct results most of the time, there are some edge cases where an 1808 incorrect RTTM can be calculated. All of these situations involve 1809 the loss of packets. It is felt that these scenarios are rare, and 1810 that if they should happen, they will cause a single RTTM measurement 1811 to be inflated, which mitigates its effects on RTO calculations. 1813 [Martin03] cites two similar cases when the returning ACK is lost, 1814 and before the retransmission timer fires, another returning packet 1815 arrives, which ACKs the data. In this case, the RTTM calculated will 1816 be inflated: 1818 clock 1819 tc=1 -------------------> 1821 tc=2 (lost) <---- 1822 (RTTM would have been 1) 1824 (receive window opens, window update is sent) 1825 tc=5 <---- 1826 (RTTM is calculated at 4) 1828 One thing to note about this situation is that it is somewhat bounded 1829 by RTO + RTT, limiting how far off the RTTM calculation will be. 1830 While more complex scenarios can be constructed that produce larger 1831 inflations (e.g., retransmissions are lost), those scenarios involve 1832 multiple packet losses, and the connection will have other more 1833 serious operational problems than using an inflated RTTM in the RTO 1834 calculation. 1836 Appendix G. Changes from RFC 1072, RFC 1185, and RFC 1323 1838 The protocol extensions defined in RFC 1323 document differ in 1839 several important ways from those defined in RFC 1072 and RFC 1185. 1841 (a) SACK has been split off into a separate document, [RFC2018]. 1843 (b) The detailed rules for sending timestamp replies (see 1844 Section 4.4) differ in important ways. The earlier rules could 1845 result in an under-estimate of the RTT in certain cases (packets 1846 dropped or out of order). 1848 (c) The same value TS.Recent is now shared by the two distinct 1849 mechanisms RTTM and PAWS. This simplification became possible 1850 because of change (b). 1852 (d) An ambiguity in RFC 1185 was resolved in favor of putting 1853 timestamps on ACK as well as data segments. This supports the 1854 symmetry of the underlying TCP protocol. 1856 (e) The echo and echo reply options of RFC 1072 were combined into a 1857 single Timestamps option, to reflect the symmetry and to 1858 simplify processing. 1860 (f) The problem of outdated timestamps on long-idle connections, 1861 discussed in Section 5.2.2, was realized and resolved. 1863 (g) RFC 1185 recommended that header prediction take precedence over 1864 the timestamp check. Based upon some skepticism about the 1865 probabilistic arguments given in Section 5.2.4, it was decided 1866 to recommend that the timestamp check be performed first. 1868 (h) The spec was modified so that the extended options will be sent 1869 on segments only when they are received in the 1870 corresponding segments. This provides the most 1871 conservative possible conditions for interoperation with 1872 implementations without the extensions. 1874 In addition to these substantive changes, the present RFC attempts to 1875 specify the algorithms unambiguously by presenting modifications to 1876 the Event Processing rules of RFC 793; see Appendix E. 1878 There are additional changes in this document from RFC 1323. These 1879 changes are: 1881 (a) The description of which TSecr values can be used to update the 1882 measured RTT has been clarified. Specifically, with Timestamps, 1883 the Karn algorithm [Karn87] is disabled. The Karn algorithm 1884 disables all RTT measurements during retransmission, since it is 1885 ambiguous whether the ACK is for the original packet, or the 1886 retransmitted packet. With Timestamps, that ambiguity is 1887 removed since the TSecr in the ACK will contain the TSval from 1888 whichever data packet made it to the destination. 1890 (b) In RFC1323, section 3.4, step (2) of the algorithm to control 1891 which timestamp is echoed was incorrect in two regards: 1893 (1) It failed to update TS.recent for a retransmitted segment 1894 that resulted from a lost ACK. 1896 (2) It failed if SEG.LEN = 0. 1898 In the new algorithm, the case of SEG.TSval >= TS.recent is 1899 included for consistency with the PAWS test. 1901 (c) One correction was made to the Event Processing Summary in 1902 Appendix E. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to 1903 fill in the SEG.WND value, not SND.WND. 1905 (d) New pseudo-code summary has been added in Appendix D. 1907 (e) Appendix A has been expanded with information about the TCP 1908 Urgent Pointer. An earlier revision contained text around the 1909 TCP MSS option, which was split off into [RFC6691]. 1911 (f) It is now recommended that Timestamps options be included in RST 1912 packets if the incoming packet contained a Timestamps option. 1914 (g) RST packets are explicitly excluded from PAWS processing. 1916 (h) Snd.TSoffset and Snd.TSclock variables have been added. 1917 Snd.TSclock is the sum of my.TSclock and Snd.TSoffset. This 1918 allows the starting points for timestamps to be randomized on a 1919 per-connection basis. Setting Snd.TSoffset to zero yields the 1920 same results as [RFC1323]. 1922 (i) RTTM update processing explicitly excludes packets containing 1923 SACK options. This addresses inflation of the RTT during 1924 episodes of packet loss in both directions. 1926 (j) In Section 4.2 the if-clause allowing sending of timestamps only 1927 when received in a or was removed, to allow for 1928 late timestamp negotiation. 1930 (k) Section 3.4 was added describing the unavoidable window 1931 retraction issue, and explicitly describing the mitigation steps 1932 necessary. 1934 (l) Section 2 was added for RFC2119 wording. Normative text was 1935 updated with the appropriate phrases. 1937 (m) Removed much of the discussion in Section 1 to streamline the 1938 document. However, detailed examples and discussions in 1939 Section 3, Section 4 and Section 5 are kept as guideline for 1940 implementers. 1942 (n) Moved Appendix "Changes" at the end of the appendices for easier 1943 lookup. 1945 Authors' Addresses 1947 David Borman 1948 Quantum Corporation 1949 Mendota Heights MN 55120 1950 USA 1952 Email: david.borman@quantum.com 1954 Bob Braden 1955 University of Southern California 1956 4676 Admiralty Way 1957 Marina del Rey CA 90292 1958 USA 1960 Email: braden@isi.edu 1962 Van Jacobson 1963 Packet Design 1964 2465 Latham Street 1965 Mountain View CA 94040 1966 USA 1968 Email: van@packetdesign.com 1970 Richard Scheffenegger (editor) 1971 NetApp, Inc. 1972 Am Euro Platz 2 1973 Vienna, 1120 1974 Austria 1976 Email: rs@netapp.com