idnits 2.17.1 draft-ietf-tcpm-1323bis-10.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The abstract seems to indicate that this document obsoletes RFC1323, but the header doesn't have an 'Obsoletes:' line to match this. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (April 16, 2013) is 4028 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFCxxxx' is mentioned on line 1863, but not defined == Unused Reference: 'Mathis08' is defined on line 1231, but no explicit reference was found in the text == Unused Reference: 'RFC1110' is defined on line 1243, but no explicit reference was found in the text == Unused Reference: 'RFC2018' is defined on line 1258, but no explicit reference was found in the text == Unused Reference: 'RFC2581' is defined on line 1261, but no explicit reference was found in the text == Unused Reference: 'RFC2883' is defined on line 1267, but no explicit reference was found in the text == Unused Reference: 'RFC5681' is defined on line 1280, but no explicit reference was found in the text == Unused Reference: 'Watson81' is defined on line 1291, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 1072 (Obsoleted by RFC 1323, RFC 2018, RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1110 (Obsoleted by RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1185 (Obsoleted by RFC 1323) -- Obsolete informational reference (is this intentional?): RFC 1323 (Obsoleted by RFC 7323) -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 6691 (Obsoleted by RFC 9293) Summary: 1 error (**), 0 flaws (~~), 9 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance (TCPM) D. Borman 3 Internet-Draft Quantum Corporation 4 Intended status: Standards Track B. Braden 5 Expires: October 18, 2013 University of Southern 6 California 7 V. Jacobson 8 Packet Design 9 R. Scheffenegger, Ed. 10 NetApp, Inc. 11 April 16, 2013 13 TCP Extensions for High Performance 14 draft-ietf-tcpm-1323bis-10 16 Abstract 18 This document specifies a set of TCP extensions to improve 19 performance over paths with a large bandwidth * delay product and to 20 provide reliable operation over very high-speed paths. It defines 21 TCP options for scaled windows and timestamps. The timestamps are 22 used for two distinct mechanisms, RTTM (Round Trip Time Measurement) 23 and PAWS (Protection Against Wrapped Sequences). 25 This document updates and obsoletes RFC 1323. 27 Status of this Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on October 18, 2013. 44 Copyright Notice 46 Copyright (c) 2013 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 62 1.1. TCP Performance . . . . . . . . . . . . . . . . . . . . . 4 63 1.2. TCP Reliability . . . . . . . . . . . . . . . . . . . . . 5 64 1.3. Using TCP options . . . . . . . . . . . . . . . . . . . . 6 65 1.4. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7 66 2. TCP Window Scale Option . . . . . . . . . . . . . . . . . . . 8 67 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 8 68 2.2. Window Scale Option . . . . . . . . . . . . . . . . . . . 8 69 2.3. Using the Window Scale Option . . . . . . . . . . . . . . 9 70 2.4. Addressing Window Retraction . . . . . . . . . . . . . . . 10 71 3. RTTM -- Round-Trip Time Measurement . . . . . . . . . . . . . 12 72 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 12 73 3.2. TCP Timestamp Option . . . . . . . . . . . . . . . . . . . 13 74 3.3. The RTTM Mechanism . . . . . . . . . . . . . . . . . . . . 14 75 3.4. Which Timestamp to Echo . . . . . . . . . . . . . . . . . 16 76 4. PAWS -- Protection Against Wrapped Sequence Numbers . . . . . 18 77 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 18 78 4.2. The PAWS Mechanism . . . . . . . . . . . . . . . . . . . . 18 79 4.3. Basic PAWS Algorithm . . . . . . . . . . . . . . . . . . . 20 80 4.4. Timestamp Clock . . . . . . . . . . . . . . . . . . . . . 22 81 4.5. Outdated Timestamps . . . . . . . . . . . . . . . . . . . 23 82 4.6. Header Prediction . . . . . . . . . . . . . . . . . . . . 24 83 4.7. IP Fragmentation . . . . . . . . . . . . . . . . . . . . . 25 84 4.8. Duplicates from Earlier Incarnations of Connection . . . . 25 85 5. Conclusions and Acknowledgements . . . . . . . . . . . . . . . 26 86 6. Security Considerations . . . . . . . . . . . . . . . . . . . 26 87 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 27 88 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 27 89 8.1. Normative References . . . . . . . . . . . . . . . . . . . 27 90 8.2. Informative References . . . . . . . . . . . . . . . . . . 28 91 Appendix A. Implementation Suggestions . . . . . . . . . . . . . 30 92 Appendix B. Duplicates from Earlier Connection Incarnations . . . 31 93 B.1. System Crash with Loss of State . . . . . . . . . . . . . 31 94 B.2. Closing and Reopening a Connection . . . . . . . . . . . . 32 95 Appendix C. Summary of Notation . . . . . . . . . . . . . . . . . 33 96 Appendix D. Event Processing Summary . . . . . . . . . . . . . . 34 97 Appendix E. Timestamps Edge Cases . . . . . . . . . . . . . . . . 40 98 Appendix F. Window Retraction Example . . . . . . . . . . . . . . 40 99 Appendix G. Changes from RFC 1323 . . . . . . . . . . . . . . . . 41 100 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 43 102 1. Introduction 104 The TCP protocol [RFC0793] was designed to operate reliably over 105 almost any transmission medium regardless of transmission rate, 106 delay, corruption, duplication, or reordering of segments. Over the 107 years, advances in networking technology has resulted in ever-higher 108 transmission speeds, and the fastest paths are well beyond the domain 109 for which TCP was originally engineered. 111 This document defines a set of modest extensions to TCP to extend the 112 domain of its application to match the increasing network capability. 113 It is an update to and obsoletes [RFC1323], which in turn is based 114 upon and obsoletes [RFC1072] and [RFC1185]. 116 Changes between [RFC1323] and this document are detailed in 117 Appendix G. 119 For brevity, the full discussions of the merits and history behind 120 the TCP options defined within this document have been omitted. 121 [RFC1323] should be consulted for reference. It is recommended that 122 a modern TCP stack implements and make use of the extensions 123 described in this document. 125 1.1. TCP Performance 127 TCP performance problems arise when the bandwidth * delay product is 128 large. A network having such paths is referred to as "long, fat 129 network" (LFN). 131 There are three fundamental performance problems with basic TCP over 132 LFN paths: 134 (1) Window Size Limit 136 The TCP header uses a 16 bit field to report the receive window 137 size to the sender. Therefore, the largest window that can be 138 used is 2^16 = 64 KiB. 140 To circumvent this problem, Section 2 of this memo defines a TCP 141 option, "Window Scale", to allow windows larger than 2^16. This 142 option defines an implicit scale factor, which is used to 143 multiply the window size value found in a TCP header to obtain 144 the true window size. 146 (2) Recovery from Losses 148 Packet losses in an LFN can have a catastrophic effect on 149 throughput. 151 To generalize the Fast Retransmit/Fast Recovery mechanism to 152 handle multiple packets dropped per window, selective 153 acknowledgments are required. Unlike the normal cumulative 154 acknowledgments of TCP, selective acknowledgments give the 155 sender a complete picture of which segments are queued at the 156 receiver and which have not yet arrived. 158 Selective acknowledgements are specified in a separate document, 159 "A Conservative Selective Acknowledgment (SACK)-based Loss 160 Recovery Algorithm for TCP" [RFC6675], and not further discussed 161 in this document. 163 (3) Round-Trip Measurement 165 TCP implements reliable data delivery by retransmitting segments 166 that are not acknowledged within some retransmission timeout 167 (RTO) interval. Accurate dynamic determination of an 168 appropriate RTO is essential to TCP performance. RTO is 169 determined by estimating the mean and variance of the measured 170 round-trip time (RTT), i.e., the time interval between sending a 171 segment and receiving an acknowledgment for it [Jacobson88a]. 173 Section 3.2 defines a TCP option, "Timestamp", and then 174 specifies a mechanism using this option that allows nearly every 175 segment, including retransmissions, to be timed at negligible 176 computational cost. We use the mnemonic RTTM (Round Trip Time 177 Measurement) for this mechanism, to distinguish it from other 178 uses of the Timestamp Option. 180 1.2. TCP Reliability 182 An especially serious kind of error may result from an accidental 183 reuse of TCP sequence numbers in data segments. TCP reliability 184 depends upon the existence of a bound on the lifetime of a segment: 185 the "Maximum Segment Lifetime" or MSL. 187 Duplication of sequence numbers might happen in either of two ways: 189 (1) Sequence number wrap-around on the current connection 191 A TCP sequence number contains 32 bits. At a high enough 192 transfer rate, the 32-bit sequence space may be "wrapped" 193 (cycled) within the time that a segment is delayed in queues. 195 (2) Earlier incarnation of the connection 197 Suppose that a connection terminates, either by a proper close 198 sequence or due to a host crash, and the same connection (i.e., 199 using the same pair of port numbers) is immediately reopened. A 200 delayed segment from the terminated connection could fall within 201 the current window for the new incarnation and be accepted as 202 valid. 204 Duplicates from earlier incarnations, case (2), are avoided by 205 enforcing the current fixed MSL of the TCP specification, as 206 explained in Section 4.8 and Appendix B. However, case (1), avoiding 207 the reuse of sequence numbers within the same connection, requires an 208 upper bound on MSL that depends upon the transfer rate, and at high 209 enough rates, a dedicated mechanism is required. 211 A possible fix for the problem of cycling the sequence space would be 212 to increase the size of the TCP sequence number field. For example, 213 the sequence number field (and also the acknowledgment field) could 214 be expanded to 64 bits. This could be done either by changing the 215 TCP header or by means of an additional option. 217 Section 4 presents a different mechanism, which we call PAWS 218 (Protection Against Wrapped Sequence numbers), to extend TCP 219 reliability to transfer rates well beyond the foreseeable upper limit 220 of network bandwidths. PAWS uses the TCP timestamp option defined in 221 Section 3.2 to protect against old duplicates from the same 222 connection. 224 1.3. Using TCP options 226 The extensions defined in this document all use TCP options. 228 When [RFC1323] was published, there was concern that some buggy TCP 229 implementation might be crashed by the first appearance of an option 230 on a non- segment. However, bugs like that can lead to DOS 231 attacks against a TCP, so it is now expected that most TCP 232 implementations will properly handle unknown options on non- 233 segments. But it is still prudent to be conservative in what you 234 send, and avoiding buggy TCP implementation is not the only reason 235 for negotiating TCP options on segments. 237 The window scale option negotiates fundamental parameters of the TCP 238 session. Therefore, it is only sent during the initial handshake. 239 Furthermore, the window scale option will be sent in a 240 segment only if the corresponding option was received in the initial 241 segment. 243 The timestamp option may appear in any data or segment, adding 244 12 bytes to the 20-byte TCP header. We recognize there is a trade- 245 off between the bandwidth saved by reducing unnecessary 246 retransmission timeouts, and the extra header bandwidth used by this 247 option. It is required that this TCP option will be sent on non- 248 segments only after an exchange of options on the 249 segments has indicated that both sides understand this extension. 251 Appendix A contains a recommended layout of the options in TCP 252 headers to achieve reasonable data field alignment. 254 Finally, we observe that most of the mechanisms defined in this 255 document are important for LFN's and/or very high-speed networks. 256 For low-speed networks, it might be a performance optimization to NOT 257 use these mechanisms. A TCP vendor concerned about optimal 258 performance over low-speed paths might consider turning these 259 extensions off for low-speed paths, or allow a user or installation 260 manager to disable them. 262 1.4. Terminology 264 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 265 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 266 document are to be interpreted as described in [RFC2119]. 268 In this document, these words will appear with that interpretation 269 only when in UPPER CASE. Lower case uses of these words are not to 270 be interpreted as carrying [RFC2119] significance. 272 2. TCP Window Scale Option 274 2.1. Introduction 276 The window scale extension expands the definition of the TCP window 277 to 32 bits and then uses a scale factor to carry this 32-bit value in 278 the 16-bit Window field of the TCP header (SEG.WND in RFC 793). The 279 scale factor is carried in a TCP option, Window Scale. This option 280 is sent only in a segment (a segment with the SYN bit on), 281 hence the window scale is fixed in each direction when a connection 282 is opened. 284 The maximum receive window, and therefore the scale factor, is 285 determined by the maximum receive buffer space. In a typical modern 286 implementation, this maximum buffer space is set by default but can 287 be overridden by a user program before a TCP connection is opened. 288 This determines the scale factor, and therefore no new user interface 289 is needed for window scaling. 291 2.2. Window Scale Option 293 The three-byte Window Scale option MAY be sent in a segment by 294 a TCP. It has two purposes: (1) indicate that the TCP is prepared to 295 do both send and receive window scaling, and (2) communicate a scale 296 factor to be applied to its receive window. Thus, a TCP that is 297 prepared to scale windows SHOULD send the option, even if its own 298 scale factor is 1. The scale factor is limited to a power of two and 299 encoded logarithmically, so it may be implemented by binary shift 300 operations. 302 TCP Window Scale Option (WSopt): 304 Kind: 3 306 Length: 3 bytes 308 +---------+---------+---------+ 309 | Kind=3 |Length=3 |shift.cnt| 310 +---------+---------+---------+ 311 1 1 1 313 This option is an offer, not a promise; both sides MUST send Window 314 Scale options in their segments to enable window scaling in 315 either direction. If window scaling is enabled, then the TCP that 316 sent this option will right-shift its true receive-window values by 317 'shift.cnt' bits for transmission in SEG.WND. The value 'shift.cnt' 318 MAY be zero (offering to scale, while applying a scale factor of 1 to 319 the receive window). 321 This option MAY be sent in an initial segment (i.e., a segment 322 with the SYN bit on and the ACK bit off). It MAY also be sent in a 323 segment, but only if a Window Scale option was received in 324 the initial segment. A Window Scale option in a segment 325 without a SYN bit SHOULD be ignored. 327 The window field in a segment where the SYN bit is set (i.e., a 328 or ) is never scaled. 330 2.3. Using the Window Scale Option 332 A model implementation of window scaling is as follows, using the 333 notation of [RFC0793]: 335 o All windows are treated as 32-bit quantities for storage in the 336 connection control block and for local calculations. This 337 includes the send-window (SND.WND) and the receive-window 338 (RCV.WND) values, as well as the congestion window. 340 o The connection state is augmented by two window shift counts, 341 Snd.Wind.Scale and Rcv.Wind.Scale, to be applied to the incoming 342 and outgoing window fields, respectively. 344 o If a TCP receives a segment containing a Window Scale 345 option, it sends its own Window Scale option in the 346 segment. 348 o The Window Scale option is sent with shift.cnt = R, where R is the 349 value that the TCP would like to use for its receive window. 351 o Upon receiving a segment with a Window Scale option 352 containing shift.cnt = S, a TCP sets Snd.Wind.Scale to S and sets 353 Rcv.Wind.Scale to R; otherwise, it sets both Snd.Wind.Scale and 354 Rcv.Wind.Scale to zero. 356 o The window field (SEG.WND) in the header of every incoming 357 segment, with the exception of segments, is left-shifted by 358 Snd.Wind.Scale bits before updating SND.WND: 360 SND.WND = SEG.WND << Snd.Wind.Scale 362 (assuming the other conditions of [RFC0793] are met, and using the 363 "C" notation "<<" for left-shift). 365 o The window field (SEG.WND) of every outgoing segment, with the 366 exception of segments, is right-shifted by Rcv.Wind.Scale 367 bits: 369 SND.WND = RCV.WND >> Rcv.Wind.Scale 371 TCP determines if a data segment is "old" or "new" by testing whether 372 its sequence number is within 2^31 bytes of the left edge of the 373 window, and if it is not, discarding the data as "old". To insure 374 that new data is never mistakenly considered old and vice versa, the 375 left edge of the sender's window has to be at most 2^31 away from the 376 right edge of the receiver's window. Similarly with the sender's 377 right edge and receiver's left edge. Since the right and left edges 378 of either the sender's or receiver's window differ by the window 379 size, and since the sender and receiver windows can be out of phase 380 by at most the window size, the above constraints imply that two 381 times the maximum window size must be less than 2^31, or 383 max window < 2^30 385 Since the max window is 2^S (where S is the scaling shift count) 386 times at most 2^16 - 1 (the maximum unscaled window), the maximum 387 window is guaranteed to be < 2^30 if S <= 14. Thus, the shift count 388 MUST be limited to 14 (which allows windows of 2^30 = 1 GiB). If a 389 Window Scale option is received with a shift.cnt value exceeding 14, 390 the TCP SHOULD log the error but use 14 instead of the specified 391 value. 393 The scale factor applies only to the Window field as transmitted in 394 the TCP header; each TCP using extended windows will maintain the 395 window values locally as 32-bit numbers. For example, the 396 "congestion window" computed by Slow Start and Congestion Avoidance 397 is not affected by the scale factor, so window scaling will not 398 introduce quantization into the congestion window. 400 2.4. Addressing Window Retraction 402 When a non-zero scale factor is in use, there are instances when a 403 retracted window can be offered - see Appendix F for a detailed 404 example. The end of the window will be on a boundary based on the 405 granularity of the scale factor being used. If the sequence number 406 is then updated by a number of bytes smaller than that granularity, 407 the TCP will have to either advertise a new window that is beyond 408 what it previously advertised (and perhaps beyond the buffer), or 409 will have to advertise a smaller window, which will cause the TCP 410 window to shrink. Implementations MUST ensure that they handle a 411 shrinking window, as specified in section 4.2.2.16 of [RFC1122]. 413 For the receiver, this implies that: 415 1) The receiver MUST honor, as in-window, any segment that would 416 have been in-window for any sent by the receiver. 418 2) When window scaling is in effect, the receiver SHOULD track the 419 actual maximum window sequence number (which is likely to be 420 greater than the window announced by the most recent , if 421 more than one segment has arrived since the application consumed 422 any data in the receive buffer). 424 On the sender side: 426 3) The initial transmission MUST be within the window announced by 427 the most recent . 429 4) On first retransmission, or if the sequence number is out-of- 430 window by less than (2^Rcv.Wind.Scale) then do normal 431 retransmission(s) without regard to receiver window as long as 432 the original segment was in window when it was sent. 434 5) Subsequent retransmissions MAY only be sent, if they are within 435 the window announced by the most recent . 437 3. RTTM -- Round-Trip Time Measurement 439 3.1. Introduction 441 Accurate and current RTT estimates are necessary to adapt to changing 442 traffic conditions and to avoid an instability known as "congestion 443 collapse" [RFC0896] in a busy network. However, accurate measurement 444 of RTT may be difficult both in theory and in implementation. 446 Many TCP implementations base their RTT measurements upon a sample of 447 one segment per window or less. While this yields an adequate 448 approximation to the RTT for small windows, it results in an 449 unacceptably poor RTT estimate for a LFN. If we look at RTT 450 estimation as a signal processing problem (which it is), a data 451 signal at some frequency, the packet rate, is being sampled at a 452 lower frequency, the window rate. This lower sampling frequency 453 violates Nyquist's criteria and may therefore introduce "aliasing" 454 artifacts into the estimated RTT [Hamming77]. 456 A good RTT estimator with a conservative retransmission timeout 457 calculation can tolerate aliasing when the sampling frequency is 458 "close" to the data frequency. For example, with a window of 8 459 segments, the sample rate is 1/8 the data frequency -- less than an 460 order of magnitude different. However, when the window is tens or 461 hundreds of segments, the RTT estimator may be seriously in error, 462 resulting in spurious retransmissions. 464 If there are dropped segments, the problem becomes worse. Zhang 465 [Zhang86], Jain [Jain86] and Karn [Karn87] have shown that it is not 466 possible to accumulate reliable RTT estimates if retransmitted 467 segments are included in the estimate. Since a full window of data 468 will have been transmitted prior to a retransmission, all of the 469 segments in that window will have to be ACKed before the next RTT 470 sample can be taken. This means at least an additional window's 471 worth of time between RTT measurements and, as the error rate 472 approaches one per window of data (e.g., 10^-6 errors per bit for the 473 Wideband satellite network), it becomes effectively impossible to 474 obtain a valid RTT measurement. 476 A solution to these problems, which actually simplifies the sender 477 substantially, is as follows: using TCP options, the sender places a 478 timestamp in each data segment, and the receiver reflects these 479 timestamps back in segments. Then a single subtract gives the 480 sender an accurate RTT measurement for every segment (which 481 will correspond to every other data segment, with a sensible 482 receiver). We call this the RTTM (Round-Trip Time Measurement) 483 mechanism. 485 It is vitally important to use the RTTM mechanism with big windows; 486 otherwise, the door is opened to some dangerous instabilities due to 487 aliasing. Furthermore, the option is probably useful for all TCP's, 488 since it simplifies the sender. 490 3.2. TCP Timestamp Option 492 TCP is a symmetric protocol, allowing data to be sent at any time in 493 either direction, and therefore timestamp echoing may occur in either 494 direction. For simplicity and symmetry, we specify that timestamps 495 always be sent and echoed in both directions. For efficiency, we 496 combine the timestamp and timestamp reply fields into a single TCP 497 Timestamp Option. 499 TCP Timestamp Option (TSopt): 501 Kind: 8 503 Length: 10 bytes 505 +-------+-------+---------------------+---------------------+ 506 |Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)| 507 +-------+-------+---------------------+---------------------+ 508 1 1 4 4 510 The Timestamp Option carries two four-byte timestamp fields. The 511 Timestamp Value field (TSval) contains the current value of the 512 timestamp clock of the TCP sending the option. 514 The Timestamp Echo Reply field (TSecr) is valid if the ACK bit is set 515 in the TCP header; if it is valid, it echoes a timestamp value that 516 was sent by the remote TCP in the TSval field of a Timestamp option. 517 When TSecr is not valid, its value MUST be zero. However, a value of 518 zero does not imply TSecr being invalid. The TSecr value will 519 generally be from the most recent Timestamp Option that was received; 520 however, there are exceptions that are explained below. 522 A TCP MAY send the Timestamp option (TSopt) in an initial 523 segment (i.e., segment containing a SYN bit and no ACK bit), and MAY 524 send a TSopt in other segments only if it received a TSopt in the 525 initial or segment for the connection. 527 Once TSopt has been successfully negotiated (sent and received) 528 during the , exchange, TSopt MUST be sent in every 529 non- segment for the duration of the connection. If a non- 530 segment is received without a TSopt, a TCP MAY drop the segment and 531 send an for the last in-sequence segment. A TCP MUST NOT abort 532 a TCP connection if a non- segment is received without a TSopt. 534 If a TSopt is received on a connection where TSopt was not negotiated 535 in the initial three-way handshake, the TSopt MUST be ignored and the 536 packet processed normally. 538 In the case of crossing segments where one contains a 539 TSopt and the other doesn't, both sides MAY send a TSopt in the 540 segment. 542 TSopt is required for the two mechanisms described in sections 3.3 543 and 4.2. There are also other mechanisms that rely on the presence 544 of the TSopt, e.g. [RFC3522]. If a TCP stopped sending TSopt at any 545 time during an established session, it interferes with these 546 mechanisms. This update to [RFC1323] describes explicitly the 547 previous assumption (see Section 4.2), that each TCP segment must 548 have TSopt, once negotiated. 550 3.3. The RTTM Mechanism 552 RTTM places a Timestamp Option in every segment, with a TSval that is 553 obtained from a (virtual) "timestamp clock". Values of this clock 554 MUST be at least approximately proportional to real time, in order to 555 measure actual RTT. 557 These TSval values are echoed in TSecr values in the reverse 558 direction. The difference between a received TSecr value and the 559 current timestamp clock value provides a RTT measurement. 561 When timestamps are used, every segment that is received will contain 562 a TSecr value. However, these values cannot all be used to update 563 the measured RTT. The following example illustrates why. It shows a 564 one-way data flow with segments arriving in sequence without loss. 565 Here A, B, C... represent data blocks occupying successive blocks of 566 sequence numbers, and ACK(A),... represent the corresponding 567 cumulative acknowledgments. The two timestamp fields of the 568 Timestamp Option are shown symbolically as . Each 569 TSecr field contains the value most recently received in a TSval 570 field. 572 TCP A TCP B 574 -----> 576 <---- 578 -----> 580 <---- 582 . . . . . . . . . . . . . . . . . . . . . . 584 ----> 586 <---- 588 (etc.) 590 The dotted line marks a pause (60 time units long) in which A had 591 nothing to send. Note that this pause inflates the RTT which B could 592 infer from receiving TSecr=131 in data segment C. Thus, in one-way 593 data flows, RTTM in the reverse direction measures a value that is 594 inflated by gaps in sending data. However, the following rule 595 prevents a resulting inflation of the measured RTT: 597 RTTM Rule: A TSecr value received in a segment MAY be used to update 598 the averaged RTT measurement only if the segment advances 599 the left edge of the send window, i.e. SND.UNA is 600 increased. 602 Since TCP B is not sending data, the data segment C does not 603 acknowledge any new data when it arrives at B. Thus, the inflated 604 RTTM measurement is not used to update B's RTTM measurement. 606 Implementers should note that with timestamps multiple RTTMs can be 607 taken per RTT. Many RTO estimators have a weighting factor based on 608 an implicit assumption that at most one RTTM will be sampled per RTT. 609 When using multiple RTTMs per RTT to update the RTO estimator, the 610 weighting factor needs to be decreased to take into account the more 611 frequent RTTMs. For example, an implementation could choose to just 612 use one sample per RTT to update the RTO estimator, or vary the gain 613 based on the congestion window, or take an average of all the RTT 614 measurements received over one RTT, and then use that value to update 615 the RTO estimator. This document does not prescribe any particular 616 method for modifying the RTO estimator. 618 3.4. Which Timestamp to Echo 620 If more than one Timestamp Option is received before a reply segment 621 is sent, the TCP must choose only one of the TSvals to echo, ignoring 622 the others. To minimize the state kept in the receiver (i.e., the 623 number of unprocessed TSvals), the receiver should be required to 624 retain at most one timestamp in the connection control block. 626 There are three situations to consider: 628 (A) Delayed ACKs. 630 Many TCP's acknowledge only every Kth segment out of a group of 631 segments arriving within a short time interval; this policy is 632 known generally as "delayed ACKs". The data-sender TCP must 633 measure the effective RTT, including the additional time due to 634 delayed ACKs, or else it will retransmit unnecessarily. Thus, 635 when delayed ACKs are in use, the receiver SHOULD reply with the 636 TSval field from the earliest unacknowledged segment. 638 (B) A hole in the sequence space (segment(s) have been lost). 640 The sender will continue sending until the window is filled, and 641 the receiver may be generating s as these out-of-order 642 segments arrive (e.g., to aid "fast retransmit"). 644 The lost segment is probably a sign of congestion, and in that 645 situation the sender should be conservative about 646 retransmission. Furthermore, it is better to overestimate than 647 underestimate the RTT. An for an out-of-order segment 648 SHOULD therefore contain the timestamp from the most recent 649 segment that advanced the window. 651 The same situation occurs if segments are re-ordered by the 652 network. 654 (C) A filled hole in the sequence space. 656 The segment that fills the hole represents the most recent 657 measurement of the network characteristics. A RTT computed from 658 an earlier segment would probably include the sender's 659 retransmit time-out, badly biasing the sender's average RTT 660 estimate. Thus, the timestamp from the latest segment (which 661 filled the hole) MUST be echoed. 663 An algorithm that covers all three cases is described in the 664 following rules for Timestamp Option processing on a synchronized 665 connection: 667 (1) The connection state is augmented with two 32-bit slots: 669 TS.Recent holds a timestamp to be echoed in TSecr whenever a 670 segment is sent, and Last.ACK.sent holds the ACK field from the 671 last segment sent. Last.ACK.sent will equal RCV.NXT except when 672 s have been delayed. 674 (2) If: 676 SEG.TSval >= TS.recent and SEG.SEQ <= Last.ACK.sent 678 then SEG.TSval is copied to TS.Recent; otherwise, it is ignored. 680 (3) When a TSopt is sent, its TSecr field is set to the current 681 TS.Recent value. 683 The following examples illustrate these rules. Here A, B, C... 684 represent data segments occupying successive blocks of sequence 685 numbers, and ACK(A),... represent the corresponding acknowledgment 686 segments. Note that ACK(A) has the same sequence number as B. We 687 show only one direction of timestamp echoing, for clarity. 689 o Segments arrive in sequence, and some of the s are delayed. 691 By case (A), the timestamp from the oldest unacknowledged segment 692 is echoed. 694 TS.Recent 695 -------------------> 696 1 697 -------------------> 698 1 699 -------------------> 700 1 701 <---- 702 (etc) 704 o Segments arrive out of order, and every segment is acknowledged. 706 By case (B), the timestamp from the last segment that advanced the 707 left window edge is echoed, until the missing segment arrives; it 708 is echoed according to Case (C). The same sequence would occur if 709 segments B and D were lost and retransmitted. 711 TS.Recent 712 -------------------> 713 1 714 <---- 715 1 716 -------------------> 717 1 718 <---- 719 1 720 -------------------> 721 2 722 <---- 723 2 724 -------------------> 725 2 726 <---- 727 2 728 -------------------> 729 4 730 <---- 731 (etc) 733 4. PAWS -- Protection Against Wrapped Sequence Numbers 735 4.1. Introduction 737 Section 4.2 describes a simple mechanism to reject old duplicate 738 segments that might corrupt an open TCP connection; we call this 739 mechanism PAWS (Protection Against Wrapped Sequence numbers). PAWS 740 operates within a single TCP connection, using state that is saved in 741 the connection control block. Section 4.8 and Appendix G discuss the 742 implications of the PAWS mechanism for avoiding old duplicates from 743 previous incarnations of the same connection. 745 4.2. The PAWS Mechanism 747 PAWS uses the same TCP Timestamp Option as the RTTM mechanism 748 described earlier, and assumes that every received TCP segment 749 (including data and segments) contains a timestamp SEG.TSval 750 whose values are monotonically non-decreasing in time. The basic 751 idea is that a segment can be discarded as an old duplicate if it is 752 received with a timestamp SEG.TSval less than some timestamp recently 753 received on this connection. 755 In both the PAWS and the RTTM mechanism, the "timestamps" are 32-bit 756 unsigned integers in a modular 32-bit space. Thus, "less than" is 757 defined the same way it is for TCP sequence numbers, and the same 758 implementation techniques apply. If s and t are timestamp values, 760 s < t if 0 < (t - s) < 2^31, 762 computed in unsigned 32-bit arithmetic. 764 The choice of incoming timestamps to be saved for this comparison 765 MUST guarantee a value that is monotonically increasing. For 766 example, we might save the timestamp from the segment that last 767 advanced the left edge of the receive window, i.e., the most recent 768 in-sequence segment. Instead, we choose the value TS.Recent 769 introduced in Section 3.4 for the RTTM mechanism, since using a 770 common value for both PAWS and RTTM simplifies the implementation of 771 both. As Section 3.4 explained, TS.Recent differs from the timestamp 772 from the last in-sequence segment only in the case of delayed s, 773 and therefore by less than one window. Either choice will therefore 774 protect against sequence number wrap-around. 776 RTTM was specified in a symmetrical manner, so that TSval timestamps 777 are carried in both data and segments and are echoed in TSecr 778 fields carried in returning or data segments. PAWS submits all 779 incoming segments to the same test, and therefore protects against 780 duplicate segments as well as data segments. (An alternative 781 non-symmetric algorithm would protect against old duplicate s: 782 the sender of data would reject incoming segments whose TSecr 783 values were less than the TSecr saved from the last segment whose ACK 784 field advanced the left edge of the send window. This algorithm was 785 deemed to lack economy of mechanism and symmetry.) 787 TSval timestamps sent on and segments are used to 788 initialize PAWS. PAWS protects against old duplicate non- 789 segments, and duplicate segments received while there is a 790 synchronized connection. Duplicate and segments 791 received when there is no connection will be discarded by the normal 792 3-way handshake and sequence number checks of TCP. 794 [RFC1323] recommended that segments NOT carry timestamps, and 795 that they be acceptable regardless of their timestamp. At that time, 796 the thinking was that old duplicate segments should be 797 exceedingly unlikely, and their cleanup function should take 798 precedence over timestamps. More recently, discussions about various 799 blind attacks on TCP connections have raised the suggestion that if 800 the timestamp option is present, SEG.TSecr could be used to provide 801 stricter acceptance tests for segments. While still under 802 discussion, to enable research into this area it is now RECOMMENDED 803 that when generating a , that if the segment causing the 804 to be generated contained a timestamp option, that the also 805 contain a timestamp option. In the segment, SEG.TSecr SHOULD 806 be set to SEG.TSval from the incoming segment and SEG.TSval SHOULD be 807 set to zero. If a is being generated because of a user abort, 808 and Snd.TS.OK is set, then a timestamp option SHOULD be included in 809 the . When a segment is received, it MUST NOT be 810 subjected to PAWS checks, and information from the timestamp option 811 MUST NOT be used to update connection state information. SEG.TSecr 812 MAY be used to provide stricter acceptance checks. 814 4.3. Basic PAWS Algorithm 816 The PAWS algorithm REQUIRES the following processing to be performed 817 on all incoming segments for a synchronized connection. Also, PAWS 818 processing MUST take precedence over the regular TCP acceptablitiy 819 check (Section 3.3 in [RFC0793]), which is performed after 820 verification of the received timestamp option: 822 R1) If there is a Timestamp Option in the arriving segment, 823 SEG.TSval < TS.Recent, TS.Recent is valid (see later discussion) 824 and the RST bit is not set, then treat the arriving segment as 825 not acceptable: 827 Send an acknowledgement in reply as specified in [RFC0793] 828 page 69 and drop the segment. 830 Note: it is necessary to send an segment in order to 831 retain TCP's mechanisms for detecting and recovering from 832 half-open connections. For example, see Figure 10 of 833 [RFC0793]. 835 R2) If the segment is outside the window, reject it (normal TCP 836 processing) 838 R3) If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent (see 839 Section 3.4), then record its timestamp in TS.Recent. 841 R4) If an arriving segment is in-sequence (i.e., at the left window 842 edge), then accept it normally. 844 R5) Otherwise, treat the segment as a normal in-window, out-of- 845 sequence TCP segment (e.g., queue it for later delivery to the 846 user). 848 Steps R2, R4, and R5 are the normal TCP processing steps specified by 849 [RFC0793]. 851 It is important to note that the timestamp MUST be checked only when 852 a segment first arrives at the receiver, regardless of whether it is 853 in-sequence or it must be queued for later delivery. 855 Consider the following example. 857 Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has been 858 sent, where the letter indicates the sequence number and the digit 859 represents the timestamp. Suppose also that segment B.1 has been 860 lost. The timestamp in TS.Recent is 1 (from A.1), so C.1, ..., 861 Z.1 are considered acceptable and are queued. When B is 862 retransmitted as segment B.2 (using the latest timestamp), it 863 fills the hole and causes all the segments through Z to be 864 acknowledged and passed to the user. The timestamps of the queued 865 segments are *not* inspected again at this time, since they have 866 already been accepted. When B.2 is accepted, TS.Recent is set to 867 2. 869 This rule allows reasonable performance under loss. A full window of 870 data is in transit at all times, and after a loss a full window less 871 one segment will show up out-of-sequence to be queued at the receiver 872 (e.g., up to ~2^30 bytes of data); the timestamp option must not 873 result in discarding this data. 875 In certain unlikely circumstances, the algorithm of rules R1-R5 could 876 lead to discarding some segments unnecessarily, as shown in the 877 following example: 879 Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have been 880 sent in sequence and that segment B.1 has been lost. Furthermore, 881 suppose delivery of some of C.1, ... Z.1 is delayed until *after* 882 the retransmission B.2 arrives at the receiver. These delayed 883 segments will be discarded unnecessarily when they do arrive, 884 since their timestamps are now out of date. 886 This case is very unlikely to occur. If the retransmission was 887 triggered by a timeout, some of the segments C.1, ... Z.1 must have 888 been delayed longer than the RTO time. This is presumably an 889 unlikely event, or there would be many spurious timeouts and 890 retransmissions. If B's retransmission was triggered by the "fast 891 retransmit" algorithm, i.e., by duplicate s, then the queued 892 segments that caused these s must have been received already. 894 Even if a segment were delayed past the RTO, the Fast Retransmit 895 mechanism [Jacobson90c] will cause the delayed segments to be 896 retransmitted at the same time as B.2, avoiding an extra RTT and 897 therefore causing a very small performance penalty. 899 We know of no case with a significant probability of occurrence in 900 which timestamps will cause performance degradation by unnecessarily 901 discarding segments. 903 4.4. Timestamp Clock 905 It is important to understand that the PAWS algorithm does not 906 require clock synchronization between sender and receiver. The 907 sender's timestamp clock is used to stamp the segments, and the 908 sender uses the echoed timestamp to measure RTTs. However, the 909 receiver treats the timestamp as simply a monotonically increasing 910 serial number, without any necessary connection to its clock. From 911 the receiver's viewpoint, the timestamp is acting as a logical 912 extension of the high-order bits of the sequence number. 914 The receiver algorithm does place some requirements on the frequency 915 of the timestamp clock. 917 (a) The timestamp clock must not be "too slow". 919 It MUST tick at least once for each 2^31 bytes sent. In fact, 920 in order to be useful to the sender for round trip timing, the 921 clock SHOULD tick at least once per window's worth of data, and 922 even with the window extension defined in Section 2.2, 2^31 923 bytes must be at least two windows. 925 To make this more quantitative, any clock faster than 1 tick/sec 926 will reject old duplicate segments for link speeds of ~8 Gbps. 927 A 1 ms timestamp clock will work at link speeds up to 8 Tbps 928 (8*10^12) bps! 930 (b) The timestamp clock must not be "too fast". 932 The recycling time of the timestamp clock MUST be greater than 933 MSL seconds. Since the clock (timestamp) is 32 bits and the 934 worst-case MSL is 255 seconds, the maximum acceptable clock 935 frequency is one tick every 59 ns. 937 However, it is desirable to establish a much longer recycle 938 period, in order to handle outdated timestamps on idle 939 connections (see Section 4.5), and to relax the MSL requirement 940 for preventing sequence number wrap-around. With a 1 ms 941 timestamp clock, the 32-bit timestamp will wrap its sign bit in 942 24.8 days. Thus, it will reject old duplicates on the same 943 connection if MSL is 24.8 days or less. This appears to be a 944 very safe figure; an MSL of 24.8 days or longer can probably be 945 assumed in the internet without requiring precise MSL 946 enforcement. 948 Based upon these considerations, we choose a timestamp clock 949 frequency in the range 1 ms to 1 sec per tick. This range also 950 matches the requirements of the RTTM mechanism, which does not need 951 much more resolution than the granularity of the retransmit timer, 952 e.g., tens or hundreds of milliseconds. 954 The PAWS mechanism also puts a strong monotonicity requirement on the 955 sender's timestamp clock. The method of implementation of the 956 timestamp clock to meet this requirement depends upon the system 957 hardware and software. 959 o Some hosts have a hardware clock that is guaranteed to be 960 monotonic between hardware resets. 962 o A clock interrupt may be used to simply increment a binary integer 963 by 1 periodically. 965 o The timestamp clock may be derived from a system clock that is 966 subject to being abruptly changed, by adding a variable offset 967 value. This offset is initialized to zero. When a new timestamp 968 clock value is needed, the offset can be adjusted as necessary to 969 make the new value equal to or larger than the previous value 970 (which was saved for this purpose). 972 4.5. Outdated Timestamps 974 If a connection remains idle long enough for the timestamp clock of 975 the other TCP to wrap its sign bit, then the value saved in TS.Recent 976 will become too old; as a result, the PAWS mechanism will cause all 977 subsequent segments to be rejected, freezing the connection (until 978 the timestamp clock wraps its sign bit again). 980 With the chosen range of timestamp clock frequencies (1 sec to 1 ms), 981 the time to wrap the sign bit will be between 24.8 days and 24800 982 days. A TCP connection that is idle for more than 24 days and then 983 comes to life is exceedingly unusual. However, it is undesirable in 984 principle to place any limitation on TCP connection lifetimes. 986 We therefore require that an implementation of PAWS include a 987 mechanism to "invalidate" the TS.Recent value when a connection is 988 idle for more than 24 days. (An alternative solution to the problem 989 of outdated timestamps would be to send keep-alive segments at a very 990 low rate, but still more often than the wrap-around time for 991 timestamps, e.g., once a day. This would impose negligible overhead. 992 However, the TCP specification has never included keep-alives, so the 993 solution based upon invalidation was chosen.) 995 Note that a TCP does not know the frequency, and therefore, the 996 wraparound time, of the other TCP, so it must assume the worst. The 997 validity of TS.Recent needs to be checked only if the basic PAWS 998 timestamp check fails, i.e., only if SEG.TSval < TS.Recent. If 999 TS.Recent is found to be invalid, then the segment is accepted, 1000 regardless of the failure of the timestamp check, and rule R3 updates 1001 TS.Recent with the TSval from the new segment. 1003 To detect how long the connection has been idle, the TCP MAY update a 1004 clock or timestamp value associated with the connection whenever 1005 TS.Recent is updated, for example. The details will be 1006 implementation-dependent. 1008 4.6. Header Prediction 1010 "Header prediction" [Jacobson90a] is a high-performance transport 1011 protocol implementation technique that is most important for high- 1012 speed links. This technique optimizes the code for the most common 1013 case, receiving a segment correctly and in order. Using header 1014 prediction, the receiver asks the question, "Is this segment the next 1015 in sequence?" This question can be answered in fewer machine 1016 instructions than the question, "Is this segment within the window?" 1018 Adding header prediction to our timestamp procedure leads to the 1019 following recommended sequence for processing an arriving TCP 1020 segment: 1022 H1) Check timestamp (same as step R1 above) 1024 H2) Do header prediction: if segment is next in sequence and if 1025 there are no special conditions requiring additional processing, 1026 accept the segment, record its timestamp, and skip H3. 1028 H3) Process the segment normally, as specified in RFC 793. This 1029 includes dropping segments that are outside the window and 1030 possibly sending acknowledgments, and queuing in-window, out-of- 1031 sequence segments. 1033 Another possibility would be to interchange steps H1 and H2, i.e., to 1034 perform the header prediction step H2 *first*, and perform H1 and H3 1035 only when header prediction fails. This could be a performance 1036 improvement, since the timestamp check in step H1 is very unlikely to 1037 fail, and it requires unsigned modulo arithmetic. To perform this 1038 check on every single segment is contrary to the philosophy of header 1039 prediction. We believe that this change might produce a measurable 1040 reduction in CPU time for TCP protocol processing on high-speed 1041 networks. 1043 However, putting H2 first would create a hazard: a segment from 2^32 1044 bytes in the past might arrive at exactly the wrong time and be 1045 accepted mistakenly by the header-prediction step. The following 1046 reasoning has been introduced in [RFC1185] to show that the 1047 probability of this failure is negligible. 1049 If all segments are equally likely to show up as old duplicates, 1050 then the probability of an old duplicate exactly matching the left 1051 window edge is the maximum segment size (MSS) divided by the size 1052 of the sequence space. This ratio must be less than 2^-16, since 1053 MSS must be < 2^16; for example, it will be (2^12)/(2^32) = 2^-20 1054 for a 100 Mbit/s link. However, the older a segment is, the less 1055 likely it is to be retained in the Internet, and under any 1056 reasonable model of segment lifetime the probability of an old 1057 duplicate exactly at the left window edge must be much smaller 1058 than 2^-16. 1060 The 16 bit TCP checksum also allows a basic unreliability of one 1061 part in 2^16. A protocol mechanism whose reliability exceeds the 1062 reliability of the TCP checksum should be considered "good 1063 enough", i.e., it won't contribute significantly to the overall 1064 error rate. We therefore believe we can ignore the problem of an 1065 old duplicate being accepted by doing header prediction before 1066 checking the timestamp. 1068 However, this probabilistic argument is not universally accepted, and 1069 the consensus at present is that the performance gain does not 1070 justify the hazard in the general case. It is therefore recommended 1071 that H2 follow H1. 1073 4.7. IP Fragmentation 1075 At high data rates, the protection against old segments provided by 1076 PAWS can be circumvented by errors in IP fragment reassembly (see 1077 [RFC4963]). The only way to protect against incorrect IP fragment 1078 reassembly is to not allow the segments to be fragmented. This is 1079 done by setting the Don't Fragment (DF) bit in the IP header. 1080 Setting the DF bit implies the use of Path MTU Discovery as described 1081 in [RFC1191], [RFC1981], and [RFC4821], thus any TCP implementation 1082 that implements PAWS MUST also implement Path MTU Discovery. 1084 4.8. Duplicates from Earlier Incarnations of Connection 1086 The PAWS mechanism protects against errors due to sequence number 1087 wrap-around on high-speed connections. Segments from an earlier 1088 incarnation of the same connection are also a potential cause of old 1089 duplicate errors. In both cases, the TCP mechanisms to prevent such 1090 errors depend upon the enforcement of a maximum segment lifetime 1091 (MSL) by the Internet (IP) layer (see Appendix of RFC 1185 for a 1092 detailed discussion). Unlike the case of sequence space wrap-around, 1093 the MSL required to prevent old duplicate errors from earlier 1094 incarnations does not depend upon the transfer rate. If the IP layer 1095 enforces the recommended 2 minute MSL of TCP, and if the TCP rules 1096 are followed, TCP connections will be safe from earlier incarnations, 1097 no matter how high the network speed. Thus, the PAWS mechanism is 1098 not required for this case. 1100 We may still ask whether the PAWS mechanism can provide additional 1101 security against old duplicates from earlier connections, allowing us 1102 to relax the enforcement of MSL by the IP layer. Appendix B explores 1103 this question, showing that further assumptions and/or mechanisms are 1104 required, beyond those of PAWS. This is not part of the current 1105 extension. 1107 5. Conclusions and Acknowledgements 1109 This memo presented a set of extensions to TCP to provide efficient 1110 operation over large bandwidth * delay product paths and reliable 1111 operation over very high-speed paths. These extensions are designed 1112 to provide compatible interworking with TCP stacks that do not 1113 implement the extensions. 1115 These mechanisms are implemented using TCP options for scaled windows 1116 and timestamps. The timestamps are used for two distinct mechanisms: 1117 RTTM (Round Trip Time Measurement) and PAWS (Protection Against 1118 Wrapped Sequences). 1120 The Window Scale option was originally suggested by Mike St. Johns of 1121 USAF/DCA. The present form of the option was suggested by Mike 1122 Karels of UC Berkeley in response to a more cumbersome scheme defined 1123 by Van Jacobson. Lixia Zhang helped formulate the PAWS mechanism 1124 description in [RFC1185]. 1126 Finally, much of this work originated as the result of discussions 1127 within the End-to-End Task Force on the theoretical limitations of 1128 transport protocols in general and TCP in particular. Task force 1129 members and other on the end2end-interest list have made valuable 1130 contributions by pointing out flaws in the algorithms and the 1131 documentation. Continued discussion and development since the 1132 publication of [RFC1323] originally occurred in the IETF TCP Large 1133 Windows Working Group, later on in the End-to-End Task Force, and 1134 most recently in the IETF TCP Maintenance Working Group. The authors 1135 are grateful for all these contributions. 1137 6. Security Considerations 1139 The TCP sequence space is a fixed size, and as the window becomes 1140 larger it becomes easier for an attacker to generate forged packets 1141 that can fall within the TCP window, and be accepted as valid 1142 segments. While use of timestamps and PAWS can help to mitigate 1143 this, when using PAWS, if an attacker is able to forge a packet that 1144 is acceptable to the TCP connection, a timestamp that is in the 1145 future would cause valid segments to be dropped due to PAWS checks. 1146 Hence, implementers should take care to not open the TCP window 1147 drastically beyond the requirements of the connection. 1149 Middle boxes and options: If a middle box removes TCP options from 1150 the segment, such as TSopt, a high speed connection that needs 1151 PAWS would not have that protection. In this situation, an 1152 implementer could provide a mechanism for the application to 1153 determine whether or not PAWS is in use on the connection, and chose 1154 to terminate the connection if that protection doesn't exist. 1156 Mechanisms to protect the TCP header from modification should also 1157 protect the TCP options. 1159 A naive implementation that derives the timestamp clock value 1160 directly from a system uptime clock may unintentionally leak this 1161 information to an attacker. This does not directly compromise any of 1162 the mechanisms described in this document. However, this may be 1163 valuable information to a potential attacker. An implementer should 1164 evaluate the potential impact and mitigate this accordingly (i.e. by 1165 using a random offset for the timestamp clock on each connection, or 1166 using an external, real-time derived timestamp clock source). 1168 Expanding the TCP window beyond 64 KiB for IPv6 allows Jumbograms 1169 [RFC2675] to be used when the local network supports packets larger 1170 than 64 KiB. When larger TCP segments are used, the TCP checksum 1171 becomes weaker. 1173 7. IANA Considerations 1175 This document has no actions for IANA. 1177 8. References 1179 8.1. Normative References 1181 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 1182 RFC 793, September 1981. 1184 [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 1185 November 1990. 1187 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1188 Requirement Levels", BCP 14, RFC 2119, March 1997. 1190 8.2. Informative References 1192 [Garlick77] 1193 Garlick, L., Rom, R., and J. Postel, "Issues in Reliable 1194 Host-to-Host Protocols", Proc. Second Berkeley Workshop on 1195 Distributed Data Management and Computer Networks, 1196 May 1977, . 1198 [Hamming77] 1199 Hamming, R., "Digital Filters", Prentice Hall, Englewood 1200 Cliffs, N.J. ISBN 0-13-212571-4, 1977. 1202 [Jacobson88a] 1203 Jacobson, V., "Congestion Avoidance and Control", SIGCOMM 1204 '88, Stanford, CA., August 1988, 1205 . 1207 [Jacobson90a] 1208 Jacobson, V., "4BSD Header Prediction", ACM Computer 1209 Communication Review, April 1990. 1211 [Jacobson90c] 1212 Jacobson, V., "Modified TCP congestion avoidance 1213 algorithm", Message to the end2end-interest mailing list, 1214 April 1990, 1215 . 1217 [Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet 1218 Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and 1219 Comm., Scottsdale, Arizona, March 1986, 1220 . 1222 [Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times in 1223 Reliable Transport Protocols", Proc. SIGCOMM '87, 1224 August 1987. 1226 [Martin03] 1227 Martin, D., "[Tsvwg] RFC 1323.bis", Message to the tsvwg 1228 mailing list, September 2003, . 1231 [Mathis08] 1232 Mathis, M., "[tcpm] Example of 1323 window retraction 1233 problem", Message to the tcpm mailing list, March 2008, 1234 . 1237 [RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks", 1238 RFC 896, January 1984. 1240 [RFC1072] Jacobson, V. and R. Braden, "TCP extensions for long-delay 1241 paths", RFC 1072, October 1988. 1243 [RFC1110] McKenzie, A., "Problem with the TCP big window option", 1244 RFC 1110, August 1989. 1246 [RFC1122] Braden, R., "Requirements for Internet Hosts - 1247 Communication Layers", STD 3, RFC 1122, October 1989. 1249 [RFC1185] Jacobson, V., Braden, B., and L. Zhang, "TCP Extension for 1250 High-Speed Paths", RFC 1185, October 1990. 1252 [RFC1323] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions 1253 for High Performance", RFC 1323, May 1992. 1255 [RFC1981] McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery 1256 for IP version 6", RFC 1981, August 1996. 1258 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 1259 Selective Acknowledgment Options", RFC 2018, October 1996. 1261 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 1262 Control", RFC 2581, April 1999. 1264 [RFC2675] Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms", 1265 RFC 2675, August 1999. 1267 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 1268 Extension to the Selective Acknowledgement (SACK) Option 1269 for TCP", RFC 2883, July 2000. 1271 [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm 1272 for TCP", RFC 3522, April 2003. 1274 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 1275 Discovery", RFC 4821, March 2007. 1277 [RFC4963] Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly 1278 Errors at High Data Rates", RFC 4963, July 2007. 1280 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1281 Control", RFC 5681, September 2009. 1283 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 1284 and Y. Nishida, "A Conservative Loss Recovery Algorithm 1285 Based on Selective Acknowledgment (SACK) for TCP", 1286 RFC 6675, August 2012. 1288 [RFC6691] Borman, D., "TCP Options and Maximum Segment Size (MSS)", 1289 RFC 6691, July 2012. 1291 [Watson81] 1292 Watson, R., "Timer-based Mechanisms in Reliable Transport 1293 Protocol Connection Management", Computer Networks, Vol. 1294 5, 1981. 1296 [Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. SIGCOMM 1297 '86, Stowe, VT, August 1986. 1299 Appendix A. Implementation Suggestions 1301 TCP Option Layout 1303 The following layouts are recommended for sending options on non- 1304 segments, to achieve maximum feasible alignment of 32-bit 1305 and 64-bit machines. 1307 +--------+--------+--------+--------+ 1308 | NOP | NOP | TSopt | 10 | 1309 +--------+--------+--------+--------+ 1310 | TSval timestamp | 1311 +--------+--------+--------+--------+ 1312 | TSecr timestamp | 1313 +--------+--------+--------+--------+ 1315 Interaction with the TCP Urgent Pointer 1317 The TCP Urgent pointer, like the TCP window, is a 16 bit value. 1318 Some of the original discussion for the TCP Window Scale option 1319 included proposals to increase the Urgent pointer to 32 bits. As 1320 it turns out, this is unnecessary. There are two observations 1321 that should be made: 1323 (1) With IP Version 4, the largest amount of TCP data that can be 1324 sent in a single packet is 65495 bytes (64 KiB - 1 -- size of 1325 fixed IP and TCP headers). 1327 (2) Updates to the urgent pointer while the user is in "urgent 1328 mode" are invisible to the user. 1330 This means that if the Urgent Pointer points beyond the end of the 1331 TCP data in the current segment, then the user will remain in 1332 urgent mode until the next TCP segment arrives. That segment will 1333 update the urgent pointer to a new offset, and the user will never 1334 have left urgent mode. 1336 Thus, to properly implement the Urgent Pointer, the sending TCP 1337 only has to check for overflow of the 16 bit Urgent Pointer field 1338 before filling it in. If it does overflow, than a value of 65535 1339 should be inserted into the Urgent Pointer. 1341 The same technique applies to IP Version 6, except in the case of 1342 IPv6 Jumbograms. When IPv6 Jumbograms are supported, [RFC2675] 1343 requires additional steps for dealing with the Urgent Pointer, 1344 these are described in section 5.2 of [RFC2675]. 1346 Appendix B. Duplicates from Earlier Connection Incarnations 1348 There are two cases to be considered: (1) a system crashing (and 1349 losing connection state) and restarting, and (2) the same connection 1350 being closed and reopened without a loss of host state. These will 1351 be described in the following two sections. 1353 B.1. System Crash with Loss of State 1355 TCP's quiet time of one MSL upon system startup handles the loss of 1356 connection state in a system crash/restart. For an explanation, see 1357 for example "When to Keep Quiet" in the TCP protocol specification 1358 [RFC0793]. The MSL that is required here does not depend upon the 1359 transfer speed. The current TCP MSL of 2 minutes seemed acceptable 1360 as an operational compromise, when many host systems used to take 1361 this long to boot after a crash. Current host systems can boot 1362 considerably faster. 1364 The timestamp option may be used to ease the MSL requirements (or to 1365 provide additional security against data corruption). If timestamps 1366 are being used and if the timestamp clock can be guaranteed to be 1367 monotonic over a system crash/restart, i.e., if the first value of 1368 the sender's timestamp clock after a crash/restart can be guaranteed 1369 to be greater than the last value before the restart, then a quiet 1370 time is unnecessary. 1372 To dispense totally with the quiet time would require that the host 1373 clock be synchronized to a time source that is stable over the crash/ 1374 restart period, with an accuracy of one timestamp clock tick or 1375 better. We can back off from this strict requirement to take 1376 advantage of approximate clock synchronization. Suppose that the 1377 clock is always re-synchronized to within N timestamp clock ticks and 1378 that booting (extended with a quiet time, if necessary) takes more 1379 than N ticks. This will guarantee monotonicity of the timestamps, 1380 which can then be used to reject old duplicates even without an 1381 enforced MSL. 1383 B.2. Closing and Reopening a Connection 1385 When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT state 1386 ties up the socket pair for 4 minutes (see Section 3.5 of [RFC0793]. 1387 Applications built upon TCP that close one connection and open a new 1388 one (e.g., an FTP data transfer connection using Stream mode) must 1389 choose a new socket pair each time. The TIME-WAIT delay serves two 1390 different purposes: 1392 (a) Implement the full-duplex reliable close handshake of TCP. 1394 The proper time to delay the final close step is not really 1395 related to the MSL; it depends instead upon the RTO for the FIN 1396 segments and therefore upon the RTT of the path. (It could be 1397 argued that the side that is sending a FIN knows what degree of 1398 reliability it needs, and therefore it should be able to 1399 determine the length of the TIME-WAIT delay for the FIN's 1400 recipient. This could be accomplished with an appropriate TCP 1401 option in FIN segments.) 1403 Although there is no formal upper-bound on RTT, common network 1404 engineering practice makes an RTT greater than 1 minute very 1405 unlikely. Thus, the 4 minute delay in TIME-WAIT state works 1406 satisfactorily to provide a reliable full-duplex TCP close. 1407 Note again that this is independent of MSL enforcement and 1408 network speed. 1410 The TIME-WAIT state could cause an indirect performance problem 1411 if an application needed to repeatedly close one connection and 1412 open another at a very high frequency, since the number of 1413 available TCP ports on a host is less than 2^16. However, high 1414 network speeds are not the major contributor to this problem; 1415 the RTT is the limiting factor in how quickly connections can be 1416 opened and closed. Therefore, this problem will be no worse at 1417 high transfer speeds. 1419 (b) Allow old duplicate segments to expire. 1421 To replace this function of TIME-WAIT state, a mechanism would 1422 have to operate across connections. PAWS is defined strictly 1423 within a single connection; the last timestamp (TS.Recent) is 1424 kept in the connection control block, and discarded when a 1425 connection is closed. 1427 An additional mechanism could be added to the TCP, a per-host 1428 cache of the last timestamp received from any connection. This 1429 value could then be used in the PAWS mechanism to reject old 1430 duplicate segments from earlier incarnations of the connection, 1431 if the timestamp clock can be guaranteed to have ticked at least 1432 once since the old connection was open. This would require that 1433 the TIME-WAIT delay plus the RTT together must be at least one 1434 tick of the sender's timestamp clock. Such an extension is not 1435 part of the proposal of this RFC. 1437 Note that this is a variant on the mechanism proposed by 1438 Garlick, Rom, and Postel [Garlick77], which required each host 1439 to maintain connection records containing the highest sequence 1440 numbers on every connection. Using timestamps instead, it is 1441 only necessary to keep one quantity per remote host, regardless 1442 of the number of simultaneous connections to that host. 1444 Appendix C. Summary of Notation 1446 The following notation has been used in this document. 1448 Options 1450 WSopt: TCP Window Scale Option 1451 TSopt: TCP Timestamp Option 1453 Option Fields 1455 shift.cnt: Window scale byte in WSopt 1456 TSval: 32-bit Timestamp Value field in TSopt 1457 TSecr: 32-bit Timestamp Reply field in TSopt 1459 Option Fields in Current Segment 1461 SEG.TSval: TSval field from TSopt in current segment 1462 SEG.TSecr: TSecr field from TSopt in current segment 1463 SEG.WSopt: 8-bit value in WSopt 1465 Clock Values 1467 my.TSclock: System wide source of 32-bit timestamp values 1468 my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec) 1469 Snd.TSoffset: A offset for randomizing Snd.TSclock 1470 Snd.TSclock: my.TSclock + Snd.TSoffset 1472 Per-Connection State Variables 1474 TS.Recent: Latest received Timestamp 1475 Last.ACK.sent: Last ACK field sent 1476 Snd.TS.OK: 1-bit flag 1477 Snd.WS.OK: 1-bit flag 1478 Rcv.Wind.Scale: Receive window scale power 1479 Snd.Wind.Scale: Send window scale power 1480 Start.Time: Snd.TSclock value when segment being timed was 1481 sent (used by pre-1323 code). 1483 Procedure 1485 Update_SRTT(m) Procedure to update the smoothed RTT and RTT 1486 variance estimates, using the rules of 1487 [Jacobson88a], given m, a new RTT measurement 1489 Appendix D. Event Processing Summary 1491 OPEN Call 1493 ... 1495 An initial send sequence number (ISS) is selected. Send a 1496 segment of the form: 1498 1500 ... 1502 SEND Call 1504 CLOSED STATE (i.e., TCB does not exist) 1506 ... 1508 LISTEN STATE 1510 If the foreign socket is specified, then change the connection 1511 from passive to active, select an ISS. Send a segment 1512 containing the options: and 1513 . Set SND.UNA to ISS, SND.NXT to ISS+1. 1514 Enter SYN-SENT state. ... 1516 SYN-SENT STATE 1517 SYN-RECEIVED STATE 1519 ... 1521 ESTABLISHED STATE 1522 CLOSE-WAIT STATE 1524 Segmentize the buffer and send it with a piggybacked 1525 acknowledgment (acknowledgment value = RCV.NXT). ... 1527 If the urgent flag is set ... 1529 If the Snd.TS.OK flag is set, then include the TCP Timestamp 1530 Option in each data 1531 segment. 1533 Scale the receive window for transmission in the segment 1534 header: 1536 SEG.WND = (RCV.WND >> Rcv.Wind.Scale). 1538 SEGMENT ARRIVES 1540 ... 1542 If the state is LISTEN then 1544 first check for an RST 1546 ... 1548 second check for an ACK 1550 ... 1552 third check for a SYN 1554 if the SYN bit is set, check the security. If the ... 1556 ... 1558 if the SEG.PRC is less than the TCB.PRC then continue. 1560 Check for a Window Scale option (WSopt); if one is found, 1561 save SEG.WSopt in Snd.Wind.Scale and set Snd.WS.OK flag on. 1562 Otherwise, set both Snd.Wind.Scale and Rcv.Wind.Scale to 1563 zero and clear Snd.WS.OK flag. 1565 Check for a TSopt option; if one is found, save SEG.TSval in 1566 the variable TS.Recent and turn on the Snd.TS.OK bit. 1568 Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any 1569 other control or text should be queued for processing later. 1570 ISS should be selected and a segment sent of the form: 1572 1574 If the Snd.WS.OK bit is on, include a WSopt option 1575 in this segment. If the Snd.TS.OK 1576 bit is on, include a TSopt in this segment. Last.ACK.sent is set to 1578 RCV.NXT. 1580 SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection 1581 state should be changed to SYN-RECEIVED. Note that any 1582 other incoming control or data (combined with SYN) will be 1583 processed in the SYN-RECEIVED state, but processing of SYN 1584 and ACK should not be repeated. If the listen was not fully 1585 specified (i.e., the foreign socket was not fully 1586 specified), then the unspecified fields should be filled in 1587 now. 1589 fourth other text or control 1591 ... 1593 If the state is SYN-SENT then 1595 first check the ACK bit 1597 ... 1599 ... 1601 fourth check the SYN bit 1602 ... 1604 If the SYN bit is on and the security/compartment and 1605 precedence are acceptable then, RCV.NXT is set to SEG.SEQ+1, 1606 IRS is set to SEG.SEQ, and any acknowledgements on the 1607 retransmission queue which are thereby acknowledged should 1608 be removed. 1610 Check for a Window Scale option (WSopt); if it is found, 1611 save SEG.WSopt in Snd.Wind.Scale; otherwise, set both 1612 Snd.Wind.Scale and Rcv.Wind.Scale to zero. 1614 Check for a TSopt option; if one is found, save SEG.TSval in 1615 variable TS.Recent and turn on the Snd.TS.OK bit in the 1616 connection control block. If the ACK bit is set, use 1617 Snd.TSclock - SEG.TSecr as the initial RTT estimate. 1619 If SND.UNA > ISS (our has been ACKed), change the 1620 connection state to ESTABLISHED, form an segment: 1622 1624 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1625 option in this 1626 segment. Last.ACK.sent is set to RCV.NXT. 1628 Data or controls which were queued for transmission may be 1629 included. If there are other controls or text in the 1630 segment then continue processing at the sixth step below 1631 where the URG bit is checked, otherwise return. 1633 Otherwise enter SYN-RECEIVED, form a segment: 1635 1637 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1638 option in this segment. 1639 If the Snd.WS.OK bit is on, include a WSopt option 1640 in this segment. Last.ACK.sent is 1641 set to RCV.NXT. 1643 If there are other controls or text in the segment, queue 1644 them for processing after the ESTABLISHED state has been 1645 reached, return. 1647 fifth, if neither of the SYN or RST bits is set then drop the 1648 segment and return. 1650 Otherwise, 1652 First, check sequence number 1654 SYN-RECEIVED STATE 1655 ESTABLISHED STATE 1656 FIN-WAIT-1 STATE 1657 FIN-WAIT-2 STATE 1658 CLOSE-WAIT STATE 1659 CLOSING STATE 1660 LAST-ACK STATE 1661 TIME-WAIT STATE 1663 Segments are processed in sequence. Initial tests on 1664 arrival are used to discard old duplicates, but further 1665 processing is done in SEG.SEQ order. If a segment's 1666 contents straddle the boundary between old and new, only the 1667 new parts should be processed. 1669 Rescale the received window field: 1671 TrueWindow = SEG.WND << Snd.Wind.Scale, 1673 and use "TrueWindow" in place of SEG.WND in the following 1674 steps. 1676 Check whether the segment contains a Timestamp Option and 1677 bit Snd.TS.OK is on. If so: 1679 If SEG.TSval < TS.Recent and the RST bit is off, then 1680 test whether connection has been idle less than 24 days; 1681 if all are true, then the segment is not acceptable; 1682 follow steps below for an unacceptable segment. 1684 If SEG.SEQ is less than or equal to Last.ACK.sent, then 1685 save SEG.TSval in variable TS.Recent. 1687 There are four cases for the acceptability test for an 1688 incoming segment: 1690 ... 1692 If an incoming segment is not acceptable, an acknowledgment 1693 should be sent in reply (unless the RST bit is set, if so 1694 drop the segment and return): 1696 1698 Last.ACK.sent is set to SEG.ACK of the acknowledgment. If 1699 the Snd.Echo.OK bit is on, include the Timestamp Option 1700 in this segment. 1701 Set Last.ACK.sent to SEG.ACK and send the segment. 1702 After sending the acknowledgment, drop the unacceptable 1703 segment and return. 1705 ... 1707 fifth check the ACK field. 1709 if the ACK bit is off drop the segment and return. 1711 if the ACK bit is on 1713 ... 1715 ESTABLISHED STATE 1717 If SND.UNA < SEG.ACK <= SND.NXT then, set SND.UNA <- 1718 SEG.ACK. Also compute a new estimate of round-trip time. 1719 If Snd.TS.OK bit is on, use Snd.TSclock - SEG.TSecr; 1720 otherwise use the elapsed time since the first segment in 1721 the retransmission queue was sent. Any segments on the 1722 retransmission queue which are thereby entirely 1723 acknowledged... 1725 ... 1727 Seventh, process the segment text. 1729 ESTABLISHED STATE 1730 FIN-WAIT-1 STATE 1731 FIN-WAIT-2 STATE 1733 ... 1735 Send an acknowledgment of the form: 1737 1739 If the Snd.TS.OK bit is on, include Timestamp Option 1740 in this segment. 1741 Set Last.ACK.sent to SEG.ACK of the acknowledgment, and send 1742 it. This acknowledgment should be piggy-backed on a segment 1743 being transmitted if possible without incurring undue delay. 1745 ... 1747 Appendix E. Timestamps Edge Cases 1749 While the rules laid out for when to calculate RTTM produce the 1750 correct results most of the time, there are some edge cases where an 1751 incorrect RTTM can be calculated. All of these situations involve 1752 the loss of segments. It is felt that these scenarios are rare, and 1753 that if they should happen, they will cause a single RTTM measurement 1754 to be inflated, which mitigates its effects on RTO calculations. 1756 [Martin03] cites two similar cases when the returning is lost, 1757 and before the retransmission timer fires, another returning 1758 segment arrives, which aknowledges the data. In this case, the RTTM 1759 calculated will be inflated: 1761 clock 1762 tc=1 -------------------> 1764 tc=2 (lost) <---- 1765 (RTTM would have been 1) 1767 (receive window opens, window update is sent) 1768 tc=5 <---- 1769 (RTTM is calculated at 4) 1771 One thing to note about this situation is that it is somewhat bounded 1772 by RTO + RTT, limiting how far off the RTTM calculation will be. 1773 While more complex scenarios can be constructed that produce larger 1774 inflations (e.g., retransmissions are lost), those scenarios involve 1775 multiple segment losses, and the connection will have other more 1776 serious operational problems than using an inflated RTTM in the RTO 1777 calculation. 1779 Appendix F. Window Retraction Example 1781 Consider a established TCP connection with WSCALE=7 (128 byte 1782 receiver window quantization), that is running with a very small 1783 windows because the receiver is bottlenecked and both ends are doing 1784 small reads and writes. 1786 Consider the ACKs coming back: 1788 SEG.ACK SEG.WIN computed SND.WIN receiver's actual window 1789 1000 2 1256 1300 1790 The sender writes 40 bytes and receiver ACKs: 1792 1040 2 1296 1300 1794 The sender writes 5 additional bytes and the receiver has a problem. 1795 Two choices: 1797 1045 2 1301 1300 - BEYOND BUFFER 1799 1045 1 1173 1300 - RETRACTED WINDOW 1801 This problems is completely general and can in principle happen any 1802 time the sender does a write which is smaller than the window scale 1803 quanta. 1805 In most stacks it is at least partially obscured when the window size 1806 is larger than some small number of segments because the stacks 1807 prefer to announce windows that are integral numbers of segments 1808 (rounded up to the next window quanta). This plus silly window 1809 suppression tends to cause less frequent, larger window updates. If 1810 the window was rounded down to a segment size there is more 1811 opportunity to advance it ("beyond buffer" case above) rather than 1812 retracting it. 1814 Appendix G. Changes from RFC 1323 1816 Several important updates and clarifications to the specification in 1817 RFC 1323 are made in these document. The technical changes are 1818 summarized below: 1820 (a) Section 2.4 was added describing the unavoidable window 1821 retraction issue, and explicitly describing the mitigation steps 1822 necessary. 1824 (b) In Section 3.2 the wording how timestamp option negotiation is 1825 to be performed was updated with RFC2119 wording. Further, a 1826 number of paragraphs were added to clarify the expected behavior 1827 with a compliant implementation using TSopt, as RFC1323 left 1828 room for interpretation - e.g. potential late enablement of 1829 TSopt. 1831 (c) The description of which TSecr values can be used to update the 1832 measured RTT has been clarified. Specifically, with timestamps, 1833 the Karn algorithm [Karn87] is disabled. The Karn algorithm 1834 disables all RTT measurements during retransmission, since it is 1835 ambiguous whether the is for the original segment, or the 1836 retransmitted segment. With timestamps, that ambiguity is 1837 removed since the TSecr in the will contain the TSval from 1838 whichever data segment made it to the destination. 1840 (d) RTTM update processing explicitly excludes segments not updating 1841 SND.UNA. The original text could be interpreted to allow taking 1842 RTT samples when SACK acknowledges some new, non-continuous 1843 data. 1845 (e) In RFC1323, section 3.4, step (2) of the algorithm to control 1846 which timestamp is echoed was incorrect in two regards: 1848 (1) It failed to update TS.recent for a retransmitted segment 1849 that resulted from a lost . 1851 (2) It failed if SEG.LEN = 0. 1853 In the new algorithm, the case of SEG.TSval >= TS.recent is 1854 included for consistency with the PAWS test. 1856 (f) It is now recommended that Timestamp Options be included in 1857 segments if the incoming segment contained a Timestamp 1858 Option. 1860 (g) segments are explicitly excluded from PAWS processing. 1862 (h) Added text to clarify the precedence between regular TCP 1863 [RFC0793] and timestamp/PAWS [RFCxxxx] processing. Discussion 1864 about combined acceptability checks are ongoing. 1866 (i) Snd.TSoffset and Snd.TSclock variables have been added. 1867 Snd.TSclock is the sum of my.TSclock and Snd.TSoffset. This 1868 allows the starting points for timestamp values to be randomized 1869 on a per-connection basis. Setting Snd.TSoffset to zero yields 1870 the same results as [RFC1323]. 1872 (j) Appendix A has been expanded with information about the TCP 1873 Urgent Pointer. An earlier revision contained text around the 1874 TCP MSS option, which was split off into [RFC6691]. 1876 (k) One correction was made to the Event Processing Summary in 1877 Appendix D. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to 1878 fill in the SEG.WND value, not SND.WND. 1880 Editorial changes of the document, that don't impact the 1881 implementation or function of the mechanisms described in this 1882 document include: 1884 (a) Removed much of the discussion in Section 1 to streamline the 1885 document. However, detailed examples and discussions in 1886 Section 2, Section 3 and Section 4 are kept as guideline for 1887 implementers. 1889 (b) Removed references to "new" options, as the options were 1890 introduced in [RFC1323] already. Changed the text in 1891 Section 1.3 to specifically address TS and WS options. 1893 (c) Section 1.4 was added for RFC2119 wording. Normative text was 1894 updated with the appropriate phrases. 1896 (d) Added < > brackets to mark specific types of segments, and 1897 replaced most occurances of "packet" with "segment", where TCP 1898 segments are referred. 1900 (e) Removed the list of changes between RFC 1323 and prior versions. 1901 These changes are mentioned in Appendix C of RFC 1323. 1903 (f) Moved Appendix "Changes" at the end of the appendices for easier 1904 lookup. In addition, the entries were split into a technical 1905 and an editorial part, and sorted to roughly correspond with the 1906 sections in the text where they apply. 1908 Authors' Addresses 1910 David Borman 1911 Quantum Corporation 1912 Mendota Heights MN 55120 1913 USA 1915 Email: david.borman@quantum.com 1917 Bob Braden 1918 University of Southern California 1919 4676 Admiralty Way 1920 Marina del Rey CA 90292 1921 USA 1923 Email: braden@isi.edu 1924 Van Jacobson 1925 Packet Design 1926 2465 Latham Street 1927 Mountain View CA 94040 1928 USA 1930 Email: van@packetdesign.com 1932 Richard Scheffenegger (editor) 1933 NetApp, Inc. 1934 Am Euro Platz 2 1935 Vienna, 1120 1936 Austria 1938 Email: rs@netapp.com