idnits 2.17.1 draft-ietf-tcpm-1323bis-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 546: '...1) The receiver MUST honor, as in-win...' RFC 2119 keyword, line 549: '... effect, the receiver SHOULD track the...' RFC 2119 keyword, line 557: '...ial transmission MUST honor window on ...' -- The abstract seems to indicate that this document obsoletes RFC1323, but the header doesn't have an 'Obsoletes:' line to match this. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1406 has weird spacing: '... TSval times...' == Line 1408 has weird spacing: '... TSecr times...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 11, 2012) is 4300 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '1' on line 300 ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 1072 (Obsoleted by RFC 1323, RFC 2018, RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1110 (Obsoleted by RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1185 (Obsoleted by RFC 1323) -- Obsolete informational reference (is this intentional?): RFC 1323 (Obsoleted by RFC 7323) -- Obsolete informational reference (is this intentional?): RFC 1981 (Obsoleted by RFC 8201) -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 3517 (Obsoleted by RFC 6675) Summary: 2 errors (**), 0 flaws (~~), 3 warnings (==), 13 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance (TCPM) D. Borman 3 Internet-Draft Quantum Corporation 4 Intended status: Standards Track B. Braden 5 Expires: January 12, 2013 University of Southern 6 California 7 V. Jacobson 8 Packet Design 9 R. Scheffenegger, Ed. 10 NetApp, Inc. 11 July 11, 2012 13 TCP Extensions for High Performance 14 draft-ietf-tcpm-1323bis-03 16 Abstract 18 This memo presents a set of TCP extensions to improve performance 19 over large bandwidth*delay product paths and to provide reliable 20 operation over very high-speed paths. It defines TCP options for 21 scaled windows and timestamps, which are designed to provide 22 compatible interworking with TCP's that do not implement the 23 extensions. The timestamps are used for two distinct mechanisms: 24 RTTM (Round Trip Time Measurement) and PAWS (Protection Against 25 Wrapped Sequences). Selective acknowledgments are not included in 26 this memo. 28 This memo updates and obsoletes RFC 1323. 30 Status of this Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF). Note that other groups may also distribute 37 working documents as Internet-Drafts. The list of current Internet- 38 Drafts is at http://datatracker.ietf.org/drafts/current/. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 This Internet-Draft will expire on January 12, 2013. 47 Copyright Notice 48 Copyright (c) 2012 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (http://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 64 1.1. TCP Performance . . . . . . . . . . . . . . . . . . . . . 4 65 1.2. TCP Reliability . . . . . . . . . . . . . . . . . . . . . 6 66 1.3. Using TCP options . . . . . . . . . . . . . . . . . . . . 9 67 2. TCP Window Scale Option . . . . . . . . . . . . . . . . . . . 10 68 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 10 69 2.2. Window Scale Option . . . . . . . . . . . . . . . . . . . 10 70 2.3. Using the Window Scale Option . . . . . . . . . . . . . . 11 71 2.4. Addressing Window Retraction . . . . . . . . . . . . . . . 13 72 3. RTTM -- Round-Trip Time Measurement . . . . . . . . . . . . . 13 73 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 13 74 3.2. TCP Timestamps Option . . . . . . . . . . . . . . . . . . 14 75 3.3. The RTTM Mechanism . . . . . . . . . . . . . . . . . . . . 15 76 3.4. Which Timestamp to Echo . . . . . . . . . . . . . . . . . 17 77 4. PAWS -- Protection Against Wrapped Sequence Numbers . . . . . 19 78 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 19 79 4.2. The PAWS Mechanism . . . . . . . . . . . . . . . . . . . . 20 80 4.2.1. Basic PAWS Algorithm . . . . . . . . . . . . . . . . . 21 81 4.2.2. Timestamp Clock . . . . . . . . . . . . . . . . . . . 23 82 4.2.3. Outdated Timestamps . . . . . . . . . . . . . . . . . 24 83 4.2.4. Header Prediction . . . . . . . . . . . . . . . . . . 25 84 4.2.5. IP Fragmentation . . . . . . . . . . . . . . . . . . . 26 85 4.3. Duplicates from Earlier Incarnations of Connection . . . . 27 86 5. Conclusions and Acknowledgements . . . . . . . . . . . . . . . 27 87 6. Security Considerations . . . . . . . . . . . . . . . . . . . 28 88 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 28 89 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 29 90 8.1. Normative References . . . . . . . . . . . . . . . . . . . 29 91 8.2. Informative References . . . . . . . . . . . . . . . . . . 29 92 Appendix A. Implementation Suggestions . . . . . . . . . . . . . 31 93 Appendix B. Duplicates from Earlier Connection Incarnations . . . 32 94 B.1. System Crash with Loss of State . . . . . . . . . . . . . 32 95 B.2. Closing and Reopening a Connection . . . . . . . . . . . . 33 96 Appendix C. Changes from RFC 1072, RFC 1185, and RFC 1323 . . . . 34 97 Appendix D. Summary of Notation . . . . . . . . . . . . . . . . . 36 98 Appendix E. Pseudo-code Summary . . . . . . . . . . . . . . . . . 37 99 Appendix F. Event Processing Summary . . . . . . . . . . . . . . 39 100 Appendix G. Timestamps Edge Cases . . . . . . . . . . . . . . . . 44 101 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 45 103 1. Introduction 105 The TCP protocol [RFC0793] was designed to operate reliably over 106 almost any transmission medium regardless of transmission rate, 107 delay, corruption, duplication, or reordering of segments. 108 Production TCP implementations currently adapt to transfer rates in 109 the range of 100 bps to 10^10 bps and round-trip delays in the range 110 1 ms to 100 seconds. Work on TCP performance has shown that TCP 111 without the extensions described in this memo can work well over a 112 variety of Internet paths, ranging from 800 Mbit/sec I/O channels to 113 300 bit/sec dial-up modems . 115 Over the years, advances in networking technology has resulted in 116 ever-higher transmission speeds, and the fastest paths are well 117 beyond the domain for which TCP was originally engineered. This memo 118 defines a set of modest extensions to TCP to extend the domain of its 119 application to match this increasing network capability. It is an 120 update to and obsoletes [RFC1323], which in turn is based upon and 121 obsoletes [RFC1072] and [RFC1185]. 123 There is no one-line answer to the question: "How fast can TCP go?". 124 There are two separate kinds of issues, performance and reliability, 125 and each depends upon different parameters. We discuss each in turn. 127 1.1. TCP Performance 129 TCP performance depends not upon the transfer rate itself, but rather 130 upon the product of the transfer rate and the round-trip delay. This 131 "bandwidth*delay product" measures the amount of data that would 132 "fill the pipe"; it is the buffer space required at sender and 133 receiver to obtain maximum throughput on the TCP connection over the 134 path, i.e., the amount of unacknowledged data that TCP must handle in 135 order to keep the pipeline full. TCP performance problems arise when 136 the bandwidth*delay product is large. We refer to an Internet path 137 operating in this region as a "long, fat pipe", and a network 138 containing this path as an "LFN" (pronounced "elephan(t)"). 140 High-capacity packet satellite channels are LFN's. For example, a 141 DS1-speed satellite channel has a bandwidth*delay product of 10^6 142 bits or more; this corresponds to 100 outstanding TCP segments of 143 1200 bytes each. Terrestrial fiber-optical paths will also fall into 144 the LFN class; for example, a cross-country delay of 30 ms at a DS3 145 bandwidth (45Mbps) also exceeds 10^6 bits. 147 There are three fundamental performance problems with the current TCP 148 over LFN paths: 150 (1) Window Size Limit 152 The TCP header uses a 16 bit field to report the receive window 153 size to the sender. Therefore, the largest window that can be 154 used is 2^16 = 65K bytes. 156 To circumvent this problem, Section 2 of this memo defines a new 157 TCP option, "Window Scale", to allow windows larger than 2^16. 158 This option defines an implicit scale factor, which is used to 159 multiply the window size value found in a TCP header to obtain 160 the true window size. 162 (2) Recovery from Losses 164 Packet losses in an LFN can have a catastrophic effect on 165 throughput. In the past, properly-operating TCP implementations 166 would cause the data pipeline to drain with every packet loss, 167 and require a slow-start action to recover. The Fast Retransmit 168 and Fast Recovery algorithms [Jacobson90c], [RFC2581] and 169 [RFC5681] were introduced, and their combined effect was to 170 recover from one packet loss per window, without draining the 171 pipeline. However, more than one packet loss per window 172 typically resulted in a retransmission timeout and the resulting 173 pipeline drain and slow start. 175 Expanding the window size to match the capacity of an LFN 176 results in a corresponding increase of the probability of more 177 than one packet per window being dropped. This could have a 178 devastating effect upon the throughput of TCP over an LFN. In 179 addition, since the publication of RFC 1323, congestion control 180 mechanism based upon some form of random dropping have been 181 introduced into gateways, and randomly spaced packet drops have 182 become common; this increases the probability of dropping more 183 than one packet per window. 185 To generalize the Fast Retransmit/Fast Recovery mechanism to 186 handle multiple packets dropped per window, selective 187 acknowledgments are required. Unlike the normal cumulative 188 acknowledgments of TCP, selective acknowledgments give the 189 sender a complete picture of which segments are queued at the 190 receiver and which have not yet arrived. 192 Since the publication of RFC1323 [RFC1323], selective 193 acknowledgments (SACK) have become important in the LFN regime. 194 SACK has been published as "TCP Selective Acknowledgment 195 Options" [RFC2018]. Additional information about SACK can be 196 found in "An Extension to the Selective Acknowledgement (SACK) 197 option for TCP" [RFC2883], and , "A Conservative Selective 198 Acknowledgment (SACK)-based Loss Recovery Algorithm for TCP" 199 [RFC3517]. 201 (3) Round-Trip Measurement 203 TCP implements reliable data delivery by retransmitting segments 204 that are not acknowledged within some retransmission timeout 205 (RTO) interval. Accurate dynamic determination of an 206 appropriate RTO is essential to TCP performance. RTO is 207 determined by estimating the mean and variance of the measured 208 round-trip time (RTT), i.e., the time interval between sending a 209 segment and receiving an acknowledgment for it [Jacobson88a]. 211 Section 3.2 introduces a new TCP option, "Timestamps", and then 212 defines a mechanism using this option that allows nearly every 213 segment, including retransmissions, to be timed at negligible 214 computational cost. We use the mnemonic RTTM (Round Trip Time 215 Measurement) for this mechanism, to distinguish it from other 216 uses of the Timestamps option. 218 1.2. TCP Reliability 220 Now we turn from performance to reliability. High transfer rate 221 enters TCP performance through the bandwidth*delay product. However, 222 high transfer rate alone can threaten TCP reliability by violating 223 the assumptions behind the TCP mechanism for duplicate detection and 224 sequencing. 226 An especially serious kind of error may result from an accidental 227 reuse of TCP sequence numbers in data segments. Suppose that an "old 228 duplicate segment", e.g., a duplicate data segment that was delayed 229 in Internet queues, is delivered to the receiver at the wrong moment, 230 so that its sequence numbers fall somewhere within the current 231 window. There would be no checksum failure to warn of the error, and 232 the result could be an undetected corruption of the data. Reception 233 of an old duplicate ACK segment at the transmitter could be only 234 slightly less serious: it is likely to lock up the connection so that 235 no further progress can be made, forcing an RST on the connection. 237 TCP reliability depends upon the existence of a bound on the lifetime 238 of a segment: the "Maximum Segment Lifetime" or MSL. An MSL is 239 generally required by any reliable transport protocol, since every 240 sequence number field must be finite, and therefore any sequence 241 number may eventually be reused. In the Internet protocol suite, the 242 MSL bound is loosely enforced by an IP-layer mechanism, the "Time-to- 243 Live" (TTL) field, or "Hop Limit" field. 245 Duplication of sequence numbers might happen in either of two ways: 247 (1) Sequence number wrap-around on the current connection 249 A TCP sequence number contains 32 bits. At a high enough 250 transfer rate, the 32-bit sequence space may be "wrapped" 251 (cycled) within the time that a segment is delayed in queues. 253 (2) Earlier incarnation of the connection 255 Suppose that a connection terminates, either by a proper close 256 sequence or due to a host crash, and the same connection (i.e., 257 using the same pair of port numbers) is immediately reopened. A 258 delayed segment from the terminated connection could fall within 259 the current window for the new incarnation and be accepted as 260 valid. 262 Duplicates from earlier incarnations, Case (2), are avoided by 263 enforcing the current fixed MSL of the TCP spec, as explained in 264 Section 4.3 and Appendix B. However, case (1), avoiding the reuse of 265 sequence numbers within the same connection, requires an MSL bound 266 that depends upon the transfer rate, and at high enough rates, a new 267 mechanism is required. 269 More specifically, if the maximum effective bandwidth at which TCP is 270 able to transmit over a particular path is B bytes per second, then 271 the following constraint must be satisfied for error-free operation: 273 2^31 / B > MSL (secs) [1] 275 The following table shows the value for Twrap = 2^31/B in seconds, 276 for some important values of the bandwidth B: 278 +------------------+----------+-------------+--------------------+ 279 | Network | bits/sec | B bytes/sec | Twrap secs | 280 +------------------+----------+-------------+--------------------+ 281 | Dialup | 56kbps | 7kBps | 3*10^5 (~3.6 days) | 282 | DS1 | 1.5Mbps | 190kBps | 10^4 (~3 hours) | 283 | 10MBit Ethernet | 10Mbps | 1.25MBps | 1700 (~0.5 hours) | 284 | DS3 | 45Mbps | 5.6MBps | 380 | 285 | 100MBit Ethernet | 100Mbps | 12.5MBps | 170 | 286 | Gigabit Ethernet | 1Gbps | 125MBps | 17 | 287 | 10Gig Ethernet | 10Gbps | 1.25GBps | 1.7 | 288 +------------------+----------+-------------+--------------------+ 290 It is clear that wrap-around of the sequence space is not a problem 291 for 56kbps packet switching or even 10Mbps Ethernets. On the other 292 hand, at DS3 and 100mbit speeds, Twrap is comparable to the 2 minute 293 MSL assumed by the TCP specification [RFC0793]. Moving towards and 294 beyond gigabit speeds, Twrap becomes too small for reliable 295 enforcement by the Internet TTL mechanism. 297 The 16-bit window field of TCP limits the effective bandwidth B to 298 2^16/RTT, where RTT is the round-trip time in seconds [RFC1110]. If 299 the RTT is large enough, this limits B to a value that meets the 300 constraint [1] for a large MSL value. For example, consider a 301 transcontinental backbone with an RTT of 60ms (set by the laws of 302 physics). With the bandwidth*delay product limited to 64KB by the 303 TCP window size, B is then limited to 1.1MBps, no matter how high the 304 theoretical transfer rate of the path. This corresponds to cycling 305 the sequence number space in Twrap = 2000 secs, which is safe in 306 today's Internet. 308 It is important to understand that the culprit is not the larger 309 window but rather the high bandwidth. For example, consider a (very 310 large) FDDI LAN with a diameter of 10km. Using the speed of light, 311 we can compute the RTT across the ring as (2*10^4)/(3*10^8) = 67 312 microseconds, and the delay*bandwidth product is then 833 bytes. A 313 TCP connection across this LAN using a window of only 833 bytes will 314 run at the full 100mbps and can wrap the sequence space in about 3 315 minutes, very close to the MSL of TCP. Thus, high speed alone can 316 cause a reliability problem with sequence number wrap-around, even 317 without extended windows. 319 Watson's Delta-T protocol [Watson81] includes network-layer 320 mechanisms for precise enforcement of an MSL. In contrast, the IP 321 mechanism for MSL enforcement is loosely defined and even more 322 loosely implemented in the Internet. Therefore, it is unwise to 323 depend upon active enforcement of MSL for TCP connections, and it is 324 unrealistic to imagine setting MSL's smaller than the current values 325 (e.g., 120 seconds specified for TCP). 327 A possible fix for the problem of cycling the sequence space would be 328 to increase the size of the TCP sequence number field. For example, 329 the sequence number field (and also the acknowledgment field) could 330 be expanded to 64 bits. This could be done either by changing the 331 TCP header or by means of an additional option. 333 Section 4 presents a different mechanism, which we call PAWS 334 (Protection Against Wrapped Sequence numbers), to extend TCP 335 reliability to transfer rates well beyond the foreseeable upper limit 336 of network bandwidths. PAWS uses the TCP Timestamps option defined 337 in Section 3.2 to protect against old duplicates from the same 338 connection. 340 1.3. Using TCP options 342 The extensions defined in this memo all use new TCP options. We must 343 address two possible issues concerning the use of TCP options: (1) 344 compatibility and (2) overhead. 346 We must pay careful attention to compatibility, i.e., to 347 interoperation with existing implementations. The only TCP option 348 defined previously, MSS, may appear only on a SYN segment. Every 349 implementation should (and we expect that most will) ignore unknown 350 options on SYN segments. When RFC 1323 was published, there was 351 concern that some buggy TCP implementation might be crashed by the 352 first appearance of an option on a non-SYN segment. However, bugs 353 like that can lead to DOS attacks against a TCP, so it is now 354 expected that most TCP implementations will properly handle unknown 355 options on non-SYN segments. But it is still prudent to be 356 conservative in what you send, and avoiding buggy TCP implementation 357 is not the only reason for negotiating TCP options on SYN segments. 358 Therefore, for each of the extensions defined below, TCP options will 359 be sent on non-SYN segments only after an exchange of options on the 360 SYN segments has indicated that both sides understand the extension. 361 Furthermore, an extension option will be sent in a segment 362 only if the corresponding option was received in the initial 363 segment. 365 A question may be raised about the bandwidth and processing overhead 366 for TCP options. Those options that occur on SYN segments are not 367 likely to cause a performance concern. Opening a TCP connection 368 requires execution of significant special-case code, and the 369 processing of options is unlikely to increase that cost 370 significantly. 372 On the other hand, a Timestamps option may appear in any data or ACK 373 segment, adding 12 bytes to the 20-byte TCP header. We believe that 374 the bandwidth saved by reducing unnecessary retransmissions will more 375 than pay for the extra header bandwidth. 377 There is also an issue about the processing overhead for parsing the 378 variable byte-aligned format of options, particularly with a RISC- 379 architecture CPU. Appendix A contains a recommended layout of the 380 options in TCP headers to achieve reasonable data field alignment. 381 In the spirit of Header Prediction, a TCP can quickly test for this 382 layout and if it is verified then use a fast path. Hosts that use 383 this canonical layout will effectively use the options as a set of 384 fixed-format fields appended to the TCP header. However, to retain 385 the philosophical and protocol framework of TCP options, a TCP must 386 be prepared to parse an arbitrary options field, albeit with less 387 efficiency. 389 Finally, we observe that most of the mechanisms defined in this memo 390 are important for LFN's and/or very high-speed networks. For low- 391 speed networks, it might be a performance optimization to NOT use 392 these mechanisms. A TCP vendor concerned about optimal performance 393 over low-speed paths might consider turning these extensions off for 394 low-speed paths, or allow a user or installation manager to disable 395 them. 397 2. TCP Window Scale Option 399 2.1. Introduction 401 The window scale extension expands the definition of the TCP window 402 to 32 bits and then uses a scale factor to carry this 32-bit value in 403 the 16-bit Window field of the TCP header (SEG.WND in RFC 793). The 404 scale factor is carried in a new TCP option, Window Scale. This 405 option is sent only in a SYN segment (a segment with the SYN bit on), 406 hence the window scale is fixed in each direction when a connection 407 is opened. (Another design choice would be to specify the window 408 scale in every TCP segment. It would be incorrect to send a window 409 scale option only when the scale factor changed, since a TCP option 410 in an acknowledgement segment will not be delivered reliably (unless 411 the ACK happens to be piggy-backed on data in the other direction). 412 Fixing the scale when the connection is opened has the advantage of 413 lower overhead but the disadvantage that the scale factor cannot be 414 changed during the connection.) 416 The maximum receive window, and therefore the scale factor, is 417 determined by the maximum receive buffer space. In a typical modern 418 implementation, this maximum buffer space is set by default but can 419 be overridden by a user program before a TCP connection is opened. 420 This determines the scale factor, and therefore no new user interface 421 is needed for window scaling. 423 2.2. Window Scale Option 425 The three-byte Window Scale option may be sent in a SYN segment by a 426 TCP. It has two purposes: (1) indicate that the TCP is prepared to 427 do both send and receive window scaling, and (2) communicate a scale 428 factor to be applied to its receive window. Thus, a TCP that is 429 prepared to scale windows should send the option, even if its own 430 scale factor is 1. The scale factor is limited to a power of two and 431 encoded logarithmically, so it may be implemented by binary shift 432 operations. 434 TCP Window Scale Option (WSopt): 436 Kind: 3 438 Length: 3 bytes 440 +---------+---------+---------+ 441 | Kind=3 |Length=3 |shift.cnt| 442 +---------+---------+---------+ 444 This option is an offer, not a promise; both sides must send Window 445 Scale options in their SYN segments to enable window scaling in 446 either direction. If window scaling is enabled, then the TCP that 447 sent this option will right-shift its true receive-window values by 448 'shift.cnt' bits for transmission in SEG.WND. The value 'shift.cnt' 449 may be zero (offering to scale, while applying a scale factor of 1 to 450 the receive window). 452 This option may be sent in an initial segment (i.e., a segment 453 with the SYN bit on and the ACK bit off). It may also be sent in a 454 segment, but only if a Window Scale option was received in 455 the initial segment. A Window Scale option in a segment 456 without a SYN bit should be ignored. 458 The Window field in a SYN (i.e., a or ) segment itself 459 is never scaled. 461 2.3. Using the Window Scale Option 463 A model implementation of window scaling is as follows, using the 464 notation of [RFC0793]: 466 o All windows are treated as 32-bit quantities for storage in the 467 connection control block and for local calculations. This 468 includes the send-window (SND.WND) and the receive-window 469 (RCV.WND) values, as well as the congestion window. 471 o The connection state is augmented by two window shift counts, 472 Snd.Wind.Scale and Rcv.Wind.Scale, to be applied to the incoming 473 and outgoing window fields, respectively. 475 o If a TCP receives a segment containing a Window Scale 476 option, it sends its own Window Scale option in the 477 segment. 479 o The Window Scale option is sent with shift.cnt = R, where R is the 480 value that the TCP would like to use for its receive window. 482 o Upon receiving a SYN segment with a Window Scale option containing 483 shift.cnt = S, a TCP sets Snd.Wind.Scale to S and sets 484 Rcv.Wind.Scale to R; otherwise, it sets both Snd.Wind.Scale and 485 Rcv.Wind.Scale to zero. 487 o The window field (SEG.WND) in the header of every incoming 488 segment, with the exception of SYN segments, is left-shifted by 489 Snd.Wind.Scale bits before updating SND.WND: 491 SND.WND = SEG.WND << Snd.Wind.Scale 493 (assuming the other conditions of RFC 793 are met, and using the 494 "C" notation "<<" for left-shift). 496 o The window field (SEG.WND) of every outgoing segment, with the 497 exception of SYN segments, is right-shifted by Rcv.Wind.Scale 498 bits: 500 SND.WND = RCV.WND >> Rcv.Wind.Scale 502 TCP determines if a data segment is "old" or "new" by testing whether 503 its sequence number is within 2^31 bytes of the left edge of the 504 window, and if it is not, discarding the data as "old". To insure 505 that new data is never mistakenly considered old and vice versa, the 506 left edge of the sender's window has to be at most 2^31 away from the 507 right edge of the receiver's window. Similarly with the sender's 508 right edge and receiver's left edge. Since the right and left edges 509 of either the sender's or receiver's window differ by the window 510 size, and since the sender and receiver windows can be out of phase 511 by at most the window size, the above constraints imply that 2 * the 512 max window size must be less than 2^31, or 514 max window < 2^30 516 Since the max window is 2^S (where S is the scaling shift count) 517 times at most 2^16 - 1 (the maximum unscaled window), the maximum 518 window is guaranteed to be < 2*30 if S <= 14. Thus, the shift count 519 must be limited to 14 (which allows windows of 2^30 = 1 Gbyte). If a 520 Window Scale option is received with a shift.cnt value exceeding 14, 521 the TCP should log the error but use 14 instead of the specified 522 value. 524 The scale factor applies only to the Window field as transmitted in 525 the TCP header; each TCP using extended windows will maintain the 526 window values locally as 32-bit numbers. For example, the 527 "congestion window" computed by Slow Start and Congestion Avoidance 528 is not affected by the scale factor, so window scaling will not 529 introduce quantization into the congestion window. 531 2.4. Addressing Window Retraction 533 When a non-zero scale factor is in use, there are instances when a 534 retracted window can be offered [Mathis08]. The end of the window 535 will be on a boundary based on the granularity of the scale factor 536 being used. If the sequence number is then updated by a number of 537 bytes smaller than that granularity, the TCP will have to either 538 advertise a new window that is beyond what it previously advertised 539 (and perhaps beyond the buffer), or will have to advertise a smaller 540 window, which will cause the TCP window to shrink. Implementations 541 should ensure that they handle a shrinking window, as specified in 542 section 4.2.2.16 of [RFC1122]. 544 For the receiver, this implies that: 546 1) The receiver MUST honor, as in-window, any segment that would 547 have been in-window for any ACK sent by the receiver. 549 2) When window scaling is in effect, the receiver SHOULD track the 550 actual maximum window sequence number (which is likely to be 551 greater than the window announced by the most recent ACK, if more 552 than one segment has arrived since the application consumed any 553 data in the receive buffer). 555 On the sender side: 557 3) The initial transmission MUST honor window on most recent ACK. 559 4) On first retransmission, or if the sequence number is out-of- 560 window by less than (2^Rcv.Wind.Scale) then do normal 561 retransmission(s) without regard to receiver window as long as 562 the original segment was in window when it was sent. 564 5) On subsequent retransmissions, treat such ACKs as zero window 565 probes. 567 3. RTTM -- Round-Trip Time Measurement 569 3.1. Introduction 571 Accurate and current RTT estimates are necessary to adapt to changing 572 traffic conditions and to avoid an instability known as "congestion 573 collapse" [RFC0896] in a busy network. However, accurate measurement 574 of RTT may be difficult both in theory and in implementation. 576 Many TCP implementations base their RTT measurements upon a sample of 577 one packet per window or less. While this yields an adequate 578 approximation to the RTT for small windows, it results in an 579 unacceptably poor RTT estimate for an LFN. If we look at RTT 580 estimation as a signal processing problem (which it is), a data 581 signal at some frequency, the packet rate, is being sampled at a 582 lower frequency, the window rate. This lower sampling frequency 583 violates Nyquist's criteria and may therefore introduce "aliasing" 584 artifacts into the estimated RTT [Hamming77]. 586 A good RTT estimator with a conservative retransmission timeout 587 calculation can tolerate aliasing when the sampling frequency is 588 "close" to the data frequency. For example, with a window of 8 589 packets, the sample rate is 1/8 the data frequency -- less than an 590 order of magnitude different. However, when the window is tens or 591 hundreds of packets, the RTT estimator may be seriously in error, 592 resulting in spurious retransmissions. 594 If there are dropped packets, the problem becomes worse. Zhang 595 [Zhang86], Jain [Jain86] and Karn [Karn87] have shown that it is not 596 possible to accumulate reliable RTT estimates if retransmitted 597 segments are included in the estimate. Since a full window of data 598 will have been transmitted prior to a retransmission, all of the 599 segments in that window will have to be ACKed before the next RTT 600 sample can be taken. This means at least an additional window's 601 worth of time between RTT measurements and, as the error rate 602 approaches one per window of data (e.g., 10^-6 errors per bit for the 603 Wideband satellite network), it becomes effectively impossible to 604 obtain a valid RTT measurement. 606 A solution to these problems, which actually simplifies the sender 607 substantially, is as follows: using TCP options, the sender places a 608 timestamp in each data segment, and the receiver reflects these 609 timestamps back in ACK segments. Then a single subtract gives the 610 sender an accurate RTT measurement for every ACK segment (which will 611 correspond to every other data segment, with a sensible receiver). 612 We call this the RTTM (Round-Trip Time Measurement) mechanism. 614 It is vitally important to use the RTTM mechanism with big windows; 615 otherwise, the door is opened to some dangerous instabilities due to 616 aliasing. Furthermore, the option is probably useful for all TCP's, 617 since it simplifies the sender. 619 3.2. TCP Timestamps Option 621 TCP is a symmetric protocol, allowing data to be sent at any time in 622 either direction, and therefore timestamp echoing may occur in either 623 direction. For simplicity and symmetry, we specify that timestamps 624 always be sent and echoed in both directions. For efficiency, we 625 combine the timestamp and timestamp reply fields into a single TCP 626 Timestamps Option. 628 TCP Timestamps Option (TSopt): 630 Kind: 8 632 Length: 10 bytes 634 +-------+-------+---------------------+---------------------+ 635 |Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)| 636 +-------+-------+---------------------+---------------------+ 637 1 1 4 4 639 The Timestamps option carries two four-byte timestamp fields. The 640 Timestamp Value field (TSval) contains the current value of the 641 timestamp clock of the TCP sending the option. 643 The Timestamp Echo Reply field (TSecr) is valid if the ACK bit is set 644 in the TCP header; if it is valid, it echos a timestamp value that 645 was sent by the remote TCP in the TSval field of a Timestamps option. 646 When TSecr is not valid, its value must be zero. However, a value of 647 zero does not imply TSecr being invalid. The TSecr value will 648 generally be from the most recent Timestamp option that was received; 649 however, there are exceptions that are explained below. 651 A TCP may send the Timestamps option (TSopt) in an initial 652 segment (i.e., a segment containing a SYN bit and no ACK bit). Once 653 a TSopt has been sent or received in a non segment, it must be 654 sent in all segments. Once a TSopt has been received in a non 655 segment, then any successive segment that is received without the RST 656 bit and without a TSopt may be dropped without further processing, 657 and an ACK of the current SND.UNA generated. 659 In the case of crossing SYN packets where one SYN contains a TSopt 660 and the other doesn't, both sides should put a TSopt in the 661 segment. 663 3.3. The RTTM Mechanism 665 RTTM places a Timestamps option in every segment, with a TSval that 666 is obtained from a (virtual) "timestamp clock". Values of this clock 667 values must be at least approximately proportional to real time, in 668 order to measure actual RTT. 670 These TSval values are echoed in TSecr values in the reverse 671 direction. The difference between a received TSecr value and the 672 current timestamp clock value provides an RTT measurement. 674 When timestamps are used, every segment that is received will contain 675 a TSecr value; however, these values cannot all be used to update the 676 measured RTT. The following example illustrates why. It shows a 677 one-way data flow with segments arriving in sequence without loss. 678 Here A, B, C... represent data blocks occupying successive blocks of 679 sequence numbers, and ACK(A),... represent the corresponding 680 cumulative acknowledgments. The two timestamp fields of the 681 Timestamps option are shown symbolically as . Each 682 TSecr field contains the value most recently received in a TSval 683 field. 685 TCP A TCP B 687 ------> 689 <---- 691 ------> 693 <---- 695 . . . . . . . . . . . . . . . . . . . . . . 697 ------> 699 <---- 701 (etc) 703 The dotted line marks a pause (60 time units long) in which A had 704 nothing to send. Note that this pause inflates the RTT which B could 705 infer from receiving TSecr=131 in data segment C. Thus, in one-way 706 data flows, RTTM in the reverse direction measures a value that is 707 inflated by gaps in sending data. However, the following rule 708 prevents a resulting inflation of the measured RTT: 710 RTTM Rule: A TSecr value received in a segment is used to update 711 the averaged RTT measurement only if 713 a) the segment acknowledges some new data, i.e., only if it 714 advances the left edge of the send window, and 716 b) the segment does not indicate any loss or reordering, i.e. 717 contains SACK options 719 Since TCP B is not sending data, the data segment C does not 720 acknowledge any new data when it arrives at B. Thus, the inflated 721 RTTM measurement is not used to update B's RTTM measurement. 723 Implementors should note that with Timestamps multiple RTTMs can be 724 taken per RTT. Many RTO estimators have a weighting factor based on 725 an implicit assumption that at most one RTTM will be gotten per RTT. 726 When using multiple RTTMs per RTT to update the RTO estimator, the 727 weighting factor needs to be decreased to take into account the more 728 frequent RTTMs. For example, an implementation could choose to just 729 use one sample per RTT to update the RTO estimator, or vary the gain 730 based on the congestion window, or take an average of all the RTTM 731 measurements received over one RTT, and then use that value to update 732 the RTO estimator. This document does not prescribe any particular 733 method for modifying the RTO estimator, the important point is that 734 the implementation should do something more than just feeding 735 additional RTTM samples from one RTT into the RTO estimator. 737 3.4. Which Timestamp to Echo 739 If more than one Timestamps option is received before a reply segment 740 is sent, the TCP must choose only one of the TSvals to echo, ignoring 741 the others. To minimize the state kept in the receiver (i.e., the 742 number of unprocessed TSvals), the receiver should be required to 743 retain at most one timestamp in the connection control block. 745 There are three situations to consider: 747 (A) Delayed ACKs. 749 Many TCP's acknowledge only every Kth segment out of a group of 750 segments arriving within a short time interval; this policy is 751 known generally as "delayed ACKs". The data-sender TCP must 752 measure the effective RTT, including the additional time due to 753 delayed ACKs, or else it will retransmit unnecessarily. Thus, 754 when delayed ACKs are in use, the receiver should reply with the 755 TSval field from the earliest unacknowledged segment. 757 (B) A hole in the sequence space (segment(s) have been lost). 759 The sender will continue sending until the window is filled, and 760 the receiver may be generating ACKs as these out-of-order 761 segments arrive (e.g., to aid "fast retransmit"). 763 The lost segment is probably a sign of congestion, and in that 764 situation the sender should be conservative about 765 retransmission. Furthermore, it is better to overestimate than 766 underestimate the RTT. An ACK for an out-of-order segment 767 should therefore contain the timestamp from the most recent 768 segment that advanced the window. 770 The same situation occurs if segments are re-ordered by the 771 network. 773 (C) A filled hole in the sequence space. 775 The segment that fills the hole represents the most recent 776 measurement of the network characteristics. On the other hand, 777 an RTT computed from an earlier segment would probably include 778 the sender's retransmit time-out, badly biasing the sender's 779 average RTT estimate. Thus, the timestamp from the latest 780 segment (which filled the hole) must be echoed. 782 An algorithm that covers all three cases is described in the 783 following rules for Timestamps option processing on a synchronized 784 connection: 786 (1) The connection state is augmented with two 32-bit slots: 788 TS.Recent holds a timestamp to be echoed in TSecr whenever a 789 segment is sent, and Last.ACK.sent holds the ACK field from the 790 last segment sent. Last.ACK.sent will equal RCV.NXT except when 791 ACKs have been delayed. 793 (2) If: 795 SEG.TSval >= TS.recent and SEG.SEQ <= Last.ACK.sent 797 then SEG.TSval is copied to TS.Recent; otherwise, it is ignored. 799 (3) When a TSopt is sent, its TSecr field is set to the current 800 TS.Recent value. 802 The following examples illustrate these rules. Here A, B, C... 803 represent data segments occupying successive blocks of sequence 804 numbers, and ACK(A),... represent the corresponding acknowledgment 805 segments. Note that ACK(A) has the same sequence number as B. We 806 show only one direction of timestamp echoing, for clarity. 808 o Packets arrive in sequence, and some of the ACKs are delayed. 810 By Case (A), the timestamp from the oldest unacknowledged segment 811 is echoed. 813 TS.Recent 814 -------------------> 815 1 816 -------------------> 817 1 818 -------------------> 819 1 820 <---- 821 (etc) 823 o Packets arrive out of order, and every packet is acknowledged. 825 By Case (B), the timestamp from the last segment that advanced the 826 left window edge is echoed, until the missing segment arrives; it 827 is echoed according to Case (C). The same sequence would occur if 828 segments B and D were lost and retransmitted. 830 TS.Recent 831 -------------------> 832 1 833 <---- 834 1 835 -------------------> 836 1 837 <---- 838 1 839 -------------------> 840 2 841 <---- 842 2 843 -------------------> 844 2 845 <---- 846 2 847 -------------------> 848 4 849 <---- 850 (etc) 852 4. PAWS -- Protection Against Wrapped Sequence Numbers 854 4.1. Introduction 856 Section 4.2 describes a simple mechanism to reject old duplicate 857 segments that might corrupt an open TCP connection; we call this 858 mechanism PAWS (Protection Against Wrapped Sequence numbers). PAWS 859 operates within a single TCP connection, using state that is saved in 860 the connection control block. Section 4.3 and Appendix C discuss the 861 implications of the PAWS mechanism for avoiding old duplicates from 862 previous incarnations of the same connection. 864 4.2. The PAWS Mechanism 866 PAWS uses the same TCP Timestamps option as the RTTM mechanism 867 described earlier, and assumes that every received TCP segment 868 (including data and ACK segments) contains a timestamp SEG.TSval 869 whose values are monotonically non-decreasing in time. The basic 870 idea is that a segment can be discarded as an old duplicate if it is 871 received with a timestamp SEG.TSval less than some timestamp recently 872 received on this connection. 874 In both the PAWS and the RTTM mechanism, the "timestamps" are 32-bit 875 unsigned integers in a modular 32-bit space. Thus, "less than" is 876 defined the same way it is for TCP sequence numbers, and the same 877 implementation techniques apply. If s and t are timestamp values, 879 s < t if 0 < (t - s) < 2^31, 881 computed in unsigned 32-bit arithmetic. 883 The choice of incoming timestamps to be saved for this comparison 884 must guarantee a value that is monotonically increasing. For 885 example, we might save the timestamp from the segment that last 886 advanced the left edge of the receive window, i.e., the most recent 887 in-sequence segment. Instead, we choose the value TS.Recent 888 introduced in Section 3.4 for the RTTM mechanism, since using a 889 common value for both PAWS and RTTM simplifies the implementation of 890 both. As Section 3.4 explained, TS.Recent differs from the timestamp 891 from the last in-sequence segment only in the case of delayed ACKs, 892 and therefore by less than one window. Either choice will therefore 893 protect against sequence number wrap-around. 895 RTTM was specified in a symmetrical manner, so that TSval timestamps 896 are carried in both data and ACK segments and are echoed in TSecr 897 fields carried in returning ACK or data segments. PAWS submits all 898 incoming segments to the same test, and therefore protects against 899 duplicate ACK segments as well as data segments. (An alternative 900 non-symmetric algorithm would protect against old duplicate ACKs: the 901 sender of data would reject incoming ACK segments whose TSecr values 902 were less than the TSecr saved from the last segment whose ACK field 903 advanced the left edge of the send window. This algorithm was deemed 904 to lack economy of mechanism and symmetry.) 906 TSval timestamps sent on and segments are used to 907 initialize PAWS. PAWS protects against old duplicate non-SYN 908 segments, and duplicate SYN segments received while there is a 909 synchronized connection. Duplicate and segments 910 received when there is no connection will be discarded by the normal 911 3-way handshake and sequence number checks of TCP. 913 RFC 1323 recommended that RST segments NOT carry timestamps, and that 914 they be acceptable regardless of their timestamp. At that time, the 915 thinking was that old duplicate RST segments should be exceedingly 916 unlikely, and their cleanup function should take precedence over 917 timestamps. More recently, discussions about various blind attacks 918 on TCP connections have raised the suggestion that if the Timestamps 919 option is present, SEG.TSecr could be used to provide stricter 920 acceptance tests for RST packets. While still under discussion, to 921 enable research into this area it is now recommended that when 922 generating a RST, that if the packet causing the RST to be generated 923 contained a Timestamps option that the RST also contain a Timestamps 924 option. In the RST segment, SEG.TSecr should be set to SEG.TSval 925 from the incoming packet and SEG.TSval should be set to zero. If a 926 RST is being generated because of a user abort, and Snd.TS.OK is set, 927 then a Timestamps option should be included in the RST. When a RST 928 packet is received, it must not be subjected to PAWS checks, and 929 information from the Timestamps option must not be use to update 930 connection state information. SEG.TSecr may be used to provide 931 stricter RST acceptance checks. 933 4.2.1. Basic PAWS Algorithm 935 The PAWS algorithm requires the following processing to be performed 936 on all incoming segments for a synchronized connection: 938 R1) If there is a Timestamps option in the arriving segment, 939 SEG.TSval < TS.Recent, TS.Recent is valid (see later discussion) 940 and the RST bit is not set, then treat the arriving segment as 941 not acceptable: 943 Send an acknowledgement in reply as specified in RFC 793 page 944 69 and drop the segment. 946 Note: it is necessary to send an ACK segment in order to 947 retain TCP's mechanisms for detecting and recovering from 948 half-open connections. For example, see Figure 10 of RFC 949 793. 951 R2) If the segment is outside the window, reject it (normal TCP 952 processing) 954 R3) If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent (see 955 Section 3.4), then record its timestamp in TS.Recent. 957 R4) If an arriving segment is in-sequence (i.e., at the left window 958 edge), then accept it normally. 960 R5) Otherwise, treat the segment as a normal in-window, out-of- 961 sequence TCP segment (e.g., queue it for later delivery to the 962 user). 964 Steps R2, R4, and R5 are the normal TCP processing steps specified by 965 RFC 793. 967 It is important to note that the timestamp is checked only when a 968 segment first arrives at the receiver, regardless of whether it is 969 in-sequence or it must be queued for later delivery. 971 Consider the following example. 973 Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has been 974 sent, where the letter indicates the sequence number and the digit 975 represents the timestamp. Suppose also that segment B.1 has been 976 lost. The timestamp in TS.TStamp is 1 (from A.1), so C.1, ..., 977 Z.1 are considered acceptable and are queued. When B is 978 retransmitted as segment B.2 (using the latest timestamp), it 979 fills the hole and causes all the segments through Z to be 980 acknowledged and passed to the user. The timestamps of the queued 981 segments are *not* inspected again at this time, since they have 982 already been accepted. When B.2 is accepted, TS.Stamp is set to 983 2. 985 This rule allows reasonable performance under loss. A full window of 986 data is in transit at all times, and after a loss a full window less 987 one packet will show up out-of-sequence to be queued at the receiver 988 (e.g., up to ~2^30 bytes of data); the timestamp option must not 989 result in discarding this data. 991 In certain unlikely circumstances, the algorithm of rules R1-R5 could 992 lead to discarding some segments unnecessarily, as shown in the 993 following example: 995 Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have been 996 sent in sequence and that segment B.1 has been lost. Furthermore, 997 suppose delivery of some of C.1, ... Z.1 is delayed until AFTER 998 the retransmission B.2 arrives at the receiver. These delayed 999 segments will be discarded unnecessarily when they do arrive, 1000 since their timestamps are now out of date. 1002 This case is very unlikely to occur. If the retransmission was 1003 triggered by a timeout, some of the segments C.1, ... Z.1 must have 1004 been delayed longer than the RTO time. This is presumably an 1005 unlikely event, or there would be many spurious timeouts and 1006 retransmissions. If B's retransmission was triggered by the "fast 1007 retransmit" algorithm, i.e., by duplicate ACKs, then the queued 1008 segments that caused these ACKs must have been received already. 1010 Even if a segment were delayed past the RTO, the Fast Retransmit 1011 mechanism [Jacobson90c] will cause the delayed packets to be 1012 retransmitted at the same time as B.2, avoiding an extra RTT and 1013 therefore causing a very small performance penalty. 1015 We know of no case with a significant probability of occurrence in 1016 which timestamps will cause performance degradation by unnecessarily 1017 discarding segments. 1019 4.2.2. Timestamp Clock 1021 It is important to understand that the PAWS algorithm does not 1022 require clock synchronization between sender and receiver. The 1023 sender's timestamp clock is used to stamp the segments, and the 1024 sender uses the echoed timestamp to measure RTTs. However, the 1025 receiver treats the timestamp as simply a monotonically increasing 1026 serial number, without any necessary connection to its clock. From 1027 the receiver's viewpoint, the timestamp is acting as a logical 1028 extension of the high-order bits of the sequence number. 1030 The receiver algorithm does place some requirements on the frequency 1031 of the timestamp clock. 1033 (a) The timestamp clock must not be "too slow". 1035 It must tick at least once for each 2^31 bytes sent. In fact, 1036 in order to be useful to the sender for round trip timing, the 1037 clock should tick at least once per window's worth of data, and 1038 even with the window extension defined in Section 2.2, 2^31 1039 bytes must be at least two windows. 1041 To make this more quantitative, any clock faster than 1 tick/sec 1042 will reject old duplicate segments for link speeds of ~8 Gbps. 1043 A 1ms timestamp clock will work at link speeds up to 8 Tbps 1044 (8*10^12) bps! 1046 (b) The timestamp clock must not be "too fast". 1048 Its recycling time must be greater than MSL seconds. Since the 1049 clock (timestamp) is 32 bits and the worst-case MSL is 255 1050 seconds, the maximum acceptable clock frequency is one tick 1051 every 59 ns. 1053 However, it is desirable to establish a much longer recycle 1054 period, in order to handle outdated timestamps on idle 1055 connections (see Section 4.2.3), and to relax the MSL 1056 requirement for preventing sequence number wrap-around. With a 1057 1 ms timestamp clock, the 32-bit timestamp will wrap its sign 1058 bit in 24.8 days. Thus, it will reject old duplicates on the 1059 same connection if MSL is 24.8 days or less. This appears to be 1060 a very safe figure; an MSL of 24.8 days or longer can probably 1061 be assumed in the internet without requiring precise MSL 1062 enforcement. 1064 Based upon these considerations, we choose a timestamp clock 1065 frequency in the range 1 ms to 1 sec per tick. This range also 1066 matches the requirements of the RTTM mechanism, which does not need 1067 much more resolution than the granularity of the retransmit timer, 1068 e.g., tens or hundreds of milliseconds. 1070 The PAWS mechanism also puts a strong monotonicity requirement on the 1071 sender's timestamp clock. The method of implementation of the 1072 timestamp clock to meet this requirement depends upon the system 1073 hardware and software. 1075 o Some hosts have a hardware clock that is guaranteed to be 1076 monotonic between hardware resets. 1078 o A clock interrupt may be used to simply increment a binary integer 1079 by 1 periodically. 1081 o The timestamp clock may be derived from a system clock that is 1082 subject to being abruptly changed, by adding a variable offset 1083 value. This offset is initialized to zero. When a new timestamp 1084 clock value is needed, the offset can be adjusted as necessary to 1085 make the new value equal to or larger than the previous value 1086 (which was saved for this purpose). 1088 4.2.3. Outdated Timestamps 1090 If a connection remains idle long enough for the timestamp clock of 1091 the other TCP to wrap its sign bit, then the value saved in TS.Recent 1092 will become too old; as a result, the PAWS mechanism will cause all 1093 subsequent segments to be rejected, freezing the connection (until 1094 the timestamp clock wraps its sign bit again). 1096 With the chosen range of timestamp clock frequencies (1 sec to 1 ms), 1097 the time to wrap the sign bit will be between 24.8 days and 24800 1098 days. A TCP connection that is idle for more than 24 days and then 1099 comes to life is exceedingly unusual. However, it is undesirable in 1100 principle to place any limitation on TCP connection lifetimes. 1102 We therefore require that an implementation of PAWS include a 1103 mechanism to "invalidate" the TS.Recent value when a connection is 1104 idle for more than 24 days. (An alternative solution to the problem 1105 of outdated timestamps would be to send keep-alive segments at a very 1106 low rate, but still more often than the wrap-around time for 1107 timestamps, e.g., once a day. This would impose negligible overhead. 1108 However, the TCP specification has never included keep-alives, so the 1109 solution based upon invalidation was chosen.) 1111 Note that a TCP does not know the frequency, and therefore, the 1112 wraparound time, of the other TCP, so it must assume the worst. The 1113 validity of TS.Recent needs to be checked only if the basic PAWS 1114 timestamp check fails, i.e., only if SEG.TSval < TS.Recent. If 1115 TS.Recent is found to be invalid, then the segment is accepted, 1116 regardless of the failure of the timestamp check, and rule R3 updates 1117 TS.Recent with the TSval from the new segment. 1119 To detect how long the connection has been idle, the TCP may update a 1120 clock or timestamp value associated with the connection whenever 1121 TS.Recent is updated, for example. The details will be 1122 implementation-dependent. 1124 4.2.4. Header Prediction 1126 "Header prediction" [Jacobson90a] is a high-performance transport 1127 protocol implementation technique that is most important for high- 1128 speed links. This technique optimizes the code for the most common 1129 case, receiving a segment correctly and in order. Using header 1130 prediction, the receiver asks the question, "Is this segment the next 1131 in sequence?" This question can be answered in fewer machine 1132 instructions than the question, "Is this segment within the window?" 1134 Adding header prediction to our timestamp procedure leads to the 1135 following recommended sequence for processing an arriving TCP 1136 segment: 1138 H1) Check timestamp (same as step R1 above) 1140 H2) Do header prediction: if segment is next in sequence and if 1141 there are no special conditions requiring additional processing, 1142 accept the segment, record its timestamp, and skip H3. 1144 H3) Process the segment normally, as specified in RFC 793. This 1145 includes dropping segments that are outside the window and 1146 possibly sending acknowledgments, and queueing in-window, out- 1147 of-sequence segments. 1149 Another possibility would be to interchange steps H1 and H2, i.e., to 1150 perform the header prediction step H2 FIRST, and perform H1 and H3 1151 only when header prediction fails. This could be a performance 1152 improvement, since the timestamp check in step H1 is very unlikely to 1153 fail, and it requires unsigned modulo arithmetic. To perform this 1154 check on every single segment is contrary to the philosophy of header 1155 prediction. We believe that this change might produce a measurable 1156 reduction in CPU time for TCP protocol processing on high-speed 1157 networks. 1159 However, putting H2 first would create a hazard: a segment from 2^32 1160 bytes in the past might arrive at exactly the wrong time and be 1161 accepted mistakenly by the header-prediction step. The following 1162 reasoning has been introduced in [RFC1185] to show that the 1163 probability of this failure is negligible. 1165 If all segments are equally likely to show up as old duplicates, 1166 then the probability of an old duplicate exactly matching the left 1167 window edge is the maximum segment size (MSS) divided by the size 1168 of the sequence space. This ratio must be less than 2^-16, since 1169 MSS must be < 2^16; for example, it will be (2^12)/(2^32) = 2^-20 1170 for a FDDI link. However, the older a segment is, the less likely 1171 it is to be retained in the Internet, and under any reasonable 1172 model of segment lifetime the probability of an old duplicate 1173 exactly at the left window edge must be much smaller than 2^-16. 1175 The 16 bit TCP checksum also allows a basic unreliability of one 1176 part in 2^16. A protocol mechanism whose reliability exceeds the 1177 reliability of the TCP checksum should be considered "good 1178 enough", i.e., it won't contribute significantly to the overall 1179 error rate. We therefore believe we can ignore the problem of an 1180 old duplicate being accepted by doing header prediction before 1181 checking the timestamp. 1183 However, this probabilistic argument is not universally accepted, and 1184 the consensus at present is that the performance gain does not 1185 justify the hazard in the general case. It is therefore recommended 1186 that H2 follow H1. 1188 4.2.5. IP Fragmentation 1190 At high data rates, the protection against old packets provided by 1191 PAWS can be circumvented by errors in IP fragment reassembly (see 1193 [RFC4963]). The only way to protect against incorrect IP fragment 1194 reassembly is to not allow the packets to be fragmented. This is 1195 done by setting the Don't Fragment (DF) bit in the IP header. 1196 Setting the DF bit implies the use of Path MTU Discovery as described 1197 in [RFC1191], [RFC1981], and [RFC4821], thus any TCP implementation 1198 that implements PAWS must also implement Path MTU Discovery. 1200 4.3. Duplicates from Earlier Incarnations of Connection 1202 The PAWS mechanism protects against errors due to sequence number 1203 wrap-around on high-speed connections. Segments from an earlier 1204 incarnation of the same connection are also a potential cause of old 1205 duplicate errors. In both cases, the TCP mechanisms to prevent such 1206 errors depend upon the enforcement of a maximum segment lifetime 1207 (MSL) by the Internet (IP) layer (see Appendix of RFC 1185 for a 1208 detailed discussion). Unlike the case of sequence space wrap-around, 1209 the MSL required to prevent old duplicate errors from earlier 1210 incarnations does not depend upon the transfer rate. If the IP layer 1211 enforces the recommended 2 minute MSL of TCP, and if the TCP rules 1212 are followed, TCP connections will be safe from earlier incarnations, 1213 no matter how high the network speed. Thus, the PAWS mechanism is 1214 not required for this case. 1216 We may still ask whether the PAWS mechanism can provide additional 1217 security against old duplicates from earlier connections, allowing us 1218 to relax the enforcement of MSL by the IP layer. Appendix B explores 1219 this question, showing that further assumptions and/or mechanisms are 1220 required, beyond those of PAWS. This is not part of the current 1221 extension. 1223 5. Conclusions and Acknowledgements 1225 This memo presented a set of extensions to TCP to provide efficient 1226 operation over large-bandwidth*delay-product paths and reliable 1227 operation over very high-speed paths. These extensions are designed 1228 to provide compatible interworking with TCP's that do not implement 1229 the extensions. 1231 These mechanisms are implemented using new TCP options for scaled 1232 windows and timestamps. The timestamps are used for two distinct 1233 mechanisms: RTTM (Round Trip Time Measurement) and PAWS (Protection 1234 Against Wrapped Sequences). 1236 The Window Scale option was originally suggested by Mike St. Johns of 1237 USAF/DCA. The present form of the option was suggested by Mike 1238 Karels of UC Berkeley in response to a more cumbersome scheme defined 1239 by Van Jacobson. Lixia Zhang helped formulate the PAWS mechanism 1240 description in RFC 1185. 1242 Finally, much of this work originated as the result of discussions 1243 within the End-to-End Task Force on the theoretical limitations of 1244 transport protocols in general and TCP in particular. Task force 1245 members and other on the end2end-interest list have made valuable 1246 contributions by pointing out flaws in the algorithms and the 1247 documentation. Continued discussion and development since the 1248 publication of RFC 1323 originally occurred in the IETF TCP Large 1249 Windows Working Group, later on in the End-to-End Task Force, and 1250 most recently in the IETF TCP Maintenance Working Group. The authors 1251 are grateful for all these contributions. 1253 6. Security Considerations 1255 The TCP sequence space is a fixed size, and as the window becomes 1256 larger it becomes easier for an attacker to generate forged packets 1257 that can fall within the TCP window, and be accepted as valid 1258 packets. While use of Timestamps and PAWS can help to mitigate this, 1259 when using PAWS, if an attacker is able to forge a packet that is 1260 acceptable to the TCP connection, a timestamp that is in the future 1261 would cause valid packets to be dropped due to PAWS checks. Hence, 1262 implementors should take care to not open the TCP window drastically 1263 beyond the requirements of the connection. 1265 Middle boxes and options: If a middle box removes TCP options from 1266 the SYN, such as TSopt, a high speed connection that needs PAWS would 1267 not have that protection. In this situation, an implementor could 1268 provide a mechanism for the application to determine whether or not 1269 PAWS is in use on the connection, and chose to terminate the 1270 connection if that protection doesn't exist. 1272 Mechanisms to protect the TCP header from modification should also 1273 protect the TCP options. 1275 Expanding the TCP window beyond 64K for IPv6 allows Jumbograms 1276 [RFC2675] to be used when the local network supports packets larger 1277 than 64K. When larger TCP packets are used, the TCP checksum becomes 1278 weaker. 1280 7. IANA Considerations 1282 This document has no actions for IANA. 1284 8. References 1285 8.1. Normative References 1287 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 1288 RFC 793, September 1981. 1290 [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 1291 November 1990. 1293 8.2. Informative References 1295 [Garlick77] 1296 Garlick, L., Rom, R., and J. Postel, "Issues in Reliable 1297 Host-to-Host Protocols", Proc. Second Berkeley Workshop on 1298 Distributed Data Management and Computer Networks, 1299 May 1977, . 1301 [Hamming77] 1302 Hamming, R., "Digital Filters", Prentice Hall, Englewood 1303 Cliffs, N.J. ISBN 0-13-212571-4, 1977. 1305 [Jacobson88a] 1306 Jacobson, V., "Congestion Avoidance and Control", SIGCOMM 1307 '88, Stanford, CA., August 1988, 1308 . 1310 [Jacobson90a] 1311 Jacobson, V., "4BSD Header Prediction", ACM Computer 1312 Communication Review, April 1990. 1314 [Jacobson90c] 1315 Jacobson, V., "Modified TCP congestion avoidance 1316 algorithm", Message to the end2end-interest mailing list, 1317 April 1990, 1318 . 1320 [Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet 1321 Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and 1322 Comm., Scottsdale, Arizona, March 1986, 1323 . 1325 [Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times in 1326 Reliable Transport Protocols", Proc. SIGCOMM '87, 1327 August 1987. 1329 [Martin03] 1330 Martin, D., "[Tsvwg] RFC 1323.bis", Message to the tsvwg 1331 mailing list, September 2003, . 1334 [Mathis08] 1335 Mathis, M., "[tcpm] Example of 1323 window retraction 1336 problem", Message to the tcpm mailing list, March 2008, 1337 . 1340 [RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks", 1341 RFC 896, January 1984. 1343 [RFC1072] Jacobson, V. and R. Braden, "TCP extensions for long-delay 1344 paths", RFC 1072, October 1988. 1346 [RFC1110] McKenzie, A., "Problem with the TCP big window option", 1347 RFC 1110, August 1989. 1349 [RFC1122] Braden, R., "Requirements for Internet Hosts - 1350 Communication Layers", STD 3, RFC 1122, October 1989. 1352 [RFC1185] Jacobson, V., Braden, B., and L. Zhang, "TCP Extension for 1353 High-Speed Paths", RFC 1185, October 1990. 1355 [RFC1323] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions 1356 for High Performance", RFC 1323, May 1992. 1358 [RFC1981] McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery 1359 for IP version 6", RFC 1981, August 1996. 1361 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 1362 Selective Acknowledgment Options", RFC 2018, October 1996. 1364 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 1365 Control", RFC 2581, April 1999. 1367 [RFC2675] Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms", 1368 RFC 2675, August 1999. 1370 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 1371 Extension to the Selective Acknowledgement (SACK) Option 1372 for TCP", RFC 2883, July 2000. 1374 [RFC3517] Blanton, E., Allman, M., Fall, K., and L. Wang, "A 1375 Conservative Selective Acknowledgment (SACK)-based Loss 1376 Recovery Algorithm for TCP", RFC 3517, April 2003. 1378 [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU 1379 Discovery", RFC 4821, March 2007. 1381 [RFC4963] Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly 1382 Errors at High Data Rates", RFC 4963, July 2007. 1384 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1385 Control", RFC 5681, September 2009. 1387 [Watson81] 1388 Watson, R., "Timer-based Mechanisms in Reliable Transport 1389 Protocol Connection Management", Computer Networks, Vol. 1390 5, 1981. 1392 [Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. SIGCOMM 1393 '86, Stowe, VT, August 1986. 1395 Appendix A. Implementation Suggestions 1397 TCP Option Layout 1399 The following layouts are recommended for sending options on non- 1400 SYN segments, to achieve maximum feasible alignment of 32-bit and 1401 64-bit machines. 1403 +--------+--------+--------+--------+ 1404 | NOP | NOP | TSopt | 10 | 1405 +--------+--------+--------+--------+ 1406 | TSval timestamp | 1407 +--------+--------+--------+--------+ 1408 | TSecr timestamp | 1409 +--------+--------+--------+--------+ 1411 Interaction with the TCP Urgent Pointer 1413 The TCP Urgent pointer, like the TCP window, is a 16 bit value. 1414 Some of the original discussion for the TCP Window Scale option 1415 included proposals to increase the Urgent pointer to 32 bits. As 1416 it turns out, this is unnecessary. There are two observations 1417 that should be made: 1419 (1) With IP Version 4, the largest amount of TCP data that can be 1420 sent in a single packet is 65495 bytes (64K - 1 -- size of 1421 fixed IP and TCP headers). 1423 (2) Updates to the urgent pointer while the user is in "urgent 1424 mode" are invisible to the user. 1426 This means that if the Urgent Pointer points beyond the end of the 1427 TCP data in the current packet, then the user will remain in 1428 urgent mode until the next TCP packet arrives. That packet will 1429 update the urgent pointer to a new offset, and the user will never 1430 have left urgent mode. 1432 Thus, to properly implement the Urgent Pointer, the sending TCP 1433 only has to check for overflow of the 16 bit Urgent Pointer field 1434 before filling it in. If it does overflow, than a value of 65535 1435 should be inserted into the Urgent Pointer. 1437 The same technique applies to IP Version 6, except in the case of 1438 IPv6 Jumbograms. When IPv6 Jumbograms are supported, [RFC2675] 1439 requires additional steps for dealing with the Urgent Pointer, 1440 these are described in section 5.2 of [RFC2675]. 1442 Appendix B. Duplicates from Earlier Connection Incarnations 1444 There are two cases to be considered: (1) a system crashing (and 1445 losing connection state) and restarting, and (2) the same connection 1446 being closed and reopened without a loss of host state. These will 1447 be described in the following two sections. 1449 B.1. System Crash with Loss of State 1451 TCP's quiet time of one MSL upon system startup handles the loss of 1452 connection state in a system crash/restart. For an explanation, see 1453 for example "When to Keep Quiet" in the TCP protocol specification 1454 [RFC0793]. The MSL that is required here does not depend upon the 1455 transfer speed. The current TCP MSL of 2 minutes seems acceptable as 1456 an operational compromise, as many host systems take this long to 1457 boot after a crash. 1459 However, the timestamp option may be used to ease the MSL 1460 requirements (or to provide additional security against data 1461 corruption). If timestamps are being used and if the timestamp clock 1462 can be guaranteed to be monotonic over a system crash/restart, i.e., 1463 if the first value of the sender's timestamp clock after a crash/ 1464 restart can be guaranteed to be greater than the last value before 1465 the restart, then a quiet time will be unnecessary. 1467 To dispense totally with the quiet time would require that the host 1468 clock be synchronized to a time source that is stable over the crash/ 1469 restart period, with an accuracy of one timestamp clock tick or 1470 better. We can back off from this strict requirement to take 1471 advantage of approximate clock synchronization. Suppose that the 1472 clock is always re-synchronized to within N timestamp clock ticks and 1473 that booting (extended with a quiet time, if necessary) takes more 1474 than N ticks. This will guarantee monotonicity of the timestamps, 1475 which can then be used to reject old duplicates even without an 1476 enforced MSL. 1478 B.2. Closing and Reopening a Connection 1480 When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT state 1481 ties up the socket pair for 4 minutes (see Section 3.5 of [RFC0793]. 1482 Applications built upon TCP that close one connection and open a new 1483 one (e.g., an FTP data transfer connection using Stream mode) must 1484 choose a new socket pair each time. The TIME-WAIT delay serves two 1485 different purposes: 1487 (a) Implement the full-duplex reliable close handshake of TCP. 1489 The proper time to delay the final close step is not really 1490 related to the MSL; it depends instead upon the RTO for the FIN 1491 segments and therefore upon the RTT of the path. (It could be 1492 argued that the side that is sending a FIN knows what degree of 1493 reliability it needs, and therefore it should be able to 1494 determine the length of the TIME-WAIT delay for the FIN's 1495 recipient. This could be accomplished with an appropriate TCP 1496 option in FIN segments.) 1498 Although there is no formal upper-bound on RTT, common network 1499 engineering practice makes an RTT greater than 1 minute very 1500 unlikely. Thus, the 4 minute delay in TIME-WAIT state works 1501 satisfactorily to provide a reliable full-duplex TCP close. 1502 Note again that this is independent of MSL enforcement and 1503 network speed. 1505 The TIME-WAIT state could cause an indirect performance problem 1506 if an application needed to repeatedly close one connection and 1507 open another at a very high frequency, since the number of 1508 available TCP ports on a host is less than 2^16. However, high 1509 network speeds are not the major contributor to this problem; 1510 the RTT is the limiting factor in how quickly connections can be 1511 opened and closed. Therefore, this problem will be no worse at 1512 high transfer speeds. 1514 (b) Allow old duplicate segments to expire. 1516 To replace this function of TIME-WAIT state, a mechanism would 1517 have to operate across connections. PAWS is defined strictly 1518 within a single connection; the last timestamp (TS.Recent) is 1519 kept in the connection control block, and discarded when a 1520 connection is closed. 1522 An additional mechanism could be added to the TCP, a per-host 1523 cache of the last timestamp received from any connection. This 1524 value could then be used in the PAWS mechanism to reject old 1525 duplicate segments from earlier incarnations of the connection, 1526 if the timestamp clock can be guaranteed to have ticked at least 1527 once since the old connection was open. This would require that 1528 the TIME-WAIT delay plus the RTT together must be at least one 1529 tick of the sender's timestamp clock. Such an extension is not 1530 part of the proposal of this RFC. 1532 Note that this is a variant on the mechanism proposed by 1533 Garlick, Rom, and Postel [Garlick77], which required each host 1534 to maintain connection records containing the highest sequence 1535 numbers on every connection. Using timestamps instead, it is 1536 only necessary to keep one quantity per remote host, regardless 1537 of the number of simultaneous connections to that host. 1539 Appendix C. Changes from RFC 1072, RFC 1185, and RFC 1323 1541 The protocol extensions defined in RFC 1323 document differ in 1542 several important ways from those defined in RFC 1072 and RFC 1185. 1544 (a) SACK has been split off into a separate document, [RFC2018]. 1546 (b) The detailed rules for sending timestamp replies (see 1547 Section 3.4) differ in important ways. The earlier rules could 1548 result in an under-estimate of the RTT in certain cases (packets 1549 dropped or out of order). 1551 (c) The same value TS.Recent is now shared by the two distinct 1552 mechanisms RTTM and PAWS. This simplification became possible 1553 because of change (b). 1555 (d) An ambiguity in RFC 1185 was resolved in favor of putting 1556 timestamps on ACK as well as data segments. This supports the 1557 symmetry of the underlying TCP protocol. 1559 (e) The echo and echo reply options of RFC 1072 were combined into a 1560 single Timestamps option, to reflect the symmetry and to 1561 simplify processing. 1563 (f) The problem of outdated timestamps on long-idle connections, 1564 discussed in Section 4.2.2, was realized and resolved. 1566 (g) RFC 1185 recommended that header prediction take precedence over 1567 the timestamp check. Based upon some skepticism about the 1568 probabilistic arguments given in Section 4.2.4, it was decided 1569 to recommend that the timestamp check be performed first. 1571 (h) The spec was modified so that the extended options will be sent 1572 on segments only when they are received in the 1573 corresponding segments. This provides the most 1574 conservative possible conditions for interoperation with 1575 implementations without the extensions. 1577 In addition to these substantive changes, the present RFC attempts to 1578 specify the algorithms unambiguously by presenting modifications to 1579 the Event Processing rules of RFC 793; see Appendix F. 1581 There are additional changes in this document from RFC 1323. These 1582 changes are: 1584 (a) The description of which TSecr values can be used to update the 1585 measured RTT has been clarified. Specifically, with Timestamps, 1586 the Karn algorithm [Karn87] is disabled. The Karn algorithm 1587 disables all RTT measurements during retransmission, since it is 1588 ambiguous whether the ACK is is for the original packet, or the 1589 retransmitted packet. With Timestamps, that ambiguity is 1590 removed since the TSecr in the ACK will contain the TSval from 1591 whichever data packet made it to the destination. 1593 (b) In RFC1323, section 3.4, step (2) of the algorithm to control 1594 which timestamp is echoed was incorrect in two regards: 1596 (1) It failed to update TS.recent for a retransmitted segment 1597 that resulted from a lost ACK. 1599 (2) It failed if SEG.LEN = 0. 1601 In the new algorithm, the case of SEG.TSval >= TS.recent is 1602 included for consistency with the PAWS test. 1604 (c) One correction was made to the Event Processing Summary in 1605 Appendix F. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to 1606 fill in the SEG.WND value, not SND.WND. 1608 (d) New pseudo-code summary has been added in Appendix E. 1610 (e) Appendix A has been expanded with information about the TCP MSS 1611 option and the TCP Urgent Pointer. 1613 (f) It is now recommended that Timestamps options be included in RST 1614 packets if the incoming packet contained a Timestamps option. 1616 (g) RST packets are explicitly excluded from PAWS processing. 1618 (h) Snd.TSoffset and Snd.TSclock variables have been added. 1619 Snd.TSclock is the sum of my.TSclock and Snd.TSoffset. This 1620 allows the starting points for timestamps to be randomized on a 1621 per-connection basis. Setting Snd.TSoffset to zero yields the 1622 same results as [RFC1323]. 1624 (i) RTTM update processing explicitly excludes packets containing 1625 SACK options. This addresses inflation of the RTT during 1626 episodes of packet loss in both directions. 1628 (j) In Section 3.2 the if-clause allowing sending of timestamps only 1629 when received in a or was removed, to allow for 1630 late timestamp negotiation. 1632 (k) Section 2.4 was added describing the unavoidable window 1633 retraction issue, and explicitly describing the mitigation steps 1634 necessary. 1636 Appendix D. Summary of Notation 1638 The following notation has been used in this document. 1640 Options 1642 WSopt: TCP Window Scale Option 1643 TSopt: TCP Timestamps Option 1645 Option Fields 1647 shift.cnt: Window scale byte in WSopt 1648 TSval: 32-bit Timestamp Value field in TSopt 1649 TSecr: 32-bit Timestamp Reply field in TSopt 1651 Option Fields in Current Segment 1653 SEG.TSval: TSval field from TSopt in current segment 1654 SEG.TSecr: TSecr field from TSopt in current segment 1655 SEG.WSopt: 8-bit value in WSopt 1657 Clock Values 1658 my.TSclock: System wide source of 32-bit timestamp values 1659 my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec) 1660 Snd.TSoffset: A offset for randomizing Snd.TSclock 1661 Snd.TSclock: my.TSclock + Snd.TSoffset 1663 Per-Connection State Variables 1665 TS.Recent: Latest received Timestamp 1666 Last.ACK.sent: Last ACK field sent 1667 Snd.TS.OK: 1-bit flag 1668 Snd.WS.OK: 1-bit flag 1669 Rcv.Wind.Scale: Receive window scale power 1670 Snd.Wind.Scale: Send window scale power 1671 Start.Time: Snd.TSclock value when segment being timed was 1672 sent (used by pre-1323 code). 1674 Procedure 1676 Update_SRTT(m) Procedure to update the smoothed RTT and RTT 1677 variance estimates, using the rules of 1678 [Jacobson88a], given m, a new RTT measurement 1680 Appendix E. Pseudo-code Summary 1682 Create new TCB => { 1683 Rcv.wind.scale = 1684 MIN( 14, MAX(0, floor(log2(receive buffer space)) - 15) ); 1685 Snd.wind.scale = 0; 1686 Last.ACK.sent = 0; 1687 Snd.TS.OK = Snd.WS.OK = FALSE; 1688 Snd.TSoffset = random 32 bit value 1689 } 1691 Send initial segment => { 1692 SEG.WND = MIN( RCV.WND, 65535 ); 1693 Include in segment: TSopt(TSval=Snd.TSclock, TCecr=0); 1694 Include in segment: WSopt = Rcv.wind.scale; 1695 } 1697 Send segment => { 1698 SEG.ACK = Last.ACK.sent = RCV.NXT; 1699 SEG.WND = MIN( RCV.WND, 65535 ); 1700 if (Snd.TS.OK) then 1701 Include in segment: 1702 TSopt(TSval=Snd.TSclock, TSecr=TS.Recent); 1703 if (Snd.WS.OK) then 1704 Include in segment: WSopt = Rcv.wind.scale; 1705 } 1707 Receive or segment => { 1708 if (Segment contains TSopt) then { 1709 TS.Recent = SEG.TSval; 1710 Snd.TS.OK = TRUE; 1711 if (is segment) then 1712 Update_SRTT( 1713 (Snd.TSclock - SEG.TSecr)/my.TSclock.rate); 1714 } 1715 if (Segment contains WSopt) then { 1716 Snd.wind.scale = SEG.WSopt; 1717 Snd.WS.OK = TRUE; 1718 if (the ACK bit is not set, and Rcv.wind.scale has not been 1719 initialized by the user) then 1720 Rcv.wind.scale = Snd.wind.scale; 1721 } 1722 else 1723 Rcv.wind.scale = Snd.wind.scale = 0; 1724 } 1726 Send non-SYN segment => { 1727 SEG.ACK = Last.ACK.sent = RCV.NXT; 1728 SEG.WND = MIN( RCV.WND >> Rcv.wind.scale, 65535 ); 1729 if (Snd.TS.OK) then 1730 Include in segment: 1731 TSopt(TSval=Snd.TSclock, TSecr=TS.Recent); 1732 } 1734 Receive non-SYN segment in (state >= ESTABLISHED) => { 1735 Window = (SEG.WND << Snd.wind.scale); 1736 /* Use 32-bit 'Window' instead of 16-bit 'SEG.WND' 1737 * in rest of processing. 1738 */ 1739 if (Segment contains TSopt) then { 1740 if (SEG.TSval < TS.Recent && Idle less than 24 days) then { 1741 if (Send.TS.OK AND (NOT RST) ) then { 1742 /* Timestamp too old => 1743 * segment is unacceptable. 1744 */ 1745 Send ACK segment; 1746 Discard segment and return; 1747 } 1748 } 1749 else { 1750 if (SEG.SEQ <= Last.ACK.sent) then 1751 TS.Recent = SEG.TSval; 1753 } 1754 } 1755 if (SEG.ACK > SND.UNA) then { 1756 /* (At least part of) first segment in 1757 * retransmission queue has been ACKd 1758 */ 1759 if (Segment contains TSopt) then 1760 Update_SRTT( 1761 (Snd.TSclock - SEG.TSecr)/my.TSclock.rate); 1762 else 1763 Update_SRTT( /* for compatibility */ 1764 (Snd.TSclock - Start.Time)/my.TSclock.rate); 1765 } 1766 } 1768 Appendix F. Event Processing Summary 1770 OPEN Call 1772 ... 1774 An initial send sequence number (ISS) is selected. Send a SYN 1775 segment of the form: 1777 1779 ... 1781 SEND Call 1783 CLOSED STATE (i.e., TCB does not exist) 1785 ... 1787 LISTEN STATE 1789 If the foreign socket is specified, then change the connection 1790 from passive to active, select an ISS. Send a SYN segment 1791 containing the options: and 1792 . Set SND.UNA to ISS, SND.NXT to ISS+1. 1793 Enter SYN-SENT state. ... 1795 SYN-SENT STATE 1796 SYN-RECEIVED STATE 1798 ... 1800 ESTABLISHED STATE 1801 CLOSE-WAIT STATE 1803 Segmentize the buffer and send it with a piggybacked 1804 acknowledgment (acknowledgment value = RCV.NXT). ... 1806 If the urgent flag is set ... 1808 If the Snd.TS.OK flag is set, then include the TCP Timestamps 1809 option in each data 1810 segment. 1812 Scale the receive window for transmission in the segment 1813 header: 1815 SEG.WND = (RCV.WND >> Rcv.Wind.Scale). 1817 SEGMENT ARRIVES 1819 ... 1821 If the state is LISTEN then 1823 first check for an RST 1825 ... 1827 second check for an ACK 1829 ... 1831 third check for a SYN 1833 if the SYN bit is set, check the security. If the ... 1835 ... 1837 if the SEG.PRC is less than the TCB.PRC then continue. 1839 Check for a Window Scale option (WSopt); if one is found, 1840 save SEG.WSopt in Snd.Wind.Scale and set Snd.WS.OK flag on. 1841 Otherwise, set both Snd.Wind.Scale and Rcv.Wind.Scale to 1842 zero and clear Snd.WS.OK flag. 1844 Check for a TSopt option; if one is found, save SEG.TSval in 1845 the variable TS.Recent and turn on the Snd.TS.OK bit. 1847 Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any 1848 other control or text should be queued for processing later. 1849 ISS should be selected and a SYN segment sent of the form: 1851 1853 If the Snd.WS.OK bit is on, include a WSopt option 1854 in this segment. If the Snd.TS.OK 1855 bit is on, include a TSopt 1856 in this segment. 1857 Last.ACK.sent is set to RCV.NXT. 1859 SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection 1860 state should be changed to SYN-RECEIVED. Note that any 1861 other incoming control or data (combined with SYN) will be 1862 processed in the SYN-RECEIVED state, but processing of SYN 1863 and ACK should not be repeated. If the listen was not fully 1864 specified (i.e., the foreign socket was not fully 1865 specified), then the unspecified fields should be filled in 1866 now. 1868 fourth other text or control 1870 ... 1872 If the state is SYN-SENT then 1874 first check the ACK bit 1876 ... 1878 ... 1880 fourth check the SYN bit 1882 ... 1884 If the SYN bit is on and the security/compartment and 1885 precedence are acceptable then, RCV.NXT is set to SEG.SEQ+1, 1886 IRS is set to SEG.SEQ, and any acknowledgements on the 1887 retransmission queue which are thereby acknowledged should 1888 be removed. 1890 Check for a Window Scale option (WSopt); if it is found, 1891 save SEG.WSopt in Snd.Wind.Scale; otherwise, set both 1892 Snd.Wind.Scale and Rcv.Wind.Scale to zero. 1894 Check for a TSopt option; if one is found, save SEG.TSval in 1895 variable TS.Recent and turn on the Snd.TS.OK bit in the 1896 connection control block. If the ACK bit is set, use 1897 Snd.TSclock - SEG.TSecr as the initial RTT estimate. 1899 If SND.UNA > ISS (our SYN has been ACKed), change the 1900 connection state to ESTABLISHED, form an ACK segment: 1902 1904 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1905 option in this ACK 1906 segment. Last.ACK.sent is set to RCV.NXT. 1908 Data or controls which were queued for transmission may be 1909 included. If there are other controls or text in the 1910 segment then continue processing at the sixth step below 1911 where the URG bit is checked, otherwise return. 1913 Otherwise enter SYN-RECEIVED, form a SYN,ACK segment: 1915 1917 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1918 option in this segment. 1919 If the Snd.WS.OK bit is on, include a WSopt option 1920 in this segment. Last.ACK.sent is 1921 set to RCV.NXT. 1923 If there are other controls or text in the segment, queue 1924 them for processing after the ESTABLISHED state has been 1925 reached, return. 1927 fifth, if neither of the SYN or RST bits is set then drop the 1928 segment and return. 1930 Otherwise, 1932 First, check sequence number 1934 SYN-RECEIVED STATE 1935 ESTABLISHED STATE 1936 FIN-WAIT-1 STATE 1937 FIN-WAIT-2 STATE 1938 CLOSE-WAIT STATE 1939 CLOSING STATE 1940 LAST-ACK STATE 1941 TIME-WAIT STATE 1942 Segments are processed in sequence. Initial tests on 1943 arrival are used to discard old duplicates, but further 1944 processing is done in SEG.SEQ order. If a segment's 1945 contents straddle the boundary between old and new, only the 1946 new parts should be processed. 1948 Rescale the received window field: 1950 TrueWindow = SEG.WND << Snd.Wind.Scale, 1952 and use "TrueWindow" in place of SEG.WND in the following 1953 steps. 1955 Check whether the segment contains a Timestamps option and 1956 bit Snd.TS.OK is on. If so: 1958 If SEG.TSval < TS.Recent and the RST bit is off, then 1959 test whether connection has been idle less than 24 days; 1960 if all are true, then the segment is not acceptable; 1961 follow steps below for an unacceptable segment. 1963 If SEG.SEQ is less than or equal to Last.ACK.sent, then 1964 save SEG.TSval in variable TS.Recent. 1966 There are four cases for the acceptability test for an 1967 incoming segment: 1969 ... 1971 If an incoming segment is not acceptable, an acknowledgment 1972 should be sent in reply (unless the RST bit is set, if so 1973 drop the segment and return): 1975 1977 Last.ACK.sent is set to SEG.ACK of the acknowledgment. If 1978 the Snd.Echo.OK bit is on, include the Timestamps option 1979 in this ACK segment. 1980 Set Last.ACK.sent to SEG.ACK and send the ACK segment. 1981 After sending the acknowledgment, drop the unacceptable 1982 segment and return. 1984 ... 1986 fifth check the ACK field. 1988 if the ACK bit is off drop the segment and return. 1990 if the ACK bit is on 1992 ... 1994 ESTABLISHED STATE 1996 If SND.UNA < SEG.ACK <= SND.NXT then, set SND.UNA <- 1997 SEG.ACK. Also compute a new estimate of round-trip time. 1998 If Snd.TS.OK bit is on, use Snd.TSclock - SEG.TSecr; 1999 otherwise use the elapsed time since the first segment in 2000 the retransmission queue was sent. Any segments on the 2001 retransmission queue which are thereby entirely 2002 acknowledged... 2004 ... 2006 Seventh, process the segment text. 2008 ESTABLISHED STATE 2009 FIN-WAIT-1 STATE 2010 FIN-WAIT-2 STATE 2012 ... 2014 Send an acknowledgment of the form: 2016 2018 If the Snd.TS.OK bit is on, include Timestamps option 2019 in this ACK segment. 2020 Set Last.ACK.sent to SEG.ACK of the acknowledgment, and send 2021 it. This acknowledgment should be piggy-backed on a segment 2022 being transmitted if possible without incurring undue delay. 2024 ... 2026 Appendix G. Timestamps Edge Cases 2028 While the rules laid out for when to calculate RTTM produce the 2029 correct results most of the time, there are some edge cases where an 2030 incorrect RTTM can be calculated. All of these situations involve 2031 the loss of packets. It is felt that these scenarios are rare, and 2032 that if they should happen, they will cause a single RTTM measurement 2033 to be inflated, which mitigates its effects on RTO calculations. 2035 [Martin03] cites two similar cases when the returning ACK is lost, 2036 and before the retransmission timer fires, another returning packet 2037 arrives, which ACKs the data. In this case, the RTTM calculated will 2038 be inflated: 2040 clock 2041 tc=1 -------------------> 2043 tc=2 (lost) <---- 2044 (RTTM would have been 1) 2046 (receive window opens, window update is sent) 2047 tc=5 <---- 2048 (RTTM is calculated at 4) 2050 One thing to note about this situation is that it is somewhat bounded 2051 by RTO + RTT, limiting how far off the RTTM calculation will be. 2052 While more complex scenarios can be constructed that produce larger 2053 inflations (e.g., retransmissions are lost), those scenarios involve 2054 multiple packet losses, and the connection will have other more 2055 serious operational problems than using an inflated RTTM in the RTO 2056 calculation. 2058 Authors' Addresses 2060 David Borman 2061 Quantum Corporation 2062 Mendota Heights MN 55120 2063 USA 2065 Email: david.borman@quantum.com 2067 Bob Braden 2068 University of Southern California 2069 4676 Admiralty Way 2070 Marina del Rey CA 90292 2071 USA 2073 Email: braden@isi.edu 2074 Van Jacobson 2075 Packet Design 2076 2465 Latham Street 2077 Mountain View CA 94040 2078 USA 2080 Email: van@packetdesign.com 2082 Richard Scheffenegger (editor) 2083 NetApp, Inc. 2084 Am Euro Platz 2 2085 Vienna, 1120 2086 Austria 2088 Email: rs@netapp.com