idnits 2.17.1 draft-borman-1323bis-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 18. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 2045. ** The document seems to lack an RFC 3978 Section 5.4 Reference to BCP 78 -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack an RFC 3979 Section 5, para. 1 IPR Disclosure Acknowledgement. ** The document seems to lack an RFC 3979 Section 5, para. 2 IPR Disclosure Acknowledgement. ** The document seems to lack an RFC 3979 Section 5, para. 3 IPR Disclosure Invitation. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 9 instances of too long lines in the document, the longest one being 4 characters in excess of 72. ** The abstract seems to contain references ([Jacobson92d]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. -- The draft header indicates that this document obsoletes RFC1323, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year == Line 259 has weird spacing: '...its/sec byt...' == Line 1317 has weird spacing: '... TSval times...' == Line 1319 has weird spacing: '... TSecr times...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 8, 2007) is 6130 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'Stevens97' is mentioned on line 1295, but not defined -- Looks like a reference, but probably isn't: '1' on line 286 ** Obsolete normative reference: RFC 3517 (ref. 'Blanton03') (Obsoleted by RFC 6675) -- Possible downref: Non-RFC (?) normative reference: ref. 'Garlick77' -- Possible downref: Non-RFC (?) normative reference: ref. 'Hamming77' -- Possible downref: Non-RFC (?) normative reference: ref. 'Jacobson88a' ** Obsolete normative reference: RFC 1072 (ref. 'Jacobson88b') (Obsoleted by RFC 1323, RFC 2018, RFC 6247) -- Possible downref: Non-RFC (?) normative reference: ref. 'Jacobson90a' ** Obsolete normative reference: RFC 1185 (ref. 'Jacobson90b') (Obsoleted by RFC 1323) -- Possible downref: Non-RFC (?) normative reference: ref. 'Jacobson90c' ** Obsolete normative reference: RFC 1323 (ref. 'Jacobson92d') (Obsoleted by RFC 7323) -- Possible downref: Non-RFC (?) normative reference: ref. 'Jain86' -- Possible downref: Non-RFC (?) normative reference: ref. 'Karn87' -- Duplicate reference: RFC1323, mentioned in 'Martin03', was also mentioned in 'Jacobson92d'. ** Obsolete normative reference: RFC 1323 (ref. 'Martin03') (Obsoleted by RFC 7323) ** Obsolete normative reference: RFC 1110 (ref. 'McKenzie89') (Obsoleted by RFC 6247) ** Obsolete normative reference: RFC 896 (ref. 'Nagle84') (Obsoleted by RFC 7805) ** Obsolete normative reference: RFC 793 (ref. 'Postel81') (Obsoleted by RFC 9293) -- Possible downref: Non-RFC (?) normative reference: ref. 'Postel83' -- Possible downref: Non-RFC (?) normative reference: ref. 'Watson81' -- Possible downref: Non-RFC (?) normative reference: ref. 'Zhang86' Summary: 18 errors (**), 0 flaws (~~), 6 warnings (==), 18 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Network Working Group 3 Internet-Draft D. Borman 4 Obsoletes: 1323 Wind River Systems 5 File: draft-borman-1323bis-00.txt R. Braden 6 ISI 7 V. Jacobson 8 Packet Design 9 July 8, 2007 11 TCP Extensions for High Performance 13 Status of This Memo 15 By submitting this Internet-Draft, each author represents that any 16 applicable patent or other IPR claims of which he or she is aware 17 have been or will be disclosed, and any of which he or she becomes 18 aware will be disclosed, in accordance with Section 6 of BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF), its areas, and its working groups. Note that 22 other groups may also distribute working documents as Internet- 23 Drafts. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet- Drafts as reference 28 material or to cite them other than as "work in progress." 30 The list of current Internet-Drafts can be accessed at 31 http://www.ietf.org/1id-abstracts.html 33 The list of Internet-Draft Shadow Directories can be accessed at 34 http://www.ietf.org/shadow.html 36 This Internet-Draftw will expire on January 8, 2008. 38 Copyright 40 Copyright (C) The IETF Trust (2007). 42 Abstract 44 This memo presents a set of TCP extensions to improve performance 45 over large bandwidth*delay product paths and to provide reliable 46 operation over very high-speed paths. It defines new TCP options for 47 scaled windows and timestamps, which are designed to provide 48 compatible interworking with TCP's that do not implement the 49 extensions. The timestamps are used for two distinct mechanisms: 50 RTTM (Round Trip Time Measurement) and PAWS (Protect Against Wrapped 51 Sequences). Selective acknowledgments are not included in this memo. 53 This memo updates and obsoletes RFC-1323 [Jacobson92d]. 55 TABLE OF CONTENTS 57 1. Introduction 2 58 2. TCP Window Scale Option 8 59 3. RTTM -- Round-Trip Time Measurement 11 60 4. PAWS -- Protect Against Wrapped Sequence Numbers 17 61 5. Conclusions and Acknowledgments 24 62 6. References 25 63 APPENDIX A: Implementation Suggestions 27 64 APPENDIX B: Duplicates from Earlier Connection Incarnations 27 65 APPENDIX C: Changes from RFC-1072, RFC-1185, RFC-1323 30 66 APPENDIX D: Summary of Notation 32 67 APPENDIX E: Pseudo-code Summary 33 68 APPENDIX F: Event Processing 35 69 APPENDIX G: Timestamps Edge Cases 35 70 Security Considerations 41 71 Authors' Addresses 41 73 1. INTRODUCTION 75 The TCP protocol [Postel81] was designed to operate reliably over 76 almost any transmission medium regardless of transmission rate, 77 delay, corruption, duplication, or reordering of segments. 78 Production TCP implementations currently adapt to transfer rates in 79 the range of 100 bps to 10**10 bps and round-trip delays in the range 80 1 ms to 100 seconds. Work on TCP performance has shown that TCP can 81 work well over a variety of Internet paths, ranging from 800 Mbit/sec 82 I/O channels to 300 bit/sec dial-up modems [Jacobson88a]. 84 Over the years, advances in networking technology has resulted in 85 ever-higher transmission speeds, and the fastest paths are well 86 beyond the domain for which TCP was originally engineered. This memo 87 defines a set of modest extensions to TCP to extend the domain of its 88 application to match this increasing network capability. It is an 89 update to and obsoletes RFC-1323 [Jacobson92d], which in turn is 90 based upon and obsoletes RFC-1072 [Jacobson88b] and RFC-1185 91 [Jacobson90b]. 93 There is no one-line answer to the question: "How fast can TCP go?". 94 There are two separate kinds of issues, performance and reliability, 95 and each depends upon different parameters. We discuss each in turn. 97 1.1 TCP Performance 99 TCP performance depends not upon the transfer rate itself, but 100 rather upon the product of the transfer rate and the round-trip 101 delay. This "bandwidth*delay product" measures the amount of data 102 that would "fill the pipe"; it is the buffer space required at 103 sender and receiver to obtain maximum throughput on the TCP 104 connection over the path, i.e., the amount of unacknowledged data 105 that TCP must handle in order to keep the pipeline full. TCP 106 performance problems arise when the bandwidth*delay product is 107 large. We refer to an Internet path operating in this region as a 108 "long, fat pipe", and a network containing this path as an "LFN" 109 (pronounced "elephan(t)"). 111 High-capacity packet satellite channels (e.g., DARPA's Wideband 112 Net) are LFN's. For example, a DS1-speed satellite channel has a 113 bandwidth*delay product of 10**6 bits or more; this corresponds to 114 100 outstanding TCP segments of 1200 bytes each. Terrestrial 115 fiber-optical paths will also fall into the LFN class; for 116 example, a cross-country delay of 30 ms at a DS3 bandwidth 117 (45Mbps) also exceeds 10**6 bits. 119 There are three fundamental performance problems with the current 120 TCP over LFN paths: 122 (1) Window Size Limit 124 The TCP header uses a 16 bit field to report the receive 125 window size to the sender. Therefore, the largest window 126 that can be used is 2**16 = 65K bytes. 128 To circumvent this problem, Section 2 of this memo defines a 129 new TCP option, "Window Scale", to allow windows larger than 130 2**16. This option defines an implicit scale factor, which 131 is used to multiply the window size value found in a TCP 132 header to obtain the true window size. 134 (2) Recovery from Losses 136 Packet losses in an LFN can have a catastrophic effect on 137 throughput. In the past, properly-operating TCP 138 implementations would cause the data pipeline to drain with 139 every packet loss, and require a slow-start action to 140 recover. The Fast Retransmit and Fast Recovery algorithms 141 [Jacobson90c] [Stevens97] were introduced, and their combined 142 effect was to recover from one packet loss per window, 143 without draining the pipeline. However, more than one packet 144 loss per window typically resulted in a retransmission 145 timeout and the resulting pipeline drain and slow start. 147 Expanding the window size to match the capacity of an LFN 148 results in a corresponding increase of the probability of 149 more than one packet per window being dropped. This could 150 have a devastating effect upon the throughput of TCP over an 151 LFN. In addition, if a congestion control mechanism based 152 upon some form of random dropping were introduced into 153 gateways, randomly spaced packet drops would become common, 154 possible increasing the probability of dropping more than one 155 packet per window. 157 To generalize the Fast Retransmit/Fast Recovery mechanism to 158 handle multiple packets dropped per window, selective 159 acknowledgments are required. Unlike the normal cumulative 160 acknowledgments of TCP, selective acknowledgments give the 161 sender a complete picture of which segments are queued at the 162 receiver and which have not yet arrived. 164 Since the publication of RFC-1323, selective acknowledgments 165 have become important in the LFN regime. RFC-1072 defined a 166 new TCP "SACK" option to send a selective acknowledgment, but 167 at the time that RFC-1323 was published, important technical 168 issues still had to be worked out concerning both the format 169 and semantics of the SACK option, so it was split off from 170 RFC-1323. SACK has now been published as a separate 171 document, RFC-2018 [Mathis96]. Additional information about 172 SACK can be found in RFC-2883, "An Extension to the Selective 173 Acknowledgement (SACK) option for TCP" [Floyd00] and 174 RFC-3517, "A Conservative Selective Acknowledgment 175 (SACK)-based Loss Recovery Algorithm for TCP" [Blanton03]. 177 (3) Round-Trip Measurement 179 TCP implements reliable data delivery by retransmitting 180 segments that are not acknowledged within some retransmission 181 timeout (RTO) interval. Accurate dynamic determination of an 182 appropriate RTO is essential to TCP performance. RTO is 183 determined by estimating the mean and variance of the 184 measured round-trip time (RTT), i.e., the time interval 185 between sending a segment and receiving an acknowledgment for 186 it [Jacobson88a]. 188 Section 4 introduces a new TCP option, "Timestamps", and then 189 defines a mechanism using this option that allows nearly 190 every segment, including retransmissions, to be timed at 191 negligible computational cost. We use the mnemonic RTTM 192 (Round Trip Time Measurement) for this mechanism, to 193 distinguish it from other uses of the Timestamps option. 195 1.2 TCP Reliability 197 Now we turn from performance to reliability. High transfer rate 198 enters TCP performance through the bandwidth*delay product. 199 However, high transfer rate alone can threaten TCP reliability by 200 violating the assumptions behind the TCP mechanism for duplicate 201 detection and sequencing. 203 An especially serious kind of error may result from an accidental 204 reuse of TCP sequence numbers in data segments. Suppose that an 205 "old duplicate segment", e.g., a duplicate data segment that was 206 delayed in Internet queues, is delivered to the receiver at the 207 wrong moment, so that its sequence numbers falls somewhere within 208 the current window. There would be no checksum failure to warn of 209 the error, and the result could be an undetected corruption of the 210 data. Reception of an old duplicate ACK segment at the 211 transmitter could be only slightly less serious: it is likely to 212 lock up the connection so that no further progress can be made, 213 forcing an RST on the connection. 215 TCP reliability depends upon the existence of a bound on the 216 lifetime of a segment: the "Maximum Segment Lifetime" or MSL. An 217 MSL is generally required by any reliable transport protocol, 218 since every sequence number field must be finite, and therefore 219 any sequence number may eventually be reused. In the Internet 220 protocol suite, the MSL bound is enforced by an IP-layer 221 mechanism, the "Time-to-Live" or TTL field. 223 Duplication of sequence numbers might happen in either of two 224 ways: 226 (1) Sequence number wrap-around on the current connection 228 A TCP sequence number contains 32 bits. At a high enough 229 transfer rate, the 32-bit sequence space may be "wrapped" 230 (cycled) within the time that a segment is delayed in queues. 232 (2) Earlier incarnation of the connection 234 Suppose that a connection terminates, either by a proper 235 close sequence or due to a host crash, and the same 236 connection (i.e., using the same pair of sockets) is 237 immediately reopened. A delayed segment from the terminated 238 connection could fall within the current window for the new 239 incarnation and be accepted as valid. 241 Duplicates from earlier incarnations, Case (2), are avoided by 242 enforcing the current fixed MSL of the TCP spec, as explained in 243 Section 5.3 and Appendix B. However, case (1), avoiding the 244 reuse of sequence numbers within the same connection, requires an 245 MSL bound that depends upon the transfer rate, and at high enough 246 rates, a new mechanism is required. 248 More specifically, if the maximum effective bandwidth at which TCP 249 is able to transmit over a particular path is B bytes per second, 250 then the following constraint must be satisfied for error-free 251 operation: 253 2**31 / B > MSL (secs) [1] 255 The following table shows the value for Twrap = 2**31/B in 256 seconds, for some important values of the bandwidth B: 258 Network B*8 B Twrap 259 bits/sec bytes/sec secs 260 _______ _______ ______ ______ 262 ARPANET 56kbps 7KBps 3*10**5 (~3.6 days) 264 DS1 1.5Mbps 190KBps 10**4 (~3 hours) 266 Ethernet 10Mbps 1.25MBps 1700 (~30 mins) 268 DS3 45Mbps 5.6MBps 380 270 FDDI 100Mbps 12.5MBps 170 272 Gigabit 1Gbps 125MBps 17 274 10GigE 10Gbps 1.25GBps 1.7 276 It is clear that wrap-around of the sequence space is not a 277 problem for 56kbps packet switching or even 10Mbps Ethernets. On 278 the other hand, at DS3 and FDDI speeds, Twrap is comparable to the 279 2 minute MSL assumed by the TCP specification [Postel81]. Moving 280 towards and beyond gigabit speeds, Twrap becomes too small for 281 reliable enforcement by the Internet TTL mechanism. 283 The 16-bit window field of TCP limits the effective bandwidth B to 284 2**16/RTT, where RTT is the round-trip time in seconds 285 [McKenzie89]. If the RTT is large enough, this limits B to a 286 value that meets the constraint [1] for a large MSL value. For 287 example, consider a transcontinental backbone with an RTT of 60ms 288 (set by the laws of physics). With the bandwidth*delay product 289 limited to 64KB by the TCP window size, B is then limited to 290 1.1MBps, no matter how high the theoretical transfer rate of the 291 path. This corresponds to cycling the sequence number space in 292 Twrap= 2000 secs, which is safe in today's Internet. 294 It is important to understand that the culprit is not the larger 295 window but rather the high bandwidth. For example, consider a 296 (very large) FDDI LAN with a diameter of 10km. Using the speed of 297 light, we can compute the RTT across the ring as 298 (2*10**4)/(3*10**8) = 67 microseconds, and the delay*bandwidth 299 product is then 833 bytes. A TCP connection across this LAN using 300 a window of only 833 bytes will run at the full 100mbps and can 301 wrap the sequence space in about 3 minutes, very close to the MSL 302 of TCP. Thus, high speed alone can cause a reliability problem 303 with sequence number wrap-around, even without extended windows. 305 Watson's Delta-T protocol [Watson81] includes network-layer 306 mechanisms for precise enforcement of an MSL. In contrast, the IP 307 mechanism for MSL enforcement is loosely defined and even more 308 loosely implemented in the Internet. Therefore, it is unwise to 309 depend upon active enforcement of MSL for TCP connections, and it 310 is unrealistic to imagine setting MSL's smaller than the current 311 values (e.g., 120 seconds specified for TCP). 313 A possible fix for the problem of cycling the sequence space would 314 be to increase the size of the TCP sequence number field. For 315 example, the sequence number field (and also the acknowledgment 316 field) could be expanded to 64 bits. This could be done either by 317 changing the TCP header or by means of an additional option. 319 Section 5 presents a different mechanism, which we call PAWS 320 (Protect Against Wrapped Sequence numbers), to extend TCP 321 reliability to transfer rates well beyond the foreseeable upper 322 limit of network bandwidths. PAWS uses the TCP Timestamps option 323 defined in Section 4 to protect against old duplicates from the 324 same connection. 326 1.3 Using TCP options 328 The extensions defined in this memo all use new TCP options. We 329 must address two possible issues concerning the use of TCP 330 options: (1) compatibility and (2) overhead. 332 We must pay careful attention to compatibility, i.e., to 333 interoperation with existing implementations. The only TCP option 334 defined previously, MSS, may appear only on a SYN segment. Every 335 implementation should (and we expect that most will) ignore 336 unknown options on SYN segments. When RFC-1323 was published, 337 there was concern that some buggy TCP implementation might be 338 crashed by the first appearance of an option on a non-SYN segment. 339 However, bugs like that can lead to DOS attacks against a TCP, so 340 it is now expected that most TCP implementations will properly 341 handle unknown options on non-SYN segments. But it is still 342 prudent to be conservative in what you send, and avoiding buggy 343 TCP implementation is not the only reason for negotiating TCP 344 options on SYN segments. Therefore, for each of the extensions 345 defined below, TCP options will be sent on non-SYN segments only 346 after an exchange of options on the the SYN segments has indicated 347 that both sides understand the extension. Furthermore, an 348 extension option will be sent in a segment only if the 349 corresponding option was received in the initial segment. 351 A question may be raised about the bandwidth and processing 352 overhead for TCP options. Those options that occur on SYN 353 segments are not likely to cause a performance concern. Opening a 354 TCP connection requires execution of significant special-case 355 code, and the processing of options is unlikely to increase that 356 cost significantly. 358 On the other hand, a Timestamps option may appear in any data or 359 ACK segment, adding 12 bytes to the 20-byte TCP header. We 360 believe that the bandwidth saved by reducing unnecessary 361 retransmissions will more than pay for the extra header bandwidth. 363 There is also an issue about the processing overhead for parsing 364 the variable byte-aligned format of options, particularly with a 365 RISC-architecture CPU. Appendix A contains a recommended layout 366 of the options in TCP headers to achieve reasonable data field 367 alignment. In the spirit of Header Prediction, a TCP can quickly 368 test for this layout and if it is verified then use a fast path. 369 Hosts that use this canonical layout will effectively use the 370 options as a set of fixed-format fields appended to the TCP 371 header. However, to retain the philosophical and protocol 372 framework of TCP options, a TCP must be prepared to parse an 373 arbitrary options field, albeit with less efficiency. 375 Finally, we observe that most of the mechanisms defined in this 376 memo are important for LFN's and/or very high-speed networks. For 377 low-speed networks, it might be a performance optimization to NOT 378 use these mechanisms. A TCP vendor concerned about optimal 379 performance over low-speed paths might consider turning these 380 extensions off for low-speed paths, or allow a user or 381 installation manager to disable them. 383 2. TCP WINDOW SCALE OPTION 385 2.1 Introduction 387 The window scale extension expands the definition of the TCP 388 window to 32 bits and then uses a scale factor to carry this 389 32-bit value in the 16-bit Window field of the TCP header (SEG.WND 390 in RFC-793). The scale factor is carried in a new TCP option, 391 Window Scale. This option is sent only in a SYN segment (a 392 segment with the SYN bit on), hence the window scale is fixed in 393 each direction when a connection is opened. (Another design 394 choice would be to specify the window scale in every TCP segment. 395 It would be incorrect to send a window scale option only when the 396 scale factor changed, since a TCP option in an acknowledgement 397 segment will not be delivered reliably (unless the ACK happens to 398 be piggy-backed on data in the other direction). Fixing the scale 399 when the connection is opened has the advantage of lower overhead 400 but the disadvantage that the scale factor cannot be changed 401 during the connection.) 403 The maximum receive window, and therefore the scale factor, is 404 determined by the maximum receive buffer space. In a typical 405 modern implementation, this maximum buffer space is set by default 406 but can be overridden by a user program before a TCP connection is 407 opened. This determines the scale factor, and therefore no new 408 user interface is needed for window scaling. 410 2.2 Window Scale Option 412 The three-byte Window Scale option may be sent in a SYN segment by 413 a TCP. It has two purposes: (1) indicate that the TCP is prepared 414 to do both send and receive window scaling, and (2) communicate a 415 scale factor to be applied to its receive window. Thus, a TCP 416 that is prepared to scale windows should send the option, even if 417 its own scale factor is 1. The scale factor is limited to a power 418 of two and encoded logarithmically, so it may be implemented by 419 binary shift operations. 421 TCP Window Scale Option (WSopt): 423 Kind: 3 425 Length: 3 bytes 427 +---------+---------+---------+ 428 | Kind=3 |Length=3 |shift.cnt| 429 +---------+---------+---------+ 431 This option is an offer, not a promise; both sides must send 432 Window Scale options in their SYN segments to enable window 433 scaling in either direction. If window scaling is enabled, 434 then the TCP that sent this option will right-shift its true 435 receive-window values by 'shift.cnt' bits for transmission in 436 SEG.WND. The value 'shift.cnt' may be zero (offering to scale, 437 while applying a scale factor of 1 to the receive window). 439 This option may be sent in an initial segment (i.e., a 440 segment with the SYN bit on and the ACK bit off). It may also 441 be sent in a segment, but only if a Window Scale 442 option was received in the initial segment. A Window 443 Scale option in a segment without a SYN bit should be ignored. 445 The Window field in a SYN (i.e., a or ) segment 446 itself is never scaled. 448 2.3 Using the Window Scale Option 450 A model implementation of window scaling is as follows, using the 451 notation of RFC-793 [Postel81]: 453 * All windows are treated as 32-bit quantities for storage in 454 the connection control block and for local calculations. 455 This includes the send-window (SND.WND) and the receive- 456 window (RCV.WND) values, as well as the congestion window. 458 * The connection state is augmented by two window shift counts, 459 Snd.Wind.Scale and Rcv.Wind.Scale, to be applied to the 460 incoming and outgoing window fields, respectively. 462 * If a TCP receives a segment containing a Window Scale 463 option, it sends its own Window Scale option in the 464 segment. 466 * The Window Scale option is sent with shift.cnt = R, where R 467 is the value that the TCP would like to use for its receive 468 window. 470 * Upon receiving a SYN segment with a Window Scale option 471 containing shift.cnt = S, a TCP sets Snd.Wind.Scale to S and 472 sets Rcv.Wind.Scale to R; otherwise, it sets both 473 Snd.Wind.Scale and Rcv.Wind.Scale to zero. 475 * The window field (SEG.WND) in the header of every incoming 476 segment, with the exception of SYN segments, is left-shifted 477 by Snd.Wind.Scale bits before updating SND.WND: 479 SND.WND = SEG.WND << Snd.Wind.Scale 481 (assuming the other conditions of RFC-793 are met, and using 482 the "C" notation "<<" for left-shift). 484 * The window field (SEG.WND) of every outgoing segment, with 485 the exception of SYN segments, is right-shifted by 486 Rcv.Wind.Scale bits: 488 SEG.WND = RCV.WND >> Rcv.Wind.Scale. 490 TCP determines if a data segment is "old" or "new" by testing 491 whether its sequence number is within 2**31 bytes of the left edge 492 of the window, and if it is not, discarding the data as "old". To 493 insure that new data is never mistakenly considered old and vice- 494 versa, the left edge of the sender's window has to be at most 495 2**31 away from the right edge of the receiver's window. 496 Similarly with the sender's right edge and receiver's left edge. 497 Since the right and left edges of either the sender's or 498 receiver's window differ by the window size, and since the sender 499 and receiver windows can be out of phase by at most the window 500 size, the above constraints imply that 2 * the max window size 501 must be less than 2**31, or 503 max window < 2**30 505 Since the max window is 2**S (where S is the scaling shift count) 506 times at most 2**16 - 1 (the maximum unscaled window), the maximum 507 window is guaranteed to be < 2*30 if S <= 14. Thus, the shift 508 count must be limited to 14 (which allows windows of 2**30 = 1 509 Gbyte). If a Window Scale option is received with a shift.cnt 510 value exceeding 14, the TCP should log the error but use 14 511 instead of the specified value. 513 The scale factor applies only to the Window field as transmitted 514 in the TCP header; each TCP using extended windows will maintain 515 the window values locally as 32-bit numbers. For example, the 516 "congestion window" computed by Slow Start and Congestion 517 Avoidance is not affected by the scale factor, so window scaling 518 will not introduce quantization into the congestion window. 520 3. RTTM: ROUND-TRIP TIME MEASUREMENT 522 3.1 Introduction 524 Accurate and current RTT estimates are necessary to adapt to 525 changing traffic conditions and to avoid an instability known as 526 "congestion collapse" [Nagle84] in a busy network. However, 527 accurate measurement of RTT may be difficult both in theory and in 528 implementation. 530 Many TCP implementations base their RTT measurements upon a sample 531 of one packet per window or less. While this yields an adequate 532 approximation to the RTT for small windows, it results in an 533 unacceptably poor RTT estimate for an LFN. If we look at RTT 534 estimation as a signal processing problem (which it is), a data 535 signal at some frequency, the packet rate, is being sampled at a 536 lower frequency, the window rate. This lower sampling frequency 537 violates Nyquist's criteria and may therefore introduce "aliasing" 538 artifacts into the estimated RTT [Hamming77]. 540 A good RTT estimator with a conservative retransmission timeout 541 calculation can tolerate aliasing when the sampling frequency is 542 "close" to the data frequency. For example, with a window of 8 543 packets, the sample rate is 1/8 the data frequency -- less than an 544 order of magnitude different. However, when the window is tens or 545 hundreds of packets, the RTT estimator may be seriously in error, 546 resulting in spurious retransmissions. 548 If there are dropped packets, the problem becomes worse. Zhang 549 [Zhang86], Jain [Jain86] and Karn [Karn87] have shown that it is 550 not possible to accumulate reliable RTT estimates if retransmitted 551 segments are included in the estimate. Since a full window of 552 data will have been transmitted prior to a retransmission, all of 553 the segments in that window will have to be ACKed before the next 554 RTT sample can be taken. This means at least an additional 555 window's worth of time between RTT measurements and, as the error 556 rate approaches one per window of data (e.g., 10**-6 errors per 557 bit for the Wideband satellite network), it becomes effectively 558 impossible to obtain a valid RTT measurement. 560 A solution to these problems, which actually simplifies the sender 561 substantially, is as follows: using TCP options, the sender places 562 a timestamp in each data segment, and the receiver reflects these 563 timestamps back in ACK segments. Then a single subtract gives the 564 sender an accurate RTT measurement for every ACK segment (which 565 will correspond to every other data segment, with a sensible 566 receiver). We call this the RTTM (Round-Trip Time Measurement) 567 mechanism. 569 It is vitally important to use the RTTM mechanism with big 570 windows; otherwise, the door is opened to some dangerous 571 instabilities due to aliasing. Furthermore, the option is 572 probably useful for all TCP's, since it simplifies the sender. 574 3.2 TCP Timestamps Option 576 TCP is a symmetric protocol, allowing data to be sent at any time 577 in either direction, and therefore timestamp echoing may occur in 578 either direction. For simplicity and symmetry, we specify that 579 timestamps always be sent and echoed in both directions. For 580 efficiency, we combine the timestamp and timestamp reply fields 581 into a single TCP Timestamps Option. 583 TCP Timestamps Option (TSopt): 585 Kind: 8 587 Length: 10 bytes 589 +-------+-------+---------------------+---------------------+ 590 |Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)| 591 +-------+-------+---------------------+---------------------+ 592 1 1 4 4 594 The Timestamps option carries two four-byte timestamp fields. 595 The Timestamp Value field (TSval) contains the current value of 596 the timestamp clock of the TCP sending the option. 598 The Timestamp Echo Reply field (TSecr) is valid if the ACK bit 599 is set in the TCP header; if it is valid, it echos a timestamp 600 value that was sent by the remote TCP in the TSval field of a 601 Timestamps option. When TSecr is not valid, its value must be 602 zero. The TSecr value will generally be from the most recent 603 Timestamp option that was received; however, there are 604 exceptions that are explained below. 606 A TCP may send the Timestamps option (TSopt) in an initial 607 segment (i.e., a segment containing a SYN bit and no ACK 608 bit), and may send a TSopt in other segments only if it 609 received a TSopt in the initial or segment for 610 the connection. Once a TSopt has been sent or received in a 611 non segment, it must be sent in all segments. Once a 612 TSopt has been received in a non segment, then any 613 successive segment that is received without the RST bit and 614 without a TSopt may be ACKed and dropped without further 615 processing. 617 3.3 The RTTM Mechanism 619 RTTM places a Timestamps option in every segment, with a TSval 620 that is obtained from a (virtual) "timestamp clock". Values of 621 this clock values must be at least approximately proportional to 622 real time, in order to measure actual RTT. 624 These TSval values are echoed in TSecr values in the reverse 625 direction. The difference between a received TSecr value and the 626 current timestamp clock value provides an RTT measurement. 628 When timestamps are used, every segment that is received will 629 contain a TSecr value; however, these values cannot all be used to 630 update the measured RTT. The following example illustrates why. 631 It shows a one-way data flow with segments arriving in sequence 632 without loss. Here A, B, C... represent data blocks occupying 633 successive blocks of sequence numbers, and ACK(A),... represent 634 the corresponding cumulative acknowledgments. The two timestamp 635 fields of the Timestamps option are shown symbolically as . Each TSecr field contains the value most recently 637 received in a TSval field; these echoed values. labelled 638 "TS.Recent", are shown in parentheses. 640 TCP A TCP B 642 (TS.Recent) (TS.Recent) 644 1. (120) ---> (1) 646 2. (125) <--- (1) 648 3. (125) ---> (6) 650 4. (130) <--- (6) 652 . . . ( Pause for 60 timestamp clock ticks ) . . . . 654 5. (130) ---> (1) 656 6. (125) <--- (1) 658 4. (127) ---> ... 660 5. ... <--- (5) 662 TCP A TCP B 664 ------> 666 <---- 668 ------> 670 <---- 672 . . . . . . . . . . . . . . . . . . . . . . 674 ------> 676 <---- 678 (etc) 680 The dotted line marks a pause (60 time units long) in which A had 681 nothing to send. Note that this pause inflates the RTT which B 682 could infer from receiving TSecr=131 in data segment C. Thus, in 683 one-way data flows, RTTM in the reverse direction measures a value 684 that is inflated by gaps in sending data. However, the following 685 rule prevents a resulting inflation of the measured RTT: 687 RTTM Rule: A TSecr value received in a segment is used to 688 update the averaged RTT measurement only if the segment 689 acknowledges some new data, i.e., only if it advances the 690 left edge of the send window. 692 Since TCP B is not sending data, the data segment C does not 693 acknowledge any new data when it arrives at B. Thus, the inflated 694 RTTM measurement is not used to update B's RTTM measurement. 696 Implementors should note that with Timestamps multiple RTTMs can 697 be taken per RTT. Many RTO estimators have a weighting factor 698 based on an implicit assumption that at most one RTTM will be 699 gotten per RTT. When using multiple RTTMs per RTT to update the 700 RTO estimator, the weighting factor needs to be decreased to take 701 into account the more frequent RTTMs. For example, 703 3.4 Which Timestamp to Echo 705 If more than one Timestamps option is received before a reply 706 segment is sent, the TCP must choose only one of the TSvals to 707 echo, ignoring the others. To minimize the state kept in the 708 receiver (i.e., the number of unprocessed TSvals), the receiver 709 should be required to retain at most one timestamp in the 710 connection control block. 712 There are three situations to consider: 714 (A) Delayed ACKs. 716 Many TCP's acknowledge only every Kth segment out of a group 717 of segments arriving within a short time interval; this 718 policy is known generally as "delayed ACKs". The data-sender 719 TCP must measure the effective RTT, including the additional 720 time due to delayed ACKs, or else it will retransmit 721 unnecessarily. Thus, when delayed ACKs are in use, the 722 receiver should reply with the TSval field from the earliest 723 unacknowledged segment. 725 (B) A hole in the sequence space (segment(s) have been lost). 727 The sender will continue sending until the window is filled, 728 and the receiver may be generating ACKs as these out-of-order 729 segments arrive (e.g., to aid "fast retransmit"). 731 The lost segment is probably a sign of congestion, and in 732 that situation the sender should be conservative about 733 retransmission. Furthermore, it is better to overestimate 734 than underestimate the RTT. An ACK for an out-of-order 735 segment should therefore contain the timestamp from the most 736 recent segment that advanced the window. 738 The same situation occurs if segments are re-ordered by the 739 network. 741 (C) A filled hole in the sequence space. 743 The segment that fills the hole represents the most recent 744 measurement of the network characteristics. On the other 745 hand, an RTT computed from an earlier segment would probably 746 include the sender's retransmit time-out, badly biasing the 747 sender's average RTT estimate. Thus, the timestamp from the 748 latest segment (which filled the hole) must be echoed. 750 An algorithm that covers all three cases is described in the 751 following rules for Timestamps option processing on a synchronized 752 connection: 754 (1) The connection state is augmented with two 32-bit slots: 755 TS.Recent holds a timestamp to be echoed in TSecr whenever a 756 segment is sent, and Last.ACK.sent holds the ACK field from 757 the last segment sent. Last.ACK.sent will equal RCV.NXT 758 except when ACKs have been delayed. 760 (2) If: 762 SEG.TSval >= TSrecent and SEG.SEQ <= Last.ACK.sent 764 then SEG.TSval is copied to TS.Recent; otherwise, it is 765 ignored. 767 (3) When a TSopt is sent, its TSecr field is set to the current 768 TS.Recent value. 770 The following examples illustrate these rules. Here A, B, C... 771 represent data segments occupying successive blocks of sequence 772 numbers, and ACK(A),... represent the corresponding 773 acknowledgment segments. Note that ACK(A) has the same sequence 774 number as B. We show only one direction of timestamp echoing, for 775 clarity. 777 o Packets arrive in sequence, and some of the ACKs are delayed. 779 By Case (A), the timestamp from the oldest unacknowledged 780 segment is echoed. 782 TS.Recent 783 -------------------> 784 1 785 -------------------> 786 1 787 -------------------> 788 1 789 <---- 790 (etc) 792 o Packets arrive out of order, and every packet is 793 acknowledged. 795 By Case (B), the timestamp from the last segment that 796 advanced the left window edge is echoed, until the missing 797 segment arrives; it is echoed according to Case (C). The 798 same sequence would occur if segments B and D were lost and 799 retransmitted.. 801 TS.Recent 802 -------------------> 803 1 804 <---- 805 1 806 -------------------> 807 1 808 <---- 809 1 810 -------------------> 811 2 812 <---- 813 2 814 -------------------> 815 2 816 <---- 817 2 818 -------------------> 819 4 820 <---- 821 (etc) 823 4. PAWS: PROTECT AGAINST WRAPPED SEQUENCE NUMBERS 825 4.1 Introduction 827 Section 4.2 describes a simple mechanism to reject old duplicate 828 segments that might corrupt an open TCP connection; we call this 829 mechanism PAWS (Protect Against Wrapped Sequence numbers). PAWS 830 operates within a single TCP connection, using state that is saved 831 in the connection control block. Section 4.3 and Appendix C 832 discuss the implications of the PAWS mechanism for avoiding old 833 duplicates from previous incarnations of the same connection. 835 4.2 The PAWS Mechanism 837 PAWS uses the same TCP Timestamps option as the RTTM mechanism 838 described earlier, and assumes that every received TCP segment 839 (including data and ACK segments) contains a timestamp SEG.TSval 840 whose values are monotone non-decreasing in time. The basic idea 841 is that a segment can be discarded as an old duplicate if it is 842 received with a timestamp SEG.TSval less than some timestamp 843 recently received on this connection. 845 In both the PAWS and the RTTM mechanism, the "timestamps" are 846 32-bit unsigned integers in a modular 32-bit space. Thus, "less 847 than" is defined the same way it is for TCP sequence numbers, and 848 the same implementation techniques apply. If s and t are 849 timestamp values, s < t if 0 < (t - s) < 2**31, computed in 850 unsigned 32-bit arithmetic. 852 The choice of incoming timestamps to be saved for this comparison 853 must guarantee a value that is monotone increasing. For example, 854 we might save the timestamp from the segment that last advanced 855 the left edge of the receive window, i.e., the most recent in- 856 sequence segment. Instead, we choose the value TS.Recent 857 introduced in Section 3.4 for the RTTM mechanism, since using a 858 common value for both PAWS and RTTM simplifies the implementation 859 of both. As Section 3.4 explained, TS.Recent differs from the 860 timestamp from the last in-sequence segment only in the case of 861 delayed ACKs, and therefore by less than one window. Either 862 choice will therefore protect against sequence number wrap-around. 864 RTTM was specified in a symmetrical manner, so that TSval 865 timestamps are carried in both data and ACK segments and are 866 echoed in TSecr fields carried in returning ACK or data segments. 867 PAWS submits all incoming segments to the same test, and therefore 868 protects against duplicate ACK segments as well as data segments. 869 (An alternative un-symmetric algorithm would protect against old 870 duplicate ACKs: the sender of data would reject incoming ACK 871 segments whose TSecr values were less than the TSecr saved from 872 the last segment whose ACK field advanced the left edge of the 873 send window. This algorithm was deemed to lack economy of 874 mechanism and symmetry.) 876 TSval timestamps sent on {SYN} and {SYN,ACK} segments are used to 877 initialize PAWS. PAWS protects against old duplicate non-SYN 878 segments, and duplicate SYN segments received while there is a 879 synchronized connection. Duplicate {SYN} and {SYN,ACK} segments 880 received when there is no connection will be discarded by the 881 normal 3-way handshake and sequence number checks of TCP. 883 RFC-1323 recommended that RST segments NOT carry timestamps, and 884 that they be accetable regardless of their timestamp. At that 885 time, the thinking was that old duplicate RST segments should be 886 exceedingly unlikely, and their cleanup function should take 887 precedence over timestamps. More recently, discussion about 888 various blind attacks on TCP connections have raised the 889 suggestion that if the Timestamps option is present, SEG.TSecr 890 could be used to provide stricter acceptance tests for RST 891 packets. While still under discussion, to enable research into 892 this area it is now recommended that when generating a RST, that 893 if the packet causing the RST to be generated contained a 894 Timestamps option that the RST also contain a Timestamps option. 895 In the RST segment, SEG.TSecr should be set to SEG.TSval from the 896 incoming packet and SEG.TSval should be set to zero. If a RST is 897 being generated because of a user abort, and Snd.TS.OK is set, 898 then a Timestamps option should be included in the RST. When a 899 RST packet is received, it must not be subjected to PAWS checks, 900 and information from Timestamps option must not be use to update 901 connection state information. SEG.TSecr may be used to provide 902 stricter RST acceptance checks. 904 4.2.1 Basic PAWS Algorithm 906 The PAWS algorithm requires the following processing to be 907 performed on all incoming segments for a synchronized 908 connection: 910 R1) If there is a Timestamps option in the arriving segment, 911 SEG.TSval < TS.Recent, TS.Recent is valid (see later 912 discussion) and the RST bit is not set, then treat the 913 arriving segment as not acceptable: 915 Send an acknowledgement in reply as specified in 916 RFC-793 page 69 and drop the segment. 918 Note: it is necessary to send an ACK segment in order 919 to retain TCP's mechanisms for detecting and 920 recovering from half-open connections. For example, 921 see Figure 10 of RFC-793. 923 R2) If the segment is outside the window, reject it (normal 924 TCP processing) 926 R3) If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent 927 (see Section 3.4), then record its timestamp in TS.Recent. 929 R4) If an arriving segment is in-sequence (i.e., at the left 930 window edge), then accept it normally. 932 R5) Otherwise, treat the segment as a normal in-window, out- 933 of-sequence TCP segment (e.g., queue it for later delivery 934 to the user). 936 Steps R2, R4, and R5 are the normal TCP processing steps 937 specified by RFC-793. 939 It is important to note that the timestamp is checked only when 940 a segment first arrives at the receiver, regardless of whether 941 it is in-sequence or it must be queued for later delivery. 942 Consider the following example. 944 Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has 945 been sent, where the letter indicates the sequence number 946 and the digit represents the timestamp. Suppose also that 947 segment B.1 has been lost. The timestamp in TS.TStamp is 948 1 (from A.1), so C.1, ..., Z.1 are considered acceptable 949 and are queued. When B is retransmitted as segment B.2 950 (using the latest timestamp), it fills the hole and causes 951 all the segments through Z to be acknowledged and passed 952 to the user. The timestamps of the queued segments are 953 *not* inspected again at this time, since they have 954 already been accepted. When B.2 is accepted, TS.Stamp is 955 set to 2. 957 This rule allows reasonable performance under loss. A full 958 window of data is in transit at all times, and after a loss a 959 full window less one packet will show up out-of-sequence to be 960 queued at the receiver (e.g., up to ~2**30 bytes of data); the 961 timestamp option must not result in discarding this data. 963 In certain unlikely circumstances, the algorithm of rules R1-R4 964 could lead to discarding some segments unnecessarily, as shown 965 in the following example: 967 Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have 968 been sent in sequence and that segment B.1 has been lost. 969 Furthermore, suppose delivery of some of C.1, ... Z.1 is 970 delayed until AFTER the retransmission B.2 arrives at the 971 receiver. These delayed segments will be discarded 972 unnecessarily when they do arrive, since their timestamps 973 are now out of date. 975 This case is very unlikely to occur. If the retransmission was 976 triggered by a timeout, some of the segments C.1, ... Z.1 must 977 have been delayed longer than the RTO time. This is presumably 978 an unlikely event, or there would be many spurious timeouts and 979 retransmissions. If B's retransmission was triggered by the 980 "fast retransmit" algorithm, i.e., by duplicate ACKs, then the 981 queued segments that caused these ACKs must have been received 982 already. 984 Even if a segment were delayed past the RTO, the Fast 985 Retransmit mechanism [Jacobson90c] will cause the delayed 986 packets to be retransmitted at the same time as B.2, avoiding 987 an extra RTT and therefore causing a very small performance 988 penalty. 990 We know of no case with a significant probability of occurrence 991 in which timestamps will cause performance degradation by 992 unnecessarily discarding segments. 994 4.2.2 Timestamp Clock 996 It is important to understand that the PAWS algorithm does not 997 require clock synchronization between sender and receiver. The 998 sender's timestamp clock is used to stamp the segments, and the 999 sender uses the echoed timestamp to measure RTT's. However, 1000 the receiver treats the timestamp as simply a monotone- 1001 increasing serial number, without any necessary connection to 1002 its clock. From the receiver's viewpoint, the timestamp is 1003 acting as a logical extension of the high-order bits of the 1004 sequence number. 1006 The receiver algorithm does place some requirements on the 1007 frequency of the timestamp clock. 1009 (a) The timestamp clock must not be "too slow". 1011 It must tick at least once for each 2**31 bytes sent. In 1012 fact, in order to be useful to the sender for round trip 1013 timing, the clock should tick at least once per window's 1014 worth of data, and even with the RFC-1072 window 1015 extension, 2**31 bytes must be at least two windows. 1017 To make this more quantitative, any clock faster than 1 1018 tick/sec will reject old duplicate segments for link 1019 speeds of ~8 Gbps. A 1ms timestamp clock will work at 1020 link speeds up to 8 Tbps (8*10**12) bps! 1022 (b) The timestamp clock must not be "too fast". 1024 Its recycling time must be greater than MSL seconds. 1025 Since the clock (timestamp) is 32 bits and the worst-case 1026 MSL is 255 seconds, the maximum acceptable clock frequency 1027 is one tick every 59 ns. 1029 However, it is desirable to establish a much longer 1030 recycle period, in order to handle outdated timestamps on 1031 idle connections (see Section 4.2.3), and to relax the MSL 1032 requirement for preventing sequence number wrap-around. 1033 With a 1 ms timestamp clock, the 32-bit timestamp will 1034 wrap its sign bit in 24.8 days. Thus, it will reject old 1035 duplicates on the same connection if MSL is 24.8 days or 1036 less. This appears to be a very safe figure; an MSL of 1037 24.8 days or longer can probably be assumed by the gateway 1038 system without requiring precise MSL enforcement by the 1039 TTL value in the IP layer. 1041 Based upon these considerations, we choose a timestamp clock 1042 frequency in the range 1 ms to 1 sec per tick. This range also 1043 matches the requirements of the RTTM mechanism, which does not 1044 need much more resolution than the granularity of the 1045 retransmit timer, e.g., tens or hundreds of milliseconds. 1047 The PAWS mechanism also puts a strong monotonicity requirement 1048 on the sender's timestamp clock. The method of implementation 1049 of the timestamp clock to meet this requirement depends upon 1050 the system hardware and software. 1052 * Some hosts have a hardware clock that is guaranteed to be 1053 monotonic between hardware resets. 1055 * A clock interrupt may be used to simply increment a binary 1056 integer by 1 periodically. 1058 * The timestamp clock may be derived from a system clock 1059 that is subject to being abruptly changed, by adding a 1060 variable offset value. This offset is initialized to 1061 zero. When a new timestamp clock value is needed, the 1062 offset can be adjusted as necessary to make the new value 1063 equal to or larger than the previous value (which was 1064 saved for this purpose). 1066 4.2.3 Outdated Timestamps 1068 If a connection remains idle long enough for the timestamp 1069 clock of the other TCP to wrap its sign bit, then the value 1070 saved in TS.Recent will become too old; as a result, the PAWS 1071 mechanism will cause all subsequent segments to be rejected, 1072 freezing the connection (until the timestamp clock wraps its 1073 sign bit again). 1075 With the chosen range of timestamp clock frequencies (1 sec to 1076 1 ms), the time to wrap the sign bit will be between 24.8 days 1077 and 24800 days. A TCP connection that is idle for more than 24 1078 days and then comes to life is exceedingly unusual. However, 1079 it is undesirable in principle to place any limitation on TCP 1080 connection lifetimes. 1082 We therefore require that an implementation of PAWS include a 1083 mechanism to "invalidate" the TS.Recent value when a connection 1084 is idle for more than 24 days. (An alternative solution to the 1085 problem of outdated timestamps would be to send keepalive 1086 segments at a very low rate, but still more often than the 1087 wrap-around time for timestamps, e.g., once a day. This would 1088 impose negligible overhead. However, the TCP specification has 1089 never included keepalives, so the solution based upon 1090 invalidation was chosen.) 1092 Note that a TCP does not know the frequency, and therefore, the 1093 wraparound time, of the other TCP, so it must assume the worst. 1094 The validity of TS.Recent needs to be checked only if the basic 1095 PAWS timestamp check fails, i.e., only if SEG.TSval < 1096 TS.Recent. If TS.Recent is found to be invalid, then the 1097 segment is accepted, regardless of the failure of the timestamp 1098 check, and rule R3 updates TS.Recent with the TSval from the 1099 new segment. 1101 To detect how long the connection has been idle, the TCP may 1102 update a clock or timestamp value associated with the 1103 connection whenever TS.Recent is updated, for example. The 1104 details will be implementation-dependent. 1106 4.2.4 Header Prediction 1108 "Header prediction" [Jacobson90a] is a high-performance 1109 transport protocol implementation technique that is most 1110 important for high-speed links. This technique optimizes the 1111 code for the most common case, receiving a segment correctly 1112 and in order. Using header prediction, the receiver asks the 1113 question, "Is this segment the next in sequence?" This 1114 question can be answered in fewer machine instructions than the 1115 question, "Is this segment within the window?" 1117 Adding header prediction to our timestamp procedure leads to 1118 the following recommended sequence for processing an arriving 1119 TCP segment: 1121 H1) Check timestamp (same as step R1 above) 1123 H2) Do header prediction: if segment is next in sequence and 1124 if there are no special conditions requiring additional 1125 processing, accept the segment, record its timestamp, and 1126 skip H3. 1128 H3) Process the segment normally, as specified in RFC-793. 1129 This includes dropping segments that are outside the 1130 window and possibly sending acknowledgments, and queueing 1131 in-window, out-of-sequence segments. 1133 Another possibility would be to interchange steps H1 and H2, 1134 i.e., to perform the header prediction step H2 FIRST, and 1135 perform H1 and H3 only when header prediction fails. This 1136 could be a performance improvement, since the timestamp check 1137 in step H1 is very unlikely to fail, and it requires interval 1138 arithmetic on a finite field, a relatively expensive operation. 1139 To perform this check on every single segment is contrary to 1140 the philosophy of header prediction. We believe that this 1141 change might reduce CPU time for TCP protocol processing by up 1142 to 5-10% on high-speed networks. 1144 However, putting H2 first would create a hazard: a segment from 1145 2**32 bytes in the past might arrive at exactly the wrong time 1146 and be accepted mistakenly by the header-prediction step. The 1147 following reasoning has been introduced [Jacobson90b] to show 1148 that the probability of this failure is negligible. 1150 If all segments are equally likely to show up as old 1151 duplicates, then the probability of an old duplicate 1152 exactly matching the left window edge is the maximum 1153 segment size (MSS) divided by the size of the sequence 1154 space. This ratio must be less than 2**-16, since MSS 1155 must be < 2**16; for example, it will be (2**12)/(2**32) = 1156 2**-20 for an FDDI link. However, the older a segment is, 1157 the less likely it is to be retained in the Internet, and 1158 under any reasonable model of segment lifetime the 1159 probability of an old duplicate exactly at the left window 1160 edge must be much smaller than 2**-16. 1162 The 16 bit TCP checksum also allows a basic unreliability 1163 of one part in 2**16. A protocol mechanism whose 1164 reliability exceeds the reliability of the TCP checksum 1165 should be considered "good enough", i.e., it won't 1166 contribute significantly to the overall error rate. We 1167 therefore believe we can ignore the problem of an old 1168 duplicate being accepted by doing header prediction before 1169 checking the timestamp. 1171 However, this probabilistic argument is not universally 1172 accepted, and the consensus at present is that the performance 1173 gain does not justify the hazard in the general case. It is 1174 therefore recommended that H2 follow H1. 1176 4.3. Duplicates from Earlier Incarnations of Connection 1178 The PAWS mechanism protects against errors due to sequence number 1179 wrap-around on high-speed connection. Segments from an earlier 1180 incarnation of the same connection are also a potential cause of 1181 old duplicate errors. In both cases, the TCP mechanisms to 1182 prevent such errors depend upon the enforcement of a maximum 1183 segment lifetime (MSL) by the Internet (IP) layer (see Appendix of 1184 RFC-1185 for a detailed discussion). Unlike the case of sequence 1185 space wrap-around, the MSL required to prevent old duplicate 1186 errors from earlier incarnations does not depend upon the transfer 1187 rate. If the IP layer enforces the recommended 2 minute MSL of 1188 TCP, and if the TCP rules are followed, TCP connections will be 1189 safe from earlier incarnations, no matter how high the network 1190 speed. Thus, the PAWS mechanism is not required for this case. 1192 We may still ask whether the PAWS mechanism can provide additional 1193 security against old duplicates from earlier connections, allowing 1194 us to relax the enforcement of MSL by the IP layer. Appendix B 1195 explores this question, showing that further assumptions and/or 1196 mechanisms are required, beyond those of PAWS. This is not part 1197 of the current extension. 1199 5. CONCLUSIONS AND ACKNOWLEDGMENTS 1201 This memo presented a set of extensions to TCP to provide efficient 1202 operation over large-bandwidth*delay-product paths and reliable 1203 operation over very high-speed paths. These extensions are designed 1204 to provide compatible interworking with TCP's that do not implement 1205 the extensions. 1207 These mechanisms are implemented using new TCP options for scaled 1208 windows and timestamps. The timestamps are used for two distinct 1209 mechanisms: RTTM (Round Trip Time Measurement) and PAWS (Protect 1210 Against Wrapped Sequences). 1212 The Window Scale option was originally suggested by Mike St. Johns of 1213 USAF/DCA. The present form of the option was suggested by Mike 1214 Karels of UC Berkeley in response to a more cumbersome scheme defined 1215 by Van Jacobson. Lixia Zhang helped formulate the PAWS mechanism 1216 description in RFC-1185. 1218 Finally, much of this work originated as the result of discussions 1219 within the End-to-End Task Force on the theoretical limitations of 1220 transport protocols in general and TCP in particular. Task force 1221 members and other on the end2end-interest list have made valuable 1222 contributions by pointing out flaws in the algorithms and the 1223 documentation. Continued discussion and development since the 1224 publication of RFC-1323 originally occurred in the IETF TCP Large 1225 Windows Working Group, later on in the End-to-End Taks Force, and 1226 most recently in the IETF TCP Maintance Working Group. The authors 1227 are grateful for all these contributions. 1229 6. REFERENCES 1231 [Braden89] Braden, R., editor, "Requirements for Internet Hosts -- 1232 Communication Layers", RFC 1122, October, 1989 1234 [Floyd00] Floyd, S., Mahdavi, J., Mathis, M., Podolsky, M., "An 1235 Extension to the Selective Acknowledgement (SACK) Option for TCP", 1236 RFC 2883, July 2000. 1238 [Blanton03] Blanton, E., Allman, M., Fall, K., Wang, L., "A 1239 Conservative Selective Acknowledgment (SACK)-based Loss Recovery 1240 Algorithm for TCP", RFC 3517, April 2003. 1242 [Garlick77] Garlick, L., R. Rom, and J. Postel, "Issues in 1243 Reliable Host-to-Host Protocols", Proc. Second Berkeley Workshop 1244 on Distributed Data Management and Computer Networks, May 1977. 1246 [Hamming77] Hamming, R., "Digital Filters", ISBN 0-13-212571-4, 1247 Prentice Hall, Englewood Cliffs, N.J., 1977. 1249 [Jacobson88a] Jacobson, V., "Congestion Avoidance and Control", 1250 SIGCOMM '88, Stanford, CA., August 1988. 1252 [Jacobson88b] Jacobson, V., and R. Braden, "TCP Extensions for 1253 Long-Delay Paths", RFC-1072, LBL and USC/Information Sciences 1254 Institute, October 1988. 1256 [Jacobson90a] Jacobson, V., "4BSD Header Prediction", ACM 1257 Computer Communication Review, April 1990. 1259 [Jacobson90b] Jacobson, V., Braden, R., and Zhang, L., "TCP 1260 Extension for High-Speed Paths", RFC-1185, LBL and USC/Information 1261 Sciences Institute, October 1990. 1263 [Jacobson90c] Jacobson, V., "Modified TCP congestion avoidance 1264 algorithm", Message to end2end-interest mailing list, April 1990. 1266 [Jacobson92d] Jacobson, V., Braden, R., and Borman, D., "TCP 1267 Extension for High Performance", RFC-1323, LBL, USC/Information 1268 Sciences Institute and Cray Research, May 1992. 1270 [Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet 1271 Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and Comm., 1272 Scottsdale, Arizona, March 1986. 1274 [Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times 1275 in Reliable Transport Protocols", Proc. SIGCOMM '87, Stowe, VT, 1276 August 1987. 1278 [Martin03] Martin, D., "[Tsvwg] RFC 1323.bis" Message to tsvwg 1279 mailing list, September 30, 2003. 1281 [Mathis96] Mathis, M., Mahdavi, J., Floyd, S., and Romanow, A., 1282 "TCP Selective Acknowledgment Options", RFC 2018, October, 1996. 1284 [McKenzie89] McKenzie, A., "A Problem with the TCP Big Window 1285 Option", RFC 1110, BBN STC, August 1989. 1287 [Nagle84] Nagle, J., "Congestion Control in IP/TCP 1288 Internetworks", RFC 896, FACC, January 1984. 1290 [Postel81] Postel, J., "Transmission Control Protocol - DARPA 1291 Internet Program Protocol Specification", RFC 793, DARPA, 1292 September 1981. 1294 [Postel83] Postel, J., "The TCP Maximum Segment Size and Related 1295 Topics", RFC 879, ISI, November 1983. [Stevens97] Stevens, W., 1296 "TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast 1297 Recovery Algorithms", RFC 2001, NOAO, January 1997. 1299 [Watson81] Watson, R., "Timer-based Mechanisms in Reliable 1300 Transport Protocol Connection Management", Computer Networks, Vol. 1301 5, 1981. 1303 [Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. 1304 SIGCOMM '86, Stowe, Vt., August 1986. 1306 APPENDIX A: IMPLEMENTATION SUGGESTIONS 1308 TCP Option Layout 1310 The following layouts are recommended for sending options on 1311 non-SYN segments, to achieve maximum feasible alignment of 1312 32-bit and 64-bit machines. 1314 +--------+--------+--------+--------+ 1315 | NOP | NOP | TSopt | 10 | 1316 +--------+--------+--------+--------+ 1317 | TSval timestamp | 1318 +--------+--------+--------+--------+ 1319 | TSecr timestamp | 1320 +--------+--------+--------+--------+ 1322 Interaction with the TCP Urgent Pointer 1324 The TCP Urgent pointer, like the TCP window, is a 16 bit value. 1325 Some of the original discussion for the TCP Window Scale option 1326 included proposals to increase the Urgent pointer to 32 bits. 1327 As it turns out, this is unnecessary. There are two 1328 observations that should be made: 1330 (1) With IP Version 4, the largest amount of TCP data that can 1331 be sent in a single packet is 65495 bytes (64K - 1 - size 1332 of fixed IP and TCP headers). 1334 (2) Updates to the urgent pointer while the user is in "urgent 1335 mode" are invisible to the user. 1337 This means that if the Urgent Pointer points beyond the end of 1338 the TCP data in the current packet, then the user will remain in 1339 urgent mode until the next TCP packet arrives. That packet will 1340 update the urgent pointer to a new offset, and the user will 1341 never have left urgent mode. 1343 Thus, to properly implement the Urgent Pointer, the sending TCP 1344 only has to check for overflow of the 16 bit Urgent Pointer 1345 field before filling it in. If it does overflow, than a value 1346 of 65535 should be inserted into the Urgent Pointer. 1348 TCP Options and MSS 1350 There has been some confusion as to what value should be filled 1351 in the TCP MSS option when using TCP options. RFC-879 1353 [Postel83] stated: 1355 The MSS counts only data octets in the segment, it does not 1356 count the TCP header or the IP header. 1358 which is unclear about what to do about TCP options. RFC-1122 1359 [Braden89] attempted to clarify this in section 4.2.2.6, but 1360 there still seems to be confusion. 1362 So, the MSS value to be sent in an MSS option should be equal to 1363 the effective MTU minus the fixed IP and TCP headers. Since 1364 both IP and TCP options are ignored when calculating the value 1365 for the MSS option, if there are any IP or TCP options to be 1366 sent in a packet, then the sender must decrease the size of the 1367 TCP data accordingly. The reason for this can be seen in the 1368 following table: 1370 +--------------------+--------------------+ 1371 | MSS is adjusted | MSS isn't adjusted | 1372 | to include options | to include options | 1373 +----------------+--------------------+--------------------+ 1374 | Sender adjusts | Packets are too | Packets are the | 1375 | length for | short | correct length | 1376 | options | | | 1377 +----------------+--------------------+--------------------+ 1378 | Sender doesn't | Packets are the | Packets are too | 1379 | adjust length | correct length | long. | 1380 | for options | | | 1381 +----------------+--------------------+--------------------+ 1383 Since the goal is to not send IP datagrams that have to be 1384 fragmented, and packets sent with the constraints in the lower 1385 right of this grid will cause IP fragmentation, the only way to 1386 guarantee that this doesn't happen is for the data sender to 1387 decrease the TCP data length by the size of the IP and TCP 1388 options. And since the sender will be adjusting the TCP data 1389 length when sending IP and TCP options, there is no need to 1390 include the IP and TCP option lengths in the MSS value. 1392 APPENDIX B: DUPLICATES FROM EARLIER CONNECTION INCARNATIONS 1394 There are two cases to be considered: (1) a system crashing (and 1395 losing connection state) and restarting, and (2) the same connection 1396 being closed and reopened without a loss of host state. These will 1397 be described in the following two sections. 1399 B.1 System Crash with Loss of State 1400 TCP's quiet time of one MSL upon system startup handles the loss 1401 of connection state in a system crash/restart. For an 1402 explanation, see for example "When to Keep Quiet" in the TCP 1403 protocol specification [Postel81]. The MSL that is required here 1404 does not depend upon the transfer speed. The current TCP MSL of 2 1405 minutes seems acceptable as an operational compromise, as many 1406 host systems take this long to boot after a crash. 1408 However, the timestamp option may be used to ease the MSL 1409 requirements (or to provide additional security against data 1410 corruption). If timestamps are being used and if the timestamp 1411 clock can be guaranteed to be monotonic over a system 1412 crash/restart, i.e., if the first value of the sender's timestamp 1413 clock after a crash/restart can be guaranteed to be greater than 1414 the last value before the restart, then a quiet time will be 1415 unnecessary. 1417 To dispense totally with the quiet time would require that the 1418 host clock be synchronized to a time source that is stable over 1419 the crash/restart period, with an accuracy of one timestamp clock 1420 tick or better. We can back off from this strict requirement to 1421 take advantage of approximate clock synchronization. Suppose that 1422 the clock is always re-synchronized to within N timestamp clock 1423 ticks and that booting (extended with a quiet time, if necessary) 1424 takes more than N ticks. This will guarantee monotonicity of the 1425 timestamps, which can then be used to reject old duplicates even 1426 without an enforced MSL. 1428 B.2 Closing and Reopening a Connection 1430 When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT 1431 state ties up the socket pair for 4 minutes (see Section 3.5 of 1432 [Postel81]. Applications built upon TCP that close one connection 1433 and open a new one (e.g., an FTP data transfer connection using 1434 Stream mode) must choose a new socket pair each time. The TIME- 1435 WAIT delay serves two different purposes: 1437 (a) Implement the full-duplex reliable close handshake of TCP. 1439 The proper time to delay the final close step is not really 1440 related to the MSL; it depends instead upon the RTO for the 1441 FIN segments and therefore upon the RTT of the path. (It 1442 could be argued that the side that is sending a FIN knows 1443 what degree of reliability it needs, and therefore it should 1444 be able to determine the length of the TIME-WAIT delay for 1445 the FIN's recipient. This could be accomplished with an 1446 appropriate TCP option in FIN segments.) 1448 Although there is no formal upper-bound on RTT, common 1449 network engineering practice makes an RTT greater than 1 1450 minute very unlikely. Thus, the 4 minute delay in TIME-WAIT 1451 state works satisfactorily to provide a reliable full-duplex 1452 TCP close. Note again that this is independent of MSL 1453 enforcement and network speed. 1455 The TIME-WAIT state could cause an indirect performance 1456 problem if an application needed to repeatedly close one 1457 connection and open another at a very high frequency, since 1458 the number of available TCP ports on a host is less than 1459 2**16. However, high network speeds are not the major 1460 contributor to this problem; the RTT is the limiting factor 1461 in how quickly connections can be opened and closed. 1462 Therefore, this problem will be no worse at high transfer 1463 speeds. 1465 (b) Allow old duplicate segments to expire. 1467 To replace this function of TIME-WAIT state, a mechanism 1468 would have to operate across connections. PAWS is defined 1469 strictly within a single connection; the last timestamp is 1470 TS.Recent is kept in the connection control block, and 1471 discarded when a connection is closed. 1473 An additional mechanism could be added to the TCP, a per-host 1474 cache of the last timestamp received from any connection. 1475 This value could then be used in the PAWS mechanism to reject 1476 old duplicate segments from earlier incarnations of the 1477 connection, if the timestamp clock can be guaranteed to have 1478 ticked at least once since the old connection was open. This 1479 would require that the TIME-WAIT delay plus the RTT together 1480 must be at least one tick of the sender's timestamp clock. 1481 Such an extension is not part of the proposal of this RFC. 1483 Note that this is a variant on the mechanism proposed by 1484 Garlick, Rom, and Postel [Garlick77], which required each 1485 host to maintain connection records containing the highest 1486 sequence numbers on every connection. Using timestamps 1487 instead, it is only necessary to keep one quantity per remote 1488 host, regardless of the number of simultaneous connections to 1489 that host. 1491 APPENDIX C: CHANGES FROM RFC-1072, RFC-1185, RFC-1323 1493 The protocol extensions defined in RFC-1323 document differ in 1494 several important ways from those defined in RFC-1072 and RFC-1185. 1496 (a) SACK has been split off into a separate document, RFC 2018 1497 [Mathis96]. 1499 (b) The detailed rules for sending timestamp replies (see Section 1500 3.4) differ in important ways. The earlier rules could result 1501 in an under-estimate of the RTT in certain cases (packets 1502 dropped or out of order). 1504 (c) The same value TS.Recent is now shared by the two distinct 1505 mechanisms RTTM and PAWS. This simplification became possible 1506 because of change (b). 1508 (d) An ambiguity in RFC-1185 was resolved in favor of putting 1509 timestamps on ACK as well as data segments. This supports the 1510 symmetry of the underlying TCP protocol. 1512 (e) The echo and echo reply options of RFC-1072 were combined into a 1513 single Timestamps option, to reflect the symmetry and to 1514 simplify processing. 1516 (f) The problem of outdated timestamps on long-idle connections, 1517 discussed in Section 4.2.2, was realized and resolved. 1519 (g) RFC-1185 recommended that header prediction take precedence over 1520 the timestamp check. Based upon some scepticism about the 1521 probabilistic arguments given in Section 4.2.4, it was decided 1522 to recommend that the timestamp check be performed first. 1524 (h) The spec was modified so that the extended options will be sent 1525 on segments only when they are received in the 1526 corresponding segments. This provides the most 1527 conservative possible conditions for interoperation with 1528 implementations without the extensions. 1530 In addition to these substantive changes, the present RFC attempts to 1531 specify the algorithms unambiguously by presenting modifications to 1532 the Event Processing rules of RFC-793; see Appendix F. 1534 There are additional changes in this document from RFC-1323. These 1535 changes are: 1537 (a) The description of which TSecr values can be used to update the 1538 measured RTT has been clarified. Specifically, with Timestamps, 1539 the Karn algorithm [Karn87] is disabled. The Karn algorithm 1540 disables all RTT measurements during retransmission, since it is 1541 ambiguous whether the ACK is is for the original packet, or the 1542 retransmitted packet. With Timestamps, that ambiguity is 1543 removed since the TSecr in the ACK will contain the TSval from 1544 which ever data packet made it to the destination. 1546 (b) In RFC-1323, section 3.4, step (2) of the algorithm to control 1547 which timestamp is echoed was incorrect in two regards: 1549 (1) It failed to update TSrecent for a retransmitted segment 1550 that resulted from a lost ACK. 1552 (2) It failed if SEG.LEN = 0. 1554 In the new algorithm, the case of SEG.TSval = TSrecent is 1555 included for consistency with the PAWS test. 1557 (c) One correction was made to the Event Processing Summary in 1558 Appendix F. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to 1559 fill in the SEG.WND value, not SND.WND. 1561 (d) New pseudo-code summary has been added in Appendix E. 1563 (e) Appendix A has been expanded with information about the TCP MSS 1564 option and the TCP Urgent Pointer. 1566 (f) It is now recommended that Timestamps options be included RST 1567 packets if the incoming packet contained a Timestamps option. 1569 (g) RST packets are explicitly excluded from PAWS processing. 1571 (h) Snd.TSoffset and Snd.TSclock variables have been added. 1572 Snd.TSoffset is the sum of my.TSclock and Snd.TSoffset. This 1573 allows the starting points for timestamps to be randomized on a 1574 per-connection basis. Setting Snd.TSoffset to zero yields the 1575 same results as RFC-1323. 1577 APPENDIX D: SUMMARY OF NOTATION 1579 The following notation has been used in this document. 1581 Options 1583 WSopt: TCP Window Scale Option 1584 TSopt: TCP Timestamps Option 1586 Option Fields 1588 shift.cnt: Window scale byte in WSopt. 1589 TSval: 32-bit Timestamp Value field in TSopt. 1590 TSecr: 32-bit Timestamp Reply field in TSopt. 1592 Option Fields in Current Segment 1594 SEG.TSval: TSval field from TSopt in current segment. 1595 SEG.TSecr: TSecr field from TSopt in current segment. 1596 SEG.WSopt: 8-bit value in WSopt 1598 Clock Values 1600 my.TSclock: System Wide Local source of 32-bit timestamp values 1601 my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec). 1602 Snd.TSoffset: A offset for randomizing Snd.TSclock 1603 Snd.TSclock: my.TSclock + Snd.TSoffset 1605 Per-Connection State Variables 1607 TS.Recent: Latest received Timestamp 1608 Last.ACK.sent: Last ACK field sent 1610 Snd.TS.OK: 1-bit flag 1611 Snd.WS.OK: 1-bit flag 1613 Rcv.Wind.Scale: Receive window scale power 1614 Snd.Wind.Scale: Send window scale power 1616 Start.Time: Snd.TSclock value when segment being 1617 timed was sent (used by pre-1323 code). 1619 Procedure 1621 Update_SRTT( m ) Procedure to update the smoothed RTT and RTT 1622 variance estimates, using the rules of 1623 [Jacobson88a], given m, a new RTT measurement. 1625 APPENDIX E: PSEUDO-CODE SUMMARY 1627 Create new TCB => { 1628 Rcv.wind.scale = 1629 MIN( 14, MAX( 0, floor(log2(receive buffer space)) - 15 ) ); 1630 Snd.wind.scale = 0; 1631 Last.ACK.sent = 0; 1632 Snd.TS.OK = Snd.WS.OK = FALSE; 1633 Snd.TSoffset = random 32 bit value 1634 } 1636 Send initial {SYN} segment => { 1638 SEG.WND = MIN( RCV.WND, 65535 ); 1639 Include in segment: TSopt(TSval=Snd.TSclock, TCecr=0); 1640 Include in segment: WSopt = Rcv.wind.scale; 1641 } 1643 Send {SYN, ACK} segment => { 1645 SEG.ACK = Last.ACK.sent = RCV.NXT; 1646 SEG.WND = MIN( RCV.WND, 65535 ); 1647 if (Snd.TS.OK) then 1648 Include in segment: TSopt(TSval=Snd.TSclock, TSecr=TS.Recent); 1649 if (Snd.WS.OK) then 1650 Include in segment: WSopt = Rcv.wind.scale; 1651 } 1653 Receive {SYN} or {SYN,ACK} segment => { 1655 if (Segment contains TSopt) then { 1656 TS.Recent = SEG.TSval; 1657 Snd.TS.OK = TRUE; 1658 if (is {SYN,ACK} segment) then 1659 Update_SRTT( 1660 (Snd.TSclock - SEG.TSecr)/my.TSclock.rate ) ; 1661 } 1663 if (Segment contains WSopt) then { 1664 Snd.wind.scale = SEG.WSopt; 1665 Snd.WS.OK = TRUE; 1666 if (the ACK bit is not set, and Rcv.wind.scale has not been 1667 initialized by the user) then 1668 Rcv.wind.scale = Snd.wind.scale; 1669 } 1670 else 1671 Rcv.wind.scale = Snd.wind.scale = 0; 1673 } 1675 Send non-SYN segment => { 1677 SEG.ACK = Last.ACK.sent = RCV.NXT; 1678 SEG.WND = MIN( RCV.WND >> Rcv.wind.scale, 65535 ); 1679 if (Snd.TS.OK) then 1680 Include in segment: TSopt(TSval=Snd.TSclock, TSecr=TS.Recent); 1681 } 1683 Receive non-SYN segment in (state >= ESTABLISHED) => { 1685 Window = (SEG.WND << Snd.wind.scale); 1686 /* Use 32-bit 'Window' instead of 16-bit 'SEG.WND' 1687 * in rest of processing. 1688 */ 1690 if (Segment contains TSopt) then { 1691 if (SEG.TSval < TS.Recent && Idle less than 24 days) then { 1692 if (Send.TS.OK AND (NOT RST) ) then { 1693 /* Timestamp too old => 1694 * segment is unacceptable. 1695 */ 1696 Send ACK segment; 1697 Discard segment and return; 1698 } 1699 } 1700 else { 1701 if (SEG.SEQ =< Last.ACK.sent) then 1702 TS.Recent = SEG.TSval; 1703 } 1704 } 1706 if (SEG.ACK > SND.UNA) then { 1707 /* (At least part of) first segment in 1708 * retransmission queue has been ACKd 1709 */ 1710 if (Segment contains TSopt) then 1711 Update_SRTT( 1712 (Snd.TSclock - SEG.TSecr)/my.TSclock.rate); 1713 else 1714 Update_SRTT( /* for compatibility */ 1715 (Snd.TSclock - Start.Time)/my.TSclock.rate); 1716 } 1717 } 1719 APPENDIX F: EVENT PROCESSING SUMMARY 1721 Event Processing 1723 OPEN Call 1725 ... 1726 An initial send sequence number (ISS) is selected. Send a SYN 1727 segment of the form: 1729 1731 ... 1733 SEND Call 1735 CLOSED STATE (i.e., TCB does not exist) 1737 ... 1739 LISTEN STATE 1741 If the foreign socket is specified, then change the connection 1742 from passive to active, select an ISS. Send a SYN segment 1743 containing the options: and 1744 . Set SND.UNA to ISS, SND.NXT to ISS+1. 1745 Enter SYN-SENT state. ... 1747 SYN-SENT STATE 1748 SYN-RECEIVED STATE 1750 ... 1752 ESTABLISHED STATE 1753 CLOSE-WAIT STATE 1755 Segmentize the buffer and send it with a piggybacked 1756 acknowledgment (acknowledgment value = RCV.NXT). ... 1758 If the urgent flag is set ... 1760 If the Snd.TS.OK flag is set, then include the TCP Timestamps 1761 option in each data segment. 1763 Scale the receive window for transmission in the segment header: 1765 SEG.WND = (RCV.WND >> Rcv.Wind.Scale). 1767 SEGMENT ARRIVES 1769 ... 1771 If the state is LISTEN then 1773 first check for an RST 1775 ... 1777 second check for an ACK 1779 ... 1781 third check for a SYN 1783 if the SYN bit is set, check the security. If the ... 1785 ... 1787 If the SEG.PRC is less than the TCB.PRC then continue. 1789 Check for a Window Scale option (WSopt); if one is found, save 1790 SEG.WSopt in Snd.Wind.Scale and set Snd.WS.OK flag on. 1791 Otherwise, set both Snd.Wind.Scale and Rcv.Wind.Scale to zero 1792 and clear Snd.WS.OK flag. 1794 Check for a TSopt option; if one is found, save SEG.TSval in the 1795 variable TS.Recent and turn on the Snd.TS.OK bit. 1797 Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any other 1798 control or text should be queued for processing later. ISS 1799 should be selected and a SYN segment sent of the form: 1801 1803 If the Snd.WS.OK bit is on, include a WSopt option 1804 in this segment. If the Snd.TS.OK bit is 1805 on, include a TSopt in this 1806 segment. Last.ACK.sent is set to RCV.NXT. 1808 SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection 1809 state should be changed to SYN-RECEIVED. Note that any other 1810 incoming control or data (combined with SYN) will be processed 1811 in the SYN-RECEIVED state, but processing of SYN and ACK should 1812 not be repeated. If the listen was not fully specified (i.e., 1813 the foreign socket was not fully specified), then the 1814 unspecified fields should be filled in now. 1816 fourth other text or control 1817 ... 1819 If the state is SYN-SENT then 1821 first check the ACK bit 1823 ... 1825 fourth check the SYN bit 1827 ... 1829 If the SYN bit is on and the security/compartment and precedence 1830 are acceptable then, RCV.NXT is set to SEG.SEQ+1, IRS is set to 1831 SEG.SEQ, and any acknowledgements on the retransmission queue 1832 which are thereby acknowledged should be removed. 1834 Check for a Window Scale option (WSopt); if is found, save 1835 SEG.WSopt in Snd.Wind.Scale; otherwise, set both Snd.Wind.Scale 1836 and Rcv.Wind.Scale to zero. 1838 Check for a TSopt option; if one is found, save SEG.TSval in 1839 variable TS.Recent and turn on the Snd.TS.OK bit in the 1840 connection control block. If the ACK bit is set, use 1841 Snd.TSclock - SEG.TSecr as the initial RTT estimate. 1843 If SND.UNA > ISS (our SYN has been ACKed), change the connection 1844 state to ESTABLISHED, form an ACK segment: 1846 1848 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1849 option in this ACK segment. 1850 Last.ACK.sent is set to RCV.NXT. 1852 Data or controls which were queued for transmission may be 1853 included. If there are other controls or text in the segment 1854 then continue processing at the sixth step below where the URG 1855 bit is checked, otherwise return. 1857 Otherwise enter SYN-RECEIVED, form a SYN,ACK segment: 1859 1861 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1862 option in this segment. If 1863 the Snd.WS.OK bit is on, include a WSopt option 1864 in this segment. Last.ACK.sent is set to 1865 RCV.NXT. 1867 If there are other controls or text in the segment, queue them 1868 for processing after the ESTABLISHED state has been reached, 1869 return. 1871 fifth, if neither of the SYN or RST bits is set then drop the 1872 segment and return. 1874 Otherwise, 1876 First, check sequence number 1878 SYN-RECEIVED STATE 1879 ESTABLISHED STATE 1880 FIN-WAIT-1 STATE 1881 FIN-WAIT-2 STATE 1882 CLOSE-WAIT STATE 1883 CLOSING STATE 1884 LAST-ACK STATE 1885 TIME-WAIT STATE 1887 Segments are processed in sequence. Initial tests on arrival 1888 are used to discard old duplicates, but further processing is 1889 done in SEG.SEQ order. If a segment's contents straddle the 1890 boundary between old and new, only the new parts should be 1891 processed. 1893 Rescale the received window field: 1895 TrueWindow = SEG.WND << Snd.Wind.Scale, 1897 and use "TrueWindow" in place of SEG.WND in the following steps. 1899 Check whether the segment contains a Timestamps option and bit 1900 Snd.TS.OK is on. If so: 1902 If SEG.TSval < TS.Recent and the RST bit is off, then test 1903 whether connection has been idle less than 24 days; if all are 1904 true, then the segment is not acceptable; follow steps below 1905 for an unacceptable segment. 1907 If SEG.SEQ is equal to Last.ACK.sent, then save SEG.ECopt in 1908 variable TS.Recent. 1910 There are four cases for the acceptability test for an incoming 1911 segment: 1913 ... 1915 If an incoming segment is not acceptable, an acknowledgment 1916 should be sent in reply (unless the RST bit is set, if so drop 1917 the segment and return): 1919 1921 Last.ACK.sent is set to SEG.ACK of the acknowledgment. If the 1922 Snd.Echo.OK bit is on, include the Timestamps option 1923 in this ACK segment. Set 1924 Last.ACK.sent to SEG.ACK and send the ACK segment. After 1925 sending the acknowledgment, drop the unacceptable segment and 1926 return. 1928 ... 1930 fifth check the ACK field. 1932 if the ACK bit is off drop the segment and return. 1934 if the ACK bit is on 1936 ... 1938 ESTABLISHED STATE 1940 If SND.UNA < SEG.ACK =< SND.NXT then, set SND.UNA <- SEG.ACK. 1941 Also compute a new estimate of round-trip time. If Snd.TS.OK 1942 bit is on, use Snd.TSclock - SEG.TSecr; otherwise use the 1943 elapsed time since the first segment in the retransmission 1944 queue was sent. Any segments on the retransmission queue 1945 which are thereby entirely acknowledged... 1947 ... 1949 Seventh, process the segment text. 1951 ESTABLISHED STATE 1952 FIN-WAIT-1 STATE 1953 FIN-WAIT-2 STATE 1955 ... 1957 Send an acknowledgment of the form: 1959 1961 If the Snd.TS.OK bit is on, include Timestamps option 1962 in this ACK segment. Set 1963 Last.ACK.sent to SEG.ACK of the acknowledgment, and send it. 1964 This acknowledgment should be piggy-backed on a segment being 1965 transmitted if possible without incurring undue delay. 1967 ... 1969 APPENDIX G: Timestamps Edge Cases 1971 While the rules layed out for when to calculate RTTM produce the 1972 correct results most of the time, there are some edge cases 1973 where an incorrect RTTM can be calculated. All of these 1974 situations involve the loss of packets. It is felt that these 1975 scenarios are rare, and that if they should happen, they will 1976 cause a single RTTM measurement to be inflated, which mitigates 1977 its effects on RTO calculations. 1979 [Martin03] cites two similar cases when the returning ACK is 1980 lost, and before the retransmission timer fires, another 1981 returning packet arrives, which ACKs the data. In this case, 1982 the RTTM calculated will be inflated: 1984 clock 1985 tc=1 -------------------> 1987 tc=2 (lost) <---- 1988 (RTTM would have been 1) 1990 (receive window opens, window update is sent) 1991 tc=5 <---- 1992 (RTTM is calculated at 4) 1994 One thing to note about this situation is that it is somewhat 1995 bounded by RTO + RTT, limiting how far off the RTTM calculation 1996 will be. While more complex scenarios can be constructed that 1997 produce larger inflations (e.g., retransmissions are lost), 1998 those scenarios involve multiple packet losses, and the 1999 connection will have other more serious operational problems 2000 than using an inflated RTTM in the RTO calculation. 2001 ------------- 2003 Security Considerations 2005 Security issues are not discussed in this memo. 2007 Authors' Addresses 2009 David Borman 2010 Wind River Systems 2011 Mendota Heights, MN 55120 2013 Phone: (651) 454-3052 2014 Email: david.borman@cray.com 2015 Bob Braden 2016 University of Southern California 2017 Information Sciences Institute 2018 4676 Admiralty Way 2019 Marina del Rey, CA 90292 2021 Phone: (310) 448-9173 2022 EMail: Braden@ISI.EDU 2024 Van Jacobson 2025 Packet Design 2026 2465 Latham Street 2027 Mountain View, CA 94040 2029 EMail: van@packetdesign.com 2031 Full Copyright Statement 2033 Copyright (C) The IETF Trust (2007). 2035 This document is subject to the rights, licenses and restrictions 2036 contained in BCP 78, and except as set forth therein, the authors 2037 retain all their rights." 2039 This document and the information contained herein are provided on 2040 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 2041 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 2042 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR 2043 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 2044 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES 2045 OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.