idnits 2.17.1 draft-ietf-tcpm-1323bis-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.i or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 278 has weird spacing: '...its/sec byt...' == Line 1397 has weird spacing: '... TSval times...' == Line 1399 has weird spacing: '... TSecr times...' -- The document seems to contain a disclaimer for pre-RFC5378 work, and may have content which was first submitted before 10 November 2008. The disclaimer is necessary when there are original authors that you have been unable to contact, or if some do not wish to grant the BCP78 rights to the IETF Trust. If you are able to get all authors (current and original) to grant those rights, you can and should remove the disclaimer; otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 4, 2009) is 5532 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '1' on line 308 ** Obsolete normative reference: RFC 793 (ref. 'Postel81') (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 2581 (ref. 'Allman99') (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 3517 (ref. 'Blanton03') (Obsoleted by RFC 6675) -- Obsolete informational reference (is this intentional?): RFC 1072 (ref. 'Jacobson88b') (Obsoleted by RFC 1323, RFC 2018, RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 1185 (ref. 'Jacobson90b') (Obsoleted by RFC 1323) -- Obsolete informational reference (is this intentional?): RFC 1323 (ref. 'Jacobson92d') (Obsoleted by RFC 7323) -- Duplicate reference: RFC1323, mentioned in 'Martin03', was also mentioned in 'Jacobson92d'. -- Obsolete informational reference (is this intentional?): RFC 1323 (ref. 'Martin03') (Obsoleted by RFC 7323) -- Obsolete informational reference (is this intentional?): RFC 1110 (ref. 'McKenzie89') (Obsoleted by RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 896 (ref. 'Nagle84') (Obsoleted by RFC 7805) Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 13 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Network Working Group 3 Internet-Draft D. Borman 4 Obsoletes: 1323 Wind River Systems 5 Intended Status: Standards Track R. Braden 6 File: draft-ietf-tcpm-1323bis-01.txt ISI 7 V. Jacobson 8 Packet Design 9 March 4, 2009 11 TCP Extensions for High Performance 13 Status of This Memo 15 This Internet-Draft is submitted to IETF in full conformance with the 16 provisions of BCP 78 and BCP 79. 18 This document may contain material from IETF Documents or IETF 19 Contributions published or made publicly available before November 20 10, 2008. The person(s) controlling the copyright in some of this 21 material may not have granted the IETF Trust the right to allow 22 modifications of such material outside the IETF Standards Process. 23 Without obtaining an adequate license from the person(s) controlling 24 the copyright in such materials, this document may not be modified 25 outside the IETF Standards Process, and derivative works of it may 26 not be created outside the IETF Standards Process, except to format 27 it for publication as an RFC or to translate it into languages other 28 than English. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF), its areas, and its working groups. Note that 32 other groups may also distribute working documents as Internet- 33 Drafts. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 The list of current Internet-Drafts can be accessed at 41 http://www.ietf.org/ietf/1id-abstracts.txt. 43 The list of Internet-Draft Shadow Directories can be accessed at 44 http://www.ietf.org/shadow.html. 46 This Internet-Draft will expire on September 4, 2009. 48 Copyright 50 Copyright (c) 2009 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents in effect on the date of 55 publication of this document (http://trustee.ietf.org/license-info). 56 Please review these documents carefully, as they describe your rights 57 and restrictions with respect to this document. 59 Abstract 61 This memo presents a set of TCP extensions to improve performance 62 over large bandwidth*delay product paths and to provide reliable 63 operation over very high-speed paths. It defines TCP options for 64 scaled windows and timestamps, which are designed to provide 65 compatible interworking with TCP's that do not implement the 66 extensions. The timestamps are used for two distinct mechanisms: 67 RTTM (Round Trip Time Measurement) and PAWS (Protection Against 68 Wrapped Sequences). Selective acknowledgments are not included in 69 this memo. 71 This memo updates and obsoletes RFC 1323. 73 TABLE OF CONTENTS 75 1. Introduction 2 76 2. TCP Window Scale Option 9 77 3. RTTM -- Round-Trip Time Measurement 12 78 4. PAWS -- Protection Against Wrapped Sequence Numbers 18 79 5. Conclusions and Acknowledgments 26 80 6. Security Considerations 27 81 7. IANA Considerations 27 82 8. References 27 83 APPENDIX A: Implementation Suggestions 30 84 APPENDIX B: Duplicates from Earlier Connection Incarnations 31 85 APPENDIX C: Changes from RFC 1072, RFC 1185, RFC 1323 34 86 APPENDIX D: Summary of Notation 36 87 APPENDIX E: Pseudo-code Summary 37 88 APPENDIX F: Event Processing 40 89 APPENDIX G: Timestamps Edge Cases 46 90 Authors' Addresses 47 92 1. INTRODUCTION 94 The TCP protocol [Postel81] was designed to operate reliably over 95 almost any transmission medium regardless of transmission rate, 96 delay, corruption, duplication, or reordering of segments. 97 Production TCP implementations currently adapt to transfer rates in 98 the range of 100 bps to 10**10 bps and round-trip delays in the range 99 1 ms to 100 seconds. Work on TCP performance has shown that TCP 100 without the extensions described in this memo can work well over a 101 variety of Internet paths, ranging from 800 Mbit/sec I/O channels to 102 300 bit/sec dial-up modems [Jacobson88a]. 104 Over the years, advances in networking technology has resulted in 105 ever-higher transmission speeds, and the fastest paths are well 106 beyond the domain for which TCP was originally engineered. This memo 107 defines a set of modest extensions to TCP to extend the domain of its 108 application to match this increasing network capability. It is an 109 update to and obsoletes RFC 1323 [Jacobson92d], which in turn is 110 based upon and obsoletes RFC 1072 [Jacobson88b] and RFC 1185 111 [Jacobson90b]. 113 There is no one-line answer to the question: "How fast can TCP go?". 114 There are two separate kinds of issues, performance and reliability, 115 and each depends upon different parameters. We discuss each in turn. 117 1.1 TCP Performance 119 TCP performance depends not upon the transfer rate itself, but 120 rather upon the product of the transfer rate and the round-trip 121 delay. This "bandwidth*delay product" measures the amount of data 122 that would "fill the pipe"; it is the buffer space required at 123 sender and receiver to obtain maximum throughput on the TCP 124 connection over the path, i.e., the amount of unacknowledged data 125 that TCP must handle in order to keep the pipeline full. TCP 126 performance problems arise when the bandwidth*delay product is 127 large. We refer to an Internet path operating in this region as a 128 "long, fat pipe", and a network containing this path as an "LFN" 129 (pronounced "elephan(t)"). 131 High-capacity packet satellite channels are LFN's. For example, a 132 DS1-speed satellite channel has a bandwidth*delay product of 10**6 133 bits or more; this corresponds to 100 outstanding TCP segments of 134 1200 bytes each. Terrestrial fiber-optical paths will also fall 135 into the LFN class; for example, a cross-country delay of 30 ms at 136 a DS3 bandwidth (45Mbps) also exceeds 10**6 bits. 138 There are three fundamental performance problems with the current 139 TCP over LFN paths: 141 (1) Window Size Limit 143 The TCP header uses a 16 bit field to report the receive 144 window size to the sender. Therefore, the largest window 145 that can be used is 2**16 = 65K bytes. 147 To circumvent this problem, Section 2 of this memo defines a 148 new TCP option, "Window Scale", to allow windows larger than 149 2**16. This option defines an implicit scale factor, which 150 is used to multiply the window size value found in a TCP 151 header to obtain the true window size. 153 (2) Recovery from Losses 155 Packet losses in an LFN can have a catastrophic effect on 156 throughput. In the past, properly-operating TCP 157 implementations would cause the data pipeline to drain with 158 every packet loss, and require a slow-start action to 159 recover. The Fast Retransmit and Fast Recovery algorithms 160 [Jacobson90c] [Allman99] were introduced, and their combined 161 effect was to recover from one packet loss per window, 162 without draining the pipeline. However, more than one packet 163 loss per window typically resulted in a retransmission 164 timeout and the resulting pipeline drain and slow start. 166 Expanding the window size to match the capacity of an LFN 167 results in a corresponding increase of the probability of 168 more than one packet per window being dropped. This could 169 have a devastating effect upon the throughput of TCP over an 170 LFN. In addition, since the publication of RFC 1323, 171 congestion control mechanism based upon some form of random 172 dropping have been introduced into gateways, and randomly 173 spaced packet drops have become common; this increases the 174 probability of dropping more than one packet per window. 176 To generalize the Fast Retransmit/Fast Recovery mechanism to 177 handle multiple packets dropped per window, selective 178 acknowledgments are required. Unlike the normal cumulative 179 acknowledgments of TCP, selective acknowledgments give the 180 sender a complete picture of which segments are queued at the 181 receiver and which have not yet arrived. 183 Since the publication of RFC 1323, selective acknowledgments 184 have become important in the LFN regime. RFC 1072 defined a 185 new TCP "SACK" option to send a selective acknowledgment, but 186 at the time that RFC 1323 was published, important technical 187 issues still had to be worked out concerning both the format 188 and semantics of the SACK option, so it was split off from 189 RFC 1323. SACK has now been published as a separate 190 document, RFC 2018 [Mathis96]. Additional information about 191 SACK can be found in RFC 2883, "An Extension to the Selective 192 Acknowledgement (SACK) option for TCP" [Floyd00] and RFC 193 3517, "A Conservative Selective Acknowledgment (SACK)-based 194 Loss Recovery Algorithm for TCP" [Blanton03]. 196 (3) Round-Trip Measurement 198 TCP implements reliable data delivery by retransmitting 199 segments that are not acknowledged within some retransmission 200 timeout (RTO) interval. Accurate dynamic determination of an 201 appropriate RTO is essential to TCP performance. RTO is 202 determined by estimating the mean and variance of the 203 measured round-trip time (RTT), i.e., the time interval 204 between sending a segment and receiving an acknowledgment for 205 it [Jacobson88a]. 207 Section 4 introduces a new TCP option, "Timestamps", and then 208 defines a mechanism using this option that allows nearly 209 every segment, including retransmissions, to be timed at 210 negligible computational cost. We use the mnemonic RTTM 211 (Round Trip Time Measurement) for this mechanism, to 212 distinguish it from other uses of the Timestamps option. 214 1.2 TCP Reliability 216 Now we turn from performance to reliability. High transfer rate 217 enters TCP performance through the bandwidth*delay product. 218 However, high transfer rate alone can threaten TCP reliability by 219 violating the assumptions behind the TCP mechanism for duplicate 220 detection and sequencing. 222 An especially serious kind of error may result from an accidental 223 reuse of TCP sequence numbers in data segments. Suppose that an 224 "old duplicate segment", e.g., a duplicate data segment that was 225 delayed in Internet queues, is delivered to the receiver at the 226 wrong moment, so that its sequence numbers falls somewhere within 227 the current window. There would be no checksum failure to warn of 228 the error, and the result could be an undetected corruption of the 229 data. Reception of an old duplicate ACK segment at the 230 transmitter could be only slightly less serious: it is likely to 231 lock up the connection so that no further progress can be made, 232 forcing an RST on the connection. 234 TCP reliability depends upon the existence of a bound on the 235 lifetime of a segment: the "Maximum Segment Lifetime" or MSL. An 236 MSL is generally required by any reliable transport protocol, 237 since every sequence number field must be finite, and therefore 238 any sequence number may eventually be reused. In the Internet 239 protocol suite, the MSL bound is enforced by an IP-layer 240 mechanism, the "Time-to-Live" or TTL field. 242 Duplication of sequence numbers might happen in either of two 243 ways: 245 (1) Sequence number wrap-around on the current connection 247 A TCP sequence number contains 32 bits. At a high enough 248 transfer rate, the 32-bit sequence space may be "wrapped" 249 (cycled) within the time that a segment is delayed in queues. 251 (2) Earlier incarnation of the connection 253 Suppose that a connection terminates, either by a proper 254 close sequence or due to a host crash, and the same 255 connection (i.e., using the same pair of sockets) is 256 immediately reopened. A delayed segment from the terminated 257 connection could fall within the current window for the new 258 incarnation and be accepted as valid. 260 Duplicates from earlier incarnations, Case (2), are avoided by 261 enforcing the current fixed MSL of the TCP spec, as explained in 262 Section 5.3 and Appendix B. However, case (1), avoiding the 263 reuse of sequence numbers within the same connection, requires an 264 MSL bound that depends upon the transfer rate, and at high enough 265 rates, a new mechanism is required. 267 More specifically, if the maximum effective bandwidth at which TCP 268 is able to transmit over a particular path is B bytes per second, 269 then the following constraint must be satisfied for error-free 270 operation: 272 2**31 / B > MSL (secs) [1] 274 The following table shows the value for Twrap = 2**31/B in 275 seconds, for some important values of the bandwidth B: 277 Network B*8 B Twrap 278 bits/sec bytes/sec secs 279 _______ _______ ______ ______ 281 Dialup 56kbps 7KBps 3*10**5 (~3.6 days) 283 DS1 1.5Mbps 190KBps 10**4 (~3 hours) 285 10mbit 286 Ethernet 10Mbps 1.25MBps 1700 (~30 mins) 288 DS3 45Mbps 5.6MBps 380 290 100mbit 291 Ethernet 100Mbps 12.5MBps 170 293 Gigabit 294 Ethernet 1Gbps 125MBps 17 296 10GigE 10Gbps 1.25GBps 1.7 298 It is clear that wrap-around of the sequence space is not a 299 problem for 56kbps packet switching or even 10Mbps Ethernets. On 300 the other hand, at DS3 and 100mbit speeds, Twrap is comparable to 301 the 2 minute MSL assumed by the TCP specification [Postel81]. 302 Moving towards and beyond gigabit speeds, Twrap becomes too small 303 for reliable enforcement by the Internet TTL mechanism. 305 The 16-bit window field of TCP limits the effective bandwidth B to 306 2**16/RTT, where RTT is the round-trip time in seconds 307 [McKenzie89]. If the RTT is large enough, this limits B to a 308 value that meets the constraint [1] for a large MSL value. For 309 example, consider a transcontinental backbone with an RTT of 60ms 310 (set by the laws of physics). With the bandwidth*delay product 311 limited to 64KB by the TCP window size, B is then limited to 312 1.1MBps, no matter how high the theoretical transfer rate of the 313 path. This corresponds to cycling the sequence number space in 314 Twrap= 2000 secs, which is safe in today's Internet. 316 It is important to understand that the culprit is not the larger 317 window but rather the high bandwidth. For example, consider a 318 (very large) FDDI LAN with a diameter of 10km. Using the speed of 319 light, we can compute the RTT across the ring as 320 (2*10**4)/(3*10**8) = 67 microseconds, and the delay*bandwidth 321 product is then 833 bytes. A TCP connection across this LAN using 322 a window of only 833 bytes will run at the full 100mbps and can 323 wrap the sequence space in about 3 minutes, very close to the MSL 324 of TCP. Thus, high speed alone can cause a reliability problem 325 with sequence number wrap-around, even without extended windows. 327 Watson's Delta-T protocol [Watson81] includes network-layer 328 mechanisms for precise enforcement of an MSL. In contrast, the IP 329 mechanism for MSL enforcement is loosely defined and even more 330 loosely implemented in the Internet. Therefore, it is unwise to 331 depend upon active enforcement of MSL for TCP connections, and it 332 is unrealistic to imagine setting MSL's smaller than the current 333 values (e.g., 120 seconds specified for TCP). 335 A possible fix for the problem of cycling the sequence space would 336 be to increase the size of the TCP sequence number field. For 337 example, the sequence number field (and also the acknowledgment 338 field) could be expanded to 64 bits. This could be done either by 339 changing the TCP header or by means of an additional option. 341 Section 5 presents a different mechanism, which we call PAWS 342 (Protect Against Wrapped Sequence numbers), to extend TCP 343 reliability to transfer rates well beyond the foreseeable upper 344 limit of network bandwidths. PAWS uses the TCP Timestamps option 345 defined in Section 4 to protect against old duplicates from the 346 same connection. 348 1.3 Using TCP options 350 The extensions defined in this memo all use new TCP options. We 351 must address two possible issues concerning the use of TCP 352 options: (1) compatibility and (2) overhead. 354 We must pay careful attention to compatibility, i.e., to 355 interoperation with existing implementations. The only TCP option 356 defined previously, MSS, may appear only on a SYN segment. Every 357 implementation should (and we expect that most will) ignore 358 unknown options on SYN segments. When RFC 1323 was published, 359 there was concern that some buggy TCP implementation might be 360 crashed by the first appearance of an option on a non-SYN segment. 361 However, bugs like that can lead to DOS attacks against a TCP, so 362 it is now expected that most TCP implementations will properly 363 handle unknown options on non-SYN segments. But it is still 364 prudent to be conservative in what you send, and avoiding buggy 365 TCP implementation is not the only reason for negotiating TCP 366 options on SYN segments. Therefore, for each of the extensions 367 defined below, TCP options will be sent on non-SYN segments only 368 after an exchange of options on the the SYN segments has indicated 369 that both sides understand the extension. Furthermore, an 370 extension option will be sent in a segment only if the 371 corresponding option was received in the initial segment. 373 A question may be raised about the bandwidth and processing 374 overhead for TCP options. Those options that occur on SYN 375 segments are not likely to cause a performance concern. Opening a 376 TCP connection requires execution of significant special-case 377 code, and the processing of options is unlikely to increase that 378 cost significantly. 380 On the other hand, a Timestamps option may appear in any data or 381 ACK segment, adding 12 bytes to the 20-byte TCP header. We 382 believe that the bandwidth saved by reducing unnecessary 383 retransmissions will more than pay for the extra header bandwidth. 385 There is also an issue about the processing overhead for parsing 386 the variable byte-aligned format of options, particularly with a 387 RISC-architecture CPU. Appendix A contains a recommended layout 388 of the options in TCP headers to achieve reasonable data field 389 alignment. In the spirit of Header Prediction, a TCP can quickly 390 test for this layout and if it is verified then use a fast path. 391 Hosts that use this canonical layout will effectively use the 392 options as a set of fixed-format fields appended to the TCP 393 header. However, to retain the philosophical and protocol 394 framework of TCP options, a TCP must be prepared to parse an 395 arbitrary options field, albeit with less efficiency. 397 Finally, we observe that most of the mechanisms defined in this 398 memo are important for LFN's and/or very high-speed networks. For 399 low-speed networks, it might be a performance optimization to NOT 400 use these mechanisms. A TCP vendor concerned about optimal 401 performance over low-speed paths might consider turning these 402 extensions off for low-speed paths, or allow a user or 403 installation manager to disable them. 405 2. TCP WINDOW SCALE OPTION 407 2.1 Introduction 409 The window scale extension expands the definition of the TCP 410 window to 32 bits and then uses a scale factor to carry this 411 32-bit value in the 16-bit Window field of the TCP header (SEG.WND 412 in RFC 793). The scale factor is carried in a new TCP option, 413 Window Scale. This option is sent only in a SYN segment (a 414 segment with the SYN bit on), hence the window scale is fixed in 415 each direction when a connection is opened. (Another design 416 choice would be to specify the window scale in every TCP segment. 417 It would be incorrect to send a window scale option only when the 418 scale factor changed, since a TCP option in an acknowledgement 419 segment will not be delivered reliably (unless the ACK happens to 420 be piggy-backed on data in the other direction). Fixing the scale 421 when the connection is opened has the advantage of lower overhead 422 but the disadvantage that the scale factor cannot be changed 423 during the connection.) 425 The maximum receive window, and therefore the scale factor, is 426 determined by the maximum receive buffer space. In a typical 427 modern implementation, this maximum buffer space is set by default 428 but can be overridden by a user program before a TCP connection is 429 opened. This determines the scale factor, and therefore no new 430 user interface is needed for window scaling. 432 2.2 Window Scale Option 434 The three-byte Window Scale option may be sent in a SYN segment by 435 a TCP. It has two purposes: (1) indicate that the TCP is prepared 436 to do both send and receive window scaling, and (2) communicate a 437 scale factor to be applied to its receive window. Thus, a TCP 438 that is prepared to scale windows should send the option, even if 439 its own scale factor is 1. The scale factor is limited to a power 440 of two and encoded logarithmically, so it may be implemented by 441 binary shift operations. 443 TCP Window Scale Option (WSopt): 445 Kind: 3 447 Length: 3 bytes 449 +---------+---------+---------+ 450 | Kind=3 |Length=3 |shift.cnt| 451 +---------+---------+---------+ 453 This option is an offer, not a promise; both sides must send 454 Window Scale options in their SYN segments to enable window 455 scaling in either direction. If window scaling is enabled, 456 then the TCP that sent this option will right-shift its true 457 receive-window values by 'shift.cnt' bits for transmission in 458 SEG.WND. The value 'shift.cnt' may be zero (offering to scale, 459 while applying a scale factor of 1 to the receive window). 461 This option may be sent in an initial segment (i.e., a 462 segment with the SYN bit on and the ACK bit off). It may also 463 be sent in a segment, but only if a Window Scale 464 option was received in the initial segment. A Window 465 Scale option in a segment without a SYN bit should be ignored. 467 The Window field in a SYN (i.e., a or ) segment 468 itself is never scaled. 470 2.3 Using the Window Scale Option 472 A model implementation of window scaling is as follows, using the 473 notation of RFC 793 [Postel81]: 475 * All windows are treated as 32-bit quantities for storage in 476 the connection control block and for local calculations. 477 This includes the send-window (SND.WND) and the receive- 478 window (RCV.WND) values, as well as the congestion window. 480 * The connection state is augmented by two window shift counts, 481 Snd.Wind.Scale and Rcv.Wind.Scale, to be applied to the 482 incoming and outgoing window fields, respectively. 484 * If a TCP receives a segment containing a Window Scale 485 option, it sends its own Window Scale option in the 486 segment. 488 * The Window Scale option is sent with shift.cnt = R, where R 489 is the value that the TCP would like to use for its receive 490 window. 492 * Upon receiving a SYN segment with a Window Scale option 493 containing shift.cnt = S, a TCP sets Snd.Wind.Scale to S and 494 sets Rcv.Wind.Scale to R; otherwise, it sets both 495 Snd.Wind.Scale and Rcv.Wind.Scale to zero. 497 * The window field (SEG.WND) in the header of every incoming 498 segment, with the exception of SYN segments, is left-shifted 499 by Snd.Wind.Scale bits before updating SND.WND: 501 SND.WND = SEG.WND << Snd.Wind.Scale 503 (assuming the other conditions of RFC 793 are met, and using 504 the "C" notation "<<" for left-shift). 506 * The window field (SEG.WND) of every outgoing segment, with 507 the exception of SYN segments, is right-shifted by 508 Rcv.Wind.Scale bits: 510 SEG.WND = RCV.WND >> Rcv.Wind.Scale. 512 TCP determines if a data segment is "old" or "new" by testing 513 whether its sequence number is within 2**31 bytes of the left edge 514 of the window, and if it is not, discarding the data as "old". To 515 insure that new data is never mistakenly considered old and vice- 516 versa, the left edge of the sender's window has to be at most 517 2**31 away from the right edge of the receiver's window. 518 Similarly with the sender's right edge and receiver's left edge. 519 Since the right and left edges of either the sender's or 520 receiver's window differ by the window size, and since the sender 521 and receiver windows can be out of phase by at most the window 522 size, the above constraints imply that 2 * the max window size 523 must be less than 2**31, or 524 max window < 2**30 526 Since the max window is 2**S (where S is the scaling shift count) 527 times at most 2**16 - 1 (the maximum unscaled window), the maximum 528 window is guaranteed to be < 2*30 if S <= 14. Thus, the shift 529 count must be limited to 14 (which allows windows of 2**30 = 1 530 Gbyte). If a Window Scale option is received with a shift.cnt 531 value exceeding 14, the TCP should log the error but use 14 532 instead of the specified value. 534 The scale factor applies only to the Window field as transmitted 535 in the TCP header; each TCP using extended windows will maintain 536 the window values locally as 32-bit numbers. For example, the 537 "congestion window" computed by Slow Start and Congestion 538 Avoidance is not affected by the scale factor, so window scaling 539 will not introduce quantization into the congestion window. 541 When a non-zero scale factor is in use, there are instances when a 542 retracted window can be offered [Mathis08]. The end of the window 543 will be on a boundary based on the granularity of the scale factor 544 being used. If the sequence number is then updated by a number of 545 bytes smaller than that granularity, the TCP will have to either 546 advertise a new window that beyond what it previously advertised 547 (and perhaps beyond the buffer), or will have to advertise a 548 smaller window, which will cause the TCP window to shrink. 549 Implementations should ensure that they handle a shrinking window, 550 as specified in section 4.2.2.16 of RFC 1122 [Braden89]. 552 3. RTTM: ROUND-TRIP TIME MEASUREMENT 554 3.1 Introduction 556 Accurate and current RTT estimates are necessary to adapt to 557 changing traffic conditions and to avoid an instability known as 558 "congestion collapse" [Nagle84] in a busy network. However, 559 accurate measurement of RTT may be difficult both in theory and in 560 implementation. 562 Many TCP implementations base their RTT measurements upon a sample 563 of one packet per window or less. While this yields an adequate 564 approximation to the RTT for small windows, it results in an 565 unacceptably poor RTT estimate for an LFN. If we look at RTT 566 estimation as a signal processing problem (which it is), a data 567 signal at some frequency, the packet rate, is being sampled at a 568 lower frequency, the window rate. This lower sampling frequency 569 violates Nyquist's criteria and may therefore introduce "aliasing" 570 artifacts into the estimated RTT [Hamming77]. 572 A good RTT estimator with a conservative retransmission timeout 573 calculation can tolerate aliasing when the sampling frequency is 574 "close" to the data frequency. For example, with a window of 8 575 packets, the sample rate is 1/8 the data frequency -- less than an 576 order of magnitude different. However, when the window is tens or 577 hundreds of packets, the RTT estimator may be seriously in error, 578 resulting in spurious retransmissions. 580 If there are dropped packets, the problem becomes worse. Zhang 581 [Zhang86], Jain [Jain86] and Karn [Karn87] have shown that it is 582 not possible to accumulate reliable RTT estimates if retransmitted 583 segments are included in the estimate. Since a full window of 584 data will have been transmitted prior to a retransmission, all of 585 the segments in that window will have to be ACKed before the next 586 RTT sample can be taken. This means at least an additional 587 window's worth of time between RTT measurements and, as the error 588 rate approaches one per window of data (e.g., 10**-6 errors per 589 bit for the Wideband satellite network), it becomes effectively 590 impossible to obtain a valid RTT measurement. 592 A solution to these problems, which actually simplifies the sender 593 substantially, is as follows: using TCP options, the sender places 594 a timestamp in each data segment, and the receiver reflects these 595 timestamps back in ACK segments. Then a single subtract gives the 596 sender an accurate RTT measurement for every ACK segment (which 597 will correspond to every other data segment, with a sensible 598 receiver). We call this the RTTM (Round-Trip Time Measurement) 599 mechanism. 601 It is vitally important to use the RTTM mechanism with big 602 windows; otherwise, the door is opened to some dangerous 603 instabilities due to aliasing. Furthermore, the option is 604 probably useful for all TCP's, since it simplifies the sender. 606 3.2 TCP Timestamps Option 608 TCP is a symmetric protocol, allowing data to be sent at any time 609 in either direction, and therefore timestamp echoing may occur in 610 either direction. For simplicity and symmetry, we specify that 611 timestamps always be sent and echoed in both directions. For 612 efficiency, we combine the timestamp and timestamp reply fields 613 into a single TCP Timestamps Option. 615 TCP Timestamps Option (TSopt): 617 Kind: 8 619 Length: 10 bytes 621 +-------+-------+---------------------+---------------------+ 622 |Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)| 623 +-------+-------+---------------------+---------------------+ 624 1 1 4 4 626 The Timestamps option carries two four-byte timestamp fields. 627 The Timestamp Value field (TSval) contains the current value of 628 the timestamp clock of the TCP sending the option. 630 The Timestamp Echo Reply field (TSecr) is valid if the ACK bit 631 is set in the TCP header; if it is valid, it echos a timestamp 632 value that was sent by the remote TCP in the TSval field of a 633 Timestamps option. When TSecr is not valid, its value must be 634 zero. The TSecr value will generally be from the most recent 635 Timestamp option that was received; however, there are 636 exceptions that are explained below. 638 A TCP may send the Timestamps option (TSopt) in an initial 639 segment (i.e., a segment containing a SYN bit and no ACK 640 bit), and may send a TSopt in other segments only if it 641 received a TSopt in the initial or segment for 642 the connection. Once a TSopt has been sent or received in a 643 non segment, it must be sent in all segments. Once a 644 TSopt has been received in a non segment, then any 645 successive segment that is received without the RST bit and 646 without a TSopt may dropped without further processing, and an 647 ACK of the current SND.UNA generated. 649 In the case of crossing SYN packets where one SYN contains a 650 TSopt and the other doesn't, both sides should put a TSopt in 651 the segment. 653 3.3 The RTTM Mechanism 655 RTTM places a Timestamps option in every segment, with a TSval 656 that is obtained from a (virtual) "timestamp clock". Values of 657 this clock values must be at least approximately proportional to 658 real time, in order to measure actual RTT. 660 These TSval values are echoed in TSecr values in the reverse 661 direction. The difference between a received TSecr value and the 662 current timestamp clock value provides an RTT measurement. 664 When timestamps are used, every segment that is received will 665 contain a TSecr value; however, these values cannot all be used to 666 update the measured RTT. The following example illustrates why. 667 It shows a one-way data flow with segments arriving in sequence 668 without loss. Here A, B, C... represent data blocks occupying 669 successive blocks of sequence numbers, and ACK(A),... represent 670 the corresponding cumulative acknowledgments. The two timestamp 671 fields of the Timestamps option are shown symbolically as . Each TSecr field contains the value most recently 673 received in a TSval field. 675 TCP A TCP B 677 ------> 679 <---- 681 ------> 683 <---- 685 . . . . . . . . . . . . . . . . . . . . . . 687 ------> 689 <---- 691 (etc) 693 The dotted line marks a pause (60 time units long) in which A had 694 nothing to send. Note that this pause inflates the RTT which B 695 could infer from receiving TSecr=131 in data segment C. Thus, in 696 one-way data flows, RTTM in the reverse direction measures a value 697 that is inflated by gaps in sending data. However, the following 698 rule prevents a resulting inflation of the measured RTT: 700 RTTM Rule: A TSecr value received in a segment is used to 701 update the averaged RTT measurement only if the segment 702 acknowledges some new data, i.e., only if it advances the 703 left edge of the send window. 705 Since TCP B is not sending data, the data segment C does not 706 acknowledge any new data when it arrives at B. Thus, the inflated 707 RTTM measurement is not used to update B's RTTM measurement. 709 Implementors should note that with Timestamps multiple RTTMs can 710 be taken per RTT. Many RTO estimators have a weighting factor 711 based on an implicit assumption that at most one RTTM will be 712 gotten per RTT. When using multiple RTTMs per RTT to update the 713 RTO estimator, the weighting factor needs to be decreased to take 714 into account the more frequent RTTMs. For example, an 715 implementation could choose to just use one sample per RTT to 716 update the RTO estimator, or or vary the gain based on the 717 congestion window, or take an average of all the RTTM measurements 718 received over one RTT, and then use that value to update the RTO 719 estimator. This document does not prescribe any particular method 720 for modifying the RTO estimator, the important point is that the 721 implementation should do something more than just feeding 722 additional RTTM samples from one RTT into the RTO estimator. 724 3.4 Which Timestamp to Echo 725 If more than one Timestamps option is received before a reply 726 segment is sent, the TCP must choose only one of the TSvals to 727 echo, ignoring the others. To minimize the state kept in the 728 receiver (i.e., the number of unprocessed TSvals), the receiver 729 should be required to retain at most one timestamp in the 730 connection control block. 732 There are three situations to consider: 734 (A) Delayed ACKs. 736 Many TCP's acknowledge only every Kth segment out of a group 737 of segments arriving within a short time interval; this 738 policy is known generally as "delayed ACKs". The data-sender 739 TCP must measure the effective RTT, including the additional 740 time due to delayed ACKs, or else it will retransmit 741 unnecessarily. Thus, when delayed ACKs are in use, the 742 receiver should reply with the TSval field from the earliest 743 unacknowledged segment. 745 (B) A hole in the sequence space (segment(s) have been lost). 747 The sender will continue sending until the window is filled, 748 and the receiver may be generating ACKs as these out-of-order 749 segments arrive (e.g., to aid "fast retransmit"). 751 The lost segment is probably a sign of congestion, and in 752 that situation the sender should be conservative about 753 retransmission. Furthermore, it is better to overestimate 754 than underestimate the RTT. An ACK for an out-of-order 755 segment should therefore contain the timestamp from the most 756 recent segment that advanced the window. 758 The same situation occurs if segments are re-ordered by the 759 network. 761 (C) A filled hole in the sequence space. 763 The segment that fills the hole represents the most recent 764 measurement of the network characteristics. On the other 765 hand, an RTT computed from an earlier segment would probably 766 include the sender's retransmit time-out, badly biasing the 767 sender's average RTT estimate. Thus, the timestamp from the 768 latest segment (which filled the hole) must be echoed. 770 An algorithm that covers all three cases is described in the 771 following rules for Timestamps option processing on a synchronized 772 connection: 774 (1) The connection state is augmented with two 32-bit slots: 776 TS.Recent holds a timestamp to be echoed in TSecr whenever a 777 segment is sent, and Last.ACK.sent holds the ACK field from 778 the last segment sent. Last.ACK.sent will equal RCV.NXT 779 except when ACKs have been delayed. 781 (2) If: 783 SEG.TSval >= TSrecent and SEG.SEQ <= Last.ACK.sent 785 then SEG.TSval is copied to TS.Recent; otherwise, it is 786 ignored. 788 (3) When a TSopt is sent, its TSecr field is set to the current 789 TS.Recent value. 791 The following examples illustrate these rules. Here A, B, C... 792 represent data segments occupying successive blocks of sequence 793 numbers, and ACK(A),... represent the corresponding 794 acknowledgment segments. Note that ACK(A) has the same sequence 795 number as B. We show only one direction of timestamp echoing, for 796 clarity. 798 o Packets arrive in sequence, and some of the ACKs are delayed. 800 By Case (A), the timestamp from the oldest unacknowledged 801 segment is echoed. 803 TS.Recent 804 -------------------> 805 1 806 -------------------> 807 1 808 -------------------> 809 1 810 <---- 811 (etc) 813 o Packets arrive out of order, and every packet is 814 acknowledged. 816 By Case (B), the timestamp from the last segment that 817 advanced the left window edge is echoed, until the missing 818 segment arrives; it is echoed according to Case (C). The 819 same sequence would occur if segments B and D were lost and 820 retransmitted.. 822 TS.Recent 823 -------------------> 824 1 825 <---- 826 1 827 -------------------> 828 1 829 <---- 830 1 831 -------------------> 832 2 833 <---- 834 2 835 -------------------> 836 2 837 <---- 838 2 839 -------------------> 840 4 841 <---- 842 (etc) 844 4. PAWS: PROTECTION AGAINST WRAPPED SEQUENCE NUMBERS 846 4.1 Introduction 848 Section 4.2 describes a simple mechanism to reject old duplicate 849 segments that might corrupt an open TCP connection; we call this 850 mechanism PAWS (Protection Against Wrapped Sequence numbers). 851 PAWS operates within a single TCP connection, using state that is 852 saved in the connection control block. Section 4.3 and Appendix C 853 discuss the implications of the PAWS mechanism for avoiding old 854 duplicates from previous incarnations of the same connection. 856 4.2 The PAWS Mechanism 858 PAWS uses the same TCP Timestamps option as the RTTM mechanism 859 described earlier, and assumes that every received TCP segment 860 (including data and ACK segments) contains a timestamp SEG.TSval 861 whose values are monotonically non-decreasing in time. The basic 862 idea is that a segment can be discarded as an old duplicate if it 863 is received with a timestamp SEG.TSval less than some timestamp 864 recently received on this connection. 866 In both the PAWS and the RTTM mechanism, the "timestamps" are 867 32-bit unsigned integers in a modular 32-bit space. Thus, "less 868 than" is defined the same way it is for TCP sequence numbers, and 869 the same implementation techniques apply. If s and t are 870 timestamp values, s < t if 0 < (t - s) < 2**31, computed in 871 unsigned 32-bit arithmetic. 873 The choice of incoming timestamps to be saved for this comparison 874 must guarantee a value that is monotonically increasing. For 875 example, we might save the timestamp from the segment that last 876 advanced the left edge of the receive window, i.e., the most 877 recent in-sequence segment. Instead, we choose the value 878 TS.Recent introduced in Section 3.4 for the RTTM mechanism, since 879 using a common value for both PAWS and RTTM simplifies the 880 implementation of both. As Section 3.4 explained, TS.Recent 881 differs from the timestamp from the last in-sequence segment only 882 in the case of delayed ACKs, and therefore by less than one 883 window. Either choice will therefore protect against sequence 884 number wrap-around. 886 RTTM was specified in a symmetrical manner, so that TSval 887 timestamps are carried in both data and ACK segments and are 888 echoed in TSecr fields carried in returning ACK or data segments. 889 PAWS submits all incoming segments to the same test, and therefore 890 protects against duplicate ACK segments as well as data segments. 891 (An alternative non-symmetric algorithm would protect against old 892 duplicate ACKs: the sender of data would reject incoming ACK 893 segments whose TSecr values were less than the TSecr saved from 894 the last segment whose ACK field advanced the left edge of the 895 send window. This algorithm was deemed to lack economy of 896 mechanism and symmetry.) 898 TSval timestamps sent on {SYN} and {SYN,ACK} segments are used to 899 initialize PAWS. PAWS protects against old duplicate non-SYN 900 segments, and duplicate SYN segments received while there is a 901 synchronized connection. Duplicate {SYN} and {SYN,ACK} segments 902 received when there is no connection will be discarded by the 903 normal 3-way handshake and sequence number checks of TCP. 905 RFC 1323 recommended that RST segments NOT carry timestamps, and 906 that they be acceptable regardless of their timestamp. At that 907 time, the thinking was that old duplicate RST segments should be 908 exceedingly unlikely, and their cleanup function should take 909 precedence over timestamps. More recently, discussion about 910 various blind attacks on TCP connections have raised the 911 suggestion that if the Timestamps option is present, SEG.TSecr 912 could be used to provide stricter acceptance tests for RST 913 packets. While still under discussion, to enable research into 914 this area it is now recommended that when generating a RST, that 915 if the packet causing the RST to be generated contained a 916 Timestamps option that the RST also contain a Timestamps option. 917 In the RST segment, SEG.TSecr should be set to SEG.TSval from the 918 incoming packet and SEG.TSval should be set to zero. If a RST is 919 being generated because of a user abort, and Snd.TS.OK is set, 920 then a Timestamps option should be included in the RST. When a 921 RST packet is received, it must not be subjected to PAWS checks, 922 and information from the Timestamps option must not be use to 923 update connection state information. SEG.TSecr may be used to 924 provide stricter RST acceptance checks. 926 4.2.1 Basic PAWS Algorithm 928 The PAWS algorithm requires the following processing to be 929 performed on all incoming segments for a synchronized 930 connection: 932 R1) If there is a Timestamps option in the arriving segment, 933 SEG.TSval < TS.Recent, TS.Recent is valid (see later 934 discussion) and the RST bit is not set, then treat the 935 arriving segment as not acceptable: 937 Send an acknowledgement in reply as specified in RFC 938 793 page 69 and drop the segment. 940 Note: it is necessary to send an ACK segment in order 941 to retain TCP's mechanisms for detecting and 942 recovering from half-open connections. For example, 943 see Figure 10 of RFC 793. 945 R2) If the segment is outside the window, reject it (normal 946 TCP processing) 948 R3) If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent 949 (see Section 3.4), then record its timestamp in TS.Recent. 951 R4) If an arriving segment is in-sequence (i.e., at the left 952 window edge), then accept it normally. 954 R5) Otherwise, treat the segment as a normal in-window, out- 955 of-sequence TCP segment (e.g., queue it for later delivery 956 to the user). 958 Steps R2, R4, and R5 are the normal TCP processing steps 959 specified by RFC 793. 961 It is important to note that the timestamp is checked only when 962 a segment first arrives at the receiver, regardless of whether 963 it is in-sequence or it must be queued for later delivery. 965 Consider the following example. 967 Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has 968 been sent, where the letter indicates the sequence number 969 and the digit represents the timestamp. Suppose also that 970 segment B.1 has been lost. The timestamp in TS.TStamp is 971 1 (from A.1), so C.1, ..., Z.1 are considered acceptable 972 and are queued. When B is retransmitted as segment B.2 973 (using the latest timestamp), it fills the hole and causes 974 all the segments through Z to be acknowledged and passed 975 to the user. The timestamps of the queued segments are 976 *not* inspected again at this time, since they have 977 already been accepted. When B.2 is accepted, TS.Stamp is 978 set to 2. 980 This rule allows reasonable performance under loss. A full 981 window of data is in transit at all times, and after a loss a 982 full window less one packet will show up out-of-sequence to be 983 queued at the receiver (e.g., up to ~2**30 bytes of data); the 984 timestamp option must not result in discarding this data. 986 In certain unlikely circumstances, the algorithm of rules R1-R5 987 could lead to discarding some segments unnecessarily, as shown 988 in the following example: 990 Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have 991 been sent in sequence and that segment B.1 has been lost. 992 Furthermore, suppose delivery of some of C.1, ... Z.1 is 993 delayed until AFTER the retransmission B.2 arrives at the 994 receiver. These delayed segments will be discarded 995 unnecessarily when they do arrive, since their timestamps 996 are now out of date. 998 This case is very unlikely to occur. If the retransmission was 999 triggered by a timeout, some of the segments C.1, ... Z.1 must 1000 have been delayed longer than the RTO time. This is presumably 1001 an unlikely event, or there would be many spurious timeouts and 1002 retransmissions. If B's retransmission was triggered by the 1003 "fast retransmit" algorithm, i.e., by duplicate ACKs, then the 1004 queued segments that caused these ACKs must have been received 1005 already. 1007 Even if a segment were delayed past the RTO, the Fast 1008 Retransmit mechanism [Jacobson90c] will cause the delayed 1009 packets to be retransmitted at the same time as B.2, avoiding 1010 an extra RTT and therefore causing a very small performance 1011 penalty. 1013 We know of no case with a significant probability of occurrence 1014 in which timestamps will cause performance degradation by 1015 unnecessarily discarding segments. 1017 4.2.2 Timestamp Clock 1019 It is important to understand that the PAWS algorithm does not 1020 require clock synchronization between sender and receiver. The 1021 sender's timestamp clock is used to stamp the segments, and the 1022 sender uses the echoed timestamp to measure RTT's. However, 1023 the receiver treats the timestamp as simply a monotonically 1024 increasing serial number, without any necessary connection to 1025 its clock. From the receiver's viewpoint, the timestamp is 1026 acting as a logical extension of the high-order bits of the 1027 sequence number. 1029 The receiver algorithm does place some requirements on the 1030 frequency of the timestamp clock. 1032 (a) The timestamp clock must not be "too slow". 1034 It must tick at least once for each 2**31 bytes sent. In 1035 fact, in order to be useful to the sender for round trip 1036 timing, the clock should tick at least once per window's 1037 worth of data, and even with the window extension defined 1038 in Section 2.2, 2**31 bytes must be at least two windows. 1040 To make this more quantitative, any clock faster than 1 1041 tick/sec will reject old duplicate segments for link 1042 speeds of ~8 Gbps. A 1ms timestamp clock will work at 1043 link speeds up to 8 Tbps (8*10**12) bps! 1045 (b) The timestamp clock must not be "too fast". 1047 Its recycling time must be greater than MSL seconds. 1048 Since the clock (timestamp) is 32 bits and the worst-case 1049 MSL is 255 seconds, the maximum acceptable clock frequency 1050 is one tick every 59 ns. 1052 However, it is desirable to establish a much longer 1053 recycle period, in order to handle outdated timestamps on 1054 idle connections (see Section 4.2.3), and to relax the MSL 1055 requirement for preventing sequence number wrap-around. 1056 With a 1 ms timestamp clock, the 32-bit timestamp will 1057 wrap its sign bit in 24.8 days. Thus, it will reject old 1058 duplicates on the same connection if MSL is 24.8 days or 1059 less. This appears to be a very safe figure; an MSL of 1060 24.8 days or longer can probably be assumed by the gateway 1061 system without requiring precise MSL enforcement by the 1062 TTL value in the IP layer. 1064 Based upon these considerations, we choose a timestamp clock 1065 frequency in the range 1 ms to 1 sec per tick. This range also 1066 matches the requirements of the RTTM mechanism, which does not 1067 need much more resolution than the granularity of the 1068 retransmit timer, e.g., tens or hundreds of milliseconds. 1070 The PAWS mechanism also puts a strong monotonicity requirement 1071 on the sender's timestamp clock. The method of implementation 1072 of the timestamp clock to meet this requirement depends upon 1073 the system hardware and software. 1075 * Some hosts have a hardware clock that is guaranteed to be 1076 monotonic between hardware resets. 1078 * A clock interrupt may be used to simply increment a binary 1079 integer by 1 periodically. 1081 * The timestamp clock may be derived from a system clock 1082 that is subject to being abruptly changed, by adding a 1083 variable offset value. This offset is initialized to 1084 zero. When a new timestamp clock value is needed, the 1085 offset can be adjusted as necessary to make the new value 1086 equal to or larger than the previous value (which was 1087 saved for this purpose). 1089 4.2.3 Outdated Timestamps 1091 If a connection remains idle long enough for the timestamp 1092 clock of the other TCP to wrap its sign bit, then the value 1093 saved in TS.Recent will become too old; as a result, the PAWS 1094 mechanism will cause all subsequent segments to be rejected, 1095 freezing the connection (until the timestamp clock wraps its 1096 sign bit again). 1098 With the chosen range of timestamp clock frequencies (1 sec to 1099 1 ms), the time to wrap the sign bit will be between 24.8 days 1100 and 24800 days. A TCP connection that is idle for more than 24 1101 days and then comes to life is exceedingly unusual. However, 1102 it is undesirable in principle to place any limitation on TCP 1103 connection lifetimes. 1105 We therefore require that an implementation of PAWS include a 1106 mechanism to "invalidate" the TS.Recent value when a connection 1107 is idle for more than 24 days. (An alternative solution to the 1108 problem of outdated timestamps would be to send keep-alive 1109 segments at a very low rate, but still more often than the 1110 wrap-around time for timestamps, e.g., once a day. This would 1111 impose negligible overhead. However, the TCP specification has 1112 never included keep-alives, so the solution based upon 1113 invalidation was chosen.) 1114 Note that a TCP does not know the frequency, and therefore, the 1115 wraparound time, of the other TCP, so it must assume the worst. 1116 The validity of TS.Recent needs to be checked only if the basic 1117 PAWS timestamp check fails, i.e., only if SEG.TSval < 1118 TS.Recent. If TS.Recent is found to be invalid, then the 1119 segment is accepted, regardless of the failure of the timestamp 1120 check, and rule R3 updates TS.Recent with the TSval from the 1121 new segment. 1123 To detect how long the connection has been idle, the TCP may 1124 update a clock or timestamp value associated with the 1125 connection whenever TS.Recent is updated, for example. The 1126 details will be implementation-dependent. 1128 4.2.4 Header Prediction 1130 "Header prediction" [Jacobson90a] is a high-performance 1131 transport protocol implementation technique that is most 1132 important for high-speed links. This technique optimizes the 1133 code for the most common case, receiving a segment correctly 1134 and in order. Using header prediction, the receiver asks the 1135 question, "Is this segment the next in sequence?" This 1136 question can be answered in fewer machine instructions than the 1137 question, "Is this segment within the window?" 1139 Adding header prediction to our timestamp procedure leads to 1140 the following recommended sequence for processing an arriving 1141 TCP segment: 1143 H1) Check timestamp (same as step R1 above) 1145 H2) Do header prediction: if segment is next in sequence and 1146 if there are no special conditions requiring additional 1147 processing, accept the segment, record its timestamp, and 1148 skip H3. 1150 H3) Process the segment normally, as specified in RFC 793. 1151 This includes dropping segments that are outside the 1152 window and possibly sending acknowledgments, and queueing 1153 in-window, out-of-sequence segments. 1155 Another possibility would be to interchange steps H1 and H2, 1156 i.e., to perform the header prediction step H2 FIRST, and 1157 perform H1 and H3 only when header prediction fails. This 1158 could be a performance improvement, since the timestamp check 1159 in step H1 is very unlikely to fail, and it requires unsigned 1160 modulo arithmetic, a relatively expensive operation. To 1161 perform this check on every single segment is contrary to the 1162 philosophy of header prediction. We believe that this change 1163 might produce a measurable reduction in CPU time for TCP 1164 protocol processing on high-speed networks. 1166 However, putting H2 first would create a hazard: a segment from 1167 2**32 bytes in the past might arrive at exactly the wrong time 1168 and be accepted mistakenly by the header-prediction step. The 1169 following reasoning has been introduced [Jacobson90b] to show 1170 that the probability of this failure is negligible. 1172 If all segments are equally likely to show up as old 1173 duplicates, then the probability of an old duplicate 1174 exactly matching the left window edge is the maximum 1175 segment size (MSS) divided by the size of the sequence 1176 space. This ratio must be less than 2**-16, since MSS 1177 must be < 2**16; for example, it will be (2**12)/(2**32) = 1178 2**-20 for an FDDI link. However, the older a segment is, 1179 the less likely it is to be retained in the Internet, and 1180 under any reasonable model of segment lifetime the 1181 probability of an old duplicate exactly at the left window 1182 edge must be much smaller than 2**-16. 1184 The 16 bit TCP checksum also allows a basic unreliability 1185 of one part in 2**16. A protocol mechanism whose 1186 reliability exceeds the reliability of the TCP checksum 1187 should be considered "good enough", i.e., it won't 1188 contribute significantly to the overall error rate. We 1189 therefore believe we can ignore the problem of an old 1190 duplicate being accepted by doing header prediction before 1191 checking the timestamp. 1193 However, this probabilistic argument is not universally 1194 accepted, and the consensus at present is that the performance 1195 gain does not justify the hazard in the general case. It is 1196 therefore recommended that H2 follow H1. 1198 4.2.5 IP Fragmentation 1200 At high data rates, the protection against old packets provided 1201 by PAWS can be circumvented by errors in IP fragment reassembly 1202 [Heffner07]. The only way to protect against incorrect IP 1203 fragment reassembly is to not allow the packets to be 1204 fragmented. This is done by setting the Don't Fragment (DF) 1205 bit in the IP header. Setting the DF bit implies the use of 1206 Path MTU Discovery as described in RFC 1191 [Mogul90], thus any 1207 TCP implementation that implements PAWS must also implement 1208 Path MTU Discovery. 1210 4.3. Duplicates from Earlier Incarnations of Connection 1211 The PAWS mechanism protects against errors due to sequence number 1212 wrap-around on high-speed connection. Segments from an earlier 1213 incarnation of the same connection are also a potential cause of 1214 old duplicate errors. In both cases, the TCP mechanisms to 1215 prevent such errors depend upon the enforcement of a maximum 1216 segment lifetime (MSL) by the Internet (IP) layer (see Appendix of 1217 RFC 1185 for a detailed discussion). Unlike the case of sequence 1218 space wrap-around, the MSL required to prevent old duplicate 1219 errors from earlier incarnations does not depend upon the transfer 1220 rate. If the IP layer enforces the recommended 2 minute MSL of 1221 TCP, and if the TCP rules are followed, TCP connections will be 1222 safe from earlier incarnations, no matter how high the network 1223 speed. Thus, the PAWS mechanism is not required for this case. 1225 We may still ask whether the PAWS mechanism can provide additional 1226 security against old duplicates from earlier connections, allowing 1227 us to relax the enforcement of MSL by the IP layer. Appendix B 1228 explores this question, showing that further assumptions and/or 1229 mechanisms are required, beyond those of PAWS. This is not part 1230 of the current extension. 1232 5. CONCLUSIONS AND ACKNOWLEDGMENTS 1234 This memo presented a set of extensions to TCP to provide efficient 1235 operation over large-bandwidth*delay-product paths and reliable 1236 operation over very high-speed paths. These extensions are designed 1237 to provide compatible interworking with TCP's that do not implement 1238 the extensions. 1240 These mechanisms are implemented using new TCP options for scaled 1241 windows and timestamps. The timestamps are used for two distinct 1242 mechanisms: RTTM (Round Trip Time Measurement) and PAWS (Protect 1243 Against Wrapped Sequences). 1245 The Window Scale option was originally suggested by Mike St. Johns of 1246 USAF/DCA. The present form of the option was suggested by Mike 1247 Karels of UC Berkeley in response to a more cumbersome scheme defined 1248 by Van Jacobson. Lixia Zhang helped formulate the PAWS mechanism 1249 description in RFC 1185. 1251 Finally, much of this work originated as the result of discussions 1252 within the End-to-End Task Force on the theoretical limitations of 1253 transport protocols in general and TCP in particular. Task force 1254 members and other on the end2end-interest list have made valuable 1255 contributions by pointing out flaws in the algorithms and the 1256 documentation. Continued discussion and development since the 1257 publication of RFC 1323 originally occurred in the IETF TCP Large 1258 Windows Working Group, later on in the End-to-End Task Force, and 1259 most recently in the IETF TCP Maintenance Working Group. The authors 1260 are grateful for all these contributions. 1262 6. SECURITY CONSIDERATIONS 1264 The TCP sequence space is a fixed size, and as the window becomes 1265 larger it becomes easier for an attacker to generate forged packets 1266 that can fall within the TCP window, and be accepted as valid 1267 packets. While use of Timestamps and PAWS can help to mitigate this, 1268 when using PAWS, if an attacker is able to forge a packet that is 1269 acceptable to the TCP connection, a timestamp that is in the future 1270 would cause valid packets to be dropped due to PAWS checks. Hence, 1271 implementors should take care to not open the TCP window drastically 1272 beyond the requirements of the connection. 1274 Middle boxes and options If a middle box removes TCP options from the 1275 SYN, such as TSopt, a high speed connection that needs PAWS would not 1276 have that protection. In this situation, an implementor could 1277 provide a mechanism for the application to determine whether or not 1278 PAWS is in use on the connection, and chose to terminate the 1279 connection if that protection doesn't exist. 1281 Mechanisms to protect the TCP header from modification should also 1282 protect the TCP options. 1284 Expanding the TCP window beyond 64K for IPv6 allows Jumbograms 1285 [Borman99] to be used when the local network supports packets larger 1286 than 64K. When larger TCP packets are used, the TCP checksum becomes 1287 weaker. 1289 7. IANA CONSIDERATIONS 1291 This document has no actions for IANA. 1293 8. REFERENCES 1295 Normative References 1297 [Mogul90] Mojul, J. and Deering, S., "Path MTU Discovery", RFC 1298 1191, November 1990. 1300 [Postel81] Postel, J., "Transmission Control Protocol - DARPA 1301 Internet Program Protocol Specification", RFC 793, DARPA, 1302 September 1981. 1304 Informative References 1306 [Allman99] Allman, M., Paxson, V., Stevens, W., "TCP Congestion 1307 Control", RFC 2581, NASA Glenn/Sterling Software, ACIRI / ICSI, 1308 April 1999. 1310 [Borman99] Borman, D., Deering, S., and Hinden, R, "IPv6 1311 Jumbograms" RFC 2675, August 1999. 1313 [Braden89] Braden, R., editor, "Requirements for Internet Hosts -- 1314 Communication Layers", RFC 1122, October, 1989 1316 [Floyd00] Floyd, S., Mahdavi, J., Mathis, M., Podolsky, M., "An 1317 Extension to the Selective Acknowledgement (SACK) Option for TCP", 1318 RFC 2883, July 2000. 1320 [Blanton03] Blanton, E., Allman, M., Fall, K., Wang, L., "A 1321 Conservative Selective Acknowledgment (SACK)-based Loss Recovery 1322 Algorithm for TCP", RFC 3517, April 2003. 1324 [Garlick77] Garlick, L., R. Rom, and J. Postel, "Issues in 1325 Reliable Host-to-Host Protocols", Proc. Second Berkeley Workshop 1326 on Distributed Data Management and Computer Networks, May 1977. 1328 [Hamming77] Hamming, R., "Digital Filters", ISBN 0-13-212571-4, 1329 Prentice Hall, Englewood Cliffs, N.J., 1977. 1331 [Heffner07] Heffner, J., Mathis, M., and Chandler, B., "IPv4 1332 Reassembly Errors at High Data Rates" RFC 4963, PSC, July 2007. 1334 [Jacobson88a] Jacobson, V., "Congestion Avoidance and Control", 1335 SIGCOMM '88, Stanford, CA., August 1988. 1337 [Jacobson88b] Jacobson, V., and R. Braden, "TCP Extensions for 1338 Long-Delay Paths", RFC 1072, LBL and USC/Information Sciences 1339 Institute, October 1988. 1341 [Jacobson90a] Jacobson, V., "4BSD Header Prediction", ACM 1342 Computer Communication Review, April 1990. 1344 [Jacobson90b] Jacobson, V., Braden, R., and Zhang, L., "TCP 1345 Extension for High-Speed Paths", RFC 1185, LBL and USC/Information 1346 Sciences Institute, October 1990. 1348 [Jacobson90c] Jacobson, V., "Modified TCP congestion avoidance 1349 algorithm", Message to end2end-interest mailing list, April 1990. 1351 [Jacobson92d] Jacobson, V., Braden, R., and Borman, D., "TCP 1352 Extension for High Performance", RFC 1323, LBL, USC/Information 1353 Sciences Institute and Cray Research, May 1992. 1355 [Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet 1356 Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and Comm., 1357 Scottsdale, Arizona, March 1986. 1359 [Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times 1360 in Reliable Transport Protocols", Proc. SIGCOMM '87, Stowe, VT, 1361 August 1987. 1363 [Martin03] Martin, D., "[Tsvwg] RFC 1323.bis" Message to tsvwg 1364 mailing list, September 30, 2003. 1366 [Mathis96] Mathis, M., Mahdavi, J., Floyd, S., and Romanow, A., 1367 "TCP Selective Acknowledgment Options", RFC 2018, October, 1996. 1369 [Mathis08] Mathis, M., "[tcpm] Example of 1323 window retraction 1370 problemPer my comments at the microphone at TCPM...", Message to 1371 the tcpm mailing list, March 2008. 1373 [McKenzie89] McKenzie, A., "A Problem with the TCP Big Window 1374 Option", RFC 1110, BBN STC, August 1989. 1376 [Nagle84] Nagle, J., "Congestion Control in IP/TCP 1377 Internetworks", RFC 896, FACC, January 1984. 1379 [Watson81] Watson, R., "Timer-based Mechanisms in Reliable 1380 Transport Protocol Connection Management", Computer Networks, Vol. 1381 5, 1981. 1383 [Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. 1384 SIGCOMM '86, Stowe, Vt., August 1986. 1386 APPENDIX A: IMPLEMENTATION SUGGESTIONS 1388 TCP Option Layout 1390 The following layouts are recommended for sending options on 1391 non-SYN segments, to achieve maximum feasible alignment of 1392 32-bit and 64-bit machines. 1394 +--------+--------+--------+--------+ 1395 | NOP | NOP | TSopt | 10 | 1396 +--------+--------+--------+--------+ 1397 | TSval timestamp | 1398 +--------+--------+--------+--------+ 1399 | TSecr timestamp | 1400 +--------+--------+--------+--------+ 1402 Interaction with the TCP Urgent Pointer 1404 The TCP Urgent pointer, like the TCP window, is a 16 bit value. 1405 Some of the original discussion for the TCP Window Scale option 1406 included proposals to increase the Urgent pointer to 32 bits. 1407 As it turns out, this is unnecessary. There are two 1408 observations that should be made: 1410 (1) With IP Version 4, the largest amount of TCP data that can 1411 be sent in a single packet is 65495 bytes (64K - 1 - size 1412 of fixed IP and TCP headers). 1414 (2) Updates to the urgent pointer while the user is in "urgent 1415 mode" are invisible to the user. 1417 This means that if the Urgent Pointer points beyond the end of 1418 the TCP data in the current packet, then the user will remain in 1419 urgent mode until the next TCP packet arrives. That packet will 1420 update the urgent pointer to a new offset, and the user will 1421 never have left urgent mode. 1423 Thus, to properly implement the Urgent Pointer, the sending TCP 1424 only has to check for overflow of the 16 bit Urgent Pointer 1425 field before filling it in. If it does overflow, than a value 1426 of 65535 should be inserted into the Urgent Pointer. 1428 The same technique applies to IP Version 6, except in the case 1429 of IPv6 Jumbograms. When IPv6 Jumbograms are supported, RFC 1430 2675 [Borman99] requires additional steps for dealing with the 1431 Urgent Pointer, these are described in section 5.2 of RFC 2675. 1433 APPENDIX B: DUPLICATES FROM EARLIER CONNECTION INCARNATIONS 1435 There are two cases to be considered: (1) a system crashing (and 1436 losing connection state) and restarting, and (2) the same connection 1437 being closed and reopened without a loss of host state. These will 1438 be described in the following two sections. 1440 B.1 System Crash with Loss of State 1442 TCP's quiet time of one MSL upon system startup handles the loss 1443 of connection state in a system crash/restart. For an 1444 explanation, see for example "When to Keep Quiet" in the TCP 1445 protocol specification [Postel81]. The MSL that is required here 1446 does not depend upon the transfer speed. The current TCP MSL of 2 1447 minutes seems acceptable as an operational compromise, as many 1448 host systems take this long to boot after a crash. 1450 However, the timestamp option may be used to ease the MSL 1451 requirements (or to provide additional security against data 1452 corruption). If timestamps are being used and if the timestamp 1453 clock can be guaranteed to be monotonic over a system 1454 crash/restart, i.e., if the first value of the sender's timestamp 1455 clock after a crash/restart can be guaranteed to be greater than 1456 the last value before the restart, then a quiet time will be 1457 unnecessary. 1459 To dispense totally with the quiet time would require that the 1460 host clock be synchronized to a time source that is stable over 1461 the crash/restart period, with an accuracy of one timestamp clock 1462 tick or better. We can back off from this strict requirement to 1463 take advantage of approximate clock synchronization. Suppose that 1464 the clock is always re-synchronized to within N timestamp clock 1465 ticks and that booting (extended with a quiet time, if necessary) 1466 takes more than N ticks. This will guarantee monotonicity of the 1467 timestamps, which can then be used to reject old duplicates even 1468 without an enforced MSL. 1470 B.2 Closing and Reopening a Connection 1472 When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT 1473 state ties up the socket pair for 4 minutes (see Section 3.5 of 1474 [Postel81]. Applications built upon TCP that close one connection 1475 and open a new one (e.g., an FTP data transfer connection using 1476 Stream mode) must choose a new socket pair each time. The TIME- 1477 WAIT delay serves two different purposes: 1479 (a) Implement the full-duplex reliable close handshake of TCP. 1481 The proper time to delay the final close step is not really 1482 related to the MSL; it depends instead upon the RTO for the 1483 FIN segments and therefore upon the RTT of the path. (It 1484 could be argued that the side that is sending a FIN knows 1485 what degree of reliability it needs, and therefore it should 1486 be able to determine the length of the TIME-WAIT delay for 1487 the FIN's recipient. This could be accomplished with an 1488 appropriate TCP option in FIN segments.) 1490 Although there is no formal upper-bound on RTT, common 1491 network engineering practice makes an RTT greater than 1 1492 minute very unlikely. Thus, the 4 minute delay in TIME-WAIT 1493 state works satisfactorily to provide a reliable full-duplex 1494 TCP close. Note again that this is independent of MSL 1495 enforcement and network speed. 1497 The TIME-WAIT state could cause an indirect performance 1498 problem if an application needed to repeatedly close one 1499 connection and open another at a very high frequency, since 1500 the number of available TCP ports on a host is less than 1501 2**16. However, high network speeds are not the major 1502 contributor to this problem; the RTT is the limiting factor 1503 in how quickly connections can be opened and closed. 1504 Therefore, this problem will be no worse at high transfer 1505 speeds. 1507 (b) Allow old duplicate segments to expire. 1509 To replace this function of TIME-WAIT state, a mechanism 1510 would have to operate across connections. PAWS is defined 1511 strictly within a single connection; the last timestamp 1512 (TS.Recent) is kept in the connection control block, and 1513 discarded when a connection is closed. 1515 An additional mechanism could be added to the TCP, a per-host 1516 cache of the last timestamp received from any connection. 1517 This value could then be used in the PAWS mechanism to reject 1518 old duplicate segments from earlier incarnations of the 1519 connection, if the timestamp clock can be guaranteed to have 1520 ticked at least once since the old connection was open. This 1521 would require that the TIME-WAIT delay plus the RTT together 1522 must be at least one tick of the sender's timestamp clock. 1523 Such an extension is not part of the proposal of this RFC. 1525 Note that this is a variant on the mechanism proposed by 1526 Garlick, Rom, and Postel [Garlick77], which required each 1527 host to maintain connection records containing the highest 1528 sequence numbers on every connection. Using timestamps 1529 instead, it is only necessary to keep one quantity per remote 1530 host, regardless of the number of simultaneous connections to 1531 that host. 1533 APPENDIX C: CHANGES FROM RFC 1072, RFC 1185, RFC 1323 1535 The protocol extensions defined in RFC 1323 document differ in 1536 several important ways from those defined in RFC 1072 and RFC 1185. 1538 (a) SACK has been split off into a separate document, RFC 2018 1539 [Mathis96]. 1541 (b) The detailed rules for sending timestamp replies (see Section 1542 3.4) differ in important ways. The earlier rules could result 1543 in an under-estimate of the RTT in certain cases (packets 1544 dropped or out of order). 1546 (c) The same value TS.Recent is now shared by the two distinct 1547 mechanisms RTTM and PAWS. This simplification became possible 1548 because of change (b). 1550 (d) An ambiguity in RFC 1185 was resolved in favor of putting 1551 timestamps on ACK as well as data segments. This supports the 1552 symmetry of the underlying TCP protocol. 1554 (e) The echo and echo reply options of RFC 1072 were combined into a 1555 single Timestamps option, to reflect the symmetry and to 1556 simplify processing. 1558 (f) The problem of outdated timestamps on long-idle connections, 1559 discussed in Section 4.2.2, was realized and resolved. 1561 (g) RFC 1185 recommended that header prediction take precedence over 1562 the timestamp check. Based upon some skepticism about the 1563 probabilistic arguments given in Section 4.2.4, it was decided 1564 to recommend that the timestamp check be performed first. 1566 (h) The spec was modified so that the extended options will be sent 1567 on segments only when they are received in the 1568 corresponding segments. This provides the most 1569 conservative possible conditions for interoperation with 1570 implementations without the extensions. 1572 In addition to these substantive changes, the present RFC attempts to 1573 specify the algorithms unambiguously by presenting modifications to 1574 the Event Processing rules of RFC 793; see Appendix F. 1576 There are additional changes in this document from RFC 1323. These 1577 changes are: 1579 (a) The description of which TSecr values can be used to update the 1580 measured RTT has been clarified. Specifically, with Timestamps, 1581 the Karn algorithm [Karn87] is disabled. The Karn algorithm 1582 disables all RTT measurements during retransmission, since it is 1583 ambiguous whether the ACK is is for the original packet, or the 1584 retransmitted packet. With Timestamps, that ambiguity is 1585 removed since the TSecr in the ACK will contain the TSval from 1586 whichever data packet made it to the destination. 1588 (b) In RFC 1323, section 3.4, step (2) of the algorithm to control 1589 which timestamp is echoed was incorrect in two regards: 1591 (1) It failed to update TSrecent for a retransmitted segment 1592 that resulted from a lost ACK. 1594 (2) It failed if SEG.LEN = 0. 1596 In the new algorithm, the case of SEG.TSval = TSrecent is 1597 included for consistency with the PAWS test. 1599 (c) One correction was made to the Event Processing Summary in 1600 Appendix F. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to 1601 fill in the SEG.WND value, not SND.WND. 1603 (d) New pseudo-code summary has been added in Appendix E. 1605 (e) Appendix A has been expanded with information about the TCP MSS 1606 option and the TCP Urgent Pointer. 1608 (f) It is now recommended that Timestamps options be included in RST 1609 packets if the incoming packet contained a Timestamps option. 1611 (g) RST packets are explicitly excluded from PAWS processing. 1613 (h) Snd.TSoffset and Snd.TSclock variables have been added. 1614 Snd.TSoffset is the sum of my.TSclock and Snd.TSoffset. This 1615 allows the starting points for timestamps to be randomized on a 1616 per-connection basis. Setting Snd.TSoffset to zero yields the 1617 same results as RFC 1323. 1619 APPENDIX D: SUMMARY OF NOTATION 1621 The following notation has been used in this document. 1623 Options 1625 WSopt: TCP Window Scale Option 1626 TSopt: TCP Timestamps Option 1628 Option Fields 1630 shift.cnt: Window scale byte in WSopt. 1631 TSval: 32-bit Timestamp Value field in TSopt. 1632 TSecr: 32-bit Timestamp Reply field in TSopt. 1634 Option Fields in Current Segment 1636 SEG.TSval: TSval field from TSopt in current segment. 1637 SEG.TSecr: TSecr field from TSopt in current segment. 1638 SEG.WSopt: 8-bit value in WSopt 1640 Clock Values 1642 my.TSclock: System wide source of 32-bit timestamp values 1643 my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec). 1644 Snd.TSoffset: A offset for randomizing Snd.TSclock 1645 Snd.TSclock: my.TSclock + Snd.TSoffset 1647 Per-Connection State Variables 1649 TS.Recent: Latest received Timestamp 1650 Last.ACK.sent: Last ACK field sent 1652 Snd.TS.OK: 1-bit flag 1653 Snd.WS.OK: 1-bit flag 1655 Rcv.Wind.Scale: Receive window scale power 1656 Snd.Wind.Scale: Send window scale power 1658 Start.Time: Snd.TSclock value when segment being 1659 timed was sent (used by pre-1323 code). 1661 Procedure 1663 Update_SRTT( m ) Procedure to update the smoothed RTT and RTT 1664 variance estimates, using the rules of 1665 [Jacobson88a], given m, a new RTT measurement. 1667 APPENDIX E: PSEUDO-CODE SUMMARY 1669 Create new TCB => { 1670 Rcv.wind.scale = 1671 MIN( 14, MAX(0, floor(log2(receive buffer space)) - 15) ); 1672 Snd.wind.scale = 0; 1673 Last.ACK.sent = 0; 1674 Snd.TS.OK = Snd.WS.OK = FALSE; 1675 Snd.TSoffset = random 32 bit value 1676 } 1678 Send initial {SYN} segment => { 1680 SEG.WND = MIN( RCV.WND, 65535 ); 1681 Include in segment: TSopt(TSval=Snd.TSclock, TCecr=0); 1682 Include in segment: WSopt = Rcv.wind.scale; 1683 } 1685 Send {SYN, ACK} segment => { 1687 SEG.ACK = Last.ACK.sent = RCV.NXT; 1688 SEG.WND = MIN( RCV.WND, 65535 ); 1689 if (Snd.TS.OK) then 1690 Include in segment: 1691 TSopt(TSval=Snd.TSclock, TSecr=TS.Recent); 1692 if (Snd.WS.OK) then 1693 Include in segment: WSopt = Rcv.wind.scale; 1694 } 1696 Receive {SYN} or {SYN,ACK} segment => { 1698 if (Segment contains TSopt) then { 1699 TS.Recent = SEG.TSval; 1700 Snd.TS.OK = TRUE; 1701 if (is {SYN,ACK} segment) then 1702 Update_SRTT( 1703 (Snd.TSclock - SEG.TSecr)/my.TSclock.rate); 1704 } 1706 if (Segment contains WSopt) then { 1707 Snd.wind.scale = SEG.WSopt; 1708 Snd.WS.OK = TRUE; 1709 if (the ACK bit is not set, and Rcv.wind.scale has not been 1710 initialized by the user) then 1711 Rcv.wind.scale = Snd.wind.scale; 1712 } 1713 else 1714 Rcv.wind.scale = Snd.wind.scale = 0; 1715 } 1717 Send non-SYN segment => { 1719 SEG.ACK = Last.ACK.sent = RCV.NXT; 1720 SEG.WND = MIN( RCV.WND >> Rcv.wind.scale, 65535 ); 1721 if (Snd.TS.OK) then 1722 Include in segment: 1723 TSopt(TSval=Snd.TSclock, TSecr=TS.Recent); 1724 } 1726 Receive non-SYN segment in (state >= ESTABLISHED) => { 1728 Window = (SEG.WND << Snd.wind.scale); 1729 /* Use 32-bit 'Window' instead of 16-bit 'SEG.WND' 1730 * in rest of processing. 1731 */ 1733 if (Segment contains TSopt) then { 1734 if (SEG.TSval < TS.Recent && Idle less than 24 days) then { 1735 if (Send.TS.OK AND (NOT RST) ) then { 1736 /* Timestamp too old => 1737 * segment is unacceptable. 1738 */ 1739 Send ACK segment; 1740 Discard segment and return; 1741 } 1742 } 1743 else { 1744 if (SEG.SEQ =< Last.ACK.sent) then 1745 TS.Recent = SEG.TSval; 1746 } 1747 } 1749 if (SEG.ACK > SND.UNA) then { 1750 /* (At least part of) first segment in 1751 * retransmission queue has been ACKd 1752 */ 1753 if (Segment contains TSopt) then 1754 Update_SRTT( 1755 (Snd.TSclock - SEG.TSecr)/my.TSclock.rate); 1756 else 1757 Update_SRTT( /* for compatibility */ 1758 (Snd.TSclock - Start.Time)/my.TSclock.rate); 1759 } 1760 } 1762 APPENDIX F: EVENT PROCESSING SUMMARY 1764 Event Processing 1766 OPEN Call 1768 ... 1769 An initial send sequence number (ISS) is selected. Send a SYN 1770 segment of the form: 1772 1774 ... 1776 SEND Call 1778 CLOSED STATE (i.e., TCB does not exist) 1780 ... 1782 LISTEN STATE 1784 If the foreign socket is specified, then change the connection 1785 from passive to active, select an ISS. Send a SYN segment 1786 containing the options: and 1787 . Set SND.UNA to ISS, SND.NXT to ISS+1. 1788 Enter SYN-SENT state. ... 1790 SYN-SENT STATE 1791 SYN-RECEIVED STATE 1793 ... 1795 ESTABLISHED STATE 1796 CLOSE-WAIT STATE 1798 Segmentize the buffer and send it with a piggybacked 1799 acknowledgment (acknowledgment value = RCV.NXT). ... 1801 If the urgent flag is set ... 1803 If the Snd.TS.OK flag is set, then include the TCP Timestamps 1804 option in each data segment. 1806 Scale the receive window for transmission in the segment header: 1808 SEG.WND = (RCV.WND >> Rcv.Wind.Scale). 1810 SEGMENT ARRIVES 1812 ... 1814 If the state is LISTEN then 1816 first check for an RST 1818 ... 1820 second check for an ACK 1822 ... 1824 third check for a SYN 1826 if the SYN bit is set, check the security. If the ... 1828 ... 1830 If the SEG.PRC is less than the TCB.PRC then continue. 1832 Check for a Window Scale option (WSopt); if one is found, save 1833 SEG.WSopt in Snd.Wind.Scale and set Snd.WS.OK flag on. 1834 Otherwise, set both Snd.Wind.Scale and Rcv.Wind.Scale to zero 1835 and clear Snd.WS.OK flag. 1837 Check for a TSopt option; if one is found, save SEG.TSval in the 1838 variable TS.Recent and turn on the Snd.TS.OK bit. 1840 Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any other 1841 control or text should be queued for processing later. ISS 1842 should be selected and a SYN segment sent of the form: 1844 1846 If the Snd.WS.OK bit is on, include a WSopt option 1847 in this segment. If the Snd.TS.OK bit is 1848 on, include a TSopt in this 1849 segment. Last.ACK.sent is set to RCV.NXT. 1851 SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection 1852 state should be changed to SYN-RECEIVED. Note that any other 1853 incoming control or data (combined with SYN) will be processed 1854 in the SYN-RECEIVED state, but processing of SYN and ACK should 1855 not be repeated. If the listen was not fully specified (i.e., 1856 the foreign socket was not fully specified), then the 1857 unspecified fields should be filled in now. 1859 fourth other text or control 1860 ... 1862 If the state is SYN-SENT then 1864 first check the ACK bit 1866 ... 1868 fourth check the SYN bit 1870 ... 1872 If the SYN bit is on and the security/compartment and precedence 1873 are acceptable then, RCV.NXT is set to SEG.SEQ+1, IRS is set to 1874 SEG.SEQ, and any acknowledgements on the retransmission queue 1875 which are thereby acknowledged should be removed. 1877 Check for a Window Scale option (WSopt); if is found, save 1878 SEG.WSopt in Snd.Wind.Scale; otherwise, set both Snd.Wind.Scale 1879 and Rcv.Wind.Scale to zero. 1881 Check for a TSopt option; if one is found, save SEG.TSval in 1882 variable TS.Recent and turn on the Snd.TS.OK bit in the 1883 connection control block. If the ACK bit is set, use 1884 Snd.TSclock - SEG.TSecr as the initial RTT estimate. 1886 If SND.UNA > ISS (our SYN has been ACKed), change the connection 1887 state to ESTABLISHED, form an ACK segment: 1889 1891 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1892 option in this ACK segment. 1893 Last.ACK.sent is set to RCV.NXT. 1895 Data or controls which were queued for transmission may be 1896 included. If there are other controls or text in the segment 1897 then continue processing at the sixth step below where the URG 1898 bit is checked, otherwise return. 1900 Otherwise enter SYN-RECEIVED, form a SYN,ACK segment: 1902 1904 and send it. If the Snd.Echo.OK bit is on, include a TSopt 1905 option in this segment. If 1906 the Snd.WS.OK bit is on, include a WSopt option 1907 in this segment. Last.ACK.sent is set to 1908 RCV.NXT. 1910 If there are other controls or text in the segment, queue them 1911 for processing after the ESTABLISHED state has been reached, 1912 return. 1914 fifth, if neither of the SYN or RST bits is set then drop the 1915 segment and return. 1917 Otherwise, 1919 First, check sequence number 1921 SYN-RECEIVED STATE 1922 ESTABLISHED STATE 1923 FIN-WAIT-1 STATE 1924 FIN-WAIT-2 STATE 1925 CLOSE-WAIT STATE 1926 CLOSING STATE 1927 LAST-ACK STATE 1928 TIME-WAIT STATE 1930 Segments are processed in sequence. Initial tests on arrival 1931 are used to discard old duplicates, but further processing is 1932 done in SEG.SEQ order. If a segment's contents straddle the 1933 boundary between old and new, only the new parts should be 1934 processed. 1936 Rescale the received window field: 1938 TrueWindow = SEG.WND << Snd.Wind.Scale, 1940 and use "TrueWindow" in place of SEG.WND in the following steps. 1942 Check whether the segment contains a Timestamps option and bit 1943 Snd.TS.OK is on. If so: 1945 If SEG.TSval < TS.Recent and the RST bit is off, then test 1946 whether connection has been idle less than 24 days; if all are 1947 true, then the segment is not acceptable; follow steps below 1948 for an unacceptable segment. 1950 If SEG.SEQ is equal to Last.ACK.sent, then save SEG.ECopt in 1951 variable TS.Recent. 1953 There are four cases for the acceptability test for an incoming 1954 segment: 1956 ... 1958 If an incoming segment is not acceptable, an acknowledgment 1959 should be sent in reply (unless the RST bit is set, if so drop 1960 the segment and return): 1962 1964 Last.ACK.sent is set to SEG.ACK of the acknowledgment. If the 1965 Snd.Echo.OK bit is on, include the Timestamps option 1966 in this ACK segment. Set 1967 Last.ACK.sent to SEG.ACK and send the ACK segment. After 1968 sending the acknowledgment, drop the unacceptable segment and 1969 return. 1971 ... 1973 fifth check the ACK field. 1975 if the ACK bit is off drop the segment and return. 1977 if the ACK bit is on 1979 ... 1981 ESTABLISHED STATE 1983 If SND.UNA < SEG.ACK =< SND.NXT then, set SND.UNA <- SEG.ACK. 1984 Also compute a new estimate of round-trip time. If Snd.TS.OK 1985 bit is on, use Snd.TSclock - SEG.TSecr; otherwise use the 1986 elapsed time since the first segment in the retransmission 1987 queue was sent. Any segments on the retransmission queue 1988 which are thereby entirely acknowledged... 1990 ... 1992 Seventh, process the segment text. 1994 ESTABLISHED STATE 1995 FIN-WAIT-1 STATE 1996 FIN-WAIT-2 STATE 1998 ... 2000 Send an acknowledgment of the form: 2002 2004 If the Snd.TS.OK bit is on, include Timestamps option 2005 in this ACK segment. Set 2006 Last.ACK.sent to SEG.ACK of the acknowledgment, and send it. 2007 This acknowledgment should be piggy-backed on a segment being 2008 transmitted if possible without incurring undue delay. 2010 ... 2012 APPENDIX G: Timestamps Edge Cases 2014 While the rules laid out for when to calculate RTTM produce the 2015 correct results most of the time, there are some edge cases where an 2016 incorrect RTTM can be calculated. All of these situations involve 2017 the loss of packets. It is felt that these scenarios are rare, and 2018 that if they should happen, they will cause a single RTTM measurement 2019 to be inflated, which mitigates its effects on RTO calculations. 2021 [Martin03] cites two similar cases when the returning ACK is lost, 2022 and before the retransmission timer fires, another returning packet 2023 arrives, which ACKs the data. In this case, the RTTM calculated will 2024 be inflated: 2026 clock 2027 tc=1 -------------------> 2029 tc=2 (lost) <---- 2030 (RTTM would have been 1) 2032 (receive window opens, window update is sent) 2033 tc=5 <---- 2034 (RTTM is calculated at 4) 2036 One thing to note about this situation is that it is somewhat bounded 2037 by RTO + RTT, limiting how far off the RTTM calculation will be. 2038 While more complex scenarios can be constructed that produce larger 2039 inflations (e.g., retransmissions are lost), those scenarios involve 2040 multiple packet losses, and the connection will have other more 2041 serious operational problems than using an inflated RTTM in the RTO 2042 calculation. ------------- 2044 Authors' Addresses 2046 David Borman 2047 Wind River Systems 2048 Mendota Heights, MN 55120 2050 Phone: (651) 454-3052 2051 Email: david.borman@windriver.com 2053 Bob Braden 2054 University of Southern California 2055 Information Sciences Institute 2056 4676 Admiralty Way 2057 Marina del Rey, CA 90292 2059 Phone: (310) 448-9173 2060 EMail: Braden@ISI.EDU 2062 Van Jacobson 2063 Packet Design 2064 2465 Latham Street 2065 Mountain View, CA 94040 2067 EMail: van@packetdesign.com