idnits 2.17.1 draft-ietf-tcpm-1323bis-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.i or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 278 has weird spacing: '...its/sec byt...' == Line 1397 has weird spacing: '... TSval times...' == Line 1399 has weird spacing: '... TSecr times...' -- The document seems to contain a disclaimer for pre-RFC5378 work, and may have content which was first submitted before 10 November 2008. The disclaimer is necessary when there are original authors that you have been unable to contact, or if some do not wish to grant the BCP78 rights to the IETF Trust. If you are able to get all authors (current and original) to grant those rights, you can and should remove the disclaimer; otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 4, 2009) is 5532 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '1' on line 308

  ** Obsolete normative reference: RFC  793 (ref. 'Postel81') (Obsoleted by
     RFC 9293)

  -- Obsolete informational reference (is this intentional?): RFC 2581 (ref.
     'Allman99') (Obsoleted by RFC 5681)

  -- Obsolete informational reference (is this intentional?): RFC 3517 (ref.
     'Blanton03') (Obsoleted by RFC 6675)

  -- Obsolete informational reference (is this intentional?): RFC 1072 (ref.
     'Jacobson88b') (Obsoleted by RFC 1323, RFC 2018, RFC 6247)

  -- Obsolete informational reference (is this intentional?): RFC 1185 (ref.
     'Jacobson90b') (Obsoleted by RFC 1323)

  -- Obsolete informational reference (is this intentional?): RFC 1323 (ref.
     'Jacobson92d') (Obsoleted by RFC 7323)

  -- Duplicate reference: RFC1323, mentioned in 'Martin03', was also
     mentioned in 'Jacobson92d'.

  -- Obsolete informational reference (is this intentional?): RFC 1323 (ref.
     'Martin03') (Obsoleted by RFC 7323)

  -- Obsolete informational reference (is this intentional?): RFC 1110 (ref.
     'McKenzie89') (Obsoleted by RFC 6247)

  -- Obsolete informational reference (is this intentional?): RFC  896 (ref.
     'Nagle84') (Obsoleted by RFC 7805)


     Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 13 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                              Network Working Group
3	Internet-Draft                                                 D. Borman
4	Obsoletes: 1323                                       Wind River Systems
5	Intended Status: Standards Track                               R. Braden
6	File: draft-ietf-tcpm-1323bis-01.txt                                 ISI
7	                                                             V. Jacobson
8	                                                           Packet Design
9	                                                           March 4, 2009

11	                  TCP Extensions for High Performance

13	Status of This Memo

15	   This Internet-Draft is submitted to IETF in full conformance with the
16	   provisions of BCP 78 and BCP 79.

18	   This document may contain material from IETF Documents or IETF
19	   Contributions published or made publicly available before November
20	   10, 2008. The person(s) controlling the copyright in some of this
21	   material may not have granted the IETF Trust the right to allow
22	   modifications of such material outside the IETF Standards Process.
23	   Without obtaining an adequate license from the person(s) controlling
24	   the copyright in such materials, this document may not be modified
25	   outside the IETF Standards Process, and derivative works of it may
26	   not be created outside the IETF Standards Process, except to format
27	   it for publication as an RFC or to translate it into languages other
28	   than English.

30	   Internet-Drafts are working documents of the Internet Engineering
31	   Task Force (IETF), its areas, and its working groups.  Note that
32	   other groups may also distribute working documents as Internet-
33	   Drafts.

35	   Internet-Drafts are draft documents valid for a maximum of six months
36	   and may be updated, replaced, or obsoleted by other documents at any
37	   time.  It is inappropriate to use Internet-Drafts as reference
38	   material or to cite them other than as "work in progress."

40	   The list of current Internet-Drafts can be accessed at
41	   http://www.ietf.org/ietf/1id-abstracts.txt.

43	   The list of Internet-Draft Shadow Directories can be accessed at
44	   http://www.ietf.org/shadow.html.

46	   This Internet-Draft will expire on September 4, 2009.

48	Copyright

50	   Copyright (c) 2009 IETF Trust and the persons identified as the
51	   document authors.  All rights reserved.

53	   This document is subject to BCP 78 and the IETF Trust's Legal
54	   Provisions Relating to IETF Documents in effect on the date of
55	   publication of this document (http://trustee.ietf.org/license-info).
56	   Please review these documents carefully, as they describe your rights
57	   and restrictions with respect to this document.

59	Abstract

61	   This memo presents a set of TCP extensions to improve performance
62	   over large bandwidth*delay product paths and to provide reliable
63	   operation over very high-speed paths.  It defines TCP options for
64	   scaled windows and timestamps, which are designed to provide
65	   compatible interworking with TCP's that do not implement the
66	   extensions.  The timestamps are used for two distinct mechanisms:
67	   RTTM (Round Trip Time Measurement) and PAWS (Protection Against
68	   Wrapped Sequences).  Selective acknowledgments are not included in
69	   this memo.

71	   This memo updates and obsoletes RFC 1323.

73	TABLE OF CONTENTS

75	   1.  Introduction                                                    2
76	   2.  TCP Window Scale Option                                         9
77	   3.  RTTM -- Round-Trip Time Measurement                            12
78	   4.  PAWS -- Protection Against Wrapped Sequence Numbers            18
79	   5.  Conclusions and Acknowledgments                                26
80	   6.  Security Considerations                                        27
81	   7.  IANA Considerations                                            27
82	   8.  References                                                     27
83	   APPENDIX A: Implementation Suggestions                             30
84	   APPENDIX B: Duplicates from Earlier Connection Incarnations        31
85	   APPENDIX C: Changes from RFC 1072, RFC 1185, RFC 1323              34
86	   APPENDIX D: Summary of Notation                                    36
87	   APPENDIX E: Pseudo-code Summary                                    37
88	   APPENDIX F: Event Processing                                       40
89	   APPENDIX G: Timestamps Edge Cases                                  46
90	   Authors' Addresses                                                 47

92	1. INTRODUCTION

94	   The TCP protocol [Postel81] was designed to operate reliably over
95	   almost any transmission medium regardless of transmission rate,
96	   delay, corruption, duplication, or reordering of segments.
97	   Production TCP implementations currently adapt to transfer rates in
98	   the range of 100 bps to 10**10 bps and round-trip delays in the range
99	   1 ms to 100 seconds.  Work on TCP performance has shown that TCP
100	   without the extensions described in this memo can work well over a
101	   variety of Internet paths, ranging from 800 Mbit/sec I/O channels to
102	   300 bit/sec dial-up modems [Jacobson88a].

104	   Over the years, advances in networking technology has resulted in
105	   ever-higher transmission speeds, and the fastest paths are well
106	   beyond the domain for which TCP was originally engineered.  This memo
107	   defines a set of modest extensions to TCP to extend the domain of its
108	   application to match this increasing network capability.  It is an
109	   update to and obsoletes RFC 1323 [Jacobson92d], which in turn is
110	   based upon and obsoletes RFC 1072 [Jacobson88b] and RFC 1185
111	   [Jacobson90b].

113	   There is no one-line answer to the question: "How fast can TCP go?".
114	   There are two separate kinds of issues, performance and reliability,
115	   and each depends upon different parameters.  We discuss each in turn.

117	   1.1  TCP Performance

119	      TCP performance depends not upon the transfer rate itself, but
120	      rather upon the product of the transfer rate and the round-trip
121	      delay.  This "bandwidth*delay product" measures the amount of data
122	      that would "fill the pipe"; it is the buffer space required at
123	      sender and receiver to obtain maximum throughput on the TCP
124	      connection over the path, i.e., the amount of unacknowledged data
125	      that TCP must handle in order to keep the pipeline full.  TCP
126	      performance problems arise when the bandwidth*delay product is
127	      large.  We refer to an Internet path operating in this region as a
128	      "long, fat pipe", and a network containing this path as an "LFN"
129	      (pronounced "elephan(t)").

131	      High-capacity packet satellite channels are LFN's.  For example, a
132	      DS1-speed satellite channel has a bandwidth*delay product of 10**6
133	      bits or more; this corresponds to 100 outstanding TCP segments of
134	      1200 bytes each.  Terrestrial fiber-optical paths will also fall
135	      into the LFN class; for example, a cross-country delay of 30 ms at
136	      a DS3 bandwidth (45Mbps) also exceeds 10**6 bits.

138	      There are three fundamental performance problems with the current
139	      TCP over LFN paths:

141	      (1)  Window Size Limit

143	           The TCP header uses a 16 bit field to report the receive
144	           window size to the sender.  Therefore, the largest window
145	           that can be used is 2**16 = 65K bytes.

147	           To circumvent this problem, Section 2 of this memo defines a
148	           new TCP option, "Window Scale", to allow windows larger than
149	           2**16.  This option defines an implicit scale factor, which
150	           is used to multiply the window size value found in a TCP
151	           header to obtain the true window size.

153	      (2)  Recovery from Losses

155	           Packet losses in an LFN can have a catastrophic effect on
156	           throughput.  In the past, properly-operating TCP
157	           implementations would cause the data pipeline to drain with
158	           every packet loss, and require a slow-start action to
159	           recover.  The Fast Retransmit and Fast Recovery algorithms
160	           [Jacobson90c] [Allman99] were introduced, and their combined
161	           effect was to recover from one packet loss per window,
162	           without draining the pipeline.  However, more than one packet
163	           loss per window typically resulted in a retransmission
164	           timeout and the resulting pipeline drain and slow start.

166	           Expanding the window size to match the capacity of an LFN
167	           results in a corresponding increase of the probability of
168	           more than one packet per window being dropped.  This could
169	           have a devastating effect upon the throughput of TCP over an
170	           LFN.  In addition, since the publication of RFC 1323,
171	           congestion control mechanism based upon some form of random
172	           dropping have been introduced into gateways, and randomly
173	           spaced packet drops have become common; this increases the
174	           probability of dropping more than one packet per window.

176	           To generalize the Fast Retransmit/Fast Recovery mechanism to
177	           handle multiple packets dropped per window, selective
178	           acknowledgments are required.  Unlike the normal cumulative
179	           acknowledgments of TCP, selective acknowledgments give the
180	           sender a complete picture of which segments are queued at the
181	           receiver and which have not yet arrived.

183	           Since the publication of RFC 1323, selective acknowledgments
184	           have become important in the LFN regime.  RFC 1072 defined a
185	           new TCP "SACK" option to send a selective acknowledgment, but
186	           at the time that RFC 1323 was published, important technical
187	           issues still had to be worked out concerning both the format
188	           and semantics of the SACK option, so it was split off from
189	           RFC 1323.  SACK has now been published as a separate
190	           document, RFC 2018 [Mathis96].  Additional information about
191	           SACK can be found in RFC 2883, "An Extension to the Selective
192	           Acknowledgement (SACK) option for TCP" [Floyd00] and RFC
193	           3517, "A Conservative Selective Acknowledgment (SACK)-based
194	           Loss Recovery Algorithm for TCP" [Blanton03].

196	      (3)  Round-Trip Measurement

198	           TCP implements reliable data delivery by retransmitting
199	           segments that are not acknowledged within some retransmission
200	           timeout (RTO) interval.  Accurate dynamic determination of an
201	           appropriate RTO is essential to TCP performance.  RTO is
202	           determined by estimating the mean and variance of the
203	           measured round-trip time (RTT), i.e., the time interval
204	           between sending a segment and receiving an acknowledgment for
205	           it [Jacobson88a].

207	           Section 4 introduces a new TCP option, "Timestamps", and then
208	           defines a mechanism using this option that allows nearly
209	           every segment, including retransmissions, to be timed at
210	           negligible computational cost.  We use the mnemonic RTTM
211	           (Round Trip Time Measurement) for this mechanism, to
212	           distinguish it from other uses of the Timestamps option.

214	   1.2 TCP Reliability

216	      Now we turn from performance to reliability.  High transfer rate
217	      enters TCP performance through the bandwidth*delay product.
218	      However, high transfer rate alone can threaten TCP reliability by
219	      violating the assumptions behind the TCP mechanism for duplicate
220	      detection and sequencing.

222	      An especially serious kind of error may result from an accidental
223	      reuse of TCP sequence numbers in data segments.  Suppose that an
224	      "old duplicate segment", e.g., a duplicate data segment that was
225	      delayed in Internet queues, is delivered to the receiver at the
226	      wrong moment, so that its sequence numbers falls somewhere within
227	      the current window.  There would be no checksum failure to warn of
228	      the error, and the result could be an undetected corruption of the
229	      data.  Reception of an old duplicate ACK segment at the
230	      transmitter could be only slightly less serious: it is likely to
231	      lock up the connection so that no further progress can be made,
232	      forcing an RST on the connection.

234	      TCP reliability depends upon the existence of a bound on the
235	      lifetime of a segment: the "Maximum Segment Lifetime" or MSL.  An
236	      MSL is generally required by any reliable transport protocol,
237	      since every sequence number field must be finite, and therefore
238	      any sequence number may eventually be reused.  In the Internet
239	      protocol suite, the MSL bound is enforced by an IP-layer
240	      mechanism, the "Time-to-Live" or TTL field.

242	      Duplication of sequence numbers might happen in either of two
243	      ways:

245	      (1)  Sequence number wrap-around on the current connection

247	           A TCP sequence number contains 32 bits.  At a high enough
248	           transfer rate, the 32-bit sequence space may be "wrapped"
249	           (cycled) within the time that a segment is delayed in queues.

251	      (2)  Earlier incarnation of the connection

253	           Suppose that a connection terminates, either by a proper
254	           close sequence or due to a host crash, and the same
255	           connection (i.e., using the same pair of sockets) is
256	           immediately reopened.  A delayed segment from the terminated
257	           connection could fall within the current window for the new
258	           incarnation and be accepted as valid.

260	      Duplicates from earlier incarnations, Case (2), are avoided by
261	      enforcing the current fixed MSL of the TCP spec, as explained in
262	      Section 5.3 and Appendix B.   However, case (1), avoiding the
263	      reuse of sequence numbers within the same connection, requires an
264	      MSL bound that depends upon the transfer rate, and at high enough
265	      rates, a new mechanism is required.

267	      More specifically, if the maximum effective bandwidth at which TCP
268	      is able to transmit over a particular path is B bytes per second,
269	      then the following constraint must be satisfied for error-free
270	      operation:

272	          2**31 / B  > MSL (secs)                     [1]

274	      The following table shows the value for Twrap = 2**31/B in
275	      seconds, for some important values of the bandwidth B:

277	           Network       B*8          B         Twrap
278	                      bits/sec    bytes/sec     secs
279	           _______     _______     ______       ______

281	           Dialup        56kbps       7KBps    3*10**5 (~3.6 days)

283	           DS1          1.5Mbps     190KBps    10**4 (~3 hours)

285	           10mbit
286	           Ethernet      10Mbps    1.25MBps    1700 (~30 mins)

288	           DS3           45Mbps     5.6MBps    380

290	           100mbit
291	           Ethernet     100Mbps    12.5MBps    170

293	           Gigabit
294	           Ethernet       1Gbps     125MBps    17

296	           10GigE        10Gbps    1.25GBps    1.7

298	      It is clear that wrap-around of the sequence space is not a
299	      problem for 56kbps packet switching or even 10Mbps Ethernets.  On
300	      the other hand, at DS3 and 100mbit speeds, Twrap is comparable to
301	      the 2 minute MSL assumed by the TCP specification [Postel81].
302	      Moving towards and beyond gigabit speeds, Twrap becomes too small
303	      for reliable enforcement by the Internet TTL mechanism.

305	      The 16-bit window field of TCP limits the effective bandwidth B to
306	      2**16/RTT, where RTT is the round-trip time in seconds
307	      [McKenzie89].  If the RTT is large enough, this limits B to a
308	      value that meets the constraint [1] for a large MSL value.  For
309	      example, consider a transcontinental backbone with an RTT of 60ms
310	      (set by the laws of physics).  With the bandwidth*delay product
311	      limited to 64KB by the TCP window size, B is then limited to
312	      1.1MBps, no matter how high the theoretical transfer rate of the
313	      path.  This corresponds to cycling the sequence number space in
314	      Twrap= 2000 secs, which is safe in today's Internet.

316	      It is important to understand that the culprit is not the larger
317	      window but rather the high bandwidth.  For example, consider a
318	      (very large) FDDI LAN with a diameter of 10km.  Using the speed of
319	      light, we can compute the RTT across the ring as
320	      (2*10**4)/(3*10**8) = 67 microseconds, and the delay*bandwidth
321	      product is then 833 bytes.  A TCP connection across this LAN using
322	      a window of only 833 bytes will run at the full 100mbps and can
323	      wrap the sequence space in about 3 minutes, very close to the MSL
324	      of TCP.  Thus, high speed alone can cause a reliability problem
325	      with sequence number wrap-around, even without extended windows.

327	      Watson's Delta-T protocol [Watson81] includes network-layer
328	      mechanisms for precise enforcement of an MSL.  In contrast, the IP
329	      mechanism for MSL enforcement is loosely defined and even more
330	      loosely implemented in the Internet.  Therefore, it is unwise to
331	      depend upon active enforcement of MSL for TCP connections, and it
332	      is unrealistic to imagine setting MSL's smaller than the current
333	      values (e.g., 120 seconds specified for TCP).

335	      A possible fix for the problem of cycling the sequence space would
336	      be to increase the size of the TCP sequence number field.  For
337	      example, the sequence number field (and also the acknowledgment
338	      field) could be expanded to 64 bits.  This could be done either by
339	      changing the TCP header or by means of an additional option.

341	      Section 5 presents a different mechanism, which we call PAWS
342	      (Protect Against Wrapped Sequence numbers), to extend TCP
343	      reliability to transfer rates well beyond the foreseeable upper
344	      limit of network bandwidths.  PAWS uses the TCP Timestamps option
345	      defined in Section 4 to protect against old duplicates from the
346	      same connection.

348	   1.3 Using TCP options

350	      The extensions defined in this memo all use new TCP options.  We
351	      must address two possible issues concerning the use of TCP
352	      options: (1) compatibility and (2) overhead.

354	      We must pay careful attention to compatibility, i.e., to
355	      interoperation with existing implementations.  The only TCP option
356	      defined previously, MSS, may appear only on a SYN segment.  Every
357	      implementation should (and we expect that most will) ignore
358	      unknown options on SYN segments.  When RFC 1323 was published,
359	      there was concern that some buggy TCP implementation might be
360	      crashed by the first appearance of an option on a non-SYN segment.
361	      However, bugs like that can lead to DOS attacks against a TCP, so
362	      it is now expected that most TCP implementations will properly
363	      handle unknown options on non-SYN segments.  But it is still
364	      prudent to be conservative in what you send, and avoiding buggy
365	      TCP implementation is not the only reason for negotiating TCP
366	      options on SYN segments.  Therefore, for each of the extensions
367	      defined below, TCP options will be sent on non-SYN segments only
368	      after an exchange of options on the the SYN segments has indicated
369	      that both sides understand the extension.  Furthermore, an
370	      extension option will be sent in a  segment only if the
371	      corresponding option was received in the initial  segment.

373	      A question may be raised about the bandwidth and processing
374	      overhead for TCP options.  Those options that occur on SYN
375	      segments are not likely to cause a performance concern.  Opening a
376	      TCP connection requires execution of significant special-case
377	      code, and the processing of options is unlikely to increase that
378	      cost significantly.

380	      On the other hand, a Timestamps option may appear in any data or
381	      ACK segment, adding 12 bytes to the 20-byte TCP header.  We
382	      believe that the bandwidth saved by reducing unnecessary
383	      retransmissions will more than pay for the extra header bandwidth.

385	      There is also an issue about the processing overhead for parsing
386	      the variable byte-aligned format of options, particularly with a
387	      RISC-architecture CPU.  Appendix A contains a recommended layout
388	      of the options in TCP headers to achieve reasonable data field
389	      alignment.  In the spirit of Header Prediction, a TCP can quickly
390	      test for this layout and if it is verified then use a fast path.
391	      Hosts that use this canonical layout will effectively use the
392	      options as a set of fixed-format fields appended to the TCP
393	      header.  However, to retain the philosophical and protocol
394	      framework of TCP options, a TCP must be prepared to parse an
395	      arbitrary options field, albeit with less efficiency.

397	      Finally, we observe that most of the mechanisms defined in this
398	      memo are important for LFN's and/or very high-speed networks.  For
399	      low-speed networks, it might be a performance optimization to NOT
400	      use these mechanisms.  A TCP vendor concerned about optimal
401	      performance over low-speed paths might consider turning these
402	      extensions off for low-speed paths, or allow a user or
403	      installation manager to disable them.

405	2. TCP WINDOW SCALE OPTION

407	   2.1  Introduction

409	      The window scale extension expands the definition of the TCP
410	      window to 32 bits and then uses a scale factor to carry this
411	      32-bit value in the 16-bit Window field of the TCP header (SEG.WND
412	      in RFC 793).  The scale factor is carried in a new TCP option,
413	      Window Scale.  This option is sent only in a SYN segment (a
414	      segment with the SYN bit on), hence the window scale is fixed in
415	      each direction when a connection is opened.  (Another design
416	      choice would be to specify the window scale in every TCP segment.
417	      It would be incorrect to send a window scale option only when the
418	      scale factor changed, since a TCP option in an acknowledgement
419	      segment will not be delivered reliably (unless the ACK happens to
420	      be piggy-backed on data in the other direction).  Fixing the scale
421	      when the connection is opened has the advantage of lower overhead
422	      but the disadvantage that the scale factor cannot be changed
423	      during the connection.)

425	      The maximum receive window, and therefore the scale factor, is
426	      determined by the maximum receive buffer space.  In a typical
427	      modern implementation, this maximum buffer space is set by default
428	      but can be overridden by a user program before a TCP connection is
429	      opened.  This determines the scale factor, and therefore no new
430	      user interface is needed for window scaling.

432	   2.2  Window Scale Option

434	      The three-byte Window Scale option may be sent in a SYN segment by
435	      a TCP.  It has two purposes: (1) indicate that the TCP is prepared
436	      to do both send and receive window scaling, and (2) communicate a
437	      scale factor to be applied to its receive window.  Thus, a TCP
438	      that is prepared to scale windows should send the option, even if
439	      its own scale factor is 1.  The scale factor is limited to a power
440	      of two and encoded logarithmically, so it may be implemented by
441	      binary shift operations.

443	      TCP Window Scale Option (WSopt):

445	         Kind: 3

447	         Length: 3 bytes

449	                +---------+---------+---------+
450	                | Kind=3  |Length=3 |shift.cnt|
451	                +---------+---------+---------+

453	         This option is an offer, not a promise; both sides must send
454	         Window Scale options in their SYN segments to enable window
455	         scaling in either direction.  If window scaling is enabled,
456	         then the TCP that sent this option will right-shift its true
457	         receive-window values by 'shift.cnt' bits for transmission in
458	         SEG.WND.  The value 'shift.cnt' may be zero (offering to scale,
459	         while applying a scale factor of 1 to the receive window).

461	         This option may be sent in an initial  segment (i.e., a
462	         segment with the SYN bit on and the ACK bit off).  It may also
463	         be sent in a  segment, but only if a Window Scale
464	         option was received in the initial  segment.  A Window
465	         Scale option in a segment without a SYN bit should be ignored.

467	         The Window field in a SYN (i.e., a  or ) segment
468	         itself is never scaled.

470	   2.3  Using the Window Scale Option

472	      A model implementation of window scaling is as follows, using the
473	      notation of RFC 793 [Postel81]:

475	      *    All windows are treated as 32-bit quantities for storage in
476	           the connection control block and for local calculations.
477	           This includes the send-window (SND.WND) and the receive-
478	           window (RCV.WND) values, as well as the congestion window.

480	      *    The connection state is augmented by two window shift counts,
481	           Snd.Wind.Scale and Rcv.Wind.Scale, to be applied to the
482	           incoming and outgoing window fields, respectively.

484	      *    If a TCP receives a  segment containing a Window Scale
485	           option, it sends its own Window Scale option in the 
486	           segment.

488	      *    The Window Scale option is sent with shift.cnt = R, where R
489	           is the value that the TCP would like to use for its receive
490	           window.

492	      *    Upon receiving a SYN segment with a Window Scale option
493	           containing shift.cnt = S, a TCP sets Snd.Wind.Scale to S and
494	           sets Rcv.Wind.Scale to R; otherwise, it sets both
495	           Snd.Wind.Scale and Rcv.Wind.Scale to zero.

497	      *    The window field (SEG.WND) in the header of every incoming
498	           segment, with the exception of SYN segments, is left-shifted
499	           by Snd.Wind.Scale bits before updating SND.WND:

501	              SND.WND = SEG.WND << Snd.Wind.Scale

503	           (assuming the other conditions of RFC 793 are met, and using
504	           the "C" notation "<<" for left-shift).

506	      *    The window field (SEG.WND) of every outgoing segment, with
507	           the exception of SYN segments, is right-shifted by
508	           Rcv.Wind.Scale bits:

510	              SEG.WND = RCV.WND >> Rcv.Wind.Scale.

512	      TCP determines if a data segment is "old" or "new" by testing
513	      whether its sequence number is within 2**31 bytes of the left edge
514	      of the window, and if it is not, discarding the data as "old".  To
515	      insure that new data is never mistakenly considered old and vice-
516	      versa, the left edge of the sender's window has to be at most
517	      2**31 away from the right edge of the receiver's window.
518	      Similarly with the sender's right edge and receiver's left edge.
519	      Since the right and left edges of either the sender's or
520	      receiver's window differ by the window size, and since the sender
521	      and receiver windows can be out of phase by at most the window
522	      size, the above constraints imply that 2 * the max window size
523	      must be less than 2**31, or
524	           max window < 2**30

526	      Since the max window is 2**S (where S is the scaling shift count)
527	      times at most 2**16 - 1 (the maximum unscaled window), the maximum
528	      window is guaranteed to be < 2*30 if S <= 14.  Thus, the shift
529	      count must be limited to 14 (which allows windows of 2**30 = 1
530	      Gbyte).  If a Window Scale option is received with a shift.cnt
531	      value exceeding 14, the TCP should log the error but use 14
532	      instead of the specified value.

534	      The scale factor applies only to the Window field as transmitted
535	      in the TCP header; each TCP using extended windows will maintain
536	      the window values locally as 32-bit numbers.  For example, the
537	      "congestion window" computed by Slow Start and Congestion
538	      Avoidance is not affected by the scale factor, so window scaling
539	      will not introduce quantization into the congestion window.

541	      When a non-zero scale factor is in use, there are instances when a
542	      retracted window can be offered [Mathis08].  The end of the window
543	      will be on a boundary based on the granularity of the scale factor
544	      being used.  If the sequence number is then updated by a number of
545	      bytes smaller than that granularity, the TCP will have to either
546	      advertise a new window that beyond what it previously advertised
547	      (and perhaps beyond the buffer), or will have to advertise a
548	      smaller window, which will cause the TCP window to shrink.
549	      Implementations should ensure that they handle a shrinking window,
550	      as specified in section 4.2.2.16 of RFC 1122 [Braden89].

552	3.  RTTM: ROUND-TRIP TIME MEASUREMENT

554	   3.1  Introduction

556	      Accurate and current RTT estimates are necessary to adapt to
557	      changing traffic conditions and to avoid an instability known as
558	      "congestion collapse" [Nagle84] in a busy network.  However,
559	      accurate measurement of RTT may be difficult both in theory and in
560	      implementation.

562	      Many TCP implementations base their RTT measurements upon a sample
563	      of one packet per window or less.  While this yields an adequate
564	      approximation to the RTT for small windows, it results in an
565	      unacceptably poor RTT estimate for an LFN.  If we look at RTT
566	      estimation as a signal processing problem (which it is), a data
567	      signal at some frequency, the packet rate, is being sampled at a
568	      lower frequency, the window rate.  This lower sampling frequency
569	      violates Nyquist's criteria and may therefore introduce "aliasing"
570	      artifacts into the estimated RTT [Hamming77].

572	      A good RTT estimator with a conservative retransmission timeout
573	      calculation can tolerate aliasing when the sampling frequency is
574	      "close" to the data frequency.   For example, with a window of 8
575	      packets, the sample rate is 1/8 the data frequency -- less than an
576	      order of magnitude different.  However, when the window is tens or
577	      hundreds of packets, the RTT estimator may be seriously in error,
578	      resulting in spurious retransmissions.

580	      If there are dropped packets, the problem becomes worse.  Zhang
581	      [Zhang86], Jain [Jain86] and Karn [Karn87] have shown that it is
582	      not possible to accumulate reliable RTT estimates if retransmitted
583	      segments are included in the estimate.  Since a full window of
584	      data will have been transmitted prior to a retransmission, all of
585	      the segments in that window will have to be ACKed before the next
586	      RTT sample can be taken.  This means at least an additional
587	      window's worth of time between RTT measurements and, as the error
588	      rate approaches one per window of data (e.g., 10**-6 errors per
589	      bit for the Wideband satellite network), it becomes effectively
590	      impossible to obtain a valid RTT measurement.

592	      A solution to these problems, which actually simplifies the sender
593	      substantially, is as follows: using TCP options, the sender places
594	      a timestamp in each data segment, and the receiver reflects these
595	      timestamps back in ACK segments.  Then a single subtract gives the
596	      sender an accurate RTT measurement for every ACK segment (which
597	      will correspond to every other data segment, with a sensible
598	      receiver).  We call this the RTTM (Round-Trip Time Measurement)
599	      mechanism.

601	      It is vitally important to use the RTTM mechanism with big
602	      windows; otherwise, the door is opened to some dangerous
603	      instabilities due to aliasing.  Furthermore, the option is
604	      probably useful for all TCP's, since it simplifies the sender.

606	   3.2  TCP Timestamps Option

608	      TCP is a symmetric protocol, allowing data to be sent at any time
609	      in either direction, and therefore timestamp echoing may occur in
610	      either direction.  For simplicity and symmetry, we specify that
611	      timestamps always be sent and echoed in both directions.  For
612	      efficiency, we combine the timestamp and timestamp reply fields
613	      into a single TCP Timestamps Option.

615	      TCP Timestamps Option (TSopt):

617	         Kind: 8

619	         Length: 10 bytes

621	          +-------+-------+---------------------+---------------------+
622	          |Kind=8 |  10   |   TS Value (TSval)  |TS Echo Reply (TSecr)|
623	          +-------+-------+---------------------+---------------------+
624	              1       1              4                     4

626	         The Timestamps option carries two four-byte timestamp fields.
627	         The Timestamp Value field (TSval) contains the current value of
628	         the timestamp clock of the TCP sending the option.

630	         The Timestamp Echo Reply field (TSecr) is valid if the ACK bit
631	         is set in the TCP header; if it is valid, it echos a timestamp
632	         value that was sent by the remote TCP in the TSval field of a
633	         Timestamps option.  When TSecr is not valid, its value must be
634	         zero.  The TSecr value will generally be from the most recent
635	         Timestamp option that was received; however, there are
636	         exceptions that are explained below.

638	         A TCP may send the Timestamps option (TSopt) in an initial
639	          segment (i.e., a segment containing a SYN bit and no ACK
640	         bit), and may send a TSopt in other segments only if it
641	         received a TSopt in the initial  or  segment for
642	         the connection.  Once a TSopt has been sent or received in a
643	         non  segment, it must be sent in all segments.  Once a
644	         TSopt has been received in a non  segment, then any
645	         successive segment that is received without the RST bit and
646	         without a TSopt may dropped without further processing, and an
647	         ACK of the current SND.UNA generated.

649	         In the case of crossing SYN packets where one SYN contains a
650	         TSopt and the other doesn't, both sides should put a TSopt in
651	         the  segment.

653	   3.3 The RTTM Mechanism

655	      RTTM places a Timestamps option in every segment, with a TSval
656	      that is obtained from a (virtual) "timestamp clock".  Values of
657	      this clock values must be at least approximately proportional to
658	      real time, in order to measure actual RTT.

660	      These TSval values are echoed in TSecr values in the reverse
661	      direction.  The difference between a received TSecr value and the
662	      current timestamp clock value provides an RTT measurement.

664	      When timestamps are used, every segment that is received will
665	      contain a TSecr value; however, these values cannot all be used to
666	      update the measured RTT.  The following example illustrates why.
667	      It shows a one-way data flow with segments arriving in sequence
668	      without loss.  Here A, B, C... represent data blocks occupying
669	      successive blocks of sequence numbers, and ACK(A),...  represent
670	      the corresponding cumulative acknowledgments.  The two timestamp
671	      fields of the Timestamps option are shown symbolically as .  Each TSecr field contains the value most recently
673	      received in a TSval field.

675	         TCP  A                                          TCP B

677	                         ------>

679	             <---- 

681	                         ------>

683	             <---- 

685	             . . . . . . . . . . . . . . . . . . . . . .

687	                         ------>

689	             <---- 

691	                        (etc)

693	      The dotted line marks a pause (60 time units long) in which A had
694	      nothing to send.  Note that this pause inflates the RTT which B
695	      could infer from receiving TSecr=131 in data segment C.  Thus, in
696	      one-way data flows, RTTM in the reverse direction measures a value
697	      that is inflated by gaps in sending data.  However, the following
698	      rule prevents a resulting inflation of the measured RTT:

700	           RTTM Rule: A TSecr value received in a segment is used to
701	           update the averaged RTT measurement only if the segment
702	           acknowledges some new data, i.e., only if it advances the
703	           left edge of the send window.

705	      Since TCP B is not sending data, the data segment C does not
706	      acknowledge any new data when it arrives at B.  Thus, the inflated
707	      RTTM measurement is not used to update B's RTTM measurement.

709	      Implementors should note that with Timestamps multiple RTTMs can
710	      be taken per RTT.  Many RTO estimators have a weighting factor
711	      based on an implicit assumption that at most one RTTM will be
712	      gotten per RTT.  When using multiple RTTMs per RTT to update the
713	      RTO estimator, the weighting factor needs to be decreased to take
714	      into account the more frequent RTTMs.  For example, an
715	      implementation could choose to just use one sample per RTT to
716	      update the RTO estimator, or or vary the gain based on the
717	      congestion window, or take an average of all the RTTM measurements
718	      received over one RTT, and then use that value to update the RTO
719	      estimator.  This document does not prescribe any particular method
720	      for modifying the RTO estimator, the important point is that the
721	      implementation should do something more than just feeding
722	      additional RTTM samples from one RTT into the RTO estimator.

724	   3.4  Which Timestamp to Echo
725	      If more than one Timestamps option is received before a reply
726	      segment is sent, the TCP must choose only one of the TSvals to
727	      echo, ignoring the others.  To minimize the state kept in the
728	      receiver (i.e., the number of unprocessed TSvals), the receiver
729	      should be required to retain at most one timestamp in the
730	      connection control block.

732	      There are three situations to consider:

734	      (A)  Delayed ACKs.

736	           Many TCP's acknowledge only every Kth segment out of a group
737	           of segments arriving within a short time interval; this
738	           policy is known generally as "delayed ACKs".  The data-sender
739	           TCP must measure the effective RTT, including the additional
740	           time due to delayed ACKs, or else it will retransmit
741	           unnecessarily.  Thus, when delayed ACKs are in use, the
742	           receiver should reply with the TSval field from the earliest
743	           unacknowledged segment.

745	      (B)  A hole in the sequence space (segment(s) have been lost).

747	           The sender will continue sending until the window is filled,
748	           and the receiver may be generating ACKs as these out-of-order
749	           segments arrive (e.g., to aid "fast retransmit").

751	           The lost segment is probably a sign of congestion, and in
752	           that situation the sender should be conservative about
753	           retransmission.  Furthermore, it is better to overestimate
754	           than underestimate the RTT.  An ACK for an out-of-order
755	           segment should therefore contain the timestamp from the most
756	           recent segment that advanced the window.

758	           The same situation occurs if segments are re-ordered by the
759	           network.

761	      (C)  A filled hole in the sequence space.

763	           The segment that fills the hole represents the most recent
764	           measurement of the network characteristics.  On the other
765	           hand, an RTT computed from an earlier segment would probably
766	           include the sender's retransmit time-out, badly biasing the
767	           sender's average RTT estimate.  Thus, the timestamp from the
768	           latest segment (which filled the hole) must be echoed.

770	      An algorithm that covers all three cases is described in the
771	      following rules for Timestamps option processing on a synchronized
772	      connection:

774	      (1)  The connection state is augmented with two 32-bit slots:

776	           TS.Recent holds a timestamp to be echoed in TSecr whenever a
777	           segment is sent, and Last.ACK.sent holds the ACK field from
778	           the last segment sent.  Last.ACK.sent will equal RCV.NXT
779	           except when ACKs have been delayed.

781	      (2)  If:

783	              SEG.TSval >= TSrecent and SEG.SEQ <= Last.ACK.sent

785	           then SEG.TSval is copied to TS.Recent; otherwise, it is
786	           ignored.

788	      (3)  When a TSopt is sent, its TSecr field is set to the current
789	           TS.Recent value.

791	      The following examples illustrate these rules.  Here A, B, C...
792	      represent data segments occupying successive blocks of sequence
793	      numbers, and ACK(A),...  represent the corresponding
794	      acknowledgment segments.  Note that ACK(A) has the same sequence
795	      number as B.  We show only one direction of timestamp echoing, for
796	      clarity.

798	      o    Packets arrive in sequence, and some of the ACKs are delayed.

800	           By Case (A), the timestamp from the oldest unacknowledged
801	           segment is echoed.

803	                                                      TS.Recent
804	                     ------------------->
805	                                                          1
806	                     ------------------->
807	                                                          1
808	                     ------------------->
809	                                                          1
810	                             <---- 
811	                    (etc)

813	      o    Packets arrive out of order, and every packet is
814	           acknowledged.

816	           By Case (B), the timestamp from the last segment that
817	           advanced the left window edge is echoed, until the missing
818	           segment arrives; it is echoed according to Case (C).  The
819	           same sequence would occur if segments B and D were lost and
820	           retransmitted..

822	                                                      TS.Recent
823	                     ------------------->
824	                                                          1
825	                             <---- 
826	                                                          1
827	                     ------------------->
828	                                                          1
829	                             <---- 
830	                                                          1
831	                     ------------------->
832	                                                          2
833	                             <---- 
834	                                                          2
835	                     ------------------->
836	                                                          2
837	                             <---- 
838	                                                          2
839	                     ------------------->
840	                                                          4
841	                             <---- 
842	                    (etc)

844	4.  PAWS: PROTECTION AGAINST WRAPPED SEQUENCE NUMBERS

846	   4.1  Introduction

848	      Section 4.2 describes a simple mechanism to reject old duplicate
849	      segments that might corrupt an open TCP connection; we call this
850	      mechanism PAWS (Protection Against Wrapped Sequence numbers).
851	      PAWS operates within a single TCP connection, using state that is
852	      saved in the connection control block.  Section 4.3 and Appendix C
853	      discuss the implications of the PAWS mechanism for avoiding old
854	      duplicates from previous incarnations of the same connection.

856	   4.2  The PAWS Mechanism

858	      PAWS uses the same TCP Timestamps option as the RTTM mechanism
859	      described earlier, and assumes that every received TCP segment
860	      (including data and ACK segments) contains a timestamp SEG.TSval
861	      whose values are monotonically non-decreasing in time.  The basic
862	      idea is that a segment can be discarded as an old duplicate if it
863	      is received with a timestamp SEG.TSval less than some timestamp
864	      recently received on this connection.

866	      In both the PAWS and the RTTM mechanism, the "timestamps" are
867	      32-bit unsigned integers in a modular 32-bit space.  Thus, "less
868	      than" is defined the same way it is for TCP sequence numbers, and
869	      the same implementation techniques apply.  If s and t are
870	      timestamp values, s < t if 0 < (t - s) < 2**31, computed in
871	      unsigned 32-bit arithmetic.

873	      The choice of incoming timestamps to be saved for this comparison
874	      must guarantee a value that is monotonically increasing.  For
875	      example, we might save the timestamp from the segment that last
876	      advanced the left edge of the receive window, i.e., the most
877	      recent in-sequence segment.  Instead, we choose the value
878	      TS.Recent introduced in Section 3.4 for the RTTM mechanism, since
879	      using a common value for both PAWS and RTTM simplifies the
880	      implementation of both.  As Section 3.4 explained, TS.Recent
881	      differs from the timestamp from the last in-sequence segment only
882	      in the case of delayed ACKs, and therefore by less than one
883	      window.  Either choice will therefore protect against sequence
884	      number wrap-around.

886	      RTTM was specified in a symmetrical manner, so that TSval
887	      timestamps are carried in both data and ACK segments and are
888	      echoed in TSecr fields carried in returning ACK or data segments.
889	      PAWS submits all incoming segments to the same test, and therefore
890	      protects against duplicate ACK segments as well as data segments.
891	      (An alternative non-symmetric algorithm would protect against old
892	      duplicate ACKs: the sender of data would reject incoming ACK
893	      segments whose TSecr values were less than the TSecr saved from
894	      the last segment whose ACK field advanced the left edge of the
895	      send window.  This algorithm was deemed to lack economy of
896	      mechanism and symmetry.)

898	      TSval timestamps sent on {SYN} and {SYN,ACK} segments are used to
899	      initialize PAWS.  PAWS protects against old duplicate non-SYN
900	      segments, and duplicate SYN segments received while there is a
901	      synchronized connection.  Duplicate {SYN} and {SYN,ACK} segments
902	      received when there is no connection will be discarded by the
903	      normal 3-way handshake and sequence number checks of TCP.

905	      RFC 1323 recommended that RST segments NOT carry timestamps, and
906	      that they be acceptable regardless of their timestamp.  At that
907	      time, the thinking was that old duplicate RST segments should be
908	      exceedingly unlikely, and their cleanup function should take
909	      precedence over timestamps.  More recently, discussion about
910	      various blind attacks on TCP connections have raised the
911	      suggestion that if the Timestamps option is present, SEG.TSecr
912	      could be used to provide stricter acceptance tests for RST
913	      packets.  While still under discussion, to enable research into
914	      this area it is now recommended that when generating a RST, that
915	      if the packet causing the RST to be generated contained a
916	      Timestamps option that the RST also contain a Timestamps option.
917	      In the RST segment, SEG.TSecr should be set to SEG.TSval from the
918	      incoming packet and SEG.TSval should be set to zero.  If a RST is
919	      being generated because of a user abort, and Snd.TS.OK is set,
920	      then a Timestamps option should be included in the RST.  When a
921	      RST packet is received, it must not be subjected to PAWS checks,
922	      and information from the Timestamps option must not be use to
923	      update connection state information.  SEG.TSecr may be used to
924	      provide stricter RST acceptance checks.

926	      4.2.1  Basic PAWS Algorithm

928	         The PAWS algorithm requires the following processing to be
929	         performed on all incoming segments for a synchronized
930	         connection:

932	         R1)  If there is a Timestamps option in the arriving segment,
933	              SEG.TSval < TS.Recent, TS.Recent is valid (see later
934	              discussion) and the RST bit is not set, then treat the
935	              arriving segment as not acceptable:

937	                   Send an acknowledgement in reply as specified in RFC
938	                   793 page 69 and drop the segment.

940	                   Note: it is necessary to send an ACK segment in order
941	                   to retain TCP's mechanisms for detecting and
942	                   recovering from half-open connections.  For example,
943	                   see Figure 10 of RFC 793.

945	         R2)  If the segment is outside the window, reject it (normal
946	              TCP processing)

948	         R3)  If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent
949	              (see Section 3.4), then record its timestamp in TS.Recent.

951	         R4)  If an arriving segment is in-sequence (i.e., at the left
952	              window edge), then accept it normally.

954	         R5)  Otherwise, treat the segment as a normal in-window, out-
955	              of-sequence TCP segment (e.g., queue it for later delivery
956	              to the user).

958	         Steps R2, R4, and R5 are the normal TCP processing steps
959	         specified by RFC 793.

961	         It is important to note that the timestamp is checked only when
962	         a segment first arrives at the receiver, regardless of whether
963	         it is in-sequence or it must be queued for later delivery.

965	         Consider the following example.

967	              Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has
968	              been sent, where the letter indicates the sequence number
969	              and the digit represents the timestamp.  Suppose also that
970	              segment B.1 has been lost.  The timestamp in TS.TStamp is
971	              1 (from A.1), so C.1, ..., Z.1 are considered acceptable
972	              and are queued.  When B is retransmitted as segment B.2
973	              (using the latest timestamp), it fills the hole and causes
974	              all the segments through Z to be acknowledged and passed
975	              to the user.  The timestamps of the queued segments are
976	              *not* inspected again at this time, since they have
977	              already been accepted.  When B.2 is accepted, TS.Stamp is
978	              set to 2.

980	         This rule allows reasonable performance under loss.  A full
981	         window of data is in transit at all times, and after a loss a
982	         full window less one packet will show up out-of-sequence to be
983	         queued at the receiver (e.g., up to ~2**30 bytes of data); the
984	         timestamp option must not result in discarding this data.

986	         In certain unlikely circumstances, the algorithm of rules R1-R5
987	         could lead to discarding some segments unnecessarily, as shown
988	         in the following example:

990	              Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have
991	              been sent in sequence and that segment B.1 has been lost.
992	              Furthermore, suppose delivery of some of C.1, ... Z.1 is
993	              delayed until AFTER the retransmission B.2 arrives at the
994	              receiver.  These delayed segments will be discarded
995	              unnecessarily when they do arrive, since their timestamps
996	              are now out of date.

998	         This case is very unlikely to occur.  If the retransmission was
999	         triggered by a timeout, some of the segments C.1, ... Z.1 must
1000	         have been delayed longer than the RTO time.  This is presumably
1001	         an unlikely event, or there would be many spurious timeouts and
1002	         retransmissions.  If B's retransmission was triggered by the
1003	         "fast retransmit" algorithm, i.e., by duplicate ACKs, then the
1004	         queued segments that caused these ACKs must have been received
1005	         already.

1007	         Even if a segment were delayed past the RTO, the Fast
1008	         Retransmit mechanism [Jacobson90c] will cause the delayed
1009	         packets to be retransmitted at the same time as B.2, avoiding
1010	         an extra RTT and therefore causing a very small performance
1011	         penalty.

1013	         We know of no case with a significant probability of occurrence
1014	         in which timestamps will cause performance degradation by
1015	         unnecessarily discarding segments.

1017	      4.2.2  Timestamp Clock

1019	         It is important to understand that the PAWS algorithm does not
1020	         require clock synchronization between sender and receiver.  The
1021	         sender's timestamp clock is used to stamp the segments, and the
1022	         sender uses the echoed timestamp to measure RTT's.  However,
1023	         the receiver treats the timestamp as simply a monotonically
1024	         increasing serial number, without any necessary connection to
1025	         its clock.  From the receiver's viewpoint, the timestamp is
1026	         acting as a logical extension of the high-order bits of the
1027	         sequence number.

1029	         The receiver algorithm does place some requirements on the
1030	         frequency of the timestamp clock.

1032	         (a)  The timestamp clock must not be "too slow".

1034	              It must tick at least once for each 2**31 bytes sent.  In
1035	              fact, in order to be useful to the sender for round trip
1036	              timing, the clock should tick at least once per window's
1037	              worth of data, and even with the window extension defined
1038	              in Section 2.2, 2**31 bytes must be at least two windows.

1040	              To make this more quantitative, any clock faster than 1
1041	              tick/sec will reject old duplicate segments for link
1042	              speeds of ~8 Gbps.  A 1ms timestamp clock will work at
1043	              link speeds up to 8 Tbps (8*10**12) bps!

1045	         (b)  The timestamp clock must not be "too fast".

1047	              Its recycling time must be greater than MSL seconds.
1048	              Since the clock (timestamp) is 32 bits and the worst-case
1049	              MSL is 255 seconds, the maximum acceptable clock frequency
1050	              is one tick every 59 ns.

1052	              However, it is desirable to establish a much longer
1053	              recycle period, in order to handle outdated timestamps on
1054	              idle connections (see Section 4.2.3), and to relax the MSL
1055	              requirement for preventing sequence number wrap-around.
1056	              With a 1 ms timestamp clock, the 32-bit timestamp will
1057	              wrap its sign bit in 24.8 days.  Thus, it will reject old
1058	              duplicates on the same connection if MSL is 24.8 days or
1059	              less.  This appears to be a very safe figure; an MSL of
1060	              24.8 days or longer can probably be assumed by the gateway
1061	              system without requiring precise MSL enforcement by the
1062	              TTL value in the IP layer.

1064	         Based upon these considerations, we choose a timestamp clock
1065	         frequency in the range 1 ms to 1 sec per tick.  This range also
1066	         matches the requirements of the RTTM mechanism, which does not
1067	         need much more resolution than the granularity of the
1068	         retransmit timer, e.g., tens or hundreds of milliseconds.

1070	         The PAWS mechanism also puts a strong monotonicity requirement
1071	         on the sender's timestamp clock.  The method of implementation
1072	         of the timestamp clock to meet this requirement depends upon
1073	         the system hardware and software.

1075	         *    Some hosts have a hardware clock that is guaranteed to be
1076	              monotonic between hardware resets.

1078	         *    A clock interrupt may be used to simply increment a binary
1079	              integer by 1 periodically.

1081	         *    The timestamp clock may be derived from a system clock
1082	              that is subject to being abruptly changed, by adding a
1083	              variable offset value.  This offset is initialized to
1084	              zero.  When a new timestamp clock value is needed, the
1085	              offset can be adjusted as necessary to make the new value
1086	              equal to or larger than the previous value (which was
1087	              saved for this purpose).

1089	      4.2.3  Outdated Timestamps

1091	         If a connection remains idle long enough for the timestamp
1092	         clock of the other TCP to wrap its sign bit, then the value
1093	         saved in TS.Recent will become too old; as a result, the PAWS
1094	         mechanism will cause all subsequent segments to be rejected,
1095	         freezing the connection (until the timestamp clock wraps its
1096	         sign bit again).

1098	         With the chosen range of timestamp clock frequencies (1 sec to
1099	         1 ms), the time to wrap the sign bit will be between 24.8 days
1100	         and 24800 days.  A TCP connection that is idle for more than 24
1101	         days and then comes to life is exceedingly unusual.  However,
1102	         it is undesirable in principle to place any limitation on TCP
1103	         connection lifetimes.

1105	         We therefore require that an implementation of PAWS include a
1106	         mechanism to "invalidate" the TS.Recent value when a connection
1107	         is idle for more than 24 days.  (An alternative solution to the
1108	         problem of outdated timestamps would be to send keep-alive
1109	         segments at a very low rate, but still more often than the
1110	         wrap-around time for timestamps, e.g., once a day.  This would
1111	         impose negligible overhead.  However, the TCP specification has
1112	         never included keep-alives, so the solution based upon
1113	         invalidation was chosen.)
1114	         Note that a TCP does not know the frequency, and therefore, the
1115	         wraparound time, of the other TCP, so it must assume the worst.
1116	         The validity of TS.Recent needs to be checked only if the basic
1117	         PAWS timestamp check fails, i.e., only if SEG.TSval <
1118	         TS.Recent.  If TS.Recent is found to be invalid, then the
1119	         segment is accepted, regardless of the failure of the timestamp
1120	         check, and rule R3 updates TS.Recent with the TSval from the
1121	         new segment.

1123	         To detect how long the connection has been idle, the TCP may
1124	         update a clock or timestamp value associated with the
1125	         connection whenever TS.Recent is updated, for example.  The
1126	         details will be implementation-dependent.

1128	      4.2.4  Header Prediction

1130	         "Header prediction" [Jacobson90a] is a high-performance
1131	         transport protocol implementation technique that is most
1132	         important for high-speed links.  This technique optimizes the
1133	         code for the most common case, receiving a segment correctly
1134	         and in order.  Using header prediction, the receiver asks the
1135	         question, "Is this segment the next in sequence?"  This
1136	         question can be answered in fewer machine instructions than the
1137	         question, "Is this segment within the window?"

1139	         Adding header prediction to our timestamp procedure leads to
1140	         the following recommended sequence for processing an arriving
1141	         TCP segment:

1143	         H1)  Check timestamp (same as step R1 above)

1145	         H2)  Do header prediction: if segment is next in sequence and
1146	              if there are no special conditions requiring additional
1147	              processing, accept the segment, record its timestamp, and
1148	              skip H3.

1150	         H3)  Process the segment normally, as specified in RFC 793.
1151	              This includes dropping segments that are outside the
1152	              window and possibly sending acknowledgments, and queueing
1153	              in-window, out-of-sequence segments.

1155	         Another possibility would be to interchange steps H1 and H2,
1156	         i.e., to perform the header prediction step H2 FIRST, and
1157	         perform H1 and H3 only when header prediction fails.  This
1158	         could be a performance improvement, since the timestamp check
1159	         in step H1 is very unlikely to fail, and it requires unsigned
1160	         modulo arithmetic, a relatively expensive operation.  To
1161	         perform this check on every single segment is contrary to the
1162	         philosophy of header prediction.  We believe that this change
1163	         might produce a measurable reduction in CPU time for TCP
1164	         protocol processing on high-speed networks.

1166	         However, putting H2 first would create a hazard: a segment from
1167	         2**32 bytes in the past might arrive at exactly the wrong time
1168	         and be accepted mistakenly by the header-prediction step.  The
1169	         following reasoning has been introduced [Jacobson90b] to show
1170	         that the probability of this failure is negligible.

1172	              If all segments are equally likely to show up as old
1173	              duplicates, then the probability of an old duplicate
1174	              exactly matching the left window edge is the maximum
1175	              segment size (MSS) divided by the size of the sequence
1176	              space.  This ratio must be less than 2**-16, since MSS
1177	              must be < 2**16; for example, it will be (2**12)/(2**32) =
1178	              2**-20 for an FDDI link.  However, the older a segment is,
1179	              the less likely it is to be retained in the Internet, and
1180	              under any reasonable model of segment lifetime the
1181	              probability of an old duplicate exactly at the left window
1182	              edge must be much smaller than 2**-16.

1184	              The 16 bit TCP checksum also allows a basic unreliability
1185	              of one part in 2**16.  A protocol mechanism whose
1186	              reliability exceeds the reliability of the TCP checksum
1187	              should be considered "good enough", i.e., it won't
1188	              contribute significantly to the overall error rate.  We
1189	              therefore believe we can ignore the problem of an old
1190	              duplicate being accepted by doing header prediction before
1191	              checking the timestamp.

1193	         However, this probabilistic argument is not universally
1194	         accepted, and the consensus at present is that the performance
1195	         gain does not justify the hazard in the general case.  It is
1196	         therefore recommended that H2 follow H1.

1198	      4.2.5  IP Fragmentation

1200	         At high data rates, the protection against old packets provided
1201	         by PAWS can be circumvented by errors in IP fragment reassembly
1202	         [Heffner07].  The only way to protect against incorrect IP
1203	         fragment reassembly is to not allow the packets to be
1204	         fragmented.  This is done by setting the Don't Fragment (DF)
1205	         bit in the IP header.  Setting the DF bit implies the use of
1206	         Path MTU Discovery as described in RFC 1191 [Mogul90], thus any
1207	         TCP implementation that implements PAWS must also implement
1208	         Path MTU Discovery.

1210	   4.3.  Duplicates from Earlier Incarnations of Connection
1211	      The PAWS mechanism protects against errors due to sequence number
1212	      wrap-around on high-speed connection.  Segments from an earlier
1213	      incarnation of the same connection are also a potential cause of
1214	      old duplicate errors.  In both cases, the TCP mechanisms to
1215	      prevent such errors depend upon the enforcement of a maximum
1216	      segment lifetime (MSL) by the Internet (IP) layer (see Appendix of
1217	      RFC 1185 for a detailed discussion).  Unlike the case of sequence
1218	      space wrap-around, the MSL required to prevent old duplicate
1219	      errors from earlier incarnations does not depend upon the transfer
1220	      rate.  If the IP layer enforces the recommended 2 minute MSL of
1221	      TCP, and if the TCP rules are followed, TCP connections will be
1222	      safe from earlier incarnations, no matter how high the network
1223	      speed.  Thus, the PAWS mechanism is not required for this case.

1225	      We may still ask whether the PAWS mechanism can provide additional
1226	      security against old duplicates from earlier connections, allowing
1227	      us to relax the enforcement of MSL by the IP layer.  Appendix B
1228	      explores this question, showing that further assumptions and/or
1229	      mechanisms are required, beyond those of PAWS.  This is not part
1230	      of the current extension.

1232	5.  CONCLUSIONS AND ACKNOWLEDGMENTS

1234	   This memo presented a set of extensions to TCP to provide efficient
1235	   operation over large-bandwidth*delay-product paths and reliable
1236	   operation over very high-speed paths.  These extensions are designed
1237	   to provide compatible interworking with TCP's that do not implement
1238	   the extensions.

1240	   These mechanisms are implemented using new TCP options for scaled
1241	   windows and timestamps.  The timestamps are used for two distinct
1242	   mechanisms: RTTM (Round Trip Time Measurement) and PAWS (Protect
1243	   Against Wrapped Sequences).

1245	   The Window Scale option was originally suggested by Mike St. Johns of
1246	   USAF/DCA.  The present form of the option was suggested by Mike
1247	   Karels of UC Berkeley in response to a more cumbersome scheme defined
1248	   by Van Jacobson.  Lixia Zhang helped formulate the PAWS mechanism
1249	   description in RFC 1185.

1251	   Finally, much of this work originated as the result of discussions
1252	   within the End-to-End Task Force on the theoretical limitations of
1253	   transport protocols in general and TCP in particular.  Task force
1254	   members and other on the end2end-interest list have made valuable
1255	   contributions by pointing out flaws in the algorithms and the
1256	   documentation.  Continued discussion and development since the
1257	   publication of RFC 1323 originally occurred in the IETF TCP Large
1258	   Windows Working Group, later on in the End-to-End Task Force, and
1259	   most recently in the IETF TCP Maintenance Working Group.  The authors
1260	   are grateful for all these contributions.

1262	6.  SECURITY CONSIDERATIONS

1264	   The TCP sequence space is a fixed size, and as the window becomes
1265	   larger it becomes easier for an attacker to generate forged packets
1266	   that can fall within the TCP window, and be accepted as valid
1267	   packets.  While use of Timestamps and PAWS can help to mitigate this,
1268	   when using PAWS, if an attacker is able to forge a packet that is
1269	   acceptable to the TCP connection, a timestamp that is in the future
1270	   would cause valid packets to be dropped due to PAWS checks.  Hence,
1271	   implementors should take care to not open the TCP window drastically
1272	   beyond the requirements of the connection.

1274	   Middle boxes and options If a middle box removes TCP options from the
1275	   SYN, such as TSopt, a high speed connection that needs PAWS would not
1276	   have that protection.  In this situation, an implementor could
1277	   provide a mechanism for the application to determine whether or not
1278	   PAWS is in use on the connection, and chose to terminate the
1279	   connection if that protection doesn't exist.

1281	   Mechanisms to protect the TCP header from modification should also
1282	   protect the TCP options.

1284	   Expanding the TCP window beyond 64K for IPv6 allows Jumbograms
1285	   [Borman99] to be used when the local network supports packets larger
1286	   than 64K.  When larger TCP packets are used, the TCP checksum becomes
1287	   weaker.

1289	7.  IANA CONSIDERATIONS

1291	   This document has no actions for IANA.

1293	8.  REFERENCES

1295	   Normative References

1297	      [Mogul90] Mojul, J. and Deering, S., "Path MTU Discovery", RFC
1298	      1191, November 1990.

1300	      [Postel81]  Postel, J., "Transmission Control Protocol - DARPA
1301	      Internet Program Protocol Specification", RFC 793, DARPA,
1302	      September 1981.

1304	   Informative References

1306	      [Allman99] Allman, M., Paxson, V., Stevens, W., "TCP Congestion
1307	      Control", RFC 2581, NASA Glenn/Sterling Software, ACIRI / ICSI,
1308	      April 1999.

1310	      [Borman99] Borman, D., Deering, S., and  Hinden, R, "IPv6
1311	      Jumbograms" RFC 2675, August 1999.

1313	      [Braden89] Braden, R., editor, "Requirements for Internet Hosts --
1314	      Communication Layers", RFC 1122, October, 1989

1316	      [Floyd00] Floyd, S., Mahdavi, J., Mathis, M., Podolsky, M., "An
1317	      Extension to the Selective Acknowledgement (SACK) Option for TCP",
1318	      RFC 2883, July 2000.

1320	      [Blanton03] Blanton, E., Allman, M., Fall, K., Wang, L., "A
1321	      Conservative Selective Acknowledgment (SACK)-based Loss Recovery
1322	      Algorithm for TCP", RFC 3517, April 2003.

1324	      [Garlick77]  Garlick, L., R. Rom, and J. Postel, "Issues in
1325	      Reliable Host-to-Host Protocols", Proc. Second Berkeley Workshop
1326	      on Distributed Data Management and Computer Networks, May 1977.

1328	      [Hamming77]  Hamming, R., "Digital Filters", ISBN 0-13-212571-4,
1329	      Prentice Hall, Englewood Cliffs, N.J., 1977.

1331	      [Heffner07] Heffner, J., Mathis, M., and Chandler, B., "IPv4
1332	      Reassembly Errors at High Data Rates" RFC 4963, PSC, July 2007.

1334	      [Jacobson88a] Jacobson, V., "Congestion Avoidance and Control",
1335	      SIGCOMM '88, Stanford, CA., August 1988.

1337	      [Jacobson88b]  Jacobson, V., and R. Braden, "TCP Extensions for
1338	      Long-Delay Paths", RFC 1072, LBL and USC/Information Sciences
1339	      Institute, October 1988.

1341	      [Jacobson90a]  Jacobson, V., "4BSD Header Prediction", ACM
1342	      Computer Communication Review, April 1990.

1344	      [Jacobson90b]  Jacobson, V., Braden, R., and Zhang, L., "TCP
1345	      Extension for High-Speed Paths", RFC 1185, LBL and USC/Information
1346	      Sciences Institute, October 1990.

1348	      [Jacobson90c]  Jacobson, V., "Modified TCP congestion avoidance
1349	      algorithm", Message to end2end-interest mailing list, April 1990.

1351	      [Jacobson92d]  Jacobson, V., Braden, R., and Borman, D., "TCP
1352	      Extension for High Performance", RFC 1323, LBL, USC/Information
1353	      Sciences Institute and Cray Research, May 1992.

1355	      [Jain86]  Jain, R., "Divergence of Timeout Algorithms for Packet
1356	      Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and Comm.,
1357	      Scottsdale, Arizona, March 1986.

1359	      [Karn87]  Karn, P. and C. Partridge, "Estimating Round-Trip Times
1360	      in Reliable Transport Protocols", Proc. SIGCOMM '87, Stowe, VT,
1361	      August 1987.

1363	      [Martin03]  Martin, D., "[Tsvwg] RFC 1323.bis" Message to tsvwg
1364	      mailing list, September 30, 2003.

1366	      [Mathis96] Mathis, M., Mahdavi, J., Floyd, S., and Romanow, A.,
1367	      "TCP Selective Acknowledgment Options", RFC 2018, October, 1996.

1369	      [Mathis08] Mathis, M., "[tcpm] Example of 1323 window retraction
1370	      problemPer my comments at the microphone at TCPM...", Message to
1371	      the tcpm mailing list, March 2008.

1373	      [McKenzie89]  McKenzie, A., "A Problem with the TCP Big Window
1374	      Option", RFC 1110, BBN STC, August 1989.

1376	      [Nagle84]  Nagle, J., "Congestion Control in IP/TCP
1377	      Internetworks", RFC 896, FACC, January 1984.

1379	      [Watson81]  Watson, R., "Timer-based Mechanisms in Reliable
1380	      Transport Protocol Connection Management", Computer Networks, Vol.
1381	      5, 1981.

1383	      [Zhang86]  Zhang, L., "Why TCP Timers Don't Work Well", Proc.
1384	      SIGCOMM '86, Stowe, Vt., August 1986.

1386	APPENDIX A:  IMPLEMENTATION SUGGESTIONS

1388	   TCP Option Layout

1390	        The following layouts are recommended for sending options on
1391	        non-SYN segments, to achieve maximum feasible alignment of
1392	        32-bit and 64-bit machines.

1394	            +--------+--------+--------+--------+
1395	            |   NOP  |  NOP   |  TSopt |   10   |
1396	            +--------+--------+--------+--------+
1397	            |           TSval  timestamp        |
1398	            +--------+--------+--------+--------+
1399	            |           TSecr  timestamp        |
1400	            +--------+--------+--------+--------+

1402	   Interaction with the TCP Urgent Pointer

1404	        The TCP Urgent pointer, like the TCP window, is a 16 bit value.
1405	        Some of the original discussion for the TCP Window Scale option
1406	        included proposals to increase the Urgent pointer to 32 bits.
1407	        As it turns out, this is unnecessary.  There are two
1408	        observations that should be made:

1410	        (1)  With IP Version 4, the largest amount of TCP data that can
1411	             be sent in a single packet is 65495 bytes (64K - 1 - size
1412	             of fixed IP and TCP headers).

1414	        (2)  Updates to the urgent pointer while the user is in "urgent
1415	             mode" are invisible to the user.

1417	        This means that if the Urgent Pointer points beyond the end of
1418	        the TCP data in the current packet, then the user will remain in
1419	        urgent mode until the next TCP packet arrives.  That packet will
1420	        update the urgent pointer to a new offset, and the user will
1421	        never have left urgent mode.

1423	        Thus, to properly implement the Urgent Pointer, the sending TCP
1424	        only has to check for overflow of the 16 bit Urgent Pointer
1425	        field before filling it in.  If it does overflow, than a value
1426	        of 65535 should be inserted into the Urgent Pointer.

1428	        The same technique applies to IP Version 6, except in the case
1429	        of IPv6 Jumbograms.  When IPv6 Jumbograms are supported, RFC
1430	        2675 [Borman99] requires additional steps for dealing with the
1431	        Urgent Pointer, these are described in section 5.2 of RFC 2675.

1433	APPENDIX B: DUPLICATES FROM EARLIER CONNECTION INCARNATIONS

1435	   There are two cases to be considered:  (1) a system crashing (and
1436	   losing connection state) and restarting, and (2) the same connection
1437	   being closed and reopened without a loss of host state.  These will
1438	   be described in the following two sections.

1440	   B.1  System Crash with Loss of State

1442	      TCP's quiet time of one MSL upon system startup handles the loss
1443	      of connection state in a system crash/restart.  For an
1444	      explanation, see for example "When to Keep Quiet" in the TCP
1445	      protocol specification [Postel81].  The MSL that is required here
1446	      does not depend upon the transfer speed.  The current TCP MSL of 2
1447	      minutes seems acceptable as an operational compromise, as many
1448	      host systems take this long to boot after a crash.

1450	      However, the timestamp option may be used to ease the MSL
1451	      requirements (or to provide additional security against data
1452	      corruption).  If timestamps are being used and if the timestamp
1453	      clock can be guaranteed to be monotonic over a system
1454	      crash/restart, i.e., if the first value of the sender's timestamp
1455	      clock after a crash/restart can be guaranteed to be greater than
1456	      the last value before the restart, then a quiet time will be
1457	      unnecessary.

1459	      To dispense totally with the quiet time would require that the
1460	      host clock be synchronized to a time source that is stable over
1461	      the crash/restart period, with an accuracy of one timestamp clock
1462	      tick or better.  We can back off from this strict requirement to
1463	      take advantage of approximate clock synchronization.  Suppose that
1464	      the clock is always re-synchronized to within N timestamp clock
1465	      ticks and that booting (extended with a quiet time, if necessary)
1466	      takes more than N ticks.  This will guarantee monotonicity of the
1467	      timestamps, which can then be used to reject old duplicates even
1468	      without an enforced MSL.

1470	   B.2  Closing and Reopening a Connection

1472	      When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT
1473	      state ties up the socket pair for 4 minutes (see Section 3.5 of
1474	      [Postel81].  Applications built upon TCP that close one connection
1475	      and open a new one (e.g., an FTP data transfer connection using
1476	      Stream mode) must choose a new socket pair each time.  The TIME-
1477	      WAIT delay serves two different purposes:

1479	      (a)  Implement the full-duplex reliable close handshake of TCP.

1481	           The proper time to delay the final close step is not really
1482	           related to the MSL; it depends instead upon the RTO for the
1483	           FIN segments and therefore upon the RTT of the path.  (It
1484	           could be argued that the side that is sending a FIN knows
1485	           what degree of reliability it needs, and therefore it should
1486	           be able to determine the length of the TIME-WAIT delay for
1487	           the FIN's recipient.  This could be accomplished with an
1488	           appropriate TCP option in FIN segments.)

1490	           Although there is no formal upper-bound on RTT, common
1491	           network engineering practice makes an RTT greater than 1
1492	           minute very unlikely.  Thus, the 4 minute delay in TIME-WAIT
1493	           state works satisfactorily to provide a reliable full-duplex
1494	           TCP close.  Note again that this is independent of MSL
1495	           enforcement and network speed.

1497	           The TIME-WAIT state could cause an indirect performance
1498	           problem if an application needed to repeatedly close one
1499	           connection and open another at a very high frequency, since
1500	           the number of available TCP ports on a host is less than
1501	           2**16.  However, high network speeds are not the major
1502	           contributor to this problem; the RTT is the limiting factor
1503	           in how quickly connections can be opened and closed.
1504	           Therefore, this problem will be no worse at high transfer
1505	           speeds.

1507	      (b)  Allow old duplicate segments to expire.

1509	           To replace this function of TIME-WAIT state, a mechanism
1510	           would have to operate across connections.  PAWS is defined
1511	           strictly within a single connection; the last timestamp
1512	           (TS.Recent) is kept in the connection control block, and
1513	           discarded when a connection is closed.

1515	           An additional mechanism could be added to the TCP, a per-host
1516	           cache of the last timestamp received from any connection.
1517	           This value could then be used in the PAWS mechanism to reject
1518	           old duplicate segments from earlier incarnations of the
1519	           connection, if the timestamp clock can be guaranteed to have
1520	           ticked at least once since the old connection was open.  This
1521	           would require that the TIME-WAIT delay plus the RTT together
1522	           must be at least one tick of the sender's timestamp clock.
1523	           Such an extension is not part of the proposal of this RFC.

1525	           Note that this is a variant on the mechanism proposed by
1526	           Garlick, Rom, and Postel [Garlick77], which required each
1527	           host to maintain connection records containing the highest
1528	           sequence numbers on every connection.  Using timestamps
1529	           instead, it is only necessary to keep one quantity per remote
1530	           host, regardless of the number of simultaneous connections to
1531	           that host.

1533	APPENDIX C: CHANGES FROM RFC 1072, RFC 1185, RFC 1323

1535	   The protocol extensions defined in RFC 1323 document differ in
1536	   several important ways from those defined in RFC 1072 and RFC 1185.

1538	   (a)  SACK has been split off into a separate document, RFC 2018
1539	        [Mathis96].

1541	   (b)  The detailed rules for sending timestamp replies (see Section
1542	        3.4) differ in important ways.  The earlier rules could result
1543	        in an under-estimate of the RTT in certain cases (packets
1544	        dropped or out of order).

1546	   (c)  The same value TS.Recent is now shared by the two distinct
1547	        mechanisms RTTM and PAWS.  This simplification became possible
1548	        because of change (b).

1550	   (d)  An ambiguity in RFC 1185 was resolved in favor of putting
1551	        timestamps on ACK as well as data segments.  This supports the
1552	        symmetry of the underlying TCP protocol.

1554	   (e)  The echo and echo reply options of RFC 1072 were combined into a
1555	        single Timestamps option, to reflect the symmetry and to
1556	        simplify processing.

1558	   (f)  The problem of outdated timestamps on long-idle connections,
1559	        discussed in Section 4.2.2, was realized and resolved.

1561	   (g)  RFC 1185 recommended that header prediction take precedence over
1562	        the timestamp check.  Based upon some skepticism about the
1563	        probabilistic arguments given in Section 4.2.4, it was decided
1564	        to recommend that the timestamp check be performed first.

1566	   (h)  The spec was modified so that the extended options will be sent
1567	        on  segments only when they are received in the
1568	        corresponding  segments.  This provides the most
1569	        conservative possible conditions for interoperation with
1570	        implementations without the extensions.

1572	   In addition to these substantive changes, the present RFC attempts to
1573	   specify the algorithms unambiguously by presenting modifications to
1574	   the Event Processing rules of RFC 793; see Appendix F.

1576	   There are additional changes in this document from RFC 1323.  These
1577	   changes are:

1579	   (a)  The description of which TSecr values can be used to update the
1580	        measured RTT has been clarified.  Specifically, with Timestamps,
1581	        the Karn algorithm [Karn87] is disabled.  The Karn algorithm
1582	        disables all RTT measurements during retransmission, since it is
1583	        ambiguous whether the ACK is is for the original packet, or the
1584	        retransmitted packet.  With Timestamps, that ambiguity is
1585	        removed since the TSecr in the ACK will contain the TSval from
1586	        whichever data packet made it to the destination.

1588	   (b)  In RFC 1323, section 3.4, step (2) of the algorithm to control
1589	        which timestamp is echoed was incorrect in two regards:

1591	        (1)  It failed to update TSrecent for a retransmitted segment
1592	             that resulted from a lost ACK.

1594	        (2)  It failed if SEG.LEN = 0.

1596	        In the new algorithm, the case of SEG.TSval = TSrecent is
1597	        included for consistency with the PAWS test.

1599	   (c)  One correction was made to the Event Processing Summary in
1600	        Appendix F.  In SEND CALL/ESTABLISHED STATE, RCV.WND is used to
1601	        fill in the SEG.WND value, not SND.WND.

1603	   (d)  New pseudo-code summary has been added in Appendix E.

1605	   (e)  Appendix A has been expanded with information about the TCP MSS
1606	        option and the TCP Urgent Pointer.

1608	   (f)  It is now recommended that Timestamps options be included in RST
1609	        packets if the incoming packet contained a Timestamps option.

1611	   (g)  RST packets are explicitly excluded from PAWS processing.

1613	   (h)  Snd.TSoffset and Snd.TSclock variables have been added.
1614	        Snd.TSoffset is the sum of my.TSclock and Snd.TSoffset.  This
1615	        allows the starting points for timestamps to be randomized on a
1616	        per-connection basis.  Setting Snd.TSoffset to zero yields the
1617	        same results as RFC 1323.

1619	APPENDIX D: SUMMARY OF NOTATION

1621	   The following notation has been used in this document.

1623	   Options

1625	       WSopt:           TCP Window Scale Option
1626	       TSopt:           TCP Timestamps Option

1628	   Option Fields

1630	       shift.cnt:       Window scale byte in WSopt.
1631	       TSval:           32-bit Timestamp Value field in TSopt.
1632	       TSecr:           32-bit Timestamp Reply field in TSopt.

1634	   Option Fields in Current Segment

1636	       SEG.TSval:       TSval field from TSopt in current segment.
1637	       SEG.TSecr:       TSecr field from TSopt in current segment.
1638	       SEG.WSopt:       8-bit value in WSopt

1640	   Clock Values

1642	       my.TSclock:      System wide source of 32-bit timestamp values
1643	       my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec).
1644	       Snd.TSoffset:    A offset for randomizing Snd.TSclock
1645	       Snd.TSclock:     my.TSclock + Snd.TSoffset

1647	   Per-Connection State Variables

1649	       TS.Recent:       Latest received Timestamp
1650	       Last.ACK.sent:   Last ACK field sent

1652	       Snd.TS.OK:       1-bit flag
1653	       Snd.WS.OK:       1-bit flag

1655	       Rcv.Wind.Scale:  Receive window scale power
1656	       Snd.Wind.Scale:  Send window scale power

1658	       Start.Time:      Snd.TSclock value when segment being
1659	                        timed was sent (used by pre-1323 code).

1661	   Procedure

1663	       Update_SRTT( m ) Procedure to update the smoothed RTT and RTT
1664	                        variance estimates, using the rules of
1665	                        [Jacobson88a], given m, a new RTT measurement.

1667	APPENDIX E: PSEUDO-CODE SUMMARY

1669	   Create new TCB => {
1670	       Rcv.wind.scale =
1671	             MIN( 14, MAX(0, floor(log2(receive buffer space)) - 15) );
1672	       Snd.wind.scale = 0;
1673	       Last.ACK.sent = 0;
1674	       Snd.TS.OK = Snd.WS.OK = FALSE;
1675	       Snd.TSoffset = random 32 bit value
1676	   }

1678	   Send initial {SYN} segment => {

1680	       SEG.WND = MIN( RCV.WND, 65535 );
1681	       Include in segment: TSopt(TSval=Snd.TSclock, TCecr=0);
1682	       Include in segment: WSopt = Rcv.wind.scale;
1683	   }

1685	   Send {SYN, ACK} segment => {

1687	       SEG.ACK = Last.ACK.sent = RCV.NXT;
1688	       SEG.WND = MIN( RCV.WND, 65535 );
1689	       if (Snd.TS.OK) then
1690	             Include in segment:
1691	                   TSopt(TSval=Snd.TSclock, TSecr=TS.Recent);
1692	       if (Snd.WS.OK) then
1693	             Include in segment: WSopt = Rcv.wind.scale;
1694	   }

1696	   Receive {SYN} or {SYN,ACK} segment => {

1698	       if (Segment contains TSopt) then {
1699	             TS.Recent = SEG.TSval;
1700	             Snd.TS.OK = TRUE;
1701	             if (is {SYN,ACK} segment) then
1702	                   Update_SRTT(
1703	                          (Snd.TSclock - SEG.TSecr)/my.TSclock.rate);
1704	       }

1706	       if (Segment contains WSopt) then {
1707	             Snd.wind.scale = SEG.WSopt;
1708	             Snd.WS.OK = TRUE;
1709	             if (the ACK bit is not set, and Rcv.wind.scale has not been
1710	               initialized by the user) then
1711	                   Rcv.wind.scale = Snd.wind.scale;
1712	       }
1713	       else
1714	             Rcv.wind.scale = Snd.wind.scale = 0;
1715	   }

1717	   Send non-SYN segment => {

1719	       SEG.ACK = Last.ACK.sent = RCV.NXT;
1720	       SEG.WND = MIN( RCV.WND >> Rcv.wind.scale, 65535 );
1721	       if (Snd.TS.OK) then
1722	             Include in segment:
1723	                   TSopt(TSval=Snd.TSclock, TSecr=TS.Recent);
1724	   }

1726	   Receive non-SYN segment in (state >= ESTABLISHED) => {

1728	       Window = (SEG.WND << Snd.wind.scale);
1729	             /* Use 32-bit 'Window' instead of 16-bit 'SEG.WND'
1730	              * in rest of processing.
1731	              */

1733	       if (Segment contains TSopt) then {
1734	             if (SEG.TSval < TS.Recent && Idle less than 24 days) then {
1735	                   if (Send.TS.OK AND (NOT RST) ) then {
1736	                               /* Timestamp too old =>
1737	                                *    segment is unacceptable.
1738	                                */
1739	                         Send ACK segment;
1740	                         Discard segment and return;
1741	                   }
1742	             }
1743	             else {
1744	                   if (SEG.SEQ =< Last.ACK.sent) then
1745	                               TS.Recent = SEG.TSval;
1746	             }
1747	       }

1749	       if (SEG.ACK > SND.UNA) then {
1750	                    /* (At least part of) first segment in
1751	                     * retransmission queue has been ACKd
1752	                     */
1753	             if (Segment contains TSopt) then
1754	                   Update_SRTT(
1755	                          (Snd.TSclock - SEG.TSecr)/my.TSclock.rate);
1756	             else
1757	                   Update_SRTT( /* for compatibility */
1758	                          (Snd.TSclock - Start.Time)/my.TSclock.rate);
1759	       }
1760	   }

1762	APPENDIX F: EVENT PROCESSING SUMMARY

1764	Event Processing

1766	  OPEN Call

1768	     ...
1769	    An initial send sequence number (ISS) is selected.  Send a SYN
1770	    segment of the form:

1772	        

1774	      ...

1776	  SEND Call

1778	    CLOSED STATE (i.e., TCB does not exist)

1780	      ...

1782	    LISTEN STATE

1784	      If the foreign socket is specified, then change the connection
1785	      from passive to active, select an ISS.  Send a SYN segment
1786	      containing the options:  and
1787	      .  Set SND.UNA to ISS, SND.NXT to ISS+1.
1788	      Enter SYN-SENT state. ...

1790	    SYN-SENT STATE
1791	    SYN-RECEIVED STATE

1793	     ...

1795	    ESTABLISHED STATE
1796	    CLOSE-WAIT STATE

1798	      Segmentize the buffer and send it with a piggybacked
1799	      acknowledgment (acknowledgment value = RCV.NXT).  ...

1801	      If the urgent flag is set ...

1803	      If the Snd.TS.OK flag is set, then include the TCP Timestamps
1804	      option  in each data segment.

1806	      Scale the receive window for transmission in the segment header:

1808	            SEG.WND = (RCV.WND >> Rcv.Wind.Scale).

1810	  SEGMENT ARRIVES

1812	     ...

1814	    If the state is LISTEN then

1816	      first check for an RST

1818	       ...

1820	      second check for an ACK

1822	       ...

1824	      third check for a SYN

1826	        if the SYN bit is set, check the security.  If the ...

1828	         ...

1830	        If the SEG.PRC is less than the TCB.PRC then continue.

1832	        Check for a Window Scale option (WSopt); if one is found, save
1833	        SEG.WSopt in Snd.Wind.Scale and set Snd.WS.OK flag on.
1834	        Otherwise, set both Snd.Wind.Scale and Rcv.Wind.Scale to zero
1835	        and clear Snd.WS.OK flag.

1837	        Check for a TSopt option; if one is found, save SEG.TSval in the
1838	        variable TS.Recent and turn on the Snd.TS.OK bit.

1840	        Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any other
1841	        control or text should be queued for processing later.  ISS
1842	        should be selected and a SYN segment sent of the form:

1844	          

1846	        If the Snd.WS.OK bit is on, include a WSopt option
1847	         in this segment.  If the Snd.TS.OK bit is
1848	        on, include a TSopt  in this
1849	        segment.  Last.ACK.sent is set to RCV.NXT.

1851	        SND.NXT is set to ISS+1 and SND.UNA to ISS.  The connection
1852	        state should be changed to SYN-RECEIVED.  Note that any other
1853	        incoming control or data (combined with SYN) will be processed
1854	        in the SYN-RECEIVED state, but processing of SYN and ACK should
1855	        not be repeated.  If the listen was not fully specified (i.e.,
1856	        the foreign socket was not fully specified), then the
1857	        unspecified fields should be filled in now.

1859	      fourth other text or control
1860	       ...

1862	    If the state is SYN-SENT then

1864	      first check the ACK bit

1866	       ...

1868	      fourth check the SYN bit

1870	         ...

1872	        If the SYN bit is on and the security/compartment and precedence
1873	        are acceptable then, RCV.NXT is set to SEG.SEQ+1, IRS is set to
1874	        SEG.SEQ, and any acknowledgements on the retransmission queue
1875	        which are thereby acknowledged should be removed.

1877	        Check for a Window Scale option (WSopt); if is found, save
1878	        SEG.WSopt in Snd.Wind.Scale; otherwise, set both Snd.Wind.Scale
1879	        and Rcv.Wind.Scale to zero.

1881	        Check for a TSopt option; if one is found, save SEG.TSval in
1882	        variable TS.Recent and turn on the Snd.TS.OK bit in the
1883	        connection control block.  If the ACK bit is set, use
1884	        Snd.TSclock - SEG.TSecr as the initial RTT estimate.

1886	        If SND.UNA > ISS (our SYN has been ACKed), change the connection
1887	        state to ESTABLISHED, form an ACK segment:

1889	            

1891	        and send it.  If the Snd.Echo.OK bit is on, include a TSopt
1892	        option  in this ACK segment.
1893	        Last.ACK.sent is set to RCV.NXT.

1895	        Data or controls which were queued for transmission may be
1896	        included.  If there are other controls or text in the segment
1897	        then continue processing at the sixth step below where the URG
1898	        bit is checked, otherwise return.

1900	        Otherwise enter SYN-RECEIVED, form a SYN,ACK segment:

1902	            

1904	        and send it.  If the Snd.Echo.OK bit is on, include a TSopt
1905	        option  in this segment.  If
1906	        the Snd.WS.OK bit is on, include a WSopt option
1907	         in this segment.  Last.ACK.sent is set to
1908	        RCV.NXT.

1910	        If there are other controls or text in the segment, queue them
1911	        for processing after the ESTABLISHED state has been reached,
1912	        return.

1914	      fifth, if neither of the SYN or RST bits is set then drop the
1915	      segment and return.

1917	    Otherwise,

1919	    First, check sequence number

1921	      SYN-RECEIVED STATE
1922	      ESTABLISHED STATE
1923	      FIN-WAIT-1 STATE
1924	      FIN-WAIT-2 STATE
1925	      CLOSE-WAIT STATE
1926	      CLOSING STATE
1927	      LAST-ACK STATE
1928	      TIME-WAIT STATE

1930	        Segments are processed in sequence.  Initial tests on arrival
1931	        are used to discard old duplicates, but further processing is
1932	        done in SEG.SEQ order.  If a segment's contents straddle the
1933	        boundary between old and new, only the new parts should be
1934	        processed.

1936	        Rescale the received window field:

1938	            TrueWindow = SEG.WND << Snd.Wind.Scale,

1940	        and use "TrueWindow" in place of SEG.WND in the following steps.

1942	        Check whether the segment contains a Timestamps option and bit
1943	        Snd.TS.OK is on.  If so:

1945	          If SEG.TSval < TS.Recent and the RST bit is off, then test
1946	          whether connection has been idle less than 24 days; if all are
1947	          true, then the segment is not acceptable; follow steps below
1948	          for an unacceptable segment.

1950	          If SEG.SEQ is equal to Last.ACK.sent, then save SEG.ECopt in
1951	          variable TS.Recent.

1953	        There are four cases for the acceptability test for an incoming
1954	        segment:

1956	         ...

1958	        If an incoming segment is not acceptable, an acknowledgment
1959	        should be sent in reply (unless the RST bit is set, if so drop
1960	        the segment and return):

1962	          

1964	        Last.ACK.sent is set to SEG.ACK of the acknowledgment.  If the
1965	        Snd.Echo.OK bit is on, include the Timestamps option
1966	         in this ACK segment.  Set
1967	        Last.ACK.sent to SEG.ACK and send the ACK segment.  After
1968	        sending the acknowledgment, drop the unacceptable segment and
1969	        return.

1971	         ...

1973	    fifth check the ACK field.

1975	      if the ACK bit is off drop the segment and return.

1977	      if the ACK bit is on

1979	       ...

1981	        ESTABLISHED STATE

1983	          If SND.UNA < SEG.ACK =< SND.NXT then, set SND.UNA <- SEG.ACK.
1984	          Also compute a new estimate of round-trip time.  If Snd.TS.OK
1985	          bit is on, use Snd.TSclock - SEG.TSecr; otherwise use the
1986	          elapsed time since the first segment in the retransmission
1987	          queue was sent.  Any segments on the retransmission queue
1988	          which are thereby entirely acknowledged...

1990	           ...

1992	    Seventh, process the segment text.

1994	      ESTABLISHED STATE
1995	      FIN-WAIT-1 STATE
1996	      FIN-WAIT-2 STATE

1998	         ...

2000	        Send an acknowledgment of the form:

2002	          

2004	        If the Snd.TS.OK bit is on, include Timestamps option
2005	         in this ACK segment.  Set
2006	        Last.ACK.sent to SEG.ACK of the acknowledgment, and send it.
2007	        This acknowledgment should be piggy-backed on a segment being
2008	        transmitted if possible without incurring undue delay.

2010	         ...

2012	APPENDIX G: Timestamps Edge Cases

2014	   While the rules laid out for when to calculate RTTM produce the
2015	   correct results most of the time, there are some edge cases where an
2016	   incorrect RTTM can be calculated.  All of these situations involve
2017	   the loss of packets.  It is felt that these scenarios are rare, and
2018	   that if they should happen, they will cause a single RTTM measurement
2019	   to be inflated, which mitigates its effects on RTO calculations.

2021	   [Martin03] cites two similar cases when the returning ACK is lost,
2022	   and before the retransmission timer fires, another returning packet
2023	   arrives, which ACKs the data.  In this case, the RTTM calculated will
2024	   be inflated:

2026	      clock
2027	        tc=1    ------------------->

2029	        tc=2   (lost) <---- 
2030	            (RTTM would have been 1)

2032	               (receive window opens, window update is sent)
2033	        tc=5        <---- 
2034	               (RTTM is calculated at 4)

2036	   One thing to note about this situation is that it is somewhat bounded
2037	   by RTO + RTT, limiting how far off the RTTM calculation will be.
2038	   While more complex scenarios can be constructed that produce larger
2039	   inflations (e.g., retransmissions are lost), those scenarios involve
2040	   multiple packet losses, and the connection will have other more
2041	   serious operational problems than using an inflated RTTM in the RTO
2042	   calculation.  -------------

2044	Authors' Addresses

2046	   David Borman
2047	   Wind River Systems
2048	   Mendota Heights, MN 55120

2050	   Phone: (651) 454-3052
2051	   Email: david.borman@windriver.com

2053	   Bob Braden
2054	   University of Southern California
2055	   Information Sciences Institute
2056	   4676 Admiralty Way
2057	   Marina del Rey, CA 90292

2059	   Phone: (310) 448-9173
2060	   EMail: Braden@ISI.EDU

2062	   Van Jacobson
2063	   Packet Design
2064	   2465 Latham Street
2065	   Mountain View, CA 94040

2067	   EMail: van@packetdesign.com