idnits 2.17.1 

draft-ietf-tcpm-1323bis-10.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  -- The abstract seems to indicate that this document obsoletes RFC1323, but
     the header doesn't have an 'Obsoletes:' line to match this.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (April 16, 2013) is 4028 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'RFCxxxx' is mentioned on line 1863, but not defined

  == Unused Reference: 'Mathis08' is defined on line 1231, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC1110' is defined on line 1243, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2018' is defined on line 1258, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2581' is defined on line 1261, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2883' is defined on line 1267, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC5681' is defined on line 1280, but no explicit
     reference was found in the text

  == Unused Reference: 'Watson81' is defined on line 1291, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  -- Obsolete informational reference (is this intentional?): RFC  896
     (Obsoleted by RFC 7805)

  -- Obsolete informational reference (is this intentional?): RFC 1072
     (Obsoleted by RFC 1323, RFC 2018, RFC 6247)

  -- Obsolete informational reference (is this intentional?): RFC 1110
     (Obsoleted by RFC 6247)

  -- Obsolete informational reference (is this intentional?): RFC 1185
     (Obsoleted by RFC 1323)

  -- Obsolete informational reference (is this intentional?): RFC 1323
     (Obsoleted by RFC 7323)

  -- Obsolete informational reference (is this intentional?): RFC 1981
     (Obsoleted by RFC 8201)

  -- Obsolete informational reference (is this intentional?): RFC 2581
     (Obsoleted by RFC 5681)

  -- Obsolete informational reference (is this intentional?): RFC 6691
     (Obsoleted by RFC 9293)


     Summary: 1 error (**), 0 flaws (~~), 9 warnings (==), 11 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	TCP Maintenance (TCPM)                                         D. Borman
3	Internet-Draft                                       Quantum Corporation
4	Intended status: Standards Track                               B. Braden
5	Expires: October 18, 2013                         University of Southern
6	                                                              California
7	                                                             V. Jacobson
8	                                                           Packet Design
9	                                                   R. Scheffenegger, Ed.
10	                                                            NetApp, Inc.
11	                                                          April 16, 2013

13	                  TCP Extensions for High Performance
14	                       draft-ietf-tcpm-1323bis-10

16	Abstract

18	   This document specifies a set of TCP extensions to improve
19	   performance over paths with a large bandwidth * delay product and to
20	   provide reliable operation over very high-speed paths.  It defines
21	   TCP options for scaled windows and timestamps.  The timestamps are
22	   used for two distinct mechanisms, RTTM (Round Trip Time Measurement)
23	   and PAWS (Protection Against Wrapped Sequences).

25	   This document updates and obsoletes RFC 1323.

27	Status of this Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at http://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on October 18, 2013.

44	Copyright Notice

46	   Copyright (c) 2013 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (http://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
62	     1.1.  TCP Performance  . . . . . . . . . . . . . . . . . . . . .  4
63	     1.2.  TCP Reliability  . . . . . . . . . . . . . . . . . . . . .  5
64	     1.3.  Using TCP options  . . . . . . . . . . . . . . . . . . . .  6
65	     1.4.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  7
66	   2.  TCP Window Scale Option  . . . . . . . . . . . . . . . . . . .  8
67	     2.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . .  8
68	     2.2.  Window Scale Option  . . . . . . . . . . . . . . . . . . .  8
69	     2.3.  Using the Window Scale Option  . . . . . . . . . . . . . .  9
70	     2.4.  Addressing Window Retraction . . . . . . . . . . . . . . . 10
71	   3.  RTTM -- Round-Trip Time Measurement  . . . . . . . . . . . . . 12
72	     3.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . 12
73	     3.2.  TCP Timestamp Option . . . . . . . . . . . . . . . . . . . 13
74	     3.3.  The RTTM Mechanism . . . . . . . . . . . . . . . . . . . . 14
75	     3.4.  Which Timestamp to Echo  . . . . . . . . . . . . . . . . . 16
76	   4.  PAWS -- Protection Against Wrapped Sequence Numbers  . . . . . 18
77	     4.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . 18
78	     4.2.  The PAWS Mechanism . . . . . . . . . . . . . . . . . . . . 18
79	     4.3.  Basic PAWS Algorithm . . . . . . . . . . . . . . . . . . . 20
80	     4.4.  Timestamp Clock  . . . . . . . . . . . . . . . . . . . . . 22
81	     4.5.  Outdated Timestamps  . . . . . . . . . . . . . . . . . . . 23
82	     4.6.  Header Prediction  . . . . . . . . . . . . . . . . . . . . 24
83	     4.7.  IP Fragmentation . . . . . . . . . . . . . . . . . . . . . 25
84	     4.8.  Duplicates from Earlier Incarnations of Connection . . . . 25
85	   5.  Conclusions and Acknowledgements . . . . . . . . . . . . . . . 26
86	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 26
87	   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 27
88	   8.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 27
89	     8.1.  Normative References . . . . . . . . . . . . . . . . . . . 27
90	     8.2.  Informative References . . . . . . . . . . . . . . . . . . 28
91	   Appendix A.  Implementation Suggestions  . . . . . . . . . . . . . 30
92	   Appendix B.  Duplicates from Earlier Connection Incarnations . . . 31
93	     B.1.  System Crash with Loss of State  . . . . . . . . . . . . . 31
94	     B.2.  Closing and Reopening a Connection . . . . . . . . . . . . 32
95	   Appendix C.  Summary of Notation . . . . . . . . . . . . . . . . . 33
96	   Appendix D.  Event Processing Summary  . . . . . . . . . . . . . . 34
97	   Appendix E.  Timestamps Edge Cases . . . . . . . . . . . . . . . . 40
98	   Appendix F.  Window Retraction Example . . . . . . . . . . . . . . 40
99	   Appendix G.  Changes from RFC 1323 . . . . . . . . . . . . . . . . 41
100	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 43

102	1.  Introduction

104	   The TCP protocol [RFC0793] was designed to operate reliably over
105	   almost any transmission medium regardless of transmission rate,
106	   delay, corruption, duplication, or reordering of segments.  Over the
107	   years, advances in networking technology has resulted in ever-higher
108	   transmission speeds, and the fastest paths are well beyond the domain
109	   for which TCP was originally engineered.

111	   This document defines a set of modest extensions to TCP to extend the
112	   domain of its application to match the increasing network capability.
113	   It is an update to and obsoletes [RFC1323], which in turn is based
114	   upon and obsoletes [RFC1072] and [RFC1185].

116	   Changes between [RFC1323] and this document are detailed in
117	   Appendix G.

119	   For brevity, the full discussions of the merits and history behind
120	   the TCP options defined within this document have been omitted.
121	   [RFC1323] should be consulted for reference.  It is recommended that
122	   a modern TCP stack implements and make use of the extensions
123	   described in this document.

125	1.1.  TCP Performance

127	   TCP performance problems arise when the bandwidth * delay product is
128	   large.  A network having such paths is referred to as "long, fat
129	   network" (LFN).

131	   There are three fundamental performance problems with basic TCP over
132	   LFN paths:

134	   (1)  Window Size Limit

136	        The TCP header uses a 16 bit field to report the receive window
137	        size to the sender.  Therefore, the largest window that can be
138	        used is 2^16 = 64 KiB.

140	        To circumvent this problem, Section 2 of this memo defines a TCP
141	        option, "Window Scale", to allow windows larger than 2^16.  This
142	        option defines an implicit scale factor, which is used to
143	        multiply the window size value found in a TCP header to obtain
144	        the true window size.

146	   (2)  Recovery from Losses

148	        Packet losses in an LFN can have a catastrophic effect on
149	        throughput.

151	        To generalize the Fast Retransmit/Fast Recovery mechanism to
152	        handle multiple packets dropped per window, selective
153	        acknowledgments are required.  Unlike the normal cumulative
154	        acknowledgments of TCP, selective acknowledgments give the
155	        sender a complete picture of which segments are queued at the
156	        receiver and which have not yet arrived.

158	        Selective acknowledgements are specified in a separate document,
159	        "A Conservative Selective Acknowledgment (SACK)-based Loss
160	        Recovery Algorithm for TCP" [RFC6675], and not further discussed
161	        in this document.

163	   (3)  Round-Trip Measurement

165	        TCP implements reliable data delivery by retransmitting segments
166	        that are not acknowledged within some retransmission timeout
167	        (RTO) interval.  Accurate dynamic determination of an
168	        appropriate RTO is essential to TCP performance.  RTO is
169	        determined by estimating the mean and variance of the measured
170	        round-trip time (RTT), i.e., the time interval between sending a
171	        segment and receiving an acknowledgment for it [Jacobson88a].

173	        Section 3.2 defines a TCP option, "Timestamp", and then
174	        specifies a mechanism using this option that allows nearly every
175	        segment, including retransmissions, to be timed at negligible
176	        computational cost.  We use the mnemonic RTTM (Round Trip Time
177	        Measurement) for this mechanism, to distinguish it from other
178	        uses of the Timestamp Option.

180	1.2.  TCP Reliability

182	   An especially serious kind of error may result from an accidental
183	   reuse of TCP sequence numbers in data segments.  TCP reliability
184	   depends upon the existence of a bound on the lifetime of a segment:
185	   the "Maximum Segment Lifetime" or MSL.

187	   Duplication of sequence numbers might happen in either of two ways:

189	   (1)  Sequence number wrap-around on the current connection

191	        A TCP sequence number contains 32 bits.  At a high enough
192	        transfer rate, the 32-bit sequence space may be "wrapped"
193	        (cycled) within the time that a segment is delayed in queues.

195	   (2)  Earlier incarnation of the connection

197	        Suppose that a connection terminates, either by a proper close
198	        sequence or due to a host crash, and the same connection (i.e.,
199	        using the same pair of port numbers) is immediately reopened.  A
200	        delayed segment from the terminated connection could fall within
201	        the current window for the new incarnation and be accepted as
202	        valid.

204	   Duplicates from earlier incarnations, case (2), are avoided by
205	   enforcing the current fixed MSL of the TCP specification, as
206	   explained in Section 4.8 and Appendix B.  However, case (1), avoiding
207	   the reuse of sequence numbers within the same connection, requires an
208	   upper bound on MSL that depends upon the transfer rate, and at high
209	   enough rates, a dedicated mechanism is required.

211	   A possible fix for the problem of cycling the sequence space would be
212	   to increase the size of the TCP sequence number field.  For example,
213	   the sequence number field (and also the acknowledgment field) could
214	   be expanded to 64 bits.  This could be done either by changing the
215	   TCP header or by means of an additional option.

217	   Section 4 presents a different mechanism, which we call PAWS
218	   (Protection Against Wrapped Sequence numbers), to extend TCP
219	   reliability to transfer rates well beyond the foreseeable upper limit
220	   of network bandwidths.  PAWS uses the TCP timestamp option defined in
221	   Section 3.2 to protect against old duplicates from the same
222	   connection.

224	1.3.  Using TCP options

226	   The extensions defined in this document all use TCP options.

228	   When [RFC1323] was published, there was concern that some buggy TCP
229	   implementation might be crashed by the first appearance of an option
230	   on a non-<SYN> segment.  However, bugs like that can lead to DOS
231	   attacks against a TCP, so it is now expected that most TCP
232	   implementations will properly handle unknown options on non-<SYN>
233	   segments.  But it is still prudent to be conservative in what you
234	   send, and avoiding buggy TCP implementation is not the only reason
235	   for negotiating TCP options on <SYN> segments.

237	   The window scale option negotiates fundamental parameters of the TCP
238	   session.  Therefore, it is only sent during the initial handshake.
239	   Furthermore, the window scale option will be sent in a <SYN,ACK>
240	   segment only if the corresponding option was received in the initial
241	   <SYN> segment.

243	   The timestamp option may appear in any data or <ACK> segment, adding
244	   12 bytes to the 20-byte TCP header.  We recognize there is a trade-
245	   off between the bandwidth saved by reducing unnecessary
246	   retransmission timeouts, and the extra header bandwidth used by this
247	   option.  It is required that this TCP option will be sent on non-
248	   <SYN> segments only after an exchange of options on the <SYN>
249	   segments has indicated that both sides understand this extension.

251	   Appendix A contains a recommended layout of the options in TCP
252	   headers to achieve reasonable data field alignment.

254	   Finally, we observe that most of the mechanisms defined in this
255	   document are important for LFN's and/or very high-speed networks.
256	   For low-speed networks, it might be a performance optimization to NOT
257	   use these mechanisms.  A TCP vendor concerned about optimal
258	   performance over low-speed paths might consider turning these
259	   extensions off for low-speed paths, or allow a user or installation
260	   manager to disable them.

262	1.4.  Terminology

264	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
265	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
266	   document are to be interpreted as described in [RFC2119].

268	   In this document, these words will appear with that interpretation
269	   only when in UPPER CASE.  Lower case uses of these words are not to
270	   be interpreted as carrying [RFC2119] significance.

272	2.  TCP Window Scale Option

274	2.1.  Introduction

276	   The window scale extension expands the definition of the TCP window
277	   to 32 bits and then uses a scale factor to carry this 32-bit value in
278	   the 16-bit Window field of the TCP header (SEG.WND in RFC 793).  The
279	   scale factor is carried in a TCP option, Window Scale.  This option
280	   is sent only in a <SYN> segment (a segment with the SYN bit on),
281	   hence the window scale is fixed in each direction when a connection
282	   is opened.

284	   The maximum receive window, and therefore the scale factor, is
285	   determined by the maximum receive buffer space.  In a typical modern
286	   implementation, this maximum buffer space is set by default but can
287	   be overridden by a user program before a TCP connection is opened.
288	   This determines the scale factor, and therefore no new user interface
289	   is needed for window scaling.

291	2.2.  Window Scale Option

293	   The three-byte Window Scale option MAY be sent in a <SYN> segment by
294	   a TCP.  It has two purposes: (1) indicate that the TCP is prepared to
295	   do both send and receive window scaling, and (2) communicate a scale
296	   factor to be applied to its receive window.  Thus, a TCP that is
297	   prepared to scale windows SHOULD send the option, even if its own
298	   scale factor is 1.  The scale factor is limited to a power of two and
299	   encoded logarithmically, so it may be implemented by binary shift
300	   operations.

302	   TCP Window Scale Option (WSopt):

304	   Kind: 3

306	   Length: 3 bytes

308	          +---------+---------+---------+
309	          | Kind=3  |Length=3 |shift.cnt|
310	          +---------+---------+---------+
311	               1         1         1

313	   This option is an offer, not a promise; both sides MUST send Window
314	   Scale options in their <SYN> segments to enable window scaling in
315	   either direction.  If window scaling is enabled, then the TCP that
316	   sent this option will right-shift its true receive-window values by
317	   'shift.cnt' bits for transmission in SEG.WND.  The value 'shift.cnt'
318	   MAY be zero (offering to scale, while applying a scale factor of 1 to
319	   the receive window).

321	   This option MAY be sent in an initial <SYN> segment (i.e., a segment
322	   with the SYN bit on and the ACK bit off).  It MAY also be sent in a
323	   <SYN,ACK> segment, but only if a Window Scale option was received in
324	   the initial <SYN> segment.  A Window Scale option in a segment
325	   without a SYN bit SHOULD be ignored.

327	   The window field in a segment where the SYN bit is set (i.e., a <SYN>
328	   or <SYN,ACK>) is never scaled.

330	2.3.  Using the Window Scale Option

332	   A model implementation of window scaling is as follows, using the
333	   notation of [RFC0793]:

335	   o  All windows are treated as 32-bit quantities for storage in the
336	      connection control block and for local calculations.  This
337	      includes the send-window (SND.WND) and the receive-window
338	      (RCV.WND) values, as well as the congestion window.

340	   o  The connection state is augmented by two window shift counts,
341	      Snd.Wind.Scale and Rcv.Wind.Scale, to be applied to the incoming
342	      and outgoing window fields, respectively.

344	   o  If a TCP receives a <SYN> segment containing a Window Scale
345	      option, it sends its own Window Scale option in the <SYN,ACK>
346	      segment.

348	   o  The Window Scale option is sent with shift.cnt = R, where R is the
349	      value that the TCP would like to use for its receive window.

351	   o  Upon receiving a <SYN> segment with a Window Scale option
352	      containing shift.cnt = S, a TCP sets Snd.Wind.Scale to S and sets
353	      Rcv.Wind.Scale to R; otherwise, it sets both Snd.Wind.Scale and
354	      Rcv.Wind.Scale to zero.

356	   o  The window field (SEG.WND) in the header of every incoming
357	      segment, with the exception of <SYN> segments, is left-shifted by
358	      Snd.Wind.Scale bits before updating SND.WND:

360	                    SND.WND = SEG.WND << Snd.Wind.Scale

362	      (assuming the other conditions of [RFC0793] are met, and using the
363	      "C" notation "<<" for left-shift).

365	   o  The window field (SEG.WND) of every outgoing segment, with the
366	      exception of <SYN> segments, is right-shifted by Rcv.Wind.Scale
367	      bits:

369	                    SND.WND = RCV.WND >> Rcv.Wind.Scale

371	   TCP determines if a data segment is "old" or "new" by testing whether
372	   its sequence number is within 2^31 bytes of the left edge of the
373	   window, and if it is not, discarding the data as "old".  To insure
374	   that new data is never mistakenly considered old and vice versa, the
375	   left edge of the sender's window has to be at most 2^31 away from the
376	   right edge of the receiver's window.  Similarly with the sender's
377	   right edge and receiver's left edge.  Since the right and left edges
378	   of either the sender's or receiver's window differ by the window
379	   size, and since the sender and receiver windows can be out of phase
380	   by at most the window size, the above constraints imply that two
381	   times the maximum window size must be less than 2^31, or

383	                             max window < 2^30

385	   Since the max window is 2^S (where S is the scaling shift count)
386	   times at most 2^16 - 1 (the maximum unscaled window), the maximum
387	   window is guaranteed to be < 2^30 if S <= 14.  Thus, the shift count
388	   MUST be limited to 14 (which allows windows of 2^30 = 1 GiB).  If a
389	   Window Scale option is received with a shift.cnt value exceeding 14,
390	   the TCP SHOULD log the error but use 14 instead of the specified
391	   value.

393	   The scale factor applies only to the Window field as transmitted in
394	   the TCP header; each TCP using extended windows will maintain the
395	   window values locally as 32-bit numbers.  For example, the
396	   "congestion window" computed by Slow Start and Congestion Avoidance
397	   is not affected by the scale factor, so window scaling will not
398	   introduce quantization into the congestion window.

400	2.4.  Addressing Window Retraction

402	   When a non-zero scale factor is in use, there are instances when a
403	   retracted window can be offered - see Appendix F for a detailed
404	   example.  The end of the window will be on a boundary based on the
405	   granularity of the scale factor being used.  If the sequence number
406	   is then updated by a number of bytes smaller than that granularity,
407	   the TCP will have to either advertise a new window that is beyond
408	   what it previously advertised (and perhaps beyond the buffer), or
409	   will have to advertise a smaller window, which will cause the TCP
410	   window to shrink.  Implementations MUST ensure that they handle a
411	   shrinking window, as specified in section 4.2.2.16 of [RFC1122].

413	   For the receiver, this implies that:

415	   1)  The receiver MUST honor, as in-window, any segment that would
416	       have been in-window for any <ACK> sent by the receiver.

418	   2)  When window scaling is in effect, the receiver SHOULD track the
419	       actual maximum window sequence number (which is likely to be
420	       greater than the window announced by the most recent <ACK>, if
421	       more than one segment has arrived since the application consumed
422	       any data in the receive buffer).

424	   On the sender side:

426	   3)  The initial transmission MUST be within the window announced by
427	       the most recent <ACK>.

429	   4)  On first retransmission, or if the sequence number is out-of-
430	       window by less than (2^Rcv.Wind.Scale) then do normal
431	       retransmission(s) without regard to receiver window as long as
432	       the original segment was in window when it was sent.

434	   5)  Subsequent retransmissions MAY only be sent, if they are within
435	       the window announced by the most recent <ACK>.

437	3.  RTTM -- Round-Trip Time Measurement

439	3.1.  Introduction

441	   Accurate and current RTT estimates are necessary to adapt to changing
442	   traffic conditions and to avoid an instability known as "congestion
443	   collapse" [RFC0896] in a busy network.  However, accurate measurement
444	   of RTT may be difficult both in theory and in implementation.

446	   Many TCP implementations base their RTT measurements upon a sample of
447	   one segment per window or less.  While this yields an adequate
448	   approximation to the RTT for small windows, it results in an
449	   unacceptably poor RTT estimate for a LFN.  If we look at RTT
450	   estimation as a signal processing problem (which it is), a data
451	   signal at some frequency, the packet rate, is being sampled at a
452	   lower frequency, the window rate.  This lower sampling frequency
453	   violates Nyquist's criteria and may therefore introduce "aliasing"
454	   artifacts into the estimated RTT [Hamming77].

456	   A good RTT estimator with a conservative retransmission timeout
457	   calculation can tolerate aliasing when the sampling frequency is
458	   "close" to the data frequency.  For example, with a window of 8
459	   segments, the sample rate is 1/8 the data frequency -- less than an
460	   order of magnitude different.  However, when the window is tens or
461	   hundreds of segments, the RTT estimator may be seriously in error,
462	   resulting in spurious retransmissions.

464	   If there are dropped segments, the problem becomes worse.  Zhang
465	   [Zhang86], Jain [Jain86] and Karn [Karn87] have shown that it is not
466	   possible to accumulate reliable RTT estimates if retransmitted
467	   segments are included in the estimate.  Since a full window of data
468	   will have been transmitted prior to a retransmission, all of the
469	   segments in that window will have to be ACKed before the next RTT
470	   sample can be taken.  This means at least an additional window's
471	   worth of time between RTT measurements and, as the error rate
472	   approaches one per window of data (e.g., 10^-6 errors per bit for the
473	   Wideband satellite network), it becomes effectively impossible to
474	   obtain a valid RTT measurement.

476	   A solution to these problems, which actually simplifies the sender
477	   substantially, is as follows: using TCP options, the sender places a
478	   timestamp in each data segment, and the receiver reflects these
479	   timestamps back in <ACK> segments.  Then a single subtract gives the
480	   sender an accurate RTT measurement for every <ACK> segment (which
481	   will correspond to every other data segment, with a sensible
482	   receiver).  We call this the RTTM (Round-Trip Time Measurement)
483	   mechanism.

485	   It is vitally important to use the RTTM mechanism with big windows;
486	   otherwise, the door is opened to some dangerous instabilities due to
487	   aliasing.  Furthermore, the option is probably useful for all TCP's,
488	   since it simplifies the sender.

490	3.2.  TCP Timestamp Option

492	   TCP is a symmetric protocol, allowing data to be sent at any time in
493	   either direction, and therefore timestamp echoing may occur in either
494	   direction.  For simplicity and symmetry, we specify that timestamps
495	   always be sent and echoed in both directions.  For efficiency, we
496	   combine the timestamp and timestamp reply fields into a single TCP
497	   Timestamp Option.

499	   TCP Timestamp Option (TSopt):

501	   Kind: 8

503	   Length: 10 bytes

505	          +-------+-------+---------------------+---------------------+
506	          |Kind=8 |  10   |   TS Value (TSval)  |TS Echo Reply (TSecr)|
507	          +-------+-------+---------------------+---------------------+
508	              1       1              4                     4

510	   The Timestamp Option carries two four-byte timestamp fields.  The
511	   Timestamp Value field (TSval) contains the current value of the
512	   timestamp clock of the TCP sending the option.

514	   The Timestamp Echo Reply field (TSecr) is valid if the ACK bit is set
515	   in the TCP header; if it is valid, it echoes a timestamp value that
516	   was sent by the remote TCP in the TSval field of a Timestamp option.
517	   When TSecr is not valid, its value MUST be zero.  However, a value of
518	   zero does not imply TSecr being invalid.  The TSecr value will
519	   generally be from the most recent Timestamp Option that was received;
520	   however, there are exceptions that are explained below.

522	   A TCP MAY send the Timestamp option (TSopt) in an initial <SYN>
523	   segment (i.e., segment containing a SYN bit and no ACK bit), and MAY
524	   send a TSopt in other segments only if it received a TSopt in the
525	   initial <SYN> or <SYN,ACK> segment for the connection.

527	   Once TSopt has been successfully negotiated (sent and received)
528	   during the <SYN>, <SYN,ACK> exchange, TSopt MUST be sent in every
529	   non-<RST> segment for the duration of the connection.  If a non-<RST>
530	   segment is received without a TSopt, a TCP MAY drop the segment and
531	   send an <ACK> for the last in-sequence segment.  A TCP MUST NOT abort
532	   a TCP connection if a non-<RST> segment is received without a TSopt.

534	   If a TSopt is received on a connection where TSopt was not negotiated
535	   in the initial three-way handshake, the TSopt MUST be ignored and the
536	   packet processed normally.

538	   In the case of crossing <SYN> segments where one <SYN> contains a
539	   TSopt and the other doesn't, both sides MAY send a TSopt in the
540	   <SYN,ACK> segment.

542	   TSopt is required for the two mechanisms described in sections 3.3
543	   and 4.2.  There are also other mechanisms that rely on the presence
544	   of the TSopt, e.g.  [RFC3522].  If a TCP stopped sending TSopt at any
545	   time during an established session, it interferes with these
546	   mechanisms.  This update to [RFC1323] describes explicitly the
547	   previous assumption (see Section 4.2), that each TCP segment must
548	   have TSopt, once negotiated.

550	3.3.  The RTTM Mechanism

552	   RTTM places a Timestamp Option in every segment, with a TSval that is
553	   obtained from a (virtual) "timestamp clock".  Values of this clock
554	   MUST be at least approximately proportional to real time, in order to
555	   measure actual RTT.

557	   These TSval values are echoed in TSecr values in the reverse
558	   direction.  The difference between a received TSecr value and the
559	   current timestamp clock value provides a RTT measurement.

561	   When timestamps are used, every segment that is received will contain
562	   a TSecr value.  However, these values cannot all be used to update
563	   the measured RTT.  The following example illustrates why.  It shows a
564	   one-way data flow with segments arriving in sequence without loss.
565	   Here A, B, C... represent data blocks occupying successive blocks of
566	   sequence numbers, and ACK(A),... represent the corresponding
567	   cumulative acknowledgments.  The two timestamp fields of the
568	   Timestamp Option are shown symbolically as <TSval=x,TSecr=y>.  Each
569	   TSecr field contains the value most recently received in a TSval
570	   field.

572	              TCP  A                                     TCP B

574	                              <A,TSval=1,TSecr=120> ----->

576	                   <---- <ACK(A),TSval=127,TSecr=1>

578	                              <B,TSval=5,TSecr=127> ----->

580	                   <---- <ACK(B),TSval=131,TSecr=5>

582	                . . . . . . . . . . . . . . . . . . . . . .

584	                              <C,TSval=65,TSecr=131> ---->

586	                   <---- <ACK(C),TSval=191,TSecr=65>

588	                                  (etc.)

590	   The dotted line marks a pause (60 time units long) in which A had
591	   nothing to send.  Note that this pause inflates the RTT which B could
592	   infer from receiving TSecr=131 in data segment C. Thus, in one-way
593	   data flows, RTTM in the reverse direction measures a value that is
594	   inflated by gaps in sending data.  However, the following rule
595	   prevents a resulting inflation of the measured RTT:

597	   RTTM Rule: A TSecr value received in a segment MAY be used to update
598	              the averaged RTT measurement only if the segment advances
599	              the left edge of the send window, i.e.  SND.UNA is
600	              increased.

602	   Since TCP B is not sending data, the data segment C does not
603	   acknowledge any new data when it arrives at B. Thus, the inflated
604	   RTTM measurement is not used to update B's RTTM measurement.

606	   Implementers should note that with timestamps multiple RTTMs can be
607	   taken per RTT.  Many RTO estimators have a weighting factor based on
608	   an implicit assumption that at most one RTTM will be sampled per RTT.
609	   When using multiple RTTMs per RTT to update the RTO estimator, the
610	   weighting factor needs to be decreased to take into account the more
611	   frequent RTTMs.  For example, an implementation could choose to just
612	   use one sample per RTT to update the RTO estimator, or vary the gain
613	   based on the congestion window, or take an average of all the RTT
614	   measurements received over one RTT, and then use that value to update
615	   the RTO estimator.  This document does not prescribe any particular
616	   method for modifying the RTO estimator.

618	3.4.  Which Timestamp to Echo

620	   If more than one Timestamp Option is received before a reply segment
621	   is sent, the TCP must choose only one of the TSvals to echo, ignoring
622	   the others.  To minimize the state kept in the receiver (i.e., the
623	   number of unprocessed TSvals), the receiver should be required to
624	   retain at most one timestamp in the connection control block.

626	   There are three situations to consider:

628	   (A)  Delayed ACKs.

630	        Many TCP's acknowledge only every Kth segment out of a group of
631	        segments arriving within a short time interval; this policy is
632	        known generally as "delayed ACKs".  The data-sender TCP must
633	        measure the effective RTT, including the additional time due to
634	        delayed ACKs, or else it will retransmit unnecessarily.  Thus,
635	        when delayed ACKs are in use, the receiver SHOULD reply with the
636	        TSval field from the earliest unacknowledged segment.

638	   (B)  A hole in the sequence space (segment(s) have been lost).

640	        The sender will continue sending until the window is filled, and
641	        the receiver may be generating <ACK>s as these out-of-order
642	        segments arrive (e.g., to aid "fast retransmit").

644	        The lost segment is probably a sign of congestion, and in that
645	        situation the sender should be conservative about
646	        retransmission.  Furthermore, it is better to overestimate than
647	        underestimate the RTT.  An <ACK> for an out-of-order segment
648	        SHOULD therefore contain the timestamp from the most recent
649	        segment that advanced the window.

651	        The same situation occurs if segments are re-ordered by the
652	        network.

654	   (C)  A filled hole in the sequence space.

656	        The segment that fills the hole represents the most recent
657	        measurement of the network characteristics.  A RTT computed from
658	        an earlier segment would probably include the sender's
659	        retransmit time-out, badly biasing the sender's average RTT
660	        estimate.  Thus, the timestamp from the latest segment (which
661	        filled the hole) MUST be echoed.

663	   An algorithm that covers all three cases is described in the
664	   following rules for Timestamp Option processing on a synchronized
665	   connection:

667	   (1)  The connection state is augmented with two 32-bit slots:

669	        TS.Recent holds a timestamp to be echoed in TSecr whenever a
670	        segment is sent, and Last.ACK.sent holds the ACK field from the
671	        last segment sent.  Last.ACK.sent will equal RCV.NXT except when
672	        <ACK>s have been delayed.

674	   (2)  If:

676	            SEG.TSval >= TS.recent and SEG.SEQ <= Last.ACK.sent

678	        then SEG.TSval is copied to TS.Recent; otherwise, it is ignored.

680	   (3)  When a TSopt is sent, its TSecr field is set to the current
681	        TS.Recent value.

683	   The following examples illustrate these rules.  Here A, B, C...
684	   represent data segments occupying successive blocks of sequence
685	   numbers, and ACK(A),... represent the corresponding acknowledgment
686	   segments.  Note that ACK(A) has the same sequence number as B. We
687	   show only one direction of timestamp echoing, for clarity.

689	   o  Segments arrive in sequence, and some of the <ACK>s are delayed.

691	      By case (A), the timestamp from the oldest unacknowledged segment
692	      is echoed.

694	                                                    TS.Recent
695	                  <A, TSval=1> ------------------->
696	                                                        1
697	                  <B, TSval=2> ------------------->
698	                                                        1
699	                  <C, TSval=3> ------------------->
700	                                                        1
701	                           <---- <ACK(C), TSecr=1>
702	                  (etc)

704	   o  Segments arrive out of order, and every segment is acknowledged.

706	      By case (B), the timestamp from the last segment that advanced the
707	      left window edge is echoed, until the missing segment arrives; it
708	      is echoed according to Case (C).  The same sequence would occur if
709	      segments B and D were lost and retransmitted.

711	                                                    TS.Recent
712	                  <A, TSval=1> ------------------->
713	                                                        1
714	                           <---- <ACK(A), TSecr=1>
715	                                                        1
716	                  <C, TSval=3> ------------------->
717	                                                        1
718	                           <---- <ACK(A), TSecr=1>
719	                                                        1
720	                  <B, TSval=2> ------------------->
721	                                                        2
722	                           <---- <ACK(C), TSecr=2>
723	                                                        2
724	                  <E, TSval=5> ------------------->
725	                                                        2
726	                           <---- <ACK(C), TSecr=2>
727	                                                        2
728	                  <D, TSval=4> ------------------->
729	                                                        4
730	                           <---- <ACK(E), TSecr=4>
731	                  (etc)

733	4.  PAWS -- Protection Against Wrapped Sequence Numbers

735	4.1.  Introduction

737	   Section 4.2 describes a simple mechanism to reject old duplicate
738	   segments that might corrupt an open TCP connection; we call this
739	   mechanism PAWS (Protection Against Wrapped Sequence numbers).  PAWS
740	   operates within a single TCP connection, using state that is saved in
741	   the connection control block.  Section 4.8 and Appendix G discuss the
742	   implications of the PAWS mechanism for avoiding old duplicates from
743	   previous incarnations of the same connection.

745	4.2.  The PAWS Mechanism

747	   PAWS uses the same TCP Timestamp Option as the RTTM mechanism
748	   described earlier, and assumes that every received TCP segment
749	   (including data and <ACK> segments) contains a timestamp SEG.TSval
750	   whose values are monotonically non-decreasing in time.  The basic
751	   idea is that a segment can be discarded as an old duplicate if it is
752	   received with a timestamp SEG.TSval less than some timestamp recently
753	   received on this connection.

755	   In both the PAWS and the RTTM mechanism, the "timestamps" are 32-bit
756	   unsigned integers in a modular 32-bit space.  Thus, "less than" is
757	   defined the same way it is for TCP sequence numbers, and the same
758	   implementation techniques apply.  If s and t are timestamp values,

760	                       s < t  if 0 < (t - s) < 2^31,

762	   computed in unsigned 32-bit arithmetic.

764	   The choice of incoming timestamps to be saved for this comparison
765	   MUST guarantee a value that is monotonically increasing.  For
766	   example, we might save the timestamp from the segment that last
767	   advanced the left edge of the receive window, i.e., the most recent
768	   in-sequence segment.  Instead, we choose the value TS.Recent
769	   introduced in Section 3.4 for the RTTM mechanism, since using a
770	   common value for both PAWS and RTTM simplifies the implementation of
771	   both.  As Section 3.4 explained, TS.Recent differs from the timestamp
772	   from the last in-sequence segment only in the case of delayed <ACK>s,
773	   and therefore by less than one window.  Either choice will therefore
774	   protect against sequence number wrap-around.

776	   RTTM was specified in a symmetrical manner, so that TSval timestamps
777	   are carried in both data and <ACK> segments and are echoed in TSecr
778	   fields carried in returning <ACK> or data segments.  PAWS submits all
779	   incoming segments to the same test, and therefore protects against
780	   duplicate <ACK> segments as well as data segments.  (An alternative
781	   non-symmetric algorithm would protect against old duplicate <ACK>s:
782	   the sender of data would reject incoming <ACK> segments whose TSecr
783	   values were less than the TSecr saved from the last segment whose ACK
784	   field advanced the left edge of the send window.  This algorithm was
785	   deemed to lack economy of mechanism and symmetry.)

787	   TSval timestamps sent on <SYN> and <SYN,ACK> segments are used to
788	   initialize PAWS.  PAWS protects against old duplicate non-<SYN>
789	   segments, and duplicate <SYN> segments received while there is a
790	   synchronized connection.  Duplicate <SYN> and <SYN,ACK> segments
791	   received when there is no connection will be discarded by the normal
792	   3-way handshake and sequence number checks of TCP.

794	   [RFC1323] recommended that <RST> segments NOT carry timestamps, and
795	   that they be acceptable regardless of their timestamp.  At that time,
796	   the thinking was that old duplicate <RST> segments should be
797	   exceedingly unlikely, and their cleanup function should take
798	   precedence over timestamps.  More recently, discussions about various
799	   blind attacks on TCP connections have raised the suggestion that if
800	   the timestamp option is present, SEG.TSecr could be used to provide
801	   stricter acceptance tests for <RST> segments.  While still under
802	   discussion, to enable research into this area it is now RECOMMENDED
803	   that when generating a <RST>, that if the segment causing the <RST>
804	   to be generated contained a timestamp option, that the <RST> also
805	   contain a timestamp option.  In the <RST> segment, SEG.TSecr SHOULD
806	   be set to SEG.TSval from the incoming segment and SEG.TSval SHOULD be
807	   set to zero.  If a <RST> is being generated because of a user abort,
808	   and Snd.TS.OK is set, then a timestamp option SHOULD be included in
809	   the <RST>.  When a <RST> segment is received, it MUST NOT be
810	   subjected to PAWS checks, and information from the timestamp option
811	   MUST NOT be used to update connection state information.  SEG.TSecr
812	   MAY be used to provide stricter <RST> acceptance checks.

814	4.3.  Basic PAWS Algorithm

816	   The PAWS algorithm REQUIRES the following processing to be performed
817	   on all incoming segments for a synchronized connection.  Also, PAWS
818	   processing MUST take precedence over the regular TCP acceptablitiy
819	   check (Section 3.3 in [RFC0793]), which is performed after
820	   verification of the received timestamp option:

822	   R1)  If there is a Timestamp Option in the arriving segment,
823	        SEG.TSval < TS.Recent, TS.Recent is valid (see later discussion)
824	        and the RST bit is not set, then treat the arriving segment as
825	        not acceptable:

827	           Send an acknowledgement in reply as specified in [RFC0793]
828	           page 69 and drop the segment.

830	           Note: it is necessary to send an <ACK> segment in order to
831	           retain TCP's mechanisms for detecting and recovering from
832	           half-open connections.  For example, see Figure 10 of
833	           [RFC0793].

835	   R2)  If the segment is outside the window, reject it (normal TCP
836	        processing)

838	   R3)  If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent (see
839	        Section 3.4), then record its timestamp in TS.Recent.

841	   R4)  If an arriving segment is in-sequence (i.e., at the left window
842	        edge), then accept it normally.

844	   R5)  Otherwise, treat the segment as a normal in-window, out-of-
845	        sequence TCP segment (e.g., queue it for later delivery to the
846	        user).

848	   Steps R2, R4, and R5 are the normal TCP processing steps specified by
849	   [RFC0793].

851	   It is important to note that the timestamp MUST be checked only when
852	   a segment first arrives at the receiver, regardless of whether it is
853	   in-sequence or it must be queued for later delivery.

855	   Consider the following example.

857	      Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has been
858	      sent, where the letter indicates the sequence number and the digit
859	      represents the timestamp.  Suppose also that segment B.1 has been
860	      lost.  The timestamp in TS.Recent is 1 (from A.1), so C.1, ...,
861	      Z.1 are considered acceptable and are queued.  When B is
862	      retransmitted as segment B.2 (using the latest timestamp), it
863	      fills the hole and causes all the segments through Z to be
864	      acknowledged and passed to the user.  The timestamps of the queued
865	      segments are *not* inspected again at this time, since they have
866	      already been accepted.  When B.2 is accepted, TS.Recent is set to
867	      2.

869	   This rule allows reasonable performance under loss.  A full window of
870	   data is in transit at all times, and after a loss a full window less
871	   one segment will show up out-of-sequence to be queued at the receiver
872	   (e.g., up to ~2^30 bytes of data); the timestamp option must not
873	   result in discarding this data.

875	   In certain unlikely circumstances, the algorithm of rules R1-R5 could
876	   lead to discarding some segments unnecessarily, as shown in the
877	   following example:

879	      Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have been
880	      sent in sequence and that segment B.1 has been lost.  Furthermore,
881	      suppose delivery of some of C.1, ...  Z.1 is delayed until *after*
882	      the retransmission B.2 arrives at the receiver.  These delayed
883	      segments will be discarded unnecessarily when they do arrive,
884	      since their timestamps are now out of date.

886	   This case is very unlikely to occur.  If the retransmission was
887	   triggered by a timeout, some of the segments C.1, ...  Z.1 must have
888	   been delayed longer than the RTO time.  This is presumably an
889	   unlikely event, or there would be many spurious timeouts and
890	   retransmissions.  If B's retransmission was triggered by the "fast
891	   retransmit" algorithm, i.e., by duplicate <ACK>s, then the queued
892	   segments that caused these <ACK>s must have been received already.

894	   Even if a segment were delayed past the RTO, the Fast Retransmit
895	   mechanism [Jacobson90c] will cause the delayed segments to be
896	   retransmitted at the same time as B.2, avoiding an extra RTT and
897	   therefore causing a very small performance penalty.

899	   We know of no case with a significant probability of occurrence in
900	   which timestamps will cause performance degradation by unnecessarily
901	   discarding segments.

903	4.4.  Timestamp Clock

905	   It is important to understand that the PAWS algorithm does not
906	   require clock synchronization between sender and receiver.  The
907	   sender's timestamp clock is used to stamp the segments, and the
908	   sender uses the echoed timestamp to measure RTTs.  However, the
909	   receiver treats the timestamp as simply a monotonically increasing
910	   serial number, without any necessary connection to its clock.  From
911	   the receiver's viewpoint, the timestamp is acting as a logical
912	   extension of the high-order bits of the sequence number.

914	   The receiver algorithm does place some requirements on the frequency
915	   of the timestamp clock.

917	   (a)  The timestamp clock must not be "too slow".

919	        It MUST tick at least once for each 2^31 bytes sent.  In fact,
920	        in order to be useful to the sender for round trip timing, the
921	        clock SHOULD tick at least once per window's worth of data, and
922	        even with the window extension defined in Section 2.2, 2^31
923	        bytes must be at least two windows.

925	        To make this more quantitative, any clock faster than 1 tick/sec
926	        will reject old duplicate segments for link speeds of ~8 Gbps.
927	        A 1 ms timestamp clock will work at link speeds up to 8 Tbps
928	        (8*10^12) bps!

930	   (b)  The timestamp clock must not be "too fast".

932	        The recycling time of the timestamp clock MUST be greater than
933	        MSL seconds.  Since the clock (timestamp) is 32 bits and the
934	        worst-case MSL is 255 seconds, the maximum acceptable clock
935	        frequency is one tick every 59 ns.

937	        However, it is desirable to establish a much longer recycle
938	        period, in order to handle outdated timestamps on idle
939	        connections (see Section 4.5), and to relax the MSL requirement
940	        for preventing sequence number wrap-around.  With a 1 ms
941	        timestamp clock, the 32-bit timestamp will wrap its sign bit in
942	        24.8 days.  Thus, it will reject old duplicates on the same
943	        connection if MSL is 24.8 days or less.  This appears to be a
944	        very safe figure; an MSL of 24.8 days or longer can probably be
945	        assumed in the internet without requiring precise MSL
946	        enforcement.

948	   Based upon these considerations, we choose a timestamp clock
949	   frequency in the range 1 ms to 1 sec per tick.  This range also
950	   matches the requirements of the RTTM mechanism, which does not need
951	   much more resolution than the granularity of the retransmit timer,
952	   e.g., tens or hundreds of milliseconds.

954	   The PAWS mechanism also puts a strong monotonicity requirement on the
955	   sender's timestamp clock.  The method of implementation of the
956	   timestamp clock to meet this requirement depends upon the system
957	   hardware and software.

959	   o  Some hosts have a hardware clock that is guaranteed to be
960	      monotonic between hardware resets.

962	   o  A clock interrupt may be used to simply increment a binary integer
963	      by 1 periodically.

965	   o  The timestamp clock may be derived from a system clock that is
966	      subject to being abruptly changed, by adding a variable offset
967	      value.  This offset is initialized to zero.  When a new timestamp
968	      clock value is needed, the offset can be adjusted as necessary to
969	      make the new value equal to or larger than the previous value
970	      (which was saved for this purpose).

972	4.5.  Outdated Timestamps

974	   If a connection remains idle long enough for the timestamp clock of
975	   the other TCP to wrap its sign bit, then the value saved in TS.Recent
976	   will become too old; as a result, the PAWS mechanism will cause all
977	   subsequent segments to be rejected, freezing the connection (until
978	   the timestamp clock wraps its sign bit again).

980	   With the chosen range of timestamp clock frequencies (1 sec to 1 ms),
981	   the time to wrap the sign bit will be between 24.8 days and 24800
982	   days.  A TCP connection that is idle for more than 24 days and then
983	   comes to life is exceedingly unusual.  However, it is undesirable in
984	   principle to place any limitation on TCP connection lifetimes.

986	   We therefore require that an implementation of PAWS include a
987	   mechanism to "invalidate" the TS.Recent value when a connection is
988	   idle for more than 24 days.  (An alternative solution to the problem
989	   of outdated timestamps would be to send keep-alive segments at a very
990	   low rate, but still more often than the wrap-around time for
991	   timestamps, e.g., once a day.  This would impose negligible overhead.
992	   However, the TCP specification has never included keep-alives, so the
993	   solution based upon invalidation was chosen.)

995	   Note that a TCP does not know the frequency, and therefore, the
996	   wraparound time, of the other TCP, so it must assume the worst.  The
997	   validity of TS.Recent needs to be checked only if the basic PAWS
998	   timestamp check fails, i.e., only if SEG.TSval < TS.Recent.  If
999	   TS.Recent is found to be invalid, then the segment is accepted,
1000	   regardless of the failure of the timestamp check, and rule R3 updates
1001	   TS.Recent with the TSval from the new segment.

1003	   To detect how long the connection has been idle, the TCP MAY update a
1004	   clock or timestamp value associated with the connection whenever
1005	   TS.Recent is updated, for example.  The details will be
1006	   implementation-dependent.

1008	4.6.  Header Prediction

1010	   "Header prediction" [Jacobson90a] is a high-performance transport
1011	   protocol implementation technique that is most important for high-
1012	   speed links.  This technique optimizes the code for the most common
1013	   case, receiving a segment correctly and in order.  Using header
1014	   prediction, the receiver asks the question, "Is this segment the next
1015	   in sequence?"  This question can be answered in fewer machine
1016	   instructions than the question, "Is this segment within the window?"

1018	   Adding header prediction to our timestamp procedure leads to the
1019	   following recommended sequence for processing an arriving TCP
1020	   segment:

1022	   H1)  Check timestamp (same as step R1 above)

1024	   H2)  Do header prediction: if segment is next in sequence and if
1025	        there are no special conditions requiring additional processing,
1026	        accept the segment, record its timestamp, and skip H3.

1028	   H3)  Process the segment normally, as specified in RFC 793.  This
1029	        includes dropping segments that are outside the window and
1030	        possibly sending acknowledgments, and queuing in-window, out-of-
1031	        sequence segments.

1033	   Another possibility would be to interchange steps H1 and H2, i.e., to
1034	   perform the header prediction step H2 *first*, and perform H1 and H3
1035	   only when header prediction fails.  This could be a performance
1036	   improvement, since the timestamp check in step H1 is very unlikely to
1037	   fail, and it requires unsigned modulo arithmetic.  To perform this
1038	   check on every single segment is contrary to the philosophy of header
1039	   prediction.  We believe that this change might produce a measurable
1040	   reduction in CPU time for TCP protocol processing on high-speed
1041	   networks.

1043	   However, putting H2 first would create a hazard: a segment from 2^32
1044	   bytes in the past might arrive at exactly the wrong time and be
1045	   accepted mistakenly by the header-prediction step.  The following
1046	   reasoning has been introduced in [RFC1185] to show that the
1047	   probability of this failure is negligible.

1049	      If all segments are equally likely to show up as old duplicates,
1050	      then the probability of an old duplicate exactly matching the left
1051	      window edge is the maximum segment size (MSS) divided by the size
1052	      of the sequence space.  This ratio must be less than 2^-16, since
1053	      MSS must be < 2^16; for example, it will be (2^12)/(2^32) = 2^-20
1054	      for a 100 Mbit/s link.  However, the older a segment is, the less
1055	      likely it is to be retained in the Internet, and under any
1056	      reasonable model of segment lifetime the probability of an old
1057	      duplicate exactly at the left window edge must be much smaller
1058	      than 2^-16.

1060	      The 16 bit TCP checksum also allows a basic unreliability of one
1061	      part in 2^16.  A protocol mechanism whose reliability exceeds the
1062	      reliability of the TCP checksum should be considered "good
1063	      enough", i.e., it won't contribute significantly to the overall
1064	      error rate.  We therefore believe we can ignore the problem of an
1065	      old duplicate being accepted by doing header prediction before
1066	      checking the timestamp.

1068	   However, this probabilistic argument is not universally accepted, and
1069	   the consensus at present is that the performance gain does not
1070	   justify the hazard in the general case.  It is therefore recommended
1071	   that H2 follow H1.

1073	4.7.  IP Fragmentation

1075	   At high data rates, the protection against old segments provided by
1076	   PAWS can be circumvented by errors in IP fragment reassembly (see
1077	   [RFC4963]).  The only way to protect against incorrect IP fragment
1078	   reassembly is to not allow the segments to be fragmented.  This is
1079	   done by setting the Don't Fragment (DF) bit in the IP header.
1080	   Setting the DF bit implies the use of Path MTU Discovery as described
1081	   in [RFC1191], [RFC1981], and [RFC4821], thus any TCP implementation
1082	   that implements PAWS MUST also implement Path MTU Discovery.

1084	4.8.  Duplicates from Earlier Incarnations of Connection

1086	   The PAWS mechanism protects against errors due to sequence number
1087	   wrap-around on high-speed connections.  Segments from an earlier
1088	   incarnation of the same connection are also a potential cause of old
1089	   duplicate errors.  In both cases, the TCP mechanisms to prevent such
1090	   errors depend upon the enforcement of a maximum segment lifetime
1091	   (MSL) by the Internet (IP) layer (see Appendix of RFC 1185 for a
1092	   detailed discussion).  Unlike the case of sequence space wrap-around,
1093	   the MSL required to prevent old duplicate errors from earlier
1094	   incarnations does not depend upon the transfer rate.  If the IP layer
1095	   enforces the recommended 2 minute MSL of TCP, and if the TCP rules
1096	   are followed, TCP connections will be safe from earlier incarnations,
1097	   no matter how high the network speed.  Thus, the PAWS mechanism is
1098	   not required for this case.

1100	   We may still ask whether the PAWS mechanism can provide additional
1101	   security against old duplicates from earlier connections, allowing us
1102	   to relax the enforcement of MSL by the IP layer.  Appendix B explores
1103	   this question, showing that further assumptions and/or mechanisms are
1104	   required, beyond those of PAWS.  This is not part of the current
1105	   extension.

1107	5.  Conclusions and Acknowledgements

1109	   This memo presented a set of extensions to TCP to provide efficient
1110	   operation over large bandwidth * delay product paths and reliable
1111	   operation over very high-speed paths.  These extensions are designed
1112	   to provide compatible interworking with TCP stacks that do not
1113	   implement the extensions.

1115	   These mechanisms are implemented using TCP options for scaled windows
1116	   and timestamps.  The timestamps are used for two distinct mechanisms:
1117	   RTTM (Round Trip Time Measurement) and PAWS (Protection Against
1118	   Wrapped Sequences).

1120	   The Window Scale option was originally suggested by Mike St. Johns of
1121	   USAF/DCA.  The present form of the option was suggested by Mike
1122	   Karels of UC Berkeley in response to a more cumbersome scheme defined
1123	   by Van Jacobson.  Lixia Zhang helped formulate the PAWS mechanism
1124	   description in [RFC1185].

1126	   Finally, much of this work originated as the result of discussions
1127	   within the End-to-End Task Force on the theoretical limitations of
1128	   transport protocols in general and TCP in particular.  Task force
1129	   members and other on the end2end-interest list have made valuable
1130	   contributions by pointing out flaws in the algorithms and the
1131	   documentation.  Continued discussion and development since the
1132	   publication of [RFC1323] originally occurred in the IETF TCP Large
1133	   Windows Working Group, later on in the End-to-End Task Force, and
1134	   most recently in the IETF TCP Maintenance Working Group.  The authors
1135	   are grateful for all these contributions.

1137	6.  Security Considerations

1139	   The TCP sequence space is a fixed size, and as the window becomes
1140	   larger it becomes easier for an attacker to generate forged packets
1141	   that can fall within the TCP window, and be accepted as valid
1142	   segments.  While use of timestamps and PAWS can help to mitigate
1143	   this, when using PAWS, if an attacker is able to forge a packet that
1144	   is acceptable to the TCP connection, a timestamp that is in the
1145	   future would cause valid segments to be dropped due to PAWS checks.
1146	   Hence, implementers should take care to not open the TCP window
1147	   drastically beyond the requirements of the connection.

1149	   Middle boxes and options: If a middle box removes TCP options from
1150	   the <SYN> segment, such as TSopt, a high speed connection that needs
1151	   PAWS would not have that protection.  In this situation, an
1152	   implementer could provide a mechanism for the application to
1153	   determine whether or not PAWS is in use on the connection, and chose
1154	   to terminate the connection if that protection doesn't exist.

1156	   Mechanisms to protect the TCP header from modification should also
1157	   protect the TCP options.

1159	   A naive implementation that derives the timestamp clock value
1160	   directly from a system uptime clock may unintentionally leak this
1161	   information to an attacker.  This does not directly compromise any of
1162	   the mechanisms described in this document.  However, this may be
1163	   valuable information to a potential attacker.  An implementer should
1164	   evaluate the potential impact and mitigate this accordingly (i.e. by
1165	   using a random offset for the timestamp clock on each connection, or
1166	   using an external, real-time derived timestamp clock source).

1168	   Expanding the TCP window beyond 64 KiB for IPv6 allows Jumbograms
1169	   [RFC2675] to be used when the local network supports packets larger
1170	   than 64 KiB.  When larger TCP segments are used, the TCP checksum
1171	   becomes weaker.

1173	7.  IANA Considerations

1175	   This document has no actions for IANA.

1177	8.  References

1179	8.1.  Normative References

1181	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
1182	              RFC 793, September 1981.

1184	   [RFC1191]  Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
1185	              November 1990.

1187	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1188	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1190	8.2.  Informative References

1192	   [Garlick77]
1193	              Garlick, L., Rom, R., and J. Postel, "Issues in Reliable
1194	              Host-to-Host Protocols", Proc. Second Berkeley Workshop on
1195	              Distributed Data Management and Computer Networks,
1196	              May 1977, <http://www.rfc-editor.org/ien/ien12.txt>.

1198	   [Hamming77]
1199	              Hamming, R., "Digital Filters", Prentice Hall, Englewood
1200	              Cliffs, N.J. ISBN 0-13-212571-4, 1977.

1202	   [Jacobson88a]
1203	              Jacobson, V., "Congestion Avoidance and Control", SIGCOMM
1204	              '88, Stanford,  CA., August 1988,
1205	              <http://ee.lbl.gov/papers/congavoid.pdf>.

1207	   [Jacobson90a]
1208	              Jacobson, V., "4BSD Header Prediction", ACM Computer
1209	              Communication Review, April 1990.

1211	   [Jacobson90c]
1212	              Jacobson, V., "Modified TCP congestion avoidance
1213	              algorithm", Message to the end2end-interest mailing list,
1214	              April 1990,
1215	              <ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail>.

1217	   [Jain86]   Jain, R., "Divergence of Timeout Algorithms for Packet
1218	              Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and
1219	              Comm., Scottsdale, Arizona, March 1986,
1220	              <http://arxiv.org/ftp/cs/papers/9809/9809097.pdf>.

1222	   [Karn87]   Karn, P. and C. Partridge, "Estimating Round-Trip Times in
1223	              Reliable Transport Protocols", Proc. SIGCOMM '87,
1224	              August 1987.

1226	   [Martin03]
1227	              Martin, D., "[Tsvwg] RFC 1323.bis", Message to the tsvwg
1228	              mailing list, September 2003, <http://www.ietf.org/
1229	              mail-archive/web/tsvwg/current/msg04435.html>.

1231	   [Mathis08]
1232	              Mathis, M., "[tcpm] Example of 1323 window retraction
1233	              problem", Message to the tcpm mailing list, March 2008,
1234	              <http://www.ietf.org/mail-archive/web/tcpm/current/
1235	              msg03564.html>.

1237	   [RFC0896]  Nagle, J., "Congestion control in IP/TCP internetworks",
1238	              RFC 896, January 1984.

1240	   [RFC1072]  Jacobson, V. and R. Braden, "TCP extensions for long-delay
1241	              paths", RFC 1072, October 1988.

1243	   [RFC1110]  McKenzie, A., "Problem with the TCP big window option",
1244	              RFC 1110, August 1989.

1246	   [RFC1122]  Braden, R., "Requirements for Internet Hosts -
1247	              Communication Layers", STD 3, RFC 1122, October 1989.

1249	   [RFC1185]  Jacobson, V., Braden, B., and L. Zhang, "TCP Extension for
1250	              High-Speed Paths", RFC 1185, October 1990.

1252	   [RFC1323]  Jacobson, V., Braden, B., and D. Borman, "TCP Extensions
1253	              for High Performance", RFC 1323, May 1992.

1255	   [RFC1981]  McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery
1256	              for IP version 6", RFC 1981, August 1996.

1258	   [RFC2018]  Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
1259	              Selective Acknowledgment Options", RFC 2018, October 1996.

1261	   [RFC2581]  Allman, M., Paxson, V., and W. Stevens, "TCP Congestion
1262	              Control", RFC 2581, April 1999.

1264	   [RFC2675]  Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms",
1265	              RFC 2675, August 1999.

1267	   [RFC2883]  Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An
1268	              Extension to the Selective Acknowledgement (SACK) Option
1269	              for TCP", RFC 2883, July 2000.

1271	   [RFC3522]  Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm
1272	              for TCP", RFC 3522, April 2003.

1274	   [RFC4821]  Mathis, M. and J. Heffner, "Packetization Layer Path MTU
1275	              Discovery", RFC 4821, March 2007.

1277	   [RFC4963]  Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly
1278	              Errors at High Data Rates", RFC 4963, July 2007.

1280	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
1281	              Control", RFC 5681, September 2009.

1283	   [RFC6675]  Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
1284	              and Y. Nishida, "A Conservative Loss Recovery Algorithm
1285	              Based on Selective Acknowledgment (SACK) for TCP",
1286	              RFC 6675, August 2012.

1288	   [RFC6691]  Borman, D., "TCP Options and Maximum Segment Size (MSS)",
1289	              RFC 6691, July 2012.

1291	   [Watson81]
1292	              Watson, R., "Timer-based Mechanisms in Reliable Transport
1293	              Protocol Connection Management", Computer Networks, Vol.
1294	              5, 1981.

1296	   [Zhang86]  Zhang, L., "Why TCP Timers Don't Work Well", Proc. SIGCOMM
1297	              '86, Stowe, VT, August 1986.

1299	Appendix A.  Implementation Suggestions

1301	   TCP Option Layout

1303	      The following layouts are recommended for sending options on non-
1304	      <SYN> segments, to achieve maximum feasible alignment of 32-bit
1305	      and 64-bit machines.

1307	                   +--------+--------+--------+--------+
1308	                   |   NOP  |  NOP   |  TSopt |   10   |
1309	                   +--------+--------+--------+--------+
1310	                   |          TSval timestamp          |
1311	                   +--------+--------+--------+--------+
1312	                   |          TSecr timestamp          |
1313	                   +--------+--------+--------+--------+

1315	   Interaction with the TCP Urgent Pointer

1317	      The TCP Urgent pointer, like the TCP window, is a 16 bit value.
1318	      Some of the original discussion for the TCP Window Scale option
1319	      included proposals to increase the Urgent pointer to 32 bits.  As
1320	      it turns out, this is unnecessary.  There are two observations
1321	      that should be made:

1323	      (1)  With IP Version 4, the largest amount of TCP data that can be
1324	           sent in a single packet is 65495 bytes (64 KiB - 1 -- size of
1325	           fixed IP and TCP headers).

1327	      (2)  Updates to the urgent pointer while the user is in "urgent
1328	           mode" are invisible to the user.

1330	      This means that if the Urgent Pointer points beyond the end of the
1331	      TCP data in the current segment, then the user will remain in
1332	      urgent mode until the next TCP segment arrives.  That segment will
1333	      update the urgent pointer to a new offset, and the user will never
1334	      have left urgent mode.

1336	      Thus, to properly implement the Urgent Pointer, the sending TCP
1337	      only has to check for overflow of the 16 bit Urgent Pointer field
1338	      before filling it in.  If it does overflow, than a value of 65535
1339	      should be inserted into the Urgent Pointer.

1341	      The same technique applies to IP Version 6, except in the case of
1342	      IPv6 Jumbograms.  When IPv6 Jumbograms are supported, [RFC2675]
1343	      requires additional steps for dealing with the Urgent Pointer,
1344	      these are described in section 5.2 of [RFC2675].

1346	Appendix B.  Duplicates from Earlier Connection Incarnations

1348	   There are two cases to be considered: (1) a system crashing (and
1349	   losing connection state) and restarting, and (2) the same connection
1350	   being closed and reopened without a loss of host state.  These will
1351	   be described in the following two sections.

1353	B.1.  System Crash with Loss of State

1355	   TCP's quiet time of one MSL upon system startup handles the loss of
1356	   connection state in a system crash/restart.  For an explanation, see
1357	   for example "When to Keep Quiet" in the TCP protocol specification
1358	   [RFC0793].  The MSL that is required here does not depend upon the
1359	   transfer speed.  The current TCP MSL of 2 minutes seemed acceptable
1360	   as an operational compromise, when many host systems used to take
1361	   this long to boot after a crash.  Current host systems can boot
1362	   considerably faster.

1364	   The timestamp option may be used to ease the MSL requirements (or to
1365	   provide additional security against data corruption).  If timestamps
1366	   are being used and if the timestamp clock can be guaranteed to be
1367	   monotonic over a system crash/restart, i.e., if the first value of
1368	   the sender's timestamp clock after a crash/restart can be guaranteed
1369	   to be greater than the last value before the restart, then a quiet
1370	   time is unnecessary.

1372	   To dispense totally with the quiet time would require that the host
1373	   clock be synchronized to a time source that is stable over the crash/
1374	   restart period, with an accuracy of one timestamp clock tick or
1375	   better.  We can back off from this strict requirement to take
1376	   advantage of approximate clock synchronization.  Suppose that the
1377	   clock is always re-synchronized to within N timestamp clock ticks and
1378	   that booting (extended with a quiet time, if necessary) takes more
1379	   than N ticks.  This will guarantee monotonicity of the timestamps,
1380	   which can then be used to reject old duplicates even without an
1381	   enforced MSL.

1383	B.2.  Closing and Reopening a Connection

1385	   When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT state
1386	   ties up the socket pair for 4 minutes (see Section 3.5 of [RFC0793].
1387	   Applications built upon TCP that close one connection and open a new
1388	   one (e.g., an FTP data transfer connection using Stream mode) must
1389	   choose a new socket pair each time.  The TIME-WAIT delay serves two
1390	   different purposes:

1392	   (a)  Implement the full-duplex reliable close handshake of TCP.

1394	        The proper time to delay the final close step is not really
1395	        related to the MSL; it depends instead upon the RTO for the FIN
1396	        segments and therefore upon the RTT of the path.  (It could be
1397	        argued that the side that is sending a FIN knows what degree of
1398	        reliability it needs, and therefore it should be able to
1399	        determine the length of the TIME-WAIT delay for the FIN's
1400	        recipient.  This could be accomplished with an appropriate TCP
1401	        option in FIN segments.)

1403	        Although there is no formal upper-bound on RTT, common network
1404	        engineering practice makes an RTT greater than 1 minute very
1405	        unlikely.  Thus, the 4 minute delay in TIME-WAIT state works
1406	        satisfactorily to provide a reliable full-duplex TCP close.
1407	        Note again that this is independent of MSL enforcement and
1408	        network speed.

1410	        The TIME-WAIT state could cause an indirect performance problem
1411	        if an application needed to repeatedly close one connection and
1412	        open another at a very high frequency, since the number of
1413	        available TCP ports on a host is less than 2^16.  However, high
1414	        network speeds are not the major contributor to this problem;
1415	        the RTT is the limiting factor in how quickly connections can be
1416	        opened and closed.  Therefore, this problem will be no worse at
1417	        high transfer speeds.

1419	   (b)  Allow old duplicate segments to expire.

1421	        To replace this function of TIME-WAIT state, a mechanism would
1422	        have to operate across connections.  PAWS is defined strictly
1423	        within a single connection; the last timestamp (TS.Recent) is
1424	        kept in the connection control block, and discarded when a
1425	        connection is closed.

1427	        An additional mechanism could be added to the TCP, a per-host
1428	        cache of the last timestamp received from any connection.  This
1429	        value could then be used in the PAWS mechanism to reject old
1430	        duplicate segments from earlier incarnations of the connection,
1431	        if the timestamp clock can be guaranteed to have ticked at least
1432	        once since the old connection was open.  This would require that
1433	        the TIME-WAIT delay plus the RTT together must be at least one
1434	        tick of the sender's timestamp clock.  Such an extension is not
1435	        part of the proposal of this RFC.

1437	        Note that this is a variant on the mechanism proposed by
1438	        Garlick, Rom, and Postel [Garlick77], which required each host
1439	        to maintain connection records containing the highest sequence
1440	        numbers on every connection.  Using timestamps instead, it is
1441	        only necessary to keep one quantity per remote host, regardless
1442	        of the number of simultaneous connections to that host.

1444	Appendix C.  Summary of Notation

1446	   The following notation has been used in this document.

1448	   Options

1450	      WSopt:            TCP Window Scale Option
1451	      TSopt:            TCP Timestamp Option

1453	   Option Fields

1455	      shift.cnt:        Window scale byte in WSopt
1456	      TSval:            32-bit Timestamp Value field in TSopt
1457	      TSecr:            32-bit Timestamp Reply field in TSopt

1459	   Option Fields in Current Segment

1461	      SEG.TSval:        TSval field from TSopt in current segment
1462	      SEG.TSecr:        TSecr field from TSopt in current segment
1463	      SEG.WSopt:        8-bit value in WSopt

1465	   Clock Values

1467	      my.TSclock:       System wide source of 32-bit timestamp values
1468	      my.TSclock.rate:  Period of my.TSclock (1 ms to 1 sec)
1469	      Snd.TSoffset:     A offset for randomizing Snd.TSclock
1470	      Snd.TSclock:      my.TSclock + Snd.TSoffset

1472	   Per-Connection State Variables

1474	      TS.Recent:        Latest received Timestamp
1475	      Last.ACK.sent:    Last ACK field sent
1476	      Snd.TS.OK:        1-bit flag
1477	      Snd.WS.OK:        1-bit flag
1478	      Rcv.Wind.Scale:   Receive window scale power
1479	      Snd.Wind.Scale:   Send window scale power
1480	      Start.Time:       Snd.TSclock value when segment being timed was
1481	                        sent (used by pre-1323 code).

1483	   Procedure

1485	      Update_SRTT(m)    Procedure to update the smoothed RTT and RTT
1486	                        variance estimates, using the rules of
1487	                        [Jacobson88a], given m, a new RTT measurement

1489	Appendix D.  Event Processing Summary

1491	   OPEN Call

1493	      ...

1495	      An initial send sequence number (ISS) is selected.  Send a <SYN>
1496	      segment of the form:

1498	        <SEQ=ISS><CTL=SYN><TSval=Snd.TSclock><WSopt=Rcv.Wind.Scale>

1500	      ...

1502	   SEND Call

1504	      CLOSED STATE (i.e., TCB does not exist)

1506	         ...

1508	      LISTEN STATE

1510	         If the foreign socket is specified, then change the connection
1511	         from passive to active, select an ISS.  Send a <SYN> segment
1512	         containing the options: <TSval=Snd.TSclock> and
1513	         <WSopt=Rcv.Wind.Scale>.  Set SND.UNA to ISS, SND.NXT to ISS+1.
1514	         Enter SYN-SENT state. ...

1516	      SYN-SENT STATE
1517	      SYN-RECEIVED STATE

1519	         ...

1521	      ESTABLISHED STATE
1522	      CLOSE-WAIT STATE

1524	         Segmentize the buffer and send it with a piggybacked
1525	         acknowledgment (acknowledgment value = RCV.NXT). ...

1527	         If the urgent flag is set ...

1529	         If the Snd.TS.OK flag is set, then include the TCP Timestamp
1530	         Option <TSval=Snd.TSclock,TSecr=TS.Recent> in each data
1531	         segment.

1533	         Scale the receive window for transmission in the segment
1534	         header:

1536	                   SEG.WND = (RCV.WND >> Rcv.Wind.Scale).

1538	   SEGMENT ARRIVES

1540	      ...

1542	      If the state is LISTEN then

1544	         first check for an RST

1546	            ...

1548	         second check for an ACK

1550	            ...

1552	         third check for a SYN

1554	            if the SYN bit is set, check the security.  If the ...

1556	               ...

1558	            if the SEG.PRC is less than the TCB.PRC then continue.

1560	            Check for a Window Scale option (WSopt); if one is found,
1561	            save SEG.WSopt in Snd.Wind.Scale and set Snd.WS.OK flag on.
1562	            Otherwise, set both Snd.Wind.Scale and Rcv.Wind.Scale to
1563	            zero and clear Snd.WS.OK flag.

1565	            Check for a TSopt option; if one is found, save SEG.TSval in
1566	            the variable TS.Recent and turn on the Snd.TS.OK bit.

1568	            Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any
1569	            other control or text should be queued for processing later.
1570	            ISS should be selected and a <SYN> segment sent of the form:

1572	                    <SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK>

1574	            If the Snd.WS.OK bit is on, include a WSopt option
1575	            <WSopt=Rcv.Wind.Scale> in this segment.  If the Snd.TS.OK
1576	            bit is on, include a TSopt <TSval=Snd.TSclock,
1577	            TSecr=TS.Recent> in this segment.  Last.ACK.sent is set to
1578	            RCV.NXT.

1580	            SND.NXT is set to ISS+1 and SND.UNA to ISS.  The connection
1581	            state should be changed to SYN-RECEIVED.  Note that any
1582	            other incoming control or data (combined with SYN) will be
1583	            processed in the SYN-RECEIVED state, but processing of SYN
1584	            and ACK should not be repeated.  If the listen was not fully
1585	            specified (i.e., the foreign socket was not fully
1586	            specified), then the unspecified fields should be filled in
1587	            now.

1589	         fourth other text or control

1591	            ...

1593	      If the state is SYN-SENT then

1595	         first check the ACK bit

1597	            ...

1599	         ...

1601	         fourth check the SYN bit
1602	            ...

1604	            If the SYN bit is on and the security/compartment and
1605	            precedence are acceptable then, RCV.NXT is set to SEG.SEQ+1,
1606	            IRS is set to SEG.SEQ, and any acknowledgements on the
1607	            retransmission queue which are thereby acknowledged should
1608	            be removed.

1610	            Check for a Window Scale option (WSopt); if it is found,
1611	            save SEG.WSopt in Snd.Wind.Scale; otherwise, set both
1612	            Snd.Wind.Scale and Rcv.Wind.Scale to zero.

1614	            Check for a TSopt option; if one is found, save SEG.TSval in
1615	            variable TS.Recent and turn on the Snd.TS.OK bit in the
1616	            connection control block.  If the ACK bit is set, use
1617	            Snd.TSclock - SEG.TSecr as the initial RTT estimate.

1619	            If SND.UNA > ISS (our <SYN> has been ACKed), change the
1620	            connection state to ESTABLISHED, form an <ACK> segment:

1622	                    <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>

1624	            and send it.  If the Snd.Echo.OK bit is on, include a TSopt
1625	            option <TSval=Snd.TSclock,TSecr=TS.Recent> in this <ACK>
1626	            segment.  Last.ACK.sent is set to RCV.NXT.

1628	            Data or controls which were queued for transmission may be
1629	            included.  If there are other controls or text in the
1630	            segment then continue processing at the sixth step below
1631	            where the URG bit is checked, otherwise return.

1633	            Otherwise enter SYN-RECEIVED, form a <SYN,ACK> segment:

1635	                    <SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK>

1637	            and send it.  If the Snd.Echo.OK bit is on, include a TSopt
1638	            option <TSval=Snd.TSclock,TSecr=TS.Recent> in this segment.
1639	            If the Snd.WS.OK bit is on, include a WSopt option
1640	            <WSopt=Rcv.Wind.Scale> in this segment.  Last.ACK.sent is
1641	            set to RCV.NXT.

1643	            If there are other controls or text in the segment, queue
1644	            them for processing after the ESTABLISHED state has been
1645	            reached, return.

1647	         fifth, if neither of the SYN or RST bits is set then drop the
1648	         segment and return.

1650	      Otherwise,

1652	      First, check sequence number

1654	         SYN-RECEIVED STATE
1655	         ESTABLISHED STATE
1656	         FIN-WAIT-1 STATE
1657	         FIN-WAIT-2 STATE
1658	         CLOSE-WAIT STATE
1659	         CLOSING STATE
1660	         LAST-ACK STATE
1661	         TIME-WAIT STATE

1663	            Segments are processed in sequence.  Initial tests on
1664	            arrival are used to discard old duplicates, but further
1665	            processing is done in SEG.SEQ order.  If a segment's
1666	            contents straddle the boundary between old and new, only the
1667	            new parts should be processed.

1669	            Rescale the received window field:

1671	                  TrueWindow = SEG.WND << Snd.Wind.Scale,

1673	            and use "TrueWindow" in place of SEG.WND in the following
1674	            steps.

1676	            Check whether the segment contains a Timestamp Option and
1677	            bit Snd.TS.OK is on.  If so:

1679	               If SEG.TSval < TS.Recent and the RST bit is off, then
1680	               test whether connection has been idle less than 24 days;
1681	               if all are true, then the segment is not acceptable;
1682	               follow steps below for an unacceptable segment.

1684	               If SEG.SEQ is less than or equal to Last.ACK.sent, then
1685	               save SEG.TSval in variable TS.Recent.

1687	            There are four cases for the acceptability test for an
1688	            incoming segment:

1690	               ...

1692	            If an incoming segment is not acceptable, an acknowledgment
1693	            should be sent in reply (unless the RST bit is set, if so
1694	            drop the segment and return):

1696	                    <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>

1698	            Last.ACK.sent is set to SEG.ACK of the acknowledgment.  If
1699	            the Snd.Echo.OK bit is on, include the Timestamp Option
1700	            <TSval=Snd.TSclock,TSecr=TS.Recent> in this <ACK> segment.
1701	            Set Last.ACK.sent to SEG.ACK and send the <ACK> segment.
1702	            After sending the acknowledgment, drop the unacceptable
1703	            segment and return.

1705	      ...

1707	      fifth check the ACK field.

1709	         if the ACK bit is off drop the segment and return.

1711	         if the ACK bit is on

1713	            ...

1715	            ESTABLISHED STATE

1717	               If SND.UNA < SEG.ACK <= SND.NXT then, set SND.UNA <-
1718	               SEG.ACK.  Also compute a new estimate of round-trip time.
1719	               If Snd.TS.OK bit is on, use Snd.TSclock - SEG.TSecr;
1720	               otherwise use the elapsed time since the first segment in
1721	               the retransmission queue was sent.  Any segments on the
1722	               retransmission queue which are thereby entirely
1723	               acknowledged...

1725	      ...

1727	      Seventh, process the segment text.

1729	         ESTABLISHED STATE
1730	         FIN-WAIT-1 STATE
1731	         FIN-WAIT-2 STATE

1733	            ...

1735	            Send an acknowledgment of the form:

1737	                    <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>

1739	            If the Snd.TS.OK bit is on, include Timestamp Option
1740	            <TSval=Snd.TSclock,TSecr=TS.Recent> in this <ACK> segment.
1741	            Set Last.ACK.sent to SEG.ACK of the acknowledgment, and send
1742	            it.  This acknowledgment should be piggy-backed on a segment
1743	            being transmitted if possible without incurring undue delay.

1745	            ...

1747	Appendix E.  Timestamps Edge Cases

1749	   While the rules laid out for when to calculate RTTM produce the
1750	   correct results most of the time, there are some edge cases where an
1751	   incorrect RTTM can be calculated.  All of these situations involve
1752	   the loss of segments.  It is felt that these scenarios are rare, and
1753	   that if they should happen, they will cause a single RTTM measurement
1754	   to be inflated, which mitigates its effects on RTO calculations.

1756	   [Martin03] cites two similar cases when the returning <ACK> is lost,
1757	   and before the retransmission timer fires, another returning <ACK>
1758	   segment arrives, which aknowledges the data.  In this case, the RTTM
1759	   calculated will be inflated:

1761	           clock
1762	             tc=1   <A, TSval=1> ------------------->

1764	             tc=2   (lost) <---- <ACK(A), TSecr=1, win=n>
1765	                 (RTTM would have been 1)

1767	                    (receive window opens, window update is sent)
1768	             tc=5        <---- <ACK(A), TSecr=1, win=m>
1769	                    (RTTM is calculated at 4)

1771	   One thing to note about this situation is that it is somewhat bounded
1772	   by RTO + RTT, limiting how far off the RTTM calculation will be.
1773	   While more complex scenarios can be constructed that produce larger
1774	   inflations (e.g., retransmissions are lost), those scenarios involve
1775	   multiple segment losses, and the connection will have other more
1776	   serious operational problems than using an inflated RTTM in the RTO
1777	   calculation.

1779	Appendix F.  Window Retraction Example

1781	   Consider a established TCP connection with WSCALE=7 (128 byte
1782	   receiver window quantization), that is running with a very small
1783	   windows because the receiver is bottlenecked and both ends are doing
1784	   small reads and writes.

1786	   Consider the ACKs coming back:

1788	   SEG.ACK  SEG.WIN computed SND.WIN   receiver's actual window
1789	   1000     2       1256               1300
1790	   The sender writes 40 bytes and receiver ACKs:

1792	   1040     2       1296               1300

1794	   The sender writes 5 additional bytes and the receiver has a problem.
1795	   Two choices:

1797	   1045     2       1301               1300   - BEYOND BUFFER

1799	   1045     1       1173               1300   - RETRACTED WINDOW

1801	   This problems is completely general and can in principle happen any
1802	   time the sender does a write which is smaller than the window scale
1803	   quanta.

1805	   In most stacks it is at least partially obscured when the window size
1806	   is larger than some small number of segments because the stacks
1807	   prefer to announce windows that are integral numbers of segments
1808	   (rounded up to the next window quanta).  This plus silly window
1809	   suppression tends to cause less frequent, larger window updates.  If
1810	   the window was rounded down to a segment size there is more
1811	   opportunity to advance it ("beyond buffer" case above) rather than
1812	   retracting it.

1814	Appendix G.  Changes from RFC 1323

1816	   Several important updates and clarifications to the specification in
1817	   RFC 1323 are made in these document.  The technical changes are
1818	   summarized below:

1820	   (a)  Section 2.4 was added describing the unavoidable window
1821	        retraction issue, and explicitly describing the mitigation steps
1822	        necessary.

1824	   (b)  In Section 3.2 the wording how timestamp option negotiation is
1825	        to be performed was updated with RFC2119 wording.  Further, a
1826	        number of paragraphs were added to clarify the expected behavior
1827	        with a compliant implementation using TSopt, as RFC1323 left
1828	        room for interpretation - e.g. potential late enablement of
1829	        TSopt.

1831	   (c)  The description of which TSecr values can be used to update the
1832	        measured RTT has been clarified.  Specifically, with timestamps,
1833	        the Karn algorithm [Karn87] is disabled.  The Karn algorithm
1834	        disables all RTT measurements during retransmission, since it is
1835	        ambiguous whether the <ACK> is for the original segment, or the
1836	        retransmitted segment.  With timestamps, that ambiguity is
1837	        removed since the TSecr in the <ACK> will contain the TSval from
1838	        whichever data segment made it to the destination.

1840	   (d)  RTTM update processing explicitly excludes segments not updating
1841	        SND.UNA.  The original text could be interpreted to allow taking
1842	        RTT samples when SACK acknowledges some new, non-continuous
1843	        data.

1845	   (e)  In RFC1323, section 3.4, step (2) of the algorithm to control
1846	        which timestamp is echoed was incorrect in two regards:

1848	        (1)  It failed to update TS.recent for a retransmitted segment
1849	             that resulted from a lost <ACK>.

1851	        (2)  It failed if SEG.LEN = 0.

1853	        In the new algorithm, the case of SEG.TSval >= TS.recent is
1854	        included for consistency with the PAWS test.

1856	   (f)  It is now recommended that Timestamp Options be included in
1857	        <RST> segments if the incoming segment contained a Timestamp
1858	        Option.

1860	   (g)  <RST> segments are explicitly excluded from PAWS processing.

1862	   (h)  Added text to clarify the precedence between regular TCP
1863	        [RFC0793] and timestamp/PAWS [RFCxxxx] processing.  Discussion
1864	        about combined acceptability checks are ongoing.

1866	   (i)  Snd.TSoffset and Snd.TSclock variables have been added.
1867	        Snd.TSclock is the sum of my.TSclock and Snd.TSoffset.  This
1868	        allows the starting points for timestamp values to be randomized
1869	        on a per-connection basis.  Setting Snd.TSoffset to zero yields
1870	        the same results as [RFC1323].

1872	   (j)  Appendix A has been expanded with information about the TCP
1873	        Urgent Pointer.  An earlier revision contained text around the
1874	        TCP MSS option, which was split off into [RFC6691].

1876	   (k)  One correction was made to the Event Processing Summary in
1877	        Appendix D.  In SEND CALL/ESTABLISHED STATE, RCV.WND is used to
1878	        fill in the SEG.WND value, not SND.WND.

1880	   Editorial changes of the document, that don't impact the
1881	   implementation or function of the mechanisms described in this
1882	   document include:

1884	   (a)  Removed much of the discussion in Section 1 to streamline the
1885	        document.  However, detailed examples and discussions in
1886	        Section 2, Section 3 and Section 4 are kept as guideline for
1887	        implementers.

1889	   (b)  Removed references to "new" options, as the options were
1890	        introduced in [RFC1323] already.  Changed the text in
1891	        Section 1.3 to specifically address TS and WS options.

1893	   (c)  Section 1.4 was added for RFC2119 wording.  Normative text was
1894	        updated with the appropriate phrases.

1896	   (d)  Added < > brackets to mark specific types of segments, and
1897	        replaced most occurances of "packet" with "segment", where TCP
1898	        segments are referred.

1900	   (e)  Removed the list of changes between RFC 1323 and prior versions.
1901	        These changes are mentioned in Appendix C of RFC 1323.

1903	   (f)  Moved Appendix "Changes" at the end of the appendices for easier
1904	        lookup.  In addition, the entries were split into a technical
1905	        and an editorial part, and sorted to roughly correspond with the
1906	        sections in the text where they apply.

1908	Authors' Addresses

1910	   David Borman
1911	   Quantum Corporation
1912	   Mendota Heights  MN 55120
1913	   USA

1915	   Email: david.borman@quantum.com

1917	   Bob Braden
1918	   University of Southern California
1919	   4676 Admiralty Way
1920	   Marina del Rey  CA 90292
1921	   USA

1923	   Email: braden@isi.edu
1924	   Van Jacobson
1925	   Packet Design
1926	   2465 Latham Street
1927	   Mountain View  CA 94040
1928	   USA

1930	   Email: van@packetdesign.com

1932	   Richard Scheffenegger (editor)
1933	   NetApp, Inc.
1934	   Am Euro Platz 2
1935	   Vienna,   1120
1936	   Austria

1938	   Email: rs@netapp.com