idnits 2.17.1 draft-ietf-tcpm-1323bis-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The abstract seems to indicate that this document obsoletes RFC1323, but the header doesn't have an 'Obsoletes:' line to match this. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 25, 2013) is 4071 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'RFC1110' is defined on line 1216, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2581' is defined on line 1234, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2883' is defined on line 1240, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC5681' is defined on line 1250, but no explicit
     reference was found in the text

  == Unused Reference: 'Watson81' is defined on line 1261, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  -- Obsolete informational reference (is this intentional?): RFC  896
     (Obsoleted by RFC 7805)

  -- Obsolete informational reference (is this intentional?): RFC 1072
     (Obsoleted by RFC 1323, RFC 2018, RFC 6247)

  -- Obsolete informational reference (is this intentional?): RFC 1110
     (Obsoleted by RFC 6247)

  -- Obsolete informational reference (is this intentional?): RFC 1185
     (Obsoleted by RFC 1323)

  -- Obsolete informational reference (is this intentional?): RFC 1323
     (Obsoleted by RFC 7323)

  -- Obsolete informational reference (is this intentional?): RFC 1981
     (Obsoleted by RFC 8201)

  -- Obsolete informational reference (is this intentional?): RFC 2581
     (Obsoleted by RFC 5681)

  -- Obsolete informational reference (is this intentional?): RFC 6691
     (Obsoleted by RFC 9293)


     Summary: 1 error (**), 0 flaws (~~), 6 warnings (==), 12 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	TCP Maintenance (TCPM)                                         D. Borman
3	Internet-Draft                                       Quantum Corporation
4	Intended status: Standards Track                               B. Braden
5	Expires: August 29, 2013                          University of Southern
6	                                                              California
7	                                                             V. Jacobson
8	                                                           Packet Design
9	                                                   R. Scheffenegger, Ed.
10	                                                            NetApp, Inc.
11	                                                       February 25, 2013

13	                  TCP Extensions for High Performance
14	                       draft-ietf-tcpm-1323bis-06

16	Abstract

18	   This document specifies a set of TCP extensions to improve
19	   performance over paths with a large bandwidth*delay product and to
20	   provide reliable operation over very high-speed paths.  It defines
21	   TCP options for scaled windows and timestamps.  The timestamps are
22	   used for two distinct mechanisms, RTTM (Round Trip Time Measurement)
23	   and PAWS (Protection Against Wrapped Sequences).

25	   This document updates and obsoletes RFC 1323.

27	Status of this Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at http://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on August 29, 2013.

44	Copyright Notice

46	   Copyright (c) 2013 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (http://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
62	     1.1.  TCP Performance  . . . . . . . . . . . . . . . . . . . . .  4
63	     1.2.  TCP Reliability  . . . . . . . . . . . . . . . . . . . . .  5
64	     1.3.  Using TCP options  . . . . . . . . . . . . . . . . . . . .  6
65	   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  7
66	   3.  TCP Window Scale Option  . . . . . . . . . . . . . . . . . . .  7
67	     3.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . .  7
68	     3.2.  Window Scale Option  . . . . . . . . . . . . . . . . . . .  7
69	     3.3.  Using the Window Scale Option  . . . . . . . . . . . . . .  8
70	     3.4.  Addressing Window Retraction . . . . . . . . . . . . . . . 10
71	   4.  RTTM -- Round-Trip Time Measurement  . . . . . . . . . . . . . 11
72	     4.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . 11
73	     4.2.  TCP Timestamps Option  . . . . . . . . . . . . . . . . . . 12
74	     4.3.  The RTTM Mechanism . . . . . . . . . . . . . . . . . . . . 13
75	     4.4.  Which Timestamp to Echo  . . . . . . . . . . . . . . . . . 14
76	   5.  PAWS -- Protection Against Wrapped Sequence Numbers  . . . . . 17
77	     5.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . 17
78	     5.2.  The PAWS Mechanism . . . . . . . . . . . . . . . . . . . . 17
79	       5.2.1.  Basic PAWS Algorithm . . . . . . . . . . . . . . . . . 18
80	       5.2.2.  Timestamp Clock  . . . . . . . . . . . . . . . . . . . 20
81	       5.2.3.  Outdated Timestamps  . . . . . . . . . . . . . . . . . 22
82	       5.2.4.  Header Prediction  . . . . . . . . . . . . . . . . . . 22
83	       5.2.5.  IP Fragmentation . . . . . . . . . . . . . . . . . . . 24
84	     5.3.  Duplicates from Earlier Incarnations of Connection . . . . 24
85	   6.  Conclusions and Acknowledgements . . . . . . . . . . . . . . . 24
86	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 25
87	   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 26
88	   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 26
89	     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 26
90	     9.2.  Informative References . . . . . . . . . . . . . . . . . . 26
91	   Appendix A.  Implementation Suggestions  . . . . . . . . . . . . . 28
92	   Appendix B.  Duplicates from Earlier Connection Incarnations . . . 29
93	     B.1.  System Crash with Loss of State  . . . . . . . . . . . . . 30
94	     B.2.  Closing and Reopening a Connection . . . . . . . . . . . . 30
95	   Appendix C.  Summary of Notation . . . . . . . . . . . . . . . . . 31
96	   Appendix D.  Pseudo-code Summary . . . . . . . . . . . . . . . . . 32
97	   Appendix E.  Event Processing Summary  . . . . . . . . . . . . . . 34
98	   Appendix F.  Timestamps Edge Cases . . . . . . . . . . . . . . . . 40
99	   Appendix G.  Changes from RFC 1072, RFC 1185, and RFC 1323 . . . . 40
100	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 43

102	1.  Introduction

104	   The TCP protocol [RFC0793] was designed to operate reliably over
105	   almost any transmission medium regardless of transmission rate,
106	   delay, corruption, duplication, or reordering of segments.  Over the
107	   years, advances in networking technology has resulted in ever-higher
108	   transmission speeds, and the fastest paths are well beyond the domain
109	   for which TCP was originally engineered.

111	   This document defines a set of modest extensions to TCP to extend the
112	   domain of its application to match the increasing network capability.
113	   It is an update to and obsoletes [RFC1323], which in turn is based
114	   upon and obsoletes [RFC1072] and [RFC1185].

116	   For brevity, the full discussions of the merits and history behind
117	   the TCP options defined within this document have been omitted.
118	   [RFC1323] should be consulted for reference.  A modern TCP
119	   implementation SHOULD implement and make use of the extensions
120	   described in this document.

122	1.1.  TCP Performance

124	   TCP performance problems arise when the bandwidth*delay product is
125	   large.  A network having such paths is referred to as "long, fat
126	   network" (LFN).

128	   There are three fundamental performance problems with the current TCP
129	   over LFN paths:

131	   (1)  Window Size Limit

133	        The TCP header uses a 16 bit field to report the receive window
134	        size to the sender.  Therefore, the largest window that can be
135	        used is 2^16 = 65K bytes.

137	        To circumvent this problem, Section 2 of this memo defines a new
138	        TCP option, "Window Scale", to allow windows larger than 2^16.
139	        This option defines an implicit scale factor, which is used to
140	        multiply the window size value found in a TCP header to obtain
141	        the true window size.

143	   (2)  Recovery from Losses

145	        Packet losses in an LFN can have a catastrophic effect on
146	        throughput.

148	        To generalize the Fast Retransmit/Fast Recovery mechanism to
149	        handle multiple packets dropped per window, selective
150	        acknowledgments are required.  Unlike the normal cumulative
151	        acknowledgments of TCP, selective acknowledgments give the
152	        sender a complete picture of which segments are queued at the
153	        receiver and which have not yet arrived.

155	        Selective acknowledgements are specified in a separate document,
156	        "A Conservative Selective Acknowledgment (SACK)-based Loss
157	        Recovery Algorithm for TCP" [RFC6675], and not further discussed
158	        in this document.

160	   (3)  Round-Trip Measurement

162	        TCP implements reliable data delivery by retransmitting segments
163	        that are not acknowledged within some retransmission timeout
164	        (RTO) interval.  Accurate dynamic determination of an
165	        appropriate RTO is essential to TCP performance.  RTO is
166	        determined by estimating the mean and variance of the measured
167	        round-trip time (RTT), i.e., the time interval between sending a
168	        segment and receiving an acknowledgment for it [Jacobson88a].

170	        Section 4.2 introduces a new TCP option, "Timestamps", and then
171	        defines a mechanism using this option that allows nearly every
172	        segment, including retransmissions, to be timed at negligible
173	        computational cost.  We use the mnemonic RTTM (Round Trip Time
174	        Measurement) for this mechanism, to distinguish it from other
175	        uses of the Timestamps option.

177	1.2.  TCP Reliability

179	   An especially serious kind of error may result from an accidental
180	   reuse of TCP sequence numbers in data segments.  TCP reliability
181	   depends upon the existence of a bound on the lifetime of a segment:
182	   the "Maximum Segment Lifetime" or MSL.

184	   Duplication of sequence numbers might happen in either of two ways:

186	   (1)  Sequence number wrap-around on the current connection

188	        A TCP sequence number contains 32 bits.  At a high enough
189	        transfer rate, the 32-bit sequence space may be "wrapped"
190	        (cycled) within the time that a segment is delayed in queues.

192	   (2)  Earlier incarnation of the connection

194	        Suppose that a connection terminates, either by a proper close
195	        sequence or due to a host crash, and the same connection (i.e.,
196	        using the same pair of port numbers) is immediately reopened.  A
197	        delayed segment from the terminated connection could fall within
198	        the current window for the new incarnation and be accepted as
199	        valid.

201	   Duplicates from earlier incarnations, Case (2), are avoided by
202	   enforcing the current fixed MSL of the TCP spec, as explained in
203	   Section 5.3 and Appendix B.  However, case (1), avoiding the reuse of
204	   sequence numbers within the same connection, requires an MSL bound
205	   that depends upon the transfer rate, and at high enough rates, a new
206	   mechanism is required.

208	   A possible fix for the problem of cycling the sequence space would be
209	   to increase the size of the TCP sequence number field.  For example,
210	   the sequence number field (and also the acknowledgment field) could
211	   be expanded to 64 bits.  This could be done either by changing the
212	   TCP header or by means of an additional option.

214	   Section 5 presents a different mechanism, which we call PAWS
215	   (Protection Against Wrapped Sequence numbers), to extend TCP
216	   reliability to transfer rates well beyond the foreseeable upper limit
217	   of network bandwidths.  PAWS uses the TCP Timestamps option defined
218	   in Section 4.2 to protect against old duplicates from the same
219	   connection.

221	1.3.  Using TCP options

223	   The extensions defined in this document all use new TCP options.

225	   When RFC 1323 was published, there was concern that some buggy TCP
226	   implementation might be crashed by the first appearance of an option
227	   on a non-SYN segment.  However, bugs like that can lead to DOS
228	   attacks against a TCP, so it is now expected that most TCP
229	   implementations will properly handle unknown options on non-SYN
230	   segments.  But it is still prudent to be conservative in what you
231	   send, and avoiding buggy TCP implementation is not the only reason
232	   for negotiating TCP options on SYN segments.  Therefore, for each of
233	   the extensions defined below, it is recommended that TCP options will
234	   be sent on non-SYN segments only after an exchange of options on the
235	   SYN segments has indicated that both sides understand the extension.
236	   Furthermore, an extension option will be sent in a  segment
237	   only if the corresponding option was received in the initial 
238	   segment.

240	   The timestamps option may appear in any data or ACK segment, adding
241	   12 bytes to the 20-byte TCP header.  We believe that the bandwidth
242	   saved by reducing unnecessary retransmission timeouts will more than
243	   pay for the extra header bandwidth.

245	   Appendix A contains a recommended layout of the options in TCP
246	   headers to achieve reasonable data field alignment.

248	   Finally, we observe that most of the mechanisms defined in this memo
249	   are important for LFN's and/or very high-speed networks.  For low-
250	   speed networks, it might be a performance optimization to NOT use
251	   these mechanisms.  A TCP vendor concerned about optimal performance
252	   over low-speed paths might consider turning these extensions off for
253	   low-speed paths, or allow a user or installation manager to disable
254	   them.

256	2.  Terminology

258	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
259	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
260	   document are to be interpreted as described in [RFC2119].

262	3.  TCP Window Scale Option

264	3.1.  Introduction

266	   The window scale extension expands the definition of the TCP window
267	   to 32 bits and then uses a scale factor to carry this 32-bit value in
268	   the 16-bit Window field of the TCP header (SEG.WND in RFC 793).  The
269	   scale factor is carried in a new TCP option, Window Scale.  This
270	   option is sent only in a SYN segment (a segment with the SYN bit on),
271	   hence the window scale is fixed in each direction when a connection
272	   is opened.

274	   The maximum receive window, and therefore the scale factor, is
275	   determined by the maximum receive buffer space.  In a typical modern
276	   implementation, this maximum buffer space is set by default but can
277	   be overridden by a user program before a TCP connection is opened.
278	   This determines the scale factor, and therefore no new user interface
279	   is needed for window scaling.

281	3.2.  Window Scale Option

283	   The three-byte Window Scale option MAY be sent in a SYN segment by a
284	   TCP.  It has two purposes: (1) indicate that the TCP is prepared to
285	   do both send and receive window scaling, and (2) communicate a scale
286	   factor to be applied to its receive window.  Thus, a TCP that is
287	   prepared to scale windows SHOULD send the option, even if its own
288	   scale factor is 1.  The scale factor is limited to a power of two and
289	   encoded logarithmically, so it may be implemented by binary shift
290	   operations.

292	   TCP Window Scale Option (WSopt):

294	   Kind: 3

296	   Length: 3 bytes

298	          +---------+---------+---------+
299	          | Kind=3  |Length=3 |shift.cnt|
300	          +---------+---------+---------+
301	               1         1         1

303	   This option is an offer, not a promise; both sides MUST send Window
304	   Scale options in their SYN segments to enable window scaling in
305	   either direction.  If window scaling is enabled, then the TCP that
306	   sent this option will right-shift its true receive-window values by
307	   'shift.cnt' bits for transmission in SEG.WND.  The value 'shift.cnt'
308	   MAY be zero (offering to scale, while applying a scale factor of 1 to
309	   the receive window).

311	   This option MAY be sent in an initial  segment (i.e., a segment
312	   with the SYN bit on and the ACK bit off).  It MAY also be sent in a
313	    segment, but only if a Window Scale option was received in
314	   the initial  segment.  A Window Scale option in a segment
315	   without a SYN bit SHOULD be ignored.

317	   The Window field in a SYN (i.e., a  or ) segment itself
318	   is never scaled.

320	3.3.  Using the Window Scale Option

322	   A model implementation of window scaling is as follows, using the
323	   notation of [RFC0793]:

325	   o  All windows are treated as 32-bit quantities for storage in the
326	      connection control block and for local calculations.  This
327	      includes the send-window (SND.WND) and the receive-window
328	      (RCV.WND) values, as well as the congestion window.

330	   o  The connection state is augmented by two window shift counts,
331	      Snd.Wind.Scale and Rcv.Wind.Scale, to be applied to the incoming
332	      and outgoing window fields, respectively.

334	   o  If a TCP receives a  segment containing a Window Scale
335	      option, it sends its own Window Scale option in the 
336	      segment.

338	   o  The Window Scale option is sent with shift.cnt = R, where R is the
339	      value that the TCP would like to use for its receive window.

341	   o  Upon receiving a SYN segment with a Window Scale option containing
342	      shift.cnt = S, a TCP sets Snd.Wind.Scale to S and sets
343	      Rcv.Wind.Scale to R; otherwise, it sets both Snd.Wind.Scale and
344	      Rcv.Wind.Scale to zero.

346	   o  The window field (SEG.WND) in the header of every incoming
347	      segment, with the exception of SYN segments, is left-shifted by
348	      Snd.Wind.Scale bits before updating SND.WND:

350	                    SND.WND = SEG.WND << Snd.Wind.Scale

352	      (assuming the other conditions of [RFC0793] are met, and using the
353	      "C" notation "<<" for left-shift).

355	   o  The window field (SEG.WND) of every outgoing segment, with the
356	      exception of SYN segments, is right-shifted by Rcv.Wind.Scale
357	      bits:

359	                    SND.WND = RCV.WND >> Rcv.Wind.Scale

361	   TCP determines if a data segment is "old" or "new" by testing whether
362	   its sequence number is within 2^31 bytes of the left edge of the
363	   window, and if it is not, discarding the data as "old".  To insure
364	   that new data is never mistakenly considered old and vice versa, the
365	   left edge of the sender's window has to be at most 2^31 away from the
366	   right edge of the receiver's window.  Similarly with the sender's
367	   right edge and receiver's left edge.  Since the right and left edges
368	   of either the sender's or receiver's window differ by the window
369	   size, and since the sender and receiver windows can be out of phase
370	   by at most the window size, the above constraints imply that two
371	   times the max window size must be less than 2^31, or

373	                             max window < 2^30

375	   Since the max window is 2^S (where S is the scaling shift count)
376	   times at most 2^16 - 1 (the maximum unscaled window), the maximum
377	   window is guaranteed to be < 2^30 if S <= 14.  Thus, the shift count
378	   MUST be limited to 14 (which allows windows of 2^30 = 1 Gbyte).  If a
379	   Window Scale option is received with a shift.cnt value exceeding 14,
380	   the TCP SHOULD log the error but use 14 instead of the specified
381	   value.

383	   The scale factor applies only to the Window field as transmitted in
384	   the TCP header; each TCP using extended windows will maintain the
385	   window values locally as 32-bit numbers.  For example, the
386	   "congestion window" computed by Slow Start and Congestion Avoidance
387	   is not affected by the scale factor, so window scaling will not
388	   introduce quantization into the congestion window.

390	3.4.  Addressing Window Retraction

392	   When a non-zero scale factor is in use, there are instances when a
393	   retracted window can be offered [Mathis08].  The end of the window
394	   will be on a boundary based on the granularity of the scale factor
395	   being used.  If the sequence number is then updated by a number of
396	   bytes smaller than that granularity, the TCP will have to either
397	   advertise a new window that is beyond what it previously advertised
398	   (and perhaps beyond the buffer), or will have to advertise a smaller
399	   window, which will cause the TCP window to shrink.  Implementations
400	   MUST ensure that they handle a shrinking window, as specified in
401	   section 4.2.2.16 of [RFC1122].

403	   For the receiver, this implies that:

405	   1)  The receiver MUST honor, as in-window, any segment that would
406	       have been in-window for any ACK sent by the receiver.

408	   2)  When window scaling is in effect, the receiver SHOULD track the
409	       actual maximum window sequence number (which is likely to be
410	       greater than the window announced by the most recent ACK, if more
411	       than one segment has arrived since the application consumed any
412	       data in the receive buffer).

414	   On the sender side:

416	   3)  The initial transmission MUST honor window on most recent ACK.

418	   4)  On first retransmission, or if the sequence number is out-of-
419	       window by less than (2^Rcv.Wind.Scale) then do normal
420	       retransmission(s) without regard to receiver window as long as
421	       the original segment was in window when it was sent.

423	   5)  On subsequent retransmissions, treat such ACKs as zero window
424	       probes.

426	4.  RTTM -- Round-Trip Time Measurement

428	4.1.  Introduction

430	   Accurate and current RTT estimates are necessary to adapt to changing
431	   traffic conditions and to avoid an instability known as "congestion
432	   collapse" [RFC0896] in a busy network.  However, accurate measurement
433	   of RTT may be difficult both in theory and in implementation.

435	   Many TCP implementations base their RTT measurements upon a sample of
436	   one packet per window or less.  While this yields an adequate
437	   approximation to the RTT for small windows, it results in an
438	   unacceptably poor RTT estimate for a LFN.  If we look at RTT
439	   estimation as a signal processing problem (which it is), a data
440	   signal at some frequency, the packet rate, is being sampled at a
441	   lower frequency, the window rate.  This lower sampling frequency
442	   violates Nyquist's criteria and may therefore introduce "aliasing"
443	   artifacts into the estimated RTT [Hamming77].

445	   A good RTT estimator with a conservative retransmission timeout
446	   calculation can tolerate aliasing when the sampling frequency is
447	   "close" to the data frequency.  For example, with a window of 8
448	   packets, the sample rate is 1/8 the data frequency -- less than an
449	   order of magnitude different.  However, when the window is tens or
450	   hundreds of packets, the RTT estimator may be seriously in error,
451	   resulting in spurious retransmissions.

453	   If there are dropped packets, the problem becomes worse.  Zhang
454	   [Zhang86], Jain [Jain86] and Karn [Karn87] have shown that it is not
455	   possible to accumulate reliable RTT estimates if retransmitted
456	   segments are included in the estimate.  Since a full window of data
457	   will have been transmitted prior to a retransmission, all of the
458	   segments in that window will have to be ACKed before the next RTT
459	   sample can be taken.  This means at least an additional window's
460	   worth of time between RTT measurements and, as the error rate
461	   approaches one per window of data (e.g., 10^-6 errors per bit for the
462	   Wideband satellite network), it becomes effectively impossible to
463	   obtain a valid RTT measurement.

465	   A solution to these problems, which actually simplifies the sender
466	   substantially, is as follows: using TCP options, the sender places a
467	   timestamp in each data segment, and the receiver reflects these
468	   timestamps back in ACK segments.  Then a single subtract gives the
469	   sender an accurate RTT measurement for every ACK segment (which will
470	   correspond to every other data segment, with a sensible receiver).
471	   We call this the RTTM (Round-Trip Time Measurement) mechanism.

473	   It is vitally important to use the RTTM mechanism with big windows;
474	   otherwise, the door is opened to some dangerous instabilities due to
475	   aliasing.  Furthermore, the option is probably useful for all TCP's,
476	   since it simplifies the sender.

478	4.2.  TCP Timestamps Option

480	   TCP is a symmetric protocol, allowing data to be sent at any time in
481	   either direction, and therefore timestamp echoing may occur in either
482	   direction.  For simplicity and symmetry, we specify that timestamps
483	   always be sent and echoed in both directions.  For efficiency, we
484	   combine the timestamp and timestamp reply fields into a single TCP
485	   Timestamps Option.

487	   TCP Timestamps Option (TSopt):

489	   Kind: 8

491	   Length: 10 bytes

493	          +-------+-------+---------------------+---------------------+
494	          |Kind=8 |  10   |   TS Value (TSval)  |TS Echo Reply (TSecr)|
495	          +-------+-------+---------------------+---------------------+
496	              1       1              4                     4

498	   The Timestamps option carries two four-byte timestamp fields.  The
499	   Timestamp Value field (TSval) contains the current value of the
500	   timestamp clock of the TCP sending the option.

502	   The Timestamp Echo Reply field (TSecr) is valid if the ACK bit is set
503	   in the TCP header; if it is valid, it echoes a timestamp value that
504	   was sent by the remote TCP in the TSval field of a Timestamp option.
505	   When TSecr is not valid, its value MUST be zero.  However, a value of
506	   zero does not imply TSecr being invalid.  The TSecr value will
507	   generally be from the most recent Timestamps Option that was
508	   received; however, there are exceptions that are explained below.

510	   A TCP MAY send the Timestamps option (TSopt) in an initial 
511	   segment (i.e., a segment containing a SYN bit and no ACK bit).  Once
512	   a TSopt has been sent or received in a non  segment, it MUST be
513	   sent in all segments.  Once a TSopt has been received in a non 
514	   segment, then any successive segment that is received without the RST
515	   bit and without a TSopt MAY be dropped without further processing,
516	   and an ACK of the current SND.UNA generated.

518	   In the case of crossing SYN packets where one SYN contains a TSopt
519	   and the other doesn't, both sides SHOULD put a TSopt in the 
520	   segment.

522	4.3.  The RTTM Mechanism

524	   RTTM places a Timestamps option in every segment, with a TSval that
525	   is obtained from a (virtual) "timestamp clock".  Values of this clock
526	   MUST be at least approximately proportional to real time, in order to
527	   measure actual RTT.

529	   These TSval values are echoed in TSecr values in the reverse
530	   direction.  The difference between a received TSecr value and the
531	   current timestamp clock value provides a RTT measurement.

533	   When timestamps are used, every segment that is received will contain
534	   a TSecr value.  However, these values cannot all be used to update
535	   the measured RTT.  The following example illustrates why.  It shows a
536	   one-way data flow with segments arriving in sequence without loss.
537	   Here A, B, C... represent data blocks occupying successive blocks of
538	   sequence numbers, and ACK(A),... represent the corresponding
539	   cumulative acknowledgments.  The two timestamp fields of the
540	   Timestamps option are shown symbolically as .  Each
541	   TSecr field contains the value most recently received in a TSval
542	   field.

544	              TCP  A                                     TCP B

546	                               ----->

548	                   <---- 

550	                               ----->

552	                   <---- 

554	                . . . . . . . . . . . . . . . . . . . . . .

556	                               ---->

558	                   <---- 

560	                                  (etc.)

562	   The dotted line marks a pause (60 time units long) in which A had
563	   nothing to send.  Note that this pause inflates the RTT which B could
564	   infer from receiving TSecr=131 in data segment C. Thus, in one-way
565	   data flows, RTTM in the reverse direction measures a value that is
566	   inflated by gaps in sending data.  However, the following rule
567	   prevents a resulting inflation of the measured RTT:

569	      RTTM Rule: A TSecr value received in a segment is used to update
570	      the averaged RTT measurement only if

572	      a)  the segment acknowledges some new data, i.e., only if it
573	          advances the left edge of the send window, and

575	      b)  the segment does not indicate any loss or reordering, i.e.
576	          contains SACK options

578	   Since TCP B is not sending data, the data segment C does not
579	   acknowledge any new data when it arrives at B. Thus, the inflated
580	   RTTM measurement is not used to update B's RTTM measurement.

582	   Implementers should note that with Timestamps multiple RTTMs can be
583	   taken per RTT.  Many RTO estimators have a weighting factor based on
584	   an implicit assumption that at most one RTTM will be sampled per RTT.
585	   When using multiple RTTMs per RTT to update the RTO estimator, the
586	   weighting factor needs to be decreased to take into account the more
587	   frequent RTTMs.  For example, an implementation could choose to just
588	   use one sample per RTT to update the RTO estimator, or vary the gain
589	   based on the congestion window, or take an average of all the RTTM
590	   measurements received over one RTT, and then use that value to update
591	   the RTO estimator.  This document does not prescribe any particular
592	   method for modifying the RTO estimator.

594	4.4.  Which Timestamp to Echo

596	   If more than one Timestamps option is received before a reply segment
597	   is sent, the TCP must choose only one of the TSvals to echo, ignoring
598	   the others.  To minimize the state kept in the receiver (i.e., the
599	   number of unprocessed TSvals), the receiver should be required to
600	   retain at most one timestamp in the connection control block.

602	   There are three situations to consider:

604	   (A)  Delayed ACKs.

606	        Many TCP's acknowledge only every Kth segment out of a group of
607	        segments arriving within a short time interval; this policy is
608	        known generally as "delayed ACKs".  The data-sender TCP must
609	        measure the effective RTT, including the additional time due to
610	        delayed ACKs, or else it will retransmit unnecessarily.  Thus,
611	        when delayed ACKs are in use, the receiver SHOULD reply with the
612	        TSval field from the earliest unacknowledged segment.

614	   (B)  A hole in the sequence space (segment(s) have been lost).

616	        The sender will continue sending until the window is filled, and
617	        the receiver may be generating ACKs as these out-of-order
618	        segments arrive (e.g., to aid "fast retransmit").

620	        The lost segment is probably a sign of congestion, and in that
621	        situation the sender should be conservative about
622	        retransmission.  Furthermore, it is better to overestimate than
623	        underestimate the RTT.  An ACK for an out-of-order segment
624	        SHOULD therefore contain the timestamp from the most recent
625	        segment that advanced the window.

627	        The same situation occurs if segments are re-ordered by the
628	        network.

630	   (C)  A filled hole in the sequence space.

632	        The segment that fills the hole represents the most recent
633	        measurement of the network characteristics.  A RTT computed from
634	        an earlier segment would probably include the sender's
635	        retransmit time-out, badly biasing the sender's average RTT
636	        estimate.  Thus, the timestamp from the latest segment (which
637	        filled the hole) MUST be echoed.

639	   An algorithm that covers all three cases is described in the
640	   following rules for Timestamps option processing on a synchronized
641	   connection:

643	   (1)  The connection state is augmented with two 32-bit slots:

645	        TS.Recent holds a timestamp to be echoed in TSecr whenever a
646	        segment is sent, and Last.ACK.sent holds the ACK field from the
647	        last segment sent.  Last.ACK.sent will equal RCV.NXT except when
648	        ACKs have been delayed.

650	   (2)  If:

652	            SEG.TSval >= TS.recent and SEG.SEQ <= Last.ACK.sent

654	        then SEG.TSval is copied to TS.Recent; otherwise, it is ignored.

656	   (3)  When a TSopt is sent, its TSecr field is set to the current
657	        TS.Recent value.

659	   The following examples illustrate these rules.  Here A, B, C...
660	   represent data segments occupying successive blocks of sequence
661	   numbers, and ACK(A),... represent the corresponding acknowledgment
662	   segments.  Note that ACK(A) has the same sequence number as B. We
663	   show only one direction of timestamp echoing, for clarity.

665	   o  Packets arrive in sequence, and some of the ACKs are delayed.

667	      By case (A), the timestamp from the oldest unacknowledged segment
668	      is echoed.

670	                                                    TS.Recent
671	                   ------------------->
672	                                                        1
673	                   ------------------->
674	                                                        1
675	                   ------------------->
676	                                                        1
677	                           <---- 
678	                  (etc)

680	   o  Packets arrive out of order, and every packet is acknowledged.

682	      By case (B), the timestamp from the last segment that advanced the
683	      left window edge is echoed, until the missing segment arrives; it
684	      is echoed according to Case (C).  The same sequence would occur if
685	      segments B and D were lost and retransmitted.

687	                                                    TS.Recent
688	                   ------------------->
689	                                                        1
690	                           <---- 
691	                                                        1
692	                   ------------------->
693	                                                        1
694	                           <---- 
695	                                                        1
696	                   ------------------->
697	                                                        2
698	                           <---- 
699	                                                        2
700	                   ------------------->
701	                                                        2
702	                           <---- 
703	                                                        2
704	                   ------------------->
705	                                                        4
706	                           <---- 
707	                  (etc)

709	5.  PAWS -- Protection Against Wrapped Sequence Numbers

711	5.1.  Introduction

713	   Section 5.2 describes a simple mechanism to reject old duplicate
714	   segments that might corrupt an open TCP connection; we call this
715	   mechanism PAWS (Protection Against Wrapped Sequence numbers).  PAWS
716	   operates within a single TCP connection, using state that is saved in
717	   the connection control block.  Section 5.3 and Appendix G discuss the
718	   implications of the PAWS mechanism for avoiding old duplicates from
719	   previous incarnations of the same connection.

721	5.2.  The PAWS Mechanism

723	   PAWS uses the same TCP Timestamps option as the RTTM mechanism
724	   described earlier, and assumes that every received TCP segment
725	   (including data and ACK segments) contains a timestamp SEG.TSval
726	   whose values are monotonically non-decreasing in time.  The basic
727	   idea is that a segment can be discarded as an old duplicate if it is
728	   received with a timestamp SEG.TSval less than some timestamp recently
729	   received on this connection.

731	   In both the PAWS and the RTTM mechanism, the "timestamps" are 32-bit
732	   unsigned integers in a modular 32-bit space.  Thus, "less than" is
733	   defined the same way it is for TCP sequence numbers, and the same
734	   implementation techniques apply.  If s and t are timestamp values,

736	                       s < t  if 0 < (t - s) < 2^31,

738	   computed in unsigned 32-bit arithmetic.

740	   The choice of incoming timestamps to be saved for this comparison
741	   MUST guarantee a value that is monotonically increasing.  For
742	   example, we might save the timestamp from the segment that last
743	   advanced the left edge of the receive window, i.e., the most recent
744	   in-sequence segment.  Instead, we choose the value TS.Recent
745	   introduced in Section 4.4 for the RTTM mechanism, since using a
746	   common value for both PAWS and RTTM simplifies the implementation of
747	   both.  As Section 4.4 explained, TS.Recent differs from the timestamp
748	   from the last in-sequence segment only in the case of delayed ACKs,
749	   and therefore by less than one window.  Either choice will therefore
750	   protect against sequence number wrap-around.

752	   RTTM was specified in a symmetrical manner, so that TSval timestamps
753	   are carried in both data and ACK segments and are echoed in TSecr
754	   fields carried in returning ACK or data segments.  PAWS submits all
755	   incoming segments to the same test, and therefore protects against
756	   duplicate ACK segments as well as data segments.  (An alternative
757	   non-symmetric algorithm would protect against old duplicate ACKs: the
758	   sender of data would reject incoming ACK segments whose TSecr values
759	   were less than the TSecr saved from the last segment whose ACK field
760	   advanced the left edge of the send window.  This algorithm was deemed
761	   to lack economy of mechanism and symmetry.)

763	   TSval timestamps sent on  and  segments are used to
764	   initialize PAWS.  PAWS protects against old duplicate non-SYN
765	   segments, and duplicate SYN segments received while there is a
766	   synchronized connection.  Duplicate  and  segments
767	   received when there is no connection will be discarded by the normal
768	   3-way handshake and sequence number checks of TCP.

770	   [RFC1323] recommended that RST segments NOT carry timestamps, and
771	   that they be acceptable regardless of their timestamp.  At that time,
772	   the thinking was that old duplicate RST segments should be
773	   exceedingly unlikely, and their cleanup function should take
774	   precedence over timestamps.  More recently, discussions about various
775	   blind attacks on TCP connections have raised the suggestion that if
776	   the Timestamps option is present, SEG.TSecr could be used to provide
777	   stricter acceptance tests for RST packets.  While still under
778	   discussion, to enable research into this area it is now RECOMMENDED
779	   that when generating a RST, that if the packet causing the RST to be
780	   generated contained a Timestamps option that the RST also contain a
781	   Timestamps option.  In the RST segment, SEG.TSecr SHOULD be set to
782	   SEG.TSval from the incoming packet and SEG.TSval SHOULD be set to
783	   zero.  If a RST is being generated because of a user abort, and
784	   Snd.TS.OK is set, then a Timestamps option SHOULD be included in the
785	   RST.  When a RST packet is received, it MUST NOT be subjected to PAWS
786	   checks, and information from the Timestamps option MUST NOT be used
787	   to update connection state information.  SEG.TSecr MAY be used to
788	   provide stricter RST acceptance checks.

790	5.2.1.  Basic PAWS Algorithm

792	   The PAWS algorithm requires the following processing to be performed
793	   on all incoming segments for a synchronized connection:

795	   R1)  If there is a Timestamps option in the arriving segment,
796	        SEG.TSval < TS.Recent, TS.Recent is valid (see later discussion)
797	        and the RST bit is not set, then treat the arriving segment as
798	        not acceptable:

800	           Send an acknowledgement in reply as specified in [RFC0793]
801	           page 69 and drop the segment.

803	           Note: it is necessary to send an ACK segment in order to
804	           retain TCP's mechanisms for detecting and recovering from
805	           half-open connections.  For example, see Figure 10 of
806	           [RFC0793].

808	   R2)  If the segment is outside the window, reject it (normal TCP
809	        processing)

811	   R3)  If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent (see
812	        Section 4.4), then record its timestamp in TS.Recent.

814	   R4)  If an arriving segment is in-sequence (i.e., at the left window
815	        edge), then accept it normally.

817	   R5)  Otherwise, treat the segment as a normal in-window, out-of-
818	        sequence TCP segment (e.g., queue it for later delivery to the
819	        user).

821	   Steps R2, R4, and R5 are the normal TCP processing steps specified by
822	   [RFC0793].

824	   It is important to note that the timestamp is checked only when a
825	   segment first arrives at the receiver, regardless of whether it is
826	   in-sequence or it must be queued for later delivery.

828	   Consider the following example.

830	      Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has been
831	      sent, where the letter indicates the sequence number and the digit
832	      represents the timestamp.  Suppose also that segment B.1 has been
833	      lost.  The timestamp in TS.Recent is 1 (from A.1), so C.1, ...,
834	      Z.1 are considered acceptable and are queued.  When B is
835	      retransmitted as segment B.2 (using the latest timestamp), it
836	      fills the hole and causes all the segments through Z to be
837	      acknowledged and passed to the user.  The timestamps of the queued
838	      segments are *not* inspected again at this time, since they have
839	      already been accepted.  When B.2 is accepted, TS.Recent is set to
840	      2.

842	   This rule allows reasonable performance under loss.  A full window of
843	   data is in transit at all times, and after a loss a full window less
844	   one packet will show up out-of-sequence to be queued at the receiver
845	   (e.g., up to ~2^30 bytes of data); the timestamp option must not
846	   result in discarding this data.

848	   In certain unlikely circumstances, the algorithm of rules R1-R5 could
849	   lead to discarding some segments unnecessarily, as shown in the
850	   following example:

852	      Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have been
853	      sent in sequence and that segment B.1 has been lost.  Furthermore,
854	      suppose delivery of some of C.1, ...  Z.1 is delayed until AFTER
855	      the retransmission B.2 arrives at the receiver.  These delayed
856	      segments will be discarded unnecessarily when they do arrive,
857	      since their timestamps are now out of date.

859	   This case is very unlikely to occur.  If the retransmission was
860	   triggered by a timeout, some of the segments C.1, ...  Z.1 must have
861	   been delayed longer than the RTO time.  This is presumably an
862	   unlikely event, or there would be many spurious timeouts and
863	   retransmissions.  If B's retransmission was triggered by the "fast
864	   retransmit" algorithm, i.e., by duplicate ACKs, then the queued
865	   segments that caused these ACKs must have been received already.

867	   Even if a segment were delayed past the RTO, the Fast Retransmit
868	   mechanism [Jacobson90c] will cause the delayed packets to be
869	   retransmitted at the same time as B.2, avoiding an extra RTT and
870	   therefore causing a very small performance penalty.

872	   We know of no case with a significant probability of occurrence in
873	   which timestamps will cause performance degradation by unnecessarily
874	   discarding segments.

876	5.2.2.  Timestamp Clock

878	   It is important to understand that the PAWS algorithm does not
879	   require clock synchronization between sender and receiver.  The
880	   sender's timestamp clock is used to stamp the segments, and the
881	   sender uses the echoed timestamp to measure RTTs.  However, the
882	   receiver treats the timestamp as simply a monotonically increasing
883	   serial number, without any necessary connection to its clock.  From
884	   the receiver's viewpoint, the timestamp is acting as a logical
885	   extension of the high-order bits of the sequence number.

887	   The receiver algorithm does place some requirements on the frequency
888	   of the timestamp clock.

890	   (a)  The timestamp clock must not be "too slow".

892	        It MUST tick at least once for each 2^31 bytes sent.  In fact,
893	        in order to be useful to the sender for round trip timing, the
894	        clock SHOULD tick at least once per window's worth of data, and
895	        even with the window extension defined in Section 3.2, 2^31
896	        bytes must be at least two windows.

898	        To make this more quantitative, any clock faster than 1 tick/sec
899	        will reject old duplicate segments for link speeds of ~8 Gbps.

901	        A 1 ms timestamp clock will work at link speeds up to 8 Tbps
902	        (8*10^12) bps!

904	   (b)  The timestamp clock must not be "too fast".

906	        The recycling time of the timestamp clock MUST be greater than
907	        MSL seconds.  Since the clock (timestamp) is 32 bits and the
908	        worst-case MSL is 255 seconds, the maximum acceptable clock
909	        frequency is one tick every 59 ns.

911	        However, it is desirable to establish a much longer recycle
912	        period, in order to handle outdated timestamps on idle
913	        connections (see Section 5.2.3), and to relax the MSL
914	        requirement for preventing sequence number wrap-around.  With a
915	        1 ms timestamp clock, the 32-bit timestamp will wrap its sign
916	        bit in 24.8 days.  Thus, it will reject old duplicates on the
917	        same connection if MSL is 24.8 days or less.  This appears to be
918	        a very safe figure; an MSL of 24.8 days or longer can probably
919	        be assumed in the internet without requiring precise MSL
920	        enforcement.

922	   Based upon these considerations, we choose a timestamp clock
923	   frequency in the range 1 ms to 1 sec per tick.  This range also
924	   matches the requirements of the RTTM mechanism, which does not need
925	   much more resolution than the granularity of the retransmit timer,
926	   e.g., tens or hundreds of milliseconds.

928	   The PAWS mechanism also puts a strong monotonicity requirement on the
929	   sender's timestamp clock.  The method of implementation of the
930	   timestamp clock to meet this requirement depends upon the system
931	   hardware and software.

933	   o  Some hosts have a hardware clock that is guaranteed to be
934	      monotonic between hardware resets.

936	   o  A clock interrupt may be used to simply increment a binary integer
937	      by 1 periodically.

939	   o  The timestamp clock may be derived from a system clock that is
940	      subject to being abruptly changed, by adding a variable offset
941	      value.  This offset is initialized to zero.  When a new timestamp
942	      clock value is needed, the offset can be adjusted as necessary to
943	      make the new value equal to or larger than the previous value
944	      (which was saved for this purpose).

946	5.2.3.  Outdated Timestamps

948	   If a connection remains idle long enough for the timestamp clock of
949	   the other TCP to wrap its sign bit, then the value saved in TS.Recent
950	   will become too old; as a result, the PAWS mechanism will cause all
951	   subsequent segments to be rejected, freezing the connection (until
952	   the timestamp clock wraps its sign bit again).

954	   With the chosen range of timestamp clock frequencies (1 sec to 1 ms),
955	   the time to wrap the sign bit will be between 24.8 days and 24800
956	   days.  A TCP connection that is idle for more than 24 days and then
957	   comes to life is exceedingly unusual.  However, it is undesirable in
958	   principle to place any limitation on TCP connection lifetimes.

960	   We therefore require that an implementation of PAWS include a
961	   mechanism to "invalidate" the TS.Recent value when a connection is
962	   idle for more than 24 days.  (An alternative solution to the problem
963	   of outdated timestamps would be to send keep-alive segments at a very
964	   low rate, but still more often than the wrap-around time for
965	   timestamps, e.g., once a day.  This would impose negligible overhead.
966	   However, the TCP specification has never included keep-alives, so the
967	   solution based upon invalidation was chosen.)

969	   Note that a TCP does not know the frequency, and therefore, the
970	   wraparound time, of the other TCP, so it must assume the worst.  The
971	   validity of TS.Recent needs to be checked only if the basic PAWS
972	   timestamp check fails, i.e., only if SEG.TSval < TS.Recent.  If
973	   TS.Recent is found to be invalid, then the segment is accepted,
974	   regardless of the failure of the timestamp check, and rule R3 updates
975	   TS.Recent with the TSval from the new segment.

977	   To detect how long the connection has been idle, the TCP MAY update a
978	   clock or timestamp value associated with the connection whenever
979	   TS.Recent is updated, for example.  The details will be
980	   implementation-dependent.

982	5.2.4.  Header Prediction

984	   "Header prediction" [Jacobson90a] is a high-performance transport
985	   protocol implementation technique that is most important for high-
986	   speed links.  This technique optimizes the code for the most common
987	   case, receiving a segment correctly and in order.  Using header
988	   prediction, the receiver asks the question, "Is this segment the next
989	   in sequence?"  This question can be answered in fewer machine
990	   instructions than the question, "Is this segment within the window?"

992	   Adding header prediction to our timestamp procedure leads to the
993	   following recommended sequence for processing an arriving TCP
994	   segment:

996	   H1)  Check timestamp (same as step R1 above)

998	   H2)  Do header prediction: if segment is next in sequence and if
999	        there are no special conditions requiring additional processing,
1000	        accept the segment, record its timestamp, and skip H3.

1002	   H3)  Process the segment normally, as specified in RFC 793.  This
1003	        includes dropping segments that are outside the window and
1004	        possibly sending acknowledgments, and queuing in-window, out-of-
1005	        sequence segments.

1007	   Another possibility would be to interchange steps H1 and H2, i.e., to
1008	   perform the header prediction step H2 FIRST, and perform H1 and H3
1009	   only when header prediction fails.  This could be a performance
1010	   improvement, since the timestamp check in step H1 is very unlikely to
1011	   fail, and it requires unsigned modulo arithmetic.  To perform this
1012	   check on every single segment is contrary to the philosophy of header
1013	   prediction.  We believe that this change might produce a measurable
1014	   reduction in CPU time for TCP protocol processing on high-speed
1015	   networks.

1017	   However, putting H2 first would create a hazard: a segment from 2^32
1018	   bytes in the past might arrive at exactly the wrong time and be
1019	   accepted mistakenly by the header-prediction step.  The following
1020	   reasoning has been introduced in [RFC1185] to show that the
1021	   probability of this failure is negligible.

1023	      If all segments are equally likely to show up as old duplicates,
1024	      then the probability of an old duplicate exactly matching the left
1025	      window edge is the maximum segment size (MSS) divided by the size
1026	      of the sequence space.  This ratio must be less than 2^-16, since
1027	      MSS must be < 2^16; for example, it will be (2^12)/(2^32) = 2^-20
1028	      for a FDDI link.  However, the older a segment is, the less likely
1029	      it is to be retained in the Internet, and under any reasonable
1030	      model of segment lifetime the probability of an old duplicate
1031	      exactly at the left window edge must be much smaller than 2^-16.

1033	      The 16 bit TCP checksum also allows a basic unreliability of one
1034	      part in 2^16.  A protocol mechanism whose reliability exceeds the
1035	      reliability of the TCP checksum should be considered "good
1036	      enough", i.e., it won't contribute significantly to the overall
1037	      error rate.  We therefore believe we can ignore the problem of an
1038	      old duplicate being accepted by doing header prediction before
1039	      checking the timestamp.

1041	   However, this probabilistic argument is not universally accepted, and
1042	   the consensus at present is that the performance gain does not
1043	   justify the hazard in the general case.  It is therefore recommended
1044	   that H2 follow H1.

1046	5.2.5.  IP Fragmentation

1048	   At high data rates, the protection against old packets provided by
1049	   PAWS can be circumvented by errors in IP fragment reassembly (see
1050	   [RFC4963]).  The only way to protect against incorrect IP fragment
1051	   reassembly is to not allow the packets to be fragmented.  This is
1052	   done by setting the Don't Fragment (DF) bit in the IP header.
1053	   Setting the DF bit implies the use of Path MTU Discovery as described
1054	   in [RFC1191], [RFC1981], and [RFC4821], thus any TCP implementation
1055	   that implements PAWS MUST also implement Path MTU Discovery.

1057	5.3.  Duplicates from Earlier Incarnations of Connection

1059	   The PAWS mechanism protects against errors due to sequence number
1060	   wrap-around on high-speed connections.  Segments from an earlier
1061	   incarnation of the same connection are also a potential cause of old
1062	   duplicate errors.  In both cases, the TCP mechanisms to prevent such
1063	   errors depend upon the enforcement of a maximum segment lifetime
1064	   (MSL) by the Internet (IP) layer (see Appendix of RFC 1185 for a
1065	   detailed discussion).  Unlike the case of sequence space wrap-around,
1066	   the MSL required to prevent old duplicate errors from earlier
1067	   incarnations does not depend upon the transfer rate.  If the IP layer
1068	   enforces the recommended 2 minute MSL of TCP, and if the TCP rules
1069	   are followed, TCP connections will be safe from earlier incarnations,
1070	   no matter how high the network speed.  Thus, the PAWS mechanism is
1071	   not required for this case.

1073	   We may still ask whether the PAWS mechanism can provide additional
1074	   security against old duplicates from earlier connections, allowing us
1075	   to relax the enforcement of MSL by the IP layer.  Appendix B explores
1076	   this question, showing that further assumptions and/or mechanisms are
1077	   required, beyond those of PAWS.  This is not part of the current
1078	   extension.

1080	6.  Conclusions and Acknowledgements

1082	   This memo presented a set of extensions to TCP to provide efficient
1083	   operation over large-bandwidth*delay-product paths and reliable
1084	   operation over very high-speed paths.  These extensions are designed
1085	   to provide compatible interworking with TCP's that do not implement
1086	   the extensions.

1088	   These mechanisms are implemented using new TCP options for scaled
1089	   windows and timestamps.  The timestamps are used for two distinct
1090	   mechanisms: RTTM (Round Trip Time Measurement) and PAWS (Protection
1091	   Against Wrapped Sequences).

1093	   The Window Scale option was originally suggested by Mike St. Johns of
1094	   USAF/DCA.  The present form of the option was suggested by Mike
1095	   Karels of UC Berkeley in response to a more cumbersome scheme defined
1096	   by Van Jacobson.  Lixia Zhang helped formulate the PAWS mechanism
1097	   description in [RFC1185].

1099	   Finally, much of this work originated as the result of discussions
1100	   within the End-to-End Task Force on the theoretical limitations of
1101	   transport protocols in general and TCP in particular.  Task force
1102	   members and other on the end2end-interest list have made valuable
1103	   contributions by pointing out flaws in the algorithms and the
1104	   documentation.  Continued discussion and development since the
1105	   publication of [RFC1323] originally occurred in the IETF TCP Large
1106	   Windows Working Group, later on in the End-to-End Task Force, and
1107	   most recently in the IETF TCP Maintenance Working Group.  The authors
1108	   are grateful for all these contributions.

1110	7.  Security Considerations

1112	   The TCP sequence space is a fixed size, and as the window becomes
1113	   larger it becomes easier for an attacker to generate forged packets
1114	   that can fall within the TCP window, and be accepted as valid
1115	   packets.  While use of Timestamps and PAWS can help to mitigate this,
1116	   when using PAWS, if an attacker is able to forge a packet that is
1117	   acceptable to the TCP connection, a timestamp that is in the future
1118	   would cause valid packets to be dropped due to PAWS checks.  Hence,
1119	   implementers should take care to not open the TCP window drastically
1120	   beyond the requirements of the connection.

1122	   Middle boxes and options: If a middle box removes TCP options from
1123	   the SYN, such as TSopt, a high speed connection that needs PAWS would
1124	   not have that protection.  In this situation, an implementer could
1125	   provide a mechanism for the application to determine whether or not
1126	   PAWS is in use on the connection, and chose to terminate the
1127	   connection if that protection doesn't exist.

1129	   Mechanisms to protect the TCP header from modification should also
1130	   protect the TCP options.

1132	   A naive implementation that derives the timestamp clock value
1133	   directly from a system uptime clock may unintentionally leak this
1134	   information to an attacker.  This does not directly compromise any of
1135	   the mechanisms described in this document.  However, this may be
1136	   valuable information to a potential attacker.  An implementer should
1137	   evaluate the potential impact and mitigate this accordingly (i.e. by
1138	   using a random offset for the timestamp clock on each connection, or
1139	   using an external, real-time derived timestamp clock source).

1141	   Expanding the TCP window beyond 64K for IPv6 allows Jumbograms
1142	   [RFC2675] to be used when the local network supports packets larger
1143	   than 64K. When larger TCP packets are used, the TCP checksum becomes
1144	   weaker.

1146	8.  IANA Considerations

1148	   This document has no actions for IANA.

1150	9.  References

1152	9.1.  Normative References

1154	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
1155	              RFC 793, September 1981.

1157	   [RFC1191]  Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
1158	              November 1990.

1160	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1161	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1163	9.2.  Informative References

1165	   [Garlick77]
1166	              Garlick, L., Rom, R., and J. Postel, "Issues in Reliable
1167	              Host-to-Host Protocols", Proc. Second Berkeley Workshop on
1168	              Distributed Data Management and Computer Networks,
1169	              May 1977, .

1171	   [Hamming77]
1172	              Hamming, R., "Digital Filters", Prentice Hall, Englewood
1173	              Cliffs, N.J. ISBN 0-13-212571-4, 1977.

1175	   [Jacobson88a]
1176	              Jacobson, V., "Congestion Avoidance and Control", SIGCOMM
1177	              '88, Stanford,  CA., August 1988,
1178	              .

1180	   [Jacobson90a]
1181	              Jacobson, V., "4BSD Header Prediction", ACM Computer
1182	              Communication Review, April 1990.

1184	   [Jacobson90c]
1185	              Jacobson, V., "Modified TCP congestion avoidance
1186	              algorithm", Message to the end2end-interest mailing list,
1187	              April 1990,
1188	              .

1190	   [Jain86]   Jain, R., "Divergence of Timeout Algorithms for Packet
1191	              Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and
1192	              Comm., Scottsdale, Arizona, March 1986,
1193	              .

1195	   [Karn87]   Karn, P. and C. Partridge, "Estimating Round-Trip Times in
1196	              Reliable Transport Protocols", Proc. SIGCOMM '87,
1197	              August 1987.

1199	   [Martin03]
1200	              Martin, D., "[Tsvwg] RFC 1323.bis", Message to the tsvwg
1201	              mailing list, September 2003, .

1204	   [Mathis08]
1205	              Mathis, M., "[tcpm] Example of 1323 window retraction
1206	              problem", Message to the tcpm mailing list, March 2008,
1207	              .

1210	   [RFC0896]  Nagle, J., "Congestion control in IP/TCP internetworks",
1211	              RFC 896, January 1984.

1213	   [RFC1072]  Jacobson, V. and R. Braden, "TCP extensions for long-delay
1214	              paths", RFC 1072, October 1988.

1216	   [RFC1110]  McKenzie, A., "Problem with the TCP big window option",
1217	              RFC 1110, August 1989.

1219	   [RFC1122]  Braden, R., "Requirements for Internet Hosts -
1220	              Communication Layers", STD 3, RFC 1122, October 1989.

1222	   [RFC1185]  Jacobson, V., Braden, B., and L. Zhang, "TCP Extension for
1223	              High-Speed Paths", RFC 1185, October 1990.

1225	   [RFC1323]  Jacobson, V., Braden, B., and D. Borman, "TCP Extensions
1226	              for High Performance", RFC 1323, May 1992.

1228	   [RFC1981]  McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery
1229	              for IP version 6", RFC 1981, August 1996.

1231	   [RFC2018]  Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
1232	              Selective Acknowledgment Options", RFC 2018, October 1996.

1234	   [RFC2581]  Allman, M., Paxson, V., and W. Stevens, "TCP Congestion
1235	              Control", RFC 2581, April 1999.

1237	   [RFC2675]  Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms",
1238	              RFC 2675, August 1999.

1240	   [RFC2883]  Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An
1241	              Extension to the Selective Acknowledgement (SACK) Option
1242	              for TCP", RFC 2883, July 2000.

1244	   [RFC4821]  Mathis, M. and J. Heffner, "Packetization Layer Path MTU
1245	              Discovery", RFC 4821, March 2007.

1247	   [RFC4963]  Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly
1248	              Errors at High Data Rates", RFC 4963, July 2007.

1250	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
1251	              Control", RFC 5681, September 2009.

1253	   [RFC6675]  Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
1254	              and Y. Nishida, "A Conservative Loss Recovery Algorithm
1255	              Based on Selective Acknowledgment (SACK) for TCP",
1256	              RFC 6675, August 2012.

1258	   [RFC6691]  Borman, D., "TCP Options and Maximum Segment Size (MSS)",
1259	              RFC 6691, July 2012.

1261	   [Watson81]
1262	              Watson, R., "Timer-based Mechanisms in Reliable Transport
1263	              Protocol Connection Management", Computer Networks, Vol.
1264	              5, 1981.

1266	   [Zhang86]  Zhang, L., "Why TCP Timers Don't Work Well", Proc. SIGCOMM
1267	              '86, Stowe, VT, August 1986.

1269	Appendix A.  Implementation Suggestions

1271	   TCP Option Layout

1273	      The following layouts are recommended for sending options on non-
1274	      SYN segments, to achieve maximum feasible alignment of 32-bit and
1275	      64-bit machines.

1277	                   +--------+--------+--------+--------+
1278	                   |   NOP  |  NOP   |  TSopt |   10   |
1279	                   +--------+--------+--------+--------+
1280	                   |          TSval timestamp          |
1281	                   +--------+--------+--------+--------+
1282	                   |          TSecr timestamp          |
1283	                   +--------+--------+--------+--------+

1285	   Interaction with the TCP Urgent Pointer

1287	      The TCP Urgent pointer, like the TCP window, is a 16 bit value.
1288	      Some of the original discussion for the TCP Window Scale option
1289	      included proposals to increase the Urgent pointer to 32 bits.  As
1290	      it turns out, this is unnecessary.  There are two observations
1291	      that should be made:

1293	      (1)  With IP Version 4, the largest amount of TCP data that can be
1294	           sent in a single packet is 65495 bytes (64K - 1 -- size of
1295	           fixed IP and TCP headers).

1297	      (2)  Updates to the urgent pointer while the user is in "urgent
1298	           mode" are invisible to the user.

1300	      This means that if the Urgent Pointer points beyond the end of the
1301	      TCP data in the current packet, then the user will remain in
1302	      urgent mode until the next TCP packet arrives.  That packet will
1303	      update the urgent pointer to a new offset, and the user will never
1304	      have left urgent mode.

1306	      Thus, to properly implement the Urgent Pointer, the sending TCP
1307	      only has to check for overflow of the 16 bit Urgent Pointer field
1308	      before filling it in.  If it does overflow, than a value of 65535
1309	      should be inserted into the Urgent Pointer.

1311	      The same technique applies to IP Version 6, except in the case of
1312	      IPv6 Jumbograms.  When IPv6 Jumbograms are supported, [RFC2675]
1313	      requires additional steps for dealing with the Urgent Pointer,
1314	      these are described in section 5.2 of [RFC2675].

1316	Appendix B.  Duplicates from Earlier Connection Incarnations

1318	   There are two cases to be considered: (1) a system crashing (and
1319	   losing connection state) and restarting, and (2) the same connection
1320	   being closed and reopened without a loss of host state.  These will
1321	   be described in the following two sections.

1323	B.1.  System Crash with Loss of State

1325	   TCP's quiet time of one MSL upon system startup handles the loss of
1326	   connection state in a system crash/restart.  For an explanation, see
1327	   for example "When to Keep Quiet" in the TCP protocol specification
1328	   [RFC0793].  The MSL that is required here does not depend upon the
1329	   transfer speed.  The current TCP MSL of 2 minutes seemed acceptable
1330	   as an operational compromise, when many host systems used to take
1331	   this long to boot after a crash.  Current host systems can boot
1332	   considerably faster.

1334	   The timestamp option may be used to ease the MSL requirements (or to
1335	   provide additional security against data corruption).  If timestamps
1336	   are being used and if the timestamp clock can be guaranteed to be
1337	   monotonic over a system crash/restart, i.e., if the first value of
1338	   the sender's timestamp clock after a crash/restart can be guaranteed
1339	   to be greater than the last value before the restart, then a quiet
1340	   time is unnecessary.

1342	   To dispense totally with the quiet time would require that the host
1343	   clock be synchronized to a time source that is stable over the crash/
1344	   restart period, with an accuracy of one timestamp clock tick or
1345	   better.  We can back off from this strict requirement to take
1346	   advantage of approximate clock synchronization.  Suppose that the
1347	   clock is always re-synchronized to within N timestamp clock ticks and
1348	   that booting (extended with a quiet time, if necessary) takes more
1349	   than N ticks.  This will guarantee monotonicity of the timestamps,
1350	   which can then be used to reject old duplicates even without an
1351	   enforced MSL.

1353	B.2.  Closing and Reopening a Connection

1355	   When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT state
1356	   ties up the socket pair for 4 minutes (see Section 3.5 of [RFC0793].
1357	   Applications built upon TCP that close one connection and open a new
1358	   one (e.g., an FTP data transfer connection using Stream mode) must
1359	   choose a new socket pair each time.  The TIME-WAIT delay serves two
1360	   different purposes:

1362	   (a)  Implement the full-duplex reliable close handshake of TCP.

1364	        The proper time to delay the final close step is not really
1365	        related to the MSL; it depends instead upon the RTO for the FIN
1366	        segments and therefore upon the RTT of the path.  (It could be
1367	        argued that the side that is sending a FIN knows what degree of
1368	        reliability it needs, and therefore it should be able to
1369	        determine the length of the TIME-WAIT delay for the FIN's
1370	        recipient.  This could be accomplished with an appropriate TCP
1371	        option in FIN segments.)

1373	        Although there is no formal upper-bound on RTT, common network
1374	        engineering practice makes an RTT greater than 1 minute very
1375	        unlikely.  Thus, the 4 minute delay in TIME-WAIT state works
1376	        satisfactorily to provide a reliable full-duplex TCP close.
1377	        Note again that this is independent of MSL enforcement and
1378	        network speed.

1380	        The TIME-WAIT state could cause an indirect performance problem
1381	        if an application needed to repeatedly close one connection and
1382	        open another at a very high frequency, since the number of
1383	        available TCP ports on a host is less than 2^16.  However, high
1384	        network speeds are not the major contributor to this problem;
1385	        the RTT is the limiting factor in how quickly connections can be
1386	        opened and closed.  Therefore, this problem will be no worse at
1387	        high transfer speeds.

1389	   (b)  Allow old duplicate segments to expire.

1391	        To replace this function of TIME-WAIT state, a mechanism would
1392	        have to operate across connections.  PAWS is defined strictly
1393	        within a single connection; the last timestamp (TS.Recent) is
1394	        kept in the connection control block, and discarded when a
1395	        connection is closed.

1397	        An additional mechanism could be added to the TCP, a per-host
1398	        cache of the last timestamp received from any connection.  This
1399	        value could then be used in the PAWS mechanism to reject old
1400	        duplicate segments from earlier incarnations of the connection,
1401	        if the timestamp clock can be guaranteed to have ticked at least
1402	        once since the old connection was open.  This would require that
1403	        the TIME-WAIT delay plus the RTT together must be at least one
1404	        tick of the sender's timestamp clock.  Such an extension is not
1405	        part of the proposal of this RFC.

1407	        Note that this is a variant on the mechanism proposed by
1408	        Garlick, Rom, and Postel [Garlick77], which required each host
1409	        to maintain connection records containing the highest sequence
1410	        numbers on every connection.  Using timestamps instead, it is
1411	        only necessary to keep one quantity per remote host, regardless
1412	        of the number of simultaneous connections to that host.

1414	Appendix C.  Summary of Notation

1416	   The following notation has been used in this document.

1418	   Options

1420	      WSopt:            TCP Window Scale Option
1421	      TSopt:            TCP Timestamps Option

1423	   Option Fields

1425	      shift.cnt:        Window scale byte in WSopt
1426	      TSval:            32-bit Timestamp Value field in TSopt
1427	      TSecr:            32-bit Timestamp Reply field in TSopt

1429	   Option Fields in Current Segment

1431	      SEG.TSval:        TSval field from TSopt in current segment
1432	      SEG.TSecr:        TSecr field from TSopt in current segment
1433	      SEG.WSopt:        8-bit value in WSopt

1435	   Clock Values

1437	      my.TSclock:       System wide source of 32-bit timestamp values
1438	      my.TSclock.rate:  Period of my.TSclock (1 ms to 1 sec)
1439	      Snd.TSoffset:     A offset for randomizing Snd.TSclock
1440	      Snd.TSclock:      my.TSclock + Snd.TSoffset

1442	   Per-Connection State Variables

1444	      TS.Recent:        Latest received Timestamp
1445	      Last.ACK.sent:    Last ACK field sent
1446	      Snd.TS.OK:        1-bit flag
1447	      Snd.WS.OK:        1-bit flag
1448	      Rcv.Wind.Scale:   Receive window scale power
1449	      Snd.Wind.Scale:   Send window scale power
1450	      Start.Time:       Snd.TSclock value when segment being timed was
1451	                        sent (used by pre-1323 code).

1453	   Procedure

1455	      Update_SRTT(m)    Procedure to update the smoothed RTT and RTT
1456	                        variance estimates, using the rules of
1457	                        [Jacobson88a], given m, a new RTT measurement

1459	Appendix D.  Pseudo-code Summary

1461	   Create new TCB => {
1462	       Rcv.wind.scale =
1463	             MIN( 14, MAX(0, floor(log2(receive buffer space)) - 15) );

1465	       Snd.wind.scale = 0;
1466	       Last.ACK.sent = 0;
1467	       Snd.TS.OK = Snd.WS.OK = FALSE;
1468	       Snd.TSoffset = random 32 bit value
1469	   }

1471	   Send initial  segment => {
1472	       SEG.WND = MIN( RCV.WND, 65535 );
1473	       Include in segment: TSopt(TSval=Snd.TSclock, TSecr=0);
1474	       Include in segment: WSopt = Rcv.wind.scale;
1475	   }

1477	   Send  segment => {
1478	       SEG.ACK = Last.ACK.sent = RCV.NXT;
1479	       SEG.WND = MIN( RCV.WND, 65535 );
1480	       if (Snd.TS.OK) then
1481	             Include in segment:
1482	                   TSopt(TSval=Snd.TSclock, TSecr=TS.Recent);
1483	       if (Snd.WS.OK) then
1484	             Include in segment: WSopt = Rcv.wind.scale;
1485	   }

1487	   Receive  or  segment => {
1488	       if (Segment contains TSopt) then {
1489	             TS.Recent = SEG.TSval;
1490	             Snd.TS.OK = TRUE;
1491	             if (is  segment) then
1492	                   Update_SRTT(
1493	                          (Snd.TSclock - SEG.TSecr)/my.TSclock.rate);
1494	       }
1495	       if (Segment contains WSopt) then {
1496	             Snd.wind.scale = SEG.WSopt;
1497	             Snd.WS.OK = TRUE;
1498	             if (the ACK bit is not set, and Rcv.wind.scale has not been
1499	               initialized by the user) then
1500	                   Rcv.wind.scale = Snd.wind.scale;
1501	       }
1502	       else
1503	             Rcv.wind.scale = Snd.wind.scale = 0;
1504	   }

1506	   Send non-SYN segment => {
1507	       SEG.ACK = Last.ACK.sent = RCV.NXT;
1508	       SEG.WND = MIN( RCV.WND >> Rcv.wind.scale, 65535 );
1509	       if (Snd.TS.OK) then
1510	             Include in segment:
1511	                   TSopt(TSval=Snd.TSclock, TSecr=TS.Recent);
1512	   }
1513	   Receive non-SYN segment in (state >= ESTABLISHED) => {
1514	       Window = (SEG.WND << Snd.wind.scale);
1515	             /* Use 32-bit 'Window' instead of 16-bit 'SEG.WND'
1516	              * in rest of processing.
1517	              */
1518	       if (Segment contains TSopt) then {
1519	             if (SEG.TSval < TS.Recent && Idle less than 24 days) then {
1520	                   if (Send.TS.OK AND (NOT RST) ) then {
1521	                               /* Timestamp too old =>
1522	                                *    segment is unacceptable.
1523	                                */
1524	                         Send ACK segment;
1525	                         Discard segment and return;
1526	                   }
1527	             }
1528	             else {
1529	                   if (SEG.SEQ <= Last.ACK.sent) then
1530	                               TS.Recent = SEG.TSval;
1531	             }
1532	       }
1533	       if (SEG.ACK > SND.UNA) then {
1534	                    /* (At least part of) first segment in
1535	                     * retransmission queue has been ACKed
1536	                     */
1537	             if (Segment contains TSopt) then
1538	                   Update_SRTT(
1539	                          (Snd.TSclock - SEG.TSecr)/my.TSclock.rate);
1540	             else
1541	                   Update_SRTT( /* for compatibility */
1542	                          (Snd.TSclock - Start.Time)/my.TSclock.rate);
1543	       }
1544	   }

1546	Appendix E.  Event Processing Summary

1548	   OPEN Call

1550	      ...

1552	      An initial send sequence number (ISS) is selected.  Send a SYN
1553	      segment of the form:

1555	        

1557	      ...

1559	   SEND Call
1560	      CLOSED STATE (i.e., TCB does not exist)

1562	         ...

1564	      LISTEN STATE

1566	         If the foreign socket is specified, then change the connection
1567	         from passive to active, select an ISS.  Send a SYN segment
1568	         containing the options:  and
1569	         .  Set SND.UNA to ISS, SND.NXT to ISS+1.
1570	         Enter SYN-SENT state. ...

1572	      SYN-SENT STATE
1573	      SYN-RECEIVED STATE

1575	         ...

1577	      ESTABLISHED STATE
1578	      CLOSE-WAIT STATE

1580	         Segmentize the buffer and send it with a piggybacked
1581	         acknowledgment (acknowledgment value = RCV.NXT). ...

1583	         If the urgent flag is set ...

1585	         If the Snd.TS.OK flag is set, then include the TCP Timestamps
1586	         option  in each data
1587	         segment.

1589	         Scale the receive window for transmission in the segment
1590	         header:

1592	                   SEG.WND = (RCV.WND >> Rcv.Wind.Scale).

1594	   SEGMENT ARRIVES

1596	      ...

1598	      If the state is LISTEN then

1600	         first check for an RST

1602	            ...

1604	         second check for an ACK

1606	            ...

1608	         third check for a SYN

1610	            if the SYN bit is set, check the security.  If the ...

1612	               ...

1614	            if the SEG.PRC is less than the TCB.PRC then continue.

1616	            Check for a Window Scale option (WSopt); if one is found,
1617	            save SEG.WSopt in Snd.Wind.Scale and set Snd.WS.OK flag on.
1618	            Otherwise, set both Snd.Wind.Scale and Rcv.Wind.Scale to
1619	            zero and clear Snd.WS.OK flag.

1621	            Check for a TSopt option; if one is found, save SEG.TSval in
1622	            the variable TS.Recent and turn on the Snd.TS.OK bit.

1624	            Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any
1625	            other control or text should be queued for processing later.
1626	            ISS should be selected and a SYN segment sent of the form:

1628	                    

1630	            If the Snd.WS.OK bit is on, include a WSopt option
1631	             in this segment.  If the Snd.TS.OK
1632	            bit is on, include a TSopt
1633	             in this segment.
1634	            Last.ACK.sent is set to RCV.NXT.

1636	            SND.NXT is set to ISS+1 and SND.UNA to ISS.  The connection
1637	            state should be changed to SYN-RECEIVED.  Note that any
1638	            other incoming control or data (combined with SYN) will be
1639	            processed in the SYN-RECEIVED state, but processing of SYN
1640	            and ACK should not be repeated.  If the listen was not fully
1641	            specified (i.e., the foreign socket was not fully
1642	            specified), then the unspecified fields should be filled in
1643	            now.

1645	         fourth other text or control

1647	            ...

1649	      If the state is SYN-SENT then

1651	         first check the ACK bit

1653	            ...

1655	         ...

1657	         fourth check the SYN bit

1659	            ...

1661	            If the SYN bit is on and the security/compartment and
1662	            precedence are acceptable then, RCV.NXT is set to SEG.SEQ+1,
1663	            IRS is set to SEG.SEQ, and any acknowledgements on the
1664	            retransmission queue which are thereby acknowledged should
1665	            be removed.

1667	            Check for a Window Scale option (WSopt); if it is found,
1668	            save SEG.WSopt in Snd.Wind.Scale; otherwise, set both
1669	            Snd.Wind.Scale and Rcv.Wind.Scale to zero.

1671	            Check for a TSopt option; if one is found, save SEG.TSval in
1672	            variable TS.Recent and turn on the Snd.TS.OK bit in the
1673	            connection control block.  If the ACK bit is set, use
1674	            Snd.TSclock - SEG.TSecr as the initial RTT estimate.

1676	            If SND.UNA > ISS (our SYN has been ACKed), change the
1677	            connection state to ESTABLISHED, form an ACK segment:

1679	                    

1681	            and send it.  If the Snd.Echo.OK bit is on, include a TSopt
1682	            option  in this ACK
1683	            segment.  Last.ACK.sent is set to RCV.NXT.

1685	            Data or controls which were queued for transmission may be
1686	            included.  If there are other controls or text in the
1687	            segment then continue processing at the sixth step below
1688	            where the URG bit is checked, otherwise return.

1690	            Otherwise enter SYN-RECEIVED, form a SYN,ACK segment:

1692	                    

1694	            and send it.  If the Snd.Echo.OK bit is on, include a TSopt
1695	            option  in this segment.
1696	            If the Snd.WS.OK bit is on, include a WSopt option
1697	             in this segment.  Last.ACK.sent is
1698	            set to RCV.NXT.

1700	            If there are other controls or text in the segment, queue
1701	            them for processing after the ESTABLISHED state has been
1702	            reached, return.

1704	         fifth, if neither of the SYN or RST bits is set then drop the
1705	         segment and return.

1707	      Otherwise,

1709	      First, check sequence number

1711	         SYN-RECEIVED STATE
1712	         ESTABLISHED STATE
1713	         FIN-WAIT-1 STATE
1714	         FIN-WAIT-2 STATE
1715	         CLOSE-WAIT STATE
1716	         CLOSING STATE
1717	         LAST-ACK STATE
1718	         TIME-WAIT STATE

1720	            Segments are processed in sequence.  Initial tests on
1721	            arrival are used to discard old duplicates, but further
1722	            processing is done in SEG.SEQ order.  If a segment's
1723	            contents straddle the boundary between old and new, only the
1724	            new parts should be processed.

1726	            Rescale the received window field:

1728	                  TrueWindow = SEG.WND << Snd.Wind.Scale,

1730	            and use "TrueWindow" in place of SEG.WND in the following
1731	            steps.

1733	            Check whether the segment contains a Timestamps option and
1734	            bit Snd.TS.OK is on.  If so:

1736	               If SEG.TSval < TS.Recent and the RST bit is off, then
1737	               test whether connection has been idle less than 24 days;
1738	               if all are true, then the segment is not acceptable;
1739	               follow steps below for an unacceptable segment.

1741	               If SEG.SEQ is less than or equal to Last.ACK.sent, then
1742	               save SEG.TSval in variable TS.Recent.

1744	            There are four cases for the acceptability test for an
1745	            incoming segment:

1747	               ...

1749	            If an incoming segment is not acceptable, an acknowledgment
1750	            should be sent in reply (unless the RST bit is set, if so
1751	            drop the segment and return):

1753	                    

1755	            Last.ACK.sent is set to SEG.ACK of the acknowledgment.  If
1756	            the Snd.Echo.OK bit is on, include the Timestamps option
1757	             in this ACK segment.
1758	            Set Last.ACK.sent to SEG.ACK and send the ACK segment.
1759	            After sending the acknowledgment, drop the unacceptable
1760	            segment and return.

1762	      ...

1764	      fifth check the ACK field.

1766	         if the ACK bit is off drop the segment and return.

1768	         if the ACK bit is on

1770	            ...

1772	            ESTABLISHED STATE

1774	               If SND.UNA < SEG.ACK <= SND.NXT then, set SND.UNA <-
1775	               SEG.ACK.  Also compute a new estimate of round-trip time.
1776	               If Snd.TS.OK bit is on, use Snd.TSclock - SEG.TSecr;
1777	               otherwise use the elapsed time since the first segment in
1778	               the retransmission queue was sent.  Any segments on the
1779	               retransmission queue which are thereby entirely
1780	               acknowledged...

1782	      ...

1784	      Seventh, process the segment text.

1786	         ESTABLISHED STATE
1787	         FIN-WAIT-1 STATE
1788	         FIN-WAIT-2 STATE

1790	            ...

1792	            Send an acknowledgment of the form:

1794	                    

1796	            If the Snd.TS.OK bit is on, include Timestamps option
1797	             in this ACK segment.
1798	            Set Last.ACK.sent to SEG.ACK of the acknowledgment, and send
1799	            it.  This acknowledgment should be piggy-backed on a segment
1800	            being transmitted if possible without incurring undue delay.

1802	            ...

1804	Appendix F.  Timestamps Edge Cases

1806	   While the rules laid out for when to calculate RTTM produce the
1807	   correct results most of the time, there are some edge cases where an
1808	   incorrect RTTM can be calculated.  All of these situations involve
1809	   the loss of packets.  It is felt that these scenarios are rare, and
1810	   that if they should happen, they will cause a single RTTM measurement
1811	   to be inflated, which mitigates its effects on RTO calculations.

1813	   [Martin03] cites two similar cases when the returning ACK is lost,
1814	   and before the retransmission timer fires, another returning packet
1815	   arrives, which ACKs the data.  In this case, the RTTM calculated will
1816	   be inflated:

1818	           clock
1819	             tc=1    ------------------->

1821	             tc=2   (lost) <---- 
1822	                 (RTTM would have been 1)

1824	                    (receive window opens, window update is sent)
1825	             tc=5        <---- 
1826	                    (RTTM is calculated at 4)

1828	   One thing to note about this situation is that it is somewhat bounded
1829	   by RTO + RTT, limiting how far off the RTTM calculation will be.
1830	   While more complex scenarios can be constructed that produce larger
1831	   inflations (e.g., retransmissions are lost), those scenarios involve
1832	   multiple packet losses, and the connection will have other more
1833	   serious operational problems than using an inflated RTTM in the RTO
1834	   calculation.

1836	Appendix G.  Changes from RFC 1072, RFC 1185, and RFC 1323

1838	   The protocol extensions defined in RFC 1323 document differ in
1839	   several important ways from those defined in RFC 1072 and RFC 1185.

1841	   (a)  SACK has been split off into a separate document, [RFC2018].

1843	   (b)  The detailed rules for sending timestamp replies (see
1844	        Section 4.4) differ in important ways.  The earlier rules could
1845	        result in an under-estimate of the RTT in certain cases (packets
1846	        dropped or out of order).

1848	   (c)  The same value TS.Recent is now shared by the two distinct
1849	        mechanisms RTTM and PAWS.  This simplification became possible
1850	        because of change (b).

1852	   (d)  An ambiguity in RFC 1185 was resolved in favor of putting
1853	        timestamps on ACK as well as data segments.  This supports the
1854	        symmetry of the underlying TCP protocol.

1856	   (e)  The echo and echo reply options of RFC 1072 were combined into a
1857	        single Timestamps option, to reflect the symmetry and to
1858	        simplify processing.

1860	   (f)  The problem of outdated timestamps on long-idle connections,
1861	        discussed in Section 5.2.2, was realized and resolved.

1863	   (g)  RFC 1185 recommended that header prediction take precedence over
1864	        the timestamp check.  Based upon some skepticism about the
1865	        probabilistic arguments given in Section 5.2.4, it was decided
1866	        to recommend that the timestamp check be performed first.

1868	   (h)  The spec was modified so that the extended options will be sent
1869	        on  segments only when they are received in the
1870	        corresponding  segments.  This provides the most
1871	        conservative possible conditions for interoperation with
1872	        implementations without the extensions.

1874	   In addition to these substantive changes, the present RFC attempts to
1875	   specify the algorithms unambiguously by presenting modifications to
1876	   the Event Processing rules of RFC 793; see Appendix E.

1878	   There are additional changes in this document from RFC 1323.  These
1879	   changes are:

1881	   (a)  The description of which TSecr values can be used to update the
1882	        measured RTT has been clarified.  Specifically, with Timestamps,
1883	        the Karn algorithm [Karn87] is disabled.  The Karn algorithm
1884	        disables all RTT measurements during retransmission, since it is
1885	        ambiguous whether the ACK is for the original packet, or the
1886	        retransmitted packet.  With Timestamps, that ambiguity is
1887	        removed since the TSecr in the ACK will contain the TSval from
1888	        whichever data packet made it to the destination.

1890	   (b)  In RFC1323, section 3.4, step (2) of the algorithm to control
1891	        which timestamp is echoed was incorrect in two regards:

1893	        (1)  It failed to update TS.recent for a retransmitted segment
1894	             that resulted from a lost ACK.

1896	        (2)  It failed if SEG.LEN = 0.

1898	        In the new algorithm, the case of SEG.TSval >= TS.recent is
1899	        included for consistency with the PAWS test.

1901	   (c)  One correction was made to the Event Processing Summary in
1902	        Appendix E.  In SEND CALL/ESTABLISHED STATE, RCV.WND is used to
1903	        fill in the SEG.WND value, not SND.WND.

1905	   (d)  New pseudo-code summary has been added in Appendix D.

1907	   (e)  Appendix A has been expanded with information about the TCP
1908	        Urgent Pointer.  An earlier revision contained text around the
1909	        TCP MSS option, which was split off into [RFC6691].

1911	   (f)  It is now recommended that Timestamps options be included in RST
1912	        packets if the incoming packet contained a Timestamps option.

1914	   (g)  RST packets are explicitly excluded from PAWS processing.

1916	   (h)  Snd.TSoffset and Snd.TSclock variables have been added.
1917	        Snd.TSclock is the sum of my.TSclock and Snd.TSoffset.  This
1918	        allows the starting points for timestamps to be randomized on a
1919	        per-connection basis.  Setting Snd.TSoffset to zero yields the
1920	        same results as [RFC1323].

1922	   (i)  RTTM update processing explicitly excludes packets containing
1923	        SACK options.  This addresses inflation of the RTT during
1924	        episodes of packet loss in both directions.

1926	   (j)  In Section 4.2 the if-clause allowing sending of timestamps only
1927	        when received in a  or  was removed, to allow for
1928	        late timestamp negotiation.

1930	   (k)  Section 3.4 was added describing the unavoidable window
1931	        retraction issue, and explicitly describing the mitigation steps
1932	        necessary.

1934	   (l)  Section 2 was added for RFC2119 wording.  Normative text was
1935	        updated with the appropriate phrases.

1937	   (m)  Removed much of the discussion in Section 1 to streamline the
1938	        document.  However, detailed examples and discussions in
1939	        Section 3, Section 4 and Section 5 are kept as guideline for
1940	        implementers.

1942	   (n)  Moved Appendix "Changes" at the end of the appendices for easier
1943	        lookup.

1945	Authors' Addresses

1947	   David Borman
1948	   Quantum Corporation
1949	   Mendota Heights  MN 55120
1950	   USA

1952	   Email: david.borman@quantum.com

1954	   Bob Braden
1955	   University of Southern California
1956	   4676 Admiralty Way
1957	   Marina del Rey  CA 90292
1958	   USA

1960	   Email: braden@isi.edu

1962	   Van Jacobson
1963	   Packet Design
1964	   2465 Latham Street
1965	   Mountain View  CA 94040
1966	   USA

1968	   Email: van@packetdesign.com

1970	   Richard Scheffenegger (editor)
1971	   NetApp, Inc.
1972	   Am Euro Platz 2
1973	   Vienna,   1120
1974	   Austria

1976	   Email: rs@netapp.com