idnits 2.17.1 draft-ietf-tcpm-1323bis-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 546: '...1) The receiver MUST honor, as in-win...' RFC 2119 keyword, line 549: '... effect, the receiver SHOULD track the...' RFC 2119 keyword, line 557: '...ial transmission MUST honor window on ...' -- The abstract seems to indicate that this document obsoletes RFC1323, but the header doesn't have an 'Obsoletes:' line to match this. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1406 has weird spacing: '... TSval times...' == Line 1408 has weird spacing: '... TSecr times...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 11, 2012) is 4300 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '1' on line 300

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  -- Obsolete informational reference (is this intentional?): RFC  896
     (Obsoleted by RFC 7805)

  -- Obsolete informational reference (is this intentional?): RFC 1072
     (Obsoleted by RFC 1323, RFC 2018, RFC 6247)

  -- Obsolete informational reference (is this intentional?): RFC 1110
     (Obsoleted by RFC 6247)

  -- Obsolete informational reference (is this intentional?): RFC 1185
     (Obsoleted by RFC 1323)

  -- Obsolete informational reference (is this intentional?): RFC 1323
     (Obsoleted by RFC 7323)

  -- Obsolete informational reference (is this intentional?): RFC 1981
     (Obsoleted by RFC 8201)

  -- Obsolete informational reference (is this intentional?): RFC 2581
     (Obsoleted by RFC 5681)

  -- Obsolete informational reference (is this intentional?): RFC 3517
     (Obsoleted by RFC 6675)


     Summary: 2 errors (**), 0 flaws (~~), 3 warnings (==), 13 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	TCP Maintenance (TCPM)                                         D. Borman
3	Internet-Draft                                       Quantum Corporation
4	Intended status: Standards Track                               B. Braden
5	Expires: January 12, 2013                         University of Southern
6	                                                              California
7	                                                             V. Jacobson
8	                                                           Packet Design
9	                                                   R. Scheffenegger, Ed.
10	                                                            NetApp, Inc.
11	                                                           July 11, 2012

13	                  TCP Extensions for High Performance
14	                       draft-ietf-tcpm-1323bis-03

16	Abstract

18	   This memo presents a set of TCP extensions to improve performance
19	   over large bandwidth*delay product paths and to provide reliable
20	   operation over very high-speed paths.  It defines TCP options for
21	   scaled windows and timestamps, which are designed to provide
22	   compatible interworking with TCP's that do not implement the
23	   extensions.  The timestamps are used for two distinct mechanisms:
24	   RTTM (Round Trip Time Measurement) and PAWS (Protection Against
25	   Wrapped Sequences).  Selective acknowledgments are not included in
26	   this memo.

28	   This memo updates and obsoletes RFC 1323.

30	Status of this Memo

32	   This Internet-Draft is submitted in full conformance with the
33	   provisions of BCP 78 and BCP 79.

35	   Internet-Drafts are working documents of the Internet Engineering
36	   Task Force (IETF).  Note that other groups may also distribute
37	   working documents as Internet-Drafts.  The list of current Internet-
38	   Drafts is at http://datatracker.ietf.org/drafts/current/.

40	   Internet-Drafts are draft documents valid for a maximum of six months
41	   and may be updated, replaced, or obsoleted by other documents at any
42	   time.  It is inappropriate to use Internet-Drafts as reference
43	   material or to cite them other than as "work in progress."

45	   This Internet-Draft will expire on January 12, 2013.

47	Copyright Notice
48	   Copyright (c) 2012 IETF Trust and the persons identified as the
49	   document authors.  All rights reserved.

51	   This document is subject to BCP 78 and the IETF Trust's Legal
52	   Provisions Relating to IETF Documents
53	   (http://trustee.ietf.org/license-info) in effect on the date of
54	   publication of this document.  Please review these documents
55	   carefully, as they describe your rights and restrictions with respect
56	   to this document.  Code Components extracted from this document must
57	   include Simplified BSD License text as described in Section 4.e of
58	   the Trust Legal Provisions and are provided without warranty as
59	   described in the Simplified BSD License.

61	Table of Contents

63	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
64	     1.1.  TCP Performance  . . . . . . . . . . . . . . . . . . . . .  4
65	     1.2.  TCP Reliability  . . . . . . . . . . . . . . . . . . . . .  6
66	     1.3.  Using TCP options  . . . . . . . . . . . . . . . . . . . .  9
67	   2.  TCP Window Scale Option  . . . . . . . . . . . . . . . . . . . 10
68	     2.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . 10
69	     2.2.  Window Scale Option  . . . . . . . . . . . . . . . . . . . 10
70	     2.3.  Using the Window Scale Option  . . . . . . . . . . . . . . 11
71	     2.4.  Addressing Window Retraction . . . . . . . . . . . . . . . 13
72	   3.  RTTM -- Round-Trip Time Measurement  . . . . . . . . . . . . . 13
73	     3.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . 13
74	     3.2.  TCP Timestamps Option  . . . . . . . . . . . . . . . . . . 14
75	     3.3.  The RTTM Mechanism . . . . . . . . . . . . . . . . . . . . 15
76	     3.4.  Which Timestamp to Echo  . . . . . . . . . . . . . . . . . 17
77	   4.  PAWS -- Protection Against Wrapped Sequence Numbers  . . . . . 19
78	     4.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . 19
79	     4.2.  The PAWS Mechanism . . . . . . . . . . . . . . . . . . . . 20
80	       4.2.1.  Basic PAWS Algorithm . . . . . . . . . . . . . . . . . 21
81	       4.2.2.  Timestamp Clock  . . . . . . . . . . . . . . . . . . . 23
82	       4.2.3.  Outdated Timestamps  . . . . . . . . . . . . . . . . . 24
83	       4.2.4.  Header Prediction  . . . . . . . . . . . . . . . . . . 25
84	       4.2.5.  IP Fragmentation . . . . . . . . . . . . . . . . . . . 26
85	     4.3.  Duplicates from Earlier Incarnations of Connection . . . . 27
86	   5.  Conclusions and Acknowledgements . . . . . . . . . . . . . . . 27
87	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 28
88	   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 28
89	   8.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 29
90	     8.1.  Normative References . . . . . . . . . . . . . . . . . . . 29
91	     8.2.  Informative References . . . . . . . . . . . . . . . . . . 29
92	   Appendix A.  Implementation Suggestions  . . . . . . . . . . . . . 31
93	   Appendix B.  Duplicates from Earlier Connection Incarnations . . . 32
94	     B.1.  System Crash with Loss of State  . . . . . . . . . . . . . 32
95	     B.2.  Closing and Reopening a Connection . . . . . . . . . . . . 33
96	   Appendix C.  Changes from RFC 1072, RFC 1185, and RFC 1323 . . . . 34
97	   Appendix D.  Summary of Notation . . . . . . . . . . . . . . . . . 36
98	   Appendix E.  Pseudo-code Summary . . . . . . . . . . . . . . . . . 37
99	   Appendix F.  Event Processing Summary  . . . . . . . . . . . . . . 39
100	   Appendix G.  Timestamps Edge Cases . . . . . . . . . . . . . . . . 44
101	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 45

103	1.  Introduction

105	   The TCP protocol [RFC0793] was designed to operate reliably over
106	   almost any transmission medium regardless of transmission rate,
107	   delay, corruption, duplication, or reordering of segments.
108	   Production TCP implementations currently adapt to transfer rates in
109	   the range of 100 bps to 10^10 bps and round-trip delays in the range
110	   1 ms to 100 seconds.  Work on TCP performance has shown that TCP
111	   without the extensions described in this memo can work well over a
112	   variety of Internet paths, ranging from 800 Mbit/sec I/O channels to
113	   300 bit/sec dial-up modems .

115	   Over the years, advances in networking technology has resulted in
116	   ever-higher transmission speeds, and the fastest paths are well
117	   beyond the domain for which TCP was originally engineered.  This memo
118	   defines a set of modest extensions to TCP to extend the domain of its
119	   application to match this increasing network capability.  It is an
120	   update to and obsoletes [RFC1323], which in turn is based upon and
121	   obsoletes [RFC1072] and [RFC1185].

123	   There is no one-line answer to the question: "How fast can TCP go?".
124	   There are two separate kinds of issues, performance and reliability,
125	   and each depends upon different parameters.  We discuss each in turn.

127	1.1.  TCP Performance

129	   TCP performance depends not upon the transfer rate itself, but rather
130	   upon the product of the transfer rate and the round-trip delay.  This
131	   "bandwidth*delay product" measures the amount of data that would
132	   "fill the pipe"; it is the buffer space required at sender and
133	   receiver to obtain maximum throughput on the TCP connection over the
134	   path, i.e., the amount of unacknowledged data that TCP must handle in
135	   order to keep the pipeline full.  TCP performance problems arise when
136	   the bandwidth*delay product is large.  We refer to an Internet path
137	   operating in this region as a "long, fat pipe", and a network
138	   containing this path as an "LFN" (pronounced "elephan(t)").

140	   High-capacity packet satellite channels are LFN's.  For example, a
141	   DS1-speed satellite channel has a bandwidth*delay product of 10^6
142	   bits or more; this corresponds to 100 outstanding TCP segments of
143	   1200 bytes each.  Terrestrial fiber-optical paths will also fall into
144	   the LFN class; for example, a cross-country delay of 30 ms at a DS3
145	   bandwidth (45Mbps) also exceeds 10^6 bits.

147	   There are three fundamental performance problems with the current TCP
148	   over LFN paths:

150	   (1)  Window Size Limit

152	        The TCP header uses a 16 bit field to report the receive window
153	        size to the sender.  Therefore, the largest window that can be
154	        used is 2^16 = 65K bytes.

156	        To circumvent this problem, Section 2 of this memo defines a new
157	        TCP option, "Window Scale", to allow windows larger than 2^16.
158	        This option defines an implicit scale factor, which is used to
159	        multiply the window size value found in a TCP header to obtain
160	        the true window size.

162	   (2)  Recovery from Losses

164	        Packet losses in an LFN can have a catastrophic effect on
165	        throughput.  In the past, properly-operating TCP implementations
166	        would cause the data pipeline to drain with every packet loss,
167	        and require a slow-start action to recover.  The Fast Retransmit
168	        and Fast Recovery algorithms [Jacobson90c], [RFC2581] and
169	        [RFC5681] were introduced, and their combined effect was to
170	        recover from one packet loss per window, without draining the
171	        pipeline.  However, more than one packet loss per window
172	        typically resulted in a retransmission timeout and the resulting
173	        pipeline drain and slow start.

175	        Expanding the window size to match the capacity of an LFN
176	        results in a corresponding increase of the probability of more
177	        than one packet per window being dropped.  This could have a
178	        devastating effect upon the throughput of TCP over an LFN.  In
179	        addition, since the publication of RFC 1323, congestion control
180	        mechanism based upon some form of random dropping have been
181	        introduced into gateways, and randomly spaced packet drops have
182	        become common; this increases the probability of dropping more
183	        than one packet per window.

185	        To generalize the Fast Retransmit/Fast Recovery mechanism to
186	        handle multiple packets dropped per window, selective
187	        acknowledgments are required.  Unlike the normal cumulative
188	        acknowledgments of TCP, selective acknowledgments give the
189	        sender a complete picture of which segments are queued at the
190	        receiver and which have not yet arrived.

192	        Since the publication of RFC1323 [RFC1323], selective
193	        acknowledgments (SACK) have become important in the LFN regime.
194	        SACK has been published as "TCP Selective Acknowledgment
195	        Options" [RFC2018].  Additional information about SACK can be
196	        found in "An Extension to the Selective Acknowledgement (SACK)
197	        option for TCP" [RFC2883], and , "A Conservative Selective
198	        Acknowledgment (SACK)-based Loss Recovery Algorithm for TCP"
199	        [RFC3517].

201	   (3)  Round-Trip Measurement

203	        TCP implements reliable data delivery by retransmitting segments
204	        that are not acknowledged within some retransmission timeout
205	        (RTO) interval.  Accurate dynamic determination of an
206	        appropriate RTO is essential to TCP performance.  RTO is
207	        determined by estimating the mean and variance of the measured
208	        round-trip time (RTT), i.e., the time interval between sending a
209	        segment and receiving an acknowledgment for it [Jacobson88a].

211	        Section 3.2 introduces a new TCP option, "Timestamps", and then
212	        defines a mechanism using this option that allows nearly every
213	        segment, including retransmissions, to be timed at negligible
214	        computational cost.  We use the mnemonic RTTM (Round Trip Time
215	        Measurement) for this mechanism, to distinguish it from other
216	        uses of the Timestamps option.

218	1.2.  TCP Reliability

220	   Now we turn from performance to reliability.  High transfer rate
221	   enters TCP performance through the bandwidth*delay product.  However,
222	   high transfer rate alone can threaten TCP reliability by violating
223	   the assumptions behind the TCP mechanism for duplicate detection and
224	   sequencing.

226	   An especially serious kind of error may result from an accidental
227	   reuse of TCP sequence numbers in data segments.  Suppose that an "old
228	   duplicate segment", e.g., a duplicate data segment that was delayed
229	   in Internet queues, is delivered to the receiver at the wrong moment,
230	   so that its sequence numbers fall somewhere within the current
231	   window.  There would be no checksum failure to warn of the error, and
232	   the result could be an undetected corruption of the data.  Reception
233	   of an old duplicate ACK segment at the transmitter could be only
234	   slightly less serious: it is likely to lock up the connection so that
235	   no further progress can be made, forcing an RST on the connection.

237	   TCP reliability depends upon the existence of a bound on the lifetime
238	   of a segment: the "Maximum Segment Lifetime" or MSL.  An MSL is
239	   generally required by any reliable transport protocol, since every
240	   sequence number field must be finite, and therefore any sequence
241	   number may eventually be reused.  In the Internet protocol suite, the
242	   MSL bound is loosely enforced by an IP-layer mechanism, the "Time-to-
243	   Live" (TTL) field, or "Hop Limit" field.

245	   Duplication of sequence numbers might happen in either of two ways:

247	   (1)  Sequence number wrap-around on the current connection

249	        A TCP sequence number contains 32 bits.  At a high enough
250	        transfer rate, the 32-bit sequence space may be "wrapped"
251	        (cycled) within the time that a segment is delayed in queues.

253	   (2)  Earlier incarnation of the connection

255	        Suppose that a connection terminates, either by a proper close
256	        sequence or due to a host crash, and the same connection (i.e.,
257	        using the same pair of port numbers) is immediately reopened.  A
258	        delayed segment from the terminated connection could fall within
259	        the current window for the new incarnation and be accepted as
260	        valid.

262	   Duplicates from earlier incarnations, Case (2), are avoided by
263	   enforcing the current fixed MSL of the TCP spec, as explained in
264	   Section 4.3 and Appendix B.  However, case (1), avoiding the reuse of
265	   sequence numbers within the same connection, requires an MSL bound
266	   that depends upon the transfer rate, and at high enough rates, a new
267	   mechanism is required.

269	   More specifically, if the maximum effective bandwidth at which TCP is
270	   able to transmit over a particular path is B bytes per second, then
271	   the following constraint must be satisfied for error-free operation:

273	               2^31 / B  > MSL (secs)                     [1]

275	   The following table shows the value for Twrap = 2^31/B in seconds,
276	   for some important values of the bandwidth B:

278	    +------------------+----------+-------------+--------------------+
279	    |      Network     | bits/sec | B bytes/sec |     Twrap secs     |
280	    +------------------+----------+-------------+--------------------+
281	    |      Dialup      |   56kbps |      7kBps  | 3*10^5 (~3.6 days) |
282	    |        DS1       |  1.5Mbps |    190kBps  |   10^4 (~3 hours)  |
283	    |  10MBit Ethernet |   10Mbps |   1.25MBps  |  1700 (~0.5 hours) |
284	    |        DS3       |   45Mbps |    5.6MBps  |         380        |
285	    | 100MBit Ethernet |  100Mbps |   12.5MBps  |         170        |
286	    | Gigabit Ethernet |    1Gbps |    125MBps  |         17         |
287	    |  10Gig Ethernet  |   10Gbps |   1.25GBps  |         1.7        |
288	    +------------------+----------+-------------+--------------------+

290	   It is clear that wrap-around of the sequence space is not a problem
291	   for 56kbps packet switching or even 10Mbps Ethernets.  On the other
292	   hand, at DS3 and 100mbit speeds, Twrap is comparable to the 2 minute
293	   MSL assumed by the TCP specification [RFC0793].  Moving towards and
294	   beyond gigabit speeds, Twrap becomes too small for reliable
295	   enforcement by the Internet TTL mechanism.

297	   The 16-bit window field of TCP limits the effective bandwidth B to
298	   2^16/RTT, where RTT is the round-trip time in seconds [RFC1110].  If
299	   the RTT is large enough, this limits B to a value that meets the
300	   constraint [1] for a large MSL value.  For example, consider a
301	   transcontinental backbone with an RTT of 60ms (set by the laws of
302	   physics).  With the bandwidth*delay product limited to 64KB by the
303	   TCP window size, B is then limited to 1.1MBps, no matter how high the
304	   theoretical transfer rate of the path.  This corresponds to cycling
305	   the sequence number space in Twrap = 2000 secs, which is safe in
306	   today's Internet.

308	   It is important to understand that the culprit is not the larger
309	   window but rather the high bandwidth.  For example, consider a (very
310	   large) FDDI LAN with a diameter of 10km.  Using the speed of light,
311	   we can compute the RTT across the ring as (2*10^4)/(3*10^8) = 67
312	   microseconds, and the delay*bandwidth product is then 833 bytes.  A
313	   TCP connection across this LAN using a window of only 833 bytes will
314	   run at the full 100mbps and can wrap the sequence space in about 3
315	   minutes, very close to the MSL of TCP.  Thus, high speed alone can
316	   cause a reliability problem with sequence number wrap-around, even
317	   without extended windows.

319	   Watson's Delta-T protocol [Watson81] includes network-layer
320	   mechanisms for precise enforcement of an MSL.  In contrast, the IP
321	   mechanism for MSL enforcement is loosely defined and even more
322	   loosely implemented in the Internet.  Therefore, it is unwise to
323	   depend upon active enforcement of MSL for TCP connections, and it is
324	   unrealistic to imagine setting MSL's smaller than the current values
325	   (e.g., 120 seconds specified for TCP).

327	   A possible fix for the problem of cycling the sequence space would be
328	   to increase the size of the TCP sequence number field.  For example,
329	   the sequence number field (and also the acknowledgment field) could
330	   be expanded to 64 bits.  This could be done either by changing the
331	   TCP header or by means of an additional option.

333	   Section 4 presents a different mechanism, which we call PAWS
334	   (Protection Against Wrapped Sequence numbers), to extend TCP
335	   reliability to transfer rates well beyond the foreseeable upper limit
336	   of network bandwidths.  PAWS uses the TCP Timestamps option defined
337	   in Section 3.2 to protect against old duplicates from the same
338	   connection.

340	1.3.  Using TCP options

342	   The extensions defined in this memo all use new TCP options.  We must
343	   address two possible issues concerning the use of TCP options: (1)
344	   compatibility and (2) overhead.

346	   We must pay careful attention to compatibility, i.e., to
347	   interoperation with existing implementations.  The only TCP option
348	   defined previously, MSS, may appear only on a SYN segment.  Every
349	   implementation should (and we expect that most will) ignore unknown
350	   options on SYN segments.  When RFC 1323 was published, there was
351	   concern that some buggy TCP implementation might be crashed by the
352	   first appearance of an option on a non-SYN segment.  However, bugs
353	   like that can lead to DOS attacks against a TCP, so it is now
354	   expected that most TCP implementations will properly handle unknown
355	   options on non-SYN segments.  But it is still prudent to be
356	   conservative in what you send, and avoiding buggy TCP implementation
357	   is not the only reason for negotiating TCP options on SYN segments.
358	   Therefore, for each of the extensions defined below, TCP options will
359	   be sent on non-SYN segments only after an exchange of options on the
360	   SYN segments has indicated that both sides understand the extension.
361	   Furthermore, an extension option will be sent in a  segment
362	   only if the corresponding option was received in the initial 
363	   segment.

365	   A question may be raised about the bandwidth and processing overhead
366	   for TCP options.  Those options that occur on SYN segments are not
367	   likely to cause a performance concern.  Opening a TCP connection
368	   requires execution of significant special-case code, and the
369	   processing of options is unlikely to increase that cost
370	   significantly.

372	   On the other hand, a Timestamps option may appear in any data or ACK
373	   segment, adding 12 bytes to the 20-byte TCP header.  We believe that
374	   the bandwidth saved by reducing unnecessary retransmissions will more
375	   than pay for the extra header bandwidth.

377	   There is also an issue about the processing overhead for parsing the
378	   variable byte-aligned format of options, particularly with a RISC-
379	   architecture CPU.  Appendix A contains a recommended layout of the
380	   options in TCP headers to achieve reasonable data field alignment.
381	   In the spirit of Header Prediction, a TCP can quickly test for this
382	   layout and if it is verified then use a fast path.  Hosts that use
383	   this canonical layout will effectively use the options as a set of
384	   fixed-format fields appended to the TCP header.  However, to retain
385	   the philosophical and protocol framework of TCP options, a TCP must
386	   be prepared to parse an arbitrary options field, albeit with less
387	   efficiency.

389	   Finally, we observe that most of the mechanisms defined in this memo
390	   are important for LFN's and/or very high-speed networks.  For low-
391	   speed networks, it might be a performance optimization to NOT use
392	   these mechanisms.  A TCP vendor concerned about optimal performance
393	   over low-speed paths might consider turning these extensions off for
394	   low-speed paths, or allow a user or installation manager to disable
395	   them.

397	2.  TCP Window Scale Option

399	2.1.  Introduction

401	   The window scale extension expands the definition of the TCP window
402	   to 32 bits and then uses a scale factor to carry this 32-bit value in
403	   the 16-bit Window field of the TCP header (SEG.WND in RFC 793).  The
404	   scale factor is carried in a new TCP option, Window Scale.  This
405	   option is sent only in a SYN segment (a segment with the SYN bit on),
406	   hence the window scale is fixed in each direction when a connection
407	   is opened.  (Another design choice would be to specify the window
408	   scale in every TCP segment.  It would be incorrect to send a window
409	   scale option only when the scale factor changed, since a TCP option
410	   in an acknowledgement segment will not be delivered reliably (unless
411	   the ACK happens to be piggy-backed on data in the other direction).
412	   Fixing the scale when the connection is opened has the advantage of
413	   lower overhead but the disadvantage that the scale factor cannot be
414	   changed during the connection.)

416	   The maximum receive window, and therefore the scale factor, is
417	   determined by the maximum receive buffer space.  In a typical modern
418	   implementation, this maximum buffer space is set by default but can
419	   be overridden by a user program before a TCP connection is opened.
420	   This determines the scale factor, and therefore no new user interface
421	   is needed for window scaling.

423	2.2.  Window Scale Option

425	   The three-byte Window Scale option may be sent in a SYN segment by a
426	   TCP.  It has two purposes: (1) indicate that the TCP is prepared to
427	   do both send and receive window scaling, and (2) communicate a scale
428	   factor to be applied to its receive window.  Thus, a TCP that is
429	   prepared to scale windows should send the option, even if its own
430	   scale factor is 1.  The scale factor is limited to a power of two and
431	   encoded logarithmically, so it may be implemented by binary shift
432	   operations.

434	   TCP Window Scale Option (WSopt):

436	                   Kind: 3

438	                   Length: 3 bytes

440	                          +---------+---------+---------+
441	                          | Kind=3  |Length=3 |shift.cnt|
442	                          +---------+---------+---------+

444	   This option is an offer, not a promise; both sides must send Window
445	   Scale options in their SYN segments to enable window scaling in
446	   either direction.  If window scaling is enabled, then the TCP that
447	   sent this option will right-shift its true receive-window values by
448	   'shift.cnt' bits for transmission in SEG.WND.  The value 'shift.cnt'
449	   may be zero (offering to scale, while applying a scale factor of 1 to
450	   the receive window).

452	   This option may be sent in an initial  segment (i.e., a segment
453	   with the SYN bit on and the ACK bit off).  It may also be sent in a
454	    segment, but only if a Window Scale option was received in
455	   the initial  segment.  A Window Scale option in a segment
456	   without a SYN bit should be ignored.

458	   The Window field in a SYN (i.e., a  or ) segment itself
459	   is never scaled.

461	2.3.  Using the Window Scale Option

463	   A model implementation of window scaling is as follows, using the
464	   notation of [RFC0793]:

466	   o  All windows are treated as 32-bit quantities for storage in the
467	      connection control block and for local calculations.  This
468	      includes the send-window (SND.WND) and the receive-window
469	      (RCV.WND) values, as well as the congestion window.

471	   o  The connection state is augmented by two window shift counts,
472	      Snd.Wind.Scale and Rcv.Wind.Scale, to be applied to the incoming
473	      and outgoing window fields, respectively.

475	   o  If a TCP receives a  segment containing a Window Scale
476	      option, it sends its own Window Scale option in the 
477	      segment.

479	   o  The Window Scale option is sent with shift.cnt = R, where R is the
480	      value that the TCP would like to use for its receive window.

482	   o  Upon receiving a SYN segment with a Window Scale option containing
483	      shift.cnt = S, a TCP sets Snd.Wind.Scale to S and sets
484	      Rcv.Wind.Scale to R; otherwise, it sets both Snd.Wind.Scale and
485	      Rcv.Wind.Scale to zero.

487	   o  The window field (SEG.WND) in the header of every incoming
488	      segment, with the exception of SYN segments, is left-shifted by
489	      Snd.Wind.Scale bits before updating SND.WND:

491	                    SND.WND = SEG.WND << Snd.Wind.Scale

493	      (assuming the other conditions of RFC 793 are met, and using the
494	      "C" notation "<<" for left-shift).

496	   o  The window field (SEG.WND) of every outgoing segment, with the
497	      exception of SYN segments, is right-shifted by Rcv.Wind.Scale
498	      bits:

500	                    SND.WND = RCV.WND >> Rcv.Wind.Scale

502	   TCP determines if a data segment is "old" or "new" by testing whether
503	   its sequence number is within 2^31 bytes of the left edge of the
504	   window, and if it is not, discarding the data as "old".  To insure
505	   that new data is never mistakenly considered old and vice versa, the
506	   left edge of the sender's window has to be at most 2^31 away from the
507	   right edge of the receiver's window.  Similarly with the sender's
508	   right edge and receiver's left edge.  Since the right and left edges
509	   of either the sender's or receiver's window differ by the window
510	   size, and since the sender and receiver windows can be out of phase
511	   by at most the window size, the above constraints imply that 2 * the
512	   max window size must be less than 2^31, or

514	                             max window < 2^30

516	   Since the max window is 2^S (where S is the scaling shift count)
517	   times at most 2^16 - 1 (the maximum unscaled window), the maximum
518	   window is guaranteed to be < 2*30 if S <= 14.  Thus, the shift count
519	   must be limited to 14 (which allows windows of 2^30 = 1 Gbyte).  If a
520	   Window Scale option is received with a shift.cnt value exceeding 14,
521	   the TCP should log the error but use 14 instead of the specified
522	   value.

524	   The scale factor applies only to the Window field as transmitted in
525	   the TCP header; each TCP using extended windows will maintain the
526	   window values locally as 32-bit numbers.  For example, the
527	   "congestion window" computed by Slow Start and Congestion Avoidance
528	   is not affected by the scale factor, so window scaling will not
529	   introduce quantization into the congestion window.

531	2.4.  Addressing Window Retraction

533	   When a non-zero scale factor is in use, there are instances when a
534	   retracted window can be offered [Mathis08].  The end of the window
535	   will be on a boundary based on the granularity of the scale factor
536	   being used.  If the sequence number is then updated by a number of
537	   bytes smaller than that granularity, the TCP will have to either
538	   advertise a new window that is beyond what it previously advertised
539	   (and perhaps beyond the buffer), or will have to advertise a smaller
540	   window, which will cause the TCP window to shrink.  Implementations
541	   should ensure that they handle a shrinking window, as specified in
542	   section 4.2.2.16 of [RFC1122].

544	   For the receiver, this implies that:

546	   1)  The receiver MUST honor, as in-window, any segment that would
547	       have been in-window for any ACK sent by the receiver.

549	   2)  When window scaling is in effect, the receiver SHOULD track the
550	       actual maximum window sequence number (which is likely to be
551	       greater than the window announced by the most recent ACK, if more
552	       than one segment has arrived since the application consumed any
553	       data in the receive buffer).

555	   On the sender side:

557	   3)  The initial transmission MUST honor window on most recent ACK.

559	   4)  On first retransmission, or if the sequence number is out-of-
560	       window by less than (2^Rcv.Wind.Scale) then do normal
561	       retransmission(s) without regard to receiver window as long as
562	       the original segment was in window when it was sent.

564	   5)  On subsequent retransmissions, treat such ACKs as zero window
565	       probes.

567	3.  RTTM -- Round-Trip Time Measurement

569	3.1.  Introduction

571	   Accurate and current RTT estimates are necessary to adapt to changing
572	   traffic conditions and to avoid an instability known as "congestion
573	   collapse" [RFC0896] in a busy network.  However, accurate measurement
574	   of RTT may be difficult both in theory and in implementation.

576	   Many TCP implementations base their RTT measurements upon a sample of
577	   one packet per window or less.  While this yields an adequate
578	   approximation to the RTT for small windows, it results in an
579	   unacceptably poor RTT estimate for an LFN.  If we look at RTT
580	   estimation as a signal processing problem (which it is), a data
581	   signal at some frequency, the packet rate, is being sampled at a
582	   lower frequency, the window rate.  This lower sampling frequency
583	   violates Nyquist's criteria and may therefore introduce "aliasing"
584	   artifacts into the estimated RTT [Hamming77].

586	   A good RTT estimator with a conservative retransmission timeout
587	   calculation can tolerate aliasing when the sampling frequency is
588	   "close" to the data frequency.  For example, with a window of 8
589	   packets, the sample rate is 1/8 the data frequency -- less than an
590	   order of magnitude different.  However, when the window is tens or
591	   hundreds of packets, the RTT estimator may be seriously in error,
592	   resulting in spurious retransmissions.

594	   If there are dropped packets, the problem becomes worse.  Zhang
595	   [Zhang86], Jain [Jain86] and Karn [Karn87] have shown that it is not
596	   possible to accumulate reliable RTT estimates if retransmitted
597	   segments are included in the estimate.  Since a full window of data
598	   will have been transmitted prior to a retransmission, all of the
599	   segments in that window will have to be ACKed before the next RTT
600	   sample can be taken.  This means at least an additional window's
601	   worth of time between RTT measurements and, as the error rate
602	   approaches one per window of data (e.g., 10^-6 errors per bit for the
603	   Wideband satellite network), it becomes effectively impossible to
604	   obtain a valid RTT measurement.

606	   A solution to these problems, which actually simplifies the sender
607	   substantially, is as follows: using TCP options, the sender places a
608	   timestamp in each data segment, and the receiver reflects these
609	   timestamps back in ACK segments.  Then a single subtract gives the
610	   sender an accurate RTT measurement for every ACK segment (which will
611	   correspond to every other data segment, with a sensible receiver).
612	   We call this the RTTM (Round-Trip Time Measurement) mechanism.

614	   It is vitally important to use the RTTM mechanism with big windows;
615	   otherwise, the door is opened to some dangerous instabilities due to
616	   aliasing.  Furthermore, the option is probably useful for all TCP's,
617	   since it simplifies the sender.

619	3.2.  TCP Timestamps Option

621	   TCP is a symmetric protocol, allowing data to be sent at any time in
622	   either direction, and therefore timestamp echoing may occur in either
623	   direction.  For simplicity and symmetry, we specify that timestamps
624	   always be sent and echoed in both directions.  For efficiency, we
625	   combine the timestamp and timestamp reply fields into a single TCP
626	   Timestamps Option.

628	   TCP Timestamps Option (TSopt):

630	       Kind: 8

632	       Length: 10 bytes

634	        +-------+-------+---------------------+---------------------+
635	        |Kind=8 |  10   |   TS Value (TSval)  |TS Echo Reply (TSecr)|
636	        +-------+-------+---------------------+---------------------+
637	            1       1              4                     4

639	   The Timestamps option carries two four-byte timestamp fields.  The
640	   Timestamp Value field (TSval) contains the current value of the
641	   timestamp clock of the TCP sending the option.

643	   The Timestamp Echo Reply field (TSecr) is valid if the ACK bit is set
644	   in the TCP header; if it is valid, it echos a timestamp value that
645	   was sent by the remote TCP in the TSval field of a Timestamps option.
646	   When TSecr is not valid, its value must be zero.  However, a value of
647	   zero does not imply TSecr being invalid.  The TSecr value will
648	   generally be from the most recent Timestamp option that was received;
649	   however, there are exceptions that are explained below.

651	   A TCP may send the Timestamps option (TSopt) in an initial 
652	   segment (i.e., a segment containing a SYN bit and no ACK bit).  Once
653	   a TSopt has been sent or received in a non  segment, it must be
654	   sent in all segments.  Once a TSopt has been received in a non 
655	   segment, then any successive segment that is received without the RST
656	   bit and without a TSopt may be dropped without further processing,
657	   and an ACK of the current SND.UNA generated.

659	   In the case of crossing SYN packets where one SYN contains a TSopt
660	   and the other doesn't, both sides should put a TSopt in the 
661	   segment.

663	3.3.  The RTTM Mechanism

665	   RTTM places a Timestamps option in every segment, with a TSval that
666	   is obtained from a (virtual) "timestamp clock".  Values of this clock
667	   values must be at least approximately proportional to real time, in
668	   order to measure actual RTT.

670	   These TSval values are echoed in TSecr values in the reverse
671	   direction.  The difference between a received TSecr value and the
672	   current timestamp clock value provides an RTT measurement.

674	   When timestamps are used, every segment that is received will contain
675	   a TSecr value; however, these values cannot all be used to update the
676	   measured RTT.  The following example illustrates why.  It shows a
677	   one-way data flow with segments arriving in sequence without loss.
678	   Here A, B, C... represent data blocks occupying successive blocks of
679	   sequence numbers, and ACK(A),... represent the corresponding
680	   cumulative acknowledgments.  The two timestamp fields of the
681	   Timestamps option are shown symbolically as .  Each
682	   TSecr field contains the value most recently received in a TSval
683	   field.

685	           TCP  A                                          TCP B

687	                          ------>

689	               <---- 

691	                          ------>

693	               <---- 

695	             . . . . . . . . . . . . . . . . . . . . . .

697	                         ------>

699	               <---- 

701	                               (etc)

703	   The dotted line marks a pause (60 time units long) in which A had
704	   nothing to send.  Note that this pause inflates the RTT which B could
705	   infer from receiving TSecr=131 in data segment C. Thus, in one-way
706	   data flows, RTTM in the reverse direction measures a value that is
707	   inflated by gaps in sending data.  However, the following rule
708	   prevents a resulting inflation of the measured RTT:

710	      RTTM Rule: A TSecr value received in a segment is used to update
711	      the averaged RTT measurement only if

713	      a)  the segment acknowledges some new data, i.e., only if it
714	          advances the left edge of the send window, and

716	      b)  the segment does not indicate any loss or reordering, i.e.
717	          contains SACK options

719	   Since TCP B is not sending data, the data segment C does not
720	   acknowledge any new data when it arrives at B. Thus, the inflated
721	   RTTM measurement is not used to update B's RTTM measurement.

723	   Implementors should note that with Timestamps multiple RTTMs can be
724	   taken per RTT.  Many RTO estimators have a weighting factor based on
725	   an implicit assumption that at most one RTTM will be gotten per RTT.
726	   When using multiple RTTMs per RTT to update the RTO estimator, the
727	   weighting factor needs to be decreased to take into account the more
728	   frequent RTTMs.  For example, an implementation could choose to just
729	   use one sample per RTT to update the RTO estimator, or vary the gain
730	   based on the congestion window, or take an average of all the RTTM
731	   measurements received over one RTT, and then use that value to update
732	   the RTO estimator.  This document does not prescribe any particular
733	   method for modifying the RTO estimator, the important point is that
734	   the implementation should do something more than just feeding
735	   additional RTTM samples from one RTT into the RTO estimator.

737	3.4.  Which Timestamp to Echo

739	   If more than one Timestamps option is received before a reply segment
740	   is sent, the TCP must choose only one of the TSvals to echo, ignoring
741	   the others.  To minimize the state kept in the receiver (i.e., the
742	   number of unprocessed TSvals), the receiver should be required to
743	   retain at most one timestamp in the connection control block.

745	   There are three situations to consider:

747	   (A)  Delayed ACKs.

749	        Many TCP's acknowledge only every Kth segment out of a group of
750	        segments arriving within a short time interval; this policy is
751	        known generally as "delayed ACKs".  The data-sender TCP must
752	        measure the effective RTT, including the additional time due to
753	        delayed ACKs, or else it will retransmit unnecessarily.  Thus,
754	        when delayed ACKs are in use, the receiver should reply with the
755	        TSval field from the earliest unacknowledged segment.

757	   (B)  A hole in the sequence space (segment(s) have been lost).

759	        The sender will continue sending until the window is filled, and
760	        the receiver may be generating ACKs as these out-of-order
761	        segments arrive (e.g., to aid "fast retransmit").

763	        The lost segment is probably a sign of congestion, and in that
764	        situation the sender should be conservative about
765	        retransmission.  Furthermore, it is better to overestimate than
766	        underestimate the RTT.  An ACK for an out-of-order segment
767	        should therefore contain the timestamp from the most recent
768	        segment that advanced the window.

770	        The same situation occurs if segments are re-ordered by the
771	        network.

773	   (C)  A filled hole in the sequence space.

775	        The segment that fills the hole represents the most recent
776	        measurement of the network characteristics.  On the other hand,
777	        an RTT computed from an earlier segment would probably include
778	        the sender's retransmit time-out, badly biasing the sender's
779	        average RTT estimate.  Thus, the timestamp from the latest
780	        segment (which filled the hole) must be echoed.

782	   An algorithm that covers all three cases is described in the
783	   following rules for Timestamps option processing on a synchronized
784	   connection:

786	   (1)  The connection state is augmented with two 32-bit slots:

788	        TS.Recent holds a timestamp to be echoed in TSecr whenever a
789	        segment is sent, and Last.ACK.sent holds the ACK field from the
790	        last segment sent.  Last.ACK.sent will equal RCV.NXT except when
791	        ACKs have been delayed.

793	   (2)  If:

795	            SEG.TSval >= TS.recent and SEG.SEQ <= Last.ACK.sent

797	        then SEG.TSval is copied to TS.Recent; otherwise, it is ignored.

799	   (3)  When a TSopt is sent, its TSecr field is set to the current
800	        TS.Recent value.

802	   The following examples illustrate these rules.  Here A, B, C...
803	   represent data segments occupying successive blocks of sequence
804	   numbers, and ACK(A),... represent the corresponding acknowledgment
805	   segments.  Note that ACK(A) has the same sequence number as B. We
806	   show only one direction of timestamp echoing, for clarity.

808	   o  Packets arrive in sequence, and some of the ACKs are delayed.

810	      By Case (A), the timestamp from the oldest unacknowledged segment
811	      is echoed.

813	                                                    TS.Recent
814	                   ------------------->
815	                                                        1
816	                   ------------------->
817	                                                        1
818	                   ------------------->
819	                                                        1
820	                           <---- 
821	                  (etc)

823	   o  Packets arrive out of order, and every packet is acknowledged.

825	      By Case (B), the timestamp from the last segment that advanced the
826	      left window edge is echoed, until the missing segment arrives; it
827	      is echoed according to Case (C).  The same sequence would occur if
828	      segments B and D were lost and retransmitted.

830	                                                    TS.Recent
831	                   ------------------->
832	                                                        1
833	                           <---- 
834	                                                        1
835	                   ------------------->
836	                                                        1
837	                           <---- 
838	                                                        1
839	                   ------------------->
840	                                                        2
841	                           <---- 
842	                                                        2
843	                   ------------------->
844	                                                        2
845	                           <---- 
846	                                                        2
847	                   ------------------->
848	                                                        4
849	                           <---- 
850	                  (etc)

852	4.  PAWS -- Protection Against Wrapped Sequence Numbers

854	4.1.  Introduction

856	   Section 4.2 describes a simple mechanism to reject old duplicate
857	   segments that might corrupt an open TCP connection; we call this
858	   mechanism PAWS (Protection Against Wrapped Sequence numbers).  PAWS
859	   operates within a single TCP connection, using state that is saved in
860	   the connection control block.  Section 4.3 and Appendix C discuss the
861	   implications of the PAWS mechanism for avoiding old duplicates from
862	   previous incarnations of the same connection.

864	4.2.  The PAWS Mechanism

866	   PAWS uses the same TCP Timestamps option as the RTTM mechanism
867	   described earlier, and assumes that every received TCP segment
868	   (including data and ACK segments) contains a timestamp SEG.TSval
869	   whose values are monotonically non-decreasing in time.  The basic
870	   idea is that a segment can be discarded as an old duplicate if it is
871	   received with a timestamp SEG.TSval less than some timestamp recently
872	   received on this connection.

874	   In both the PAWS and the RTTM mechanism, the "timestamps" are 32-bit
875	   unsigned integers in a modular 32-bit space.  Thus, "less than" is
876	   defined the same way it is for TCP sequence numbers, and the same
877	   implementation techniques apply.  If s and t are timestamp values,

879	                       s < t  if 0 < (t - s) < 2^31,

881	   computed in unsigned 32-bit arithmetic.

883	   The choice of incoming timestamps to be saved for this comparison
884	   must guarantee a value that is monotonically increasing.  For
885	   example, we might save the timestamp from the segment that last
886	   advanced the left edge of the receive window, i.e., the most recent
887	   in-sequence segment.  Instead, we choose the value TS.Recent
888	   introduced in Section 3.4 for the RTTM mechanism, since using a
889	   common value for both PAWS and RTTM simplifies the implementation of
890	   both.  As Section 3.4 explained, TS.Recent differs from the timestamp
891	   from the last in-sequence segment only in the case of delayed ACKs,
892	   and therefore by less than one window.  Either choice will therefore
893	   protect against sequence number wrap-around.

895	   RTTM was specified in a symmetrical manner, so that TSval timestamps
896	   are carried in both data and ACK segments and are echoed in TSecr
897	   fields carried in returning ACK or data segments.  PAWS submits all
898	   incoming segments to the same test, and therefore protects against
899	   duplicate ACK segments as well as data segments.  (An alternative
900	   non-symmetric algorithm would protect against old duplicate ACKs: the
901	   sender of data would reject incoming ACK segments whose TSecr values
902	   were less than the TSecr saved from the last segment whose ACK field
903	   advanced the left edge of the send window.  This algorithm was deemed
904	   to lack economy of mechanism and symmetry.)

906	   TSval timestamps sent on  and  segments are used to
907	   initialize PAWS.  PAWS protects against old duplicate non-SYN
908	   segments, and duplicate SYN segments received while there is a
909	   synchronized connection.  Duplicate  and  segments
910	   received when there is no connection will be discarded by the normal
911	   3-way handshake and sequence number checks of TCP.

913	   RFC 1323 recommended that RST segments NOT carry timestamps, and that
914	   they be acceptable regardless of their timestamp.  At that time, the
915	   thinking was that old duplicate RST segments should be exceedingly
916	   unlikely, and their cleanup function should take precedence over
917	   timestamps.  More recently, discussions about various blind attacks
918	   on TCP connections have raised the suggestion that if the Timestamps
919	   option is present, SEG.TSecr could be used to provide stricter
920	   acceptance tests for RST packets.  While still under discussion, to
921	   enable research into this area it is now recommended that when
922	   generating a RST, that if the packet causing the RST to be generated
923	   contained a Timestamps option that the RST also contain a Timestamps
924	   option.  In the RST segment, SEG.TSecr should be set to SEG.TSval
925	   from the incoming packet and SEG.TSval should be set to zero.  If a
926	   RST is being generated because of a user abort, and Snd.TS.OK is set,
927	   then a Timestamps option should be included in the RST.  When a RST
928	   packet is received, it must not be subjected to PAWS checks, and
929	   information from the Timestamps option must not be use to update
930	   connection state information.  SEG.TSecr may be used to provide
931	   stricter RST acceptance checks.

933	4.2.1.  Basic PAWS Algorithm

935	   The PAWS algorithm requires the following processing to be performed
936	   on all incoming segments for a synchronized connection:

938	   R1)  If there is a Timestamps option in the arriving segment,
939	        SEG.TSval < TS.Recent, TS.Recent is valid (see later discussion)
940	        and the RST bit is not set, then treat the arriving segment as
941	        not acceptable:

943	           Send an acknowledgement in reply as specified in RFC 793 page
944	           69 and drop the segment.

946	           Note: it is necessary to send an ACK segment in order to
947	           retain TCP's mechanisms for detecting and recovering from
948	           half-open connections.  For example, see Figure 10 of RFC
949	           793.

951	   R2)  If the segment is outside the window, reject it (normal TCP
952	        processing)

954	   R3)  If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent (see
955	        Section 3.4), then record its timestamp in TS.Recent.

957	   R4)  If an arriving segment is in-sequence (i.e., at the left window
958	        edge), then accept it normally.

960	   R5)  Otherwise, treat the segment as a normal in-window, out-of-
961	        sequence TCP segment (e.g., queue it for later delivery to the
962	        user).

964	   Steps R2, R4, and R5 are the normal TCP processing steps specified by
965	   RFC 793.

967	   It is important to note that the timestamp is checked only when a
968	   segment first arrives at the receiver, regardless of whether it is
969	   in-sequence or it must be queued for later delivery.

971	   Consider the following example.

973	      Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has been
974	      sent, where the letter indicates the sequence number and the digit
975	      represents the timestamp.  Suppose also that segment B.1 has been
976	      lost.  The timestamp in TS.TStamp is 1 (from A.1), so C.1, ...,
977	      Z.1 are considered acceptable and are queued.  When B is
978	      retransmitted as segment B.2 (using the latest timestamp), it
979	      fills the hole and causes all the segments through Z to be
980	      acknowledged and passed to the user.  The timestamps of the queued
981	      segments are *not* inspected again at this time, since they have
982	      already been accepted.  When B.2 is accepted, TS.Stamp is set to
983	      2.

985	   This rule allows reasonable performance under loss.  A full window of
986	   data is in transit at all times, and after a loss a full window less
987	   one packet will show up out-of-sequence to be queued at the receiver
988	   (e.g., up to ~2^30 bytes of data); the timestamp option must not
989	   result in discarding this data.

991	   In certain unlikely circumstances, the algorithm of rules R1-R5 could
992	   lead to discarding some segments unnecessarily, as shown in the
993	   following example:

995	      Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have been
996	      sent in sequence and that segment B.1 has been lost.  Furthermore,
997	      suppose delivery of some of C.1, ...  Z.1 is delayed until AFTER
998	      the retransmission B.2 arrives at the receiver.  These delayed
999	      segments will be discarded unnecessarily when they do arrive,
1000	      since their timestamps are now out of date.

1002	   This case is very unlikely to occur.  If the retransmission was
1003	   triggered by a timeout, some of the segments C.1, ...  Z.1 must have
1004	   been delayed longer than the RTO time.  This is presumably an
1005	   unlikely event, or there would be many spurious timeouts and
1006	   retransmissions.  If B's retransmission was triggered by the "fast
1007	   retransmit" algorithm, i.e., by duplicate ACKs, then the queued
1008	   segments that caused these ACKs must have been received already.

1010	   Even if a segment were delayed past the RTO, the Fast Retransmit
1011	   mechanism [Jacobson90c] will cause the delayed packets to be
1012	   retransmitted at the same time as B.2, avoiding an extra RTT and
1013	   therefore causing a very small performance penalty.

1015	   We know of no case with a significant probability of occurrence in
1016	   which timestamps will cause performance degradation by unnecessarily
1017	   discarding segments.

1019	4.2.2.  Timestamp Clock

1021	   It is important to understand that the PAWS algorithm does not
1022	   require clock synchronization between sender and receiver.  The
1023	   sender's timestamp clock is used to stamp the segments, and the
1024	   sender uses the echoed timestamp to measure RTTs.  However, the
1025	   receiver treats the timestamp as simply a monotonically increasing
1026	   serial number, without any necessary connection to its clock.  From
1027	   the receiver's viewpoint, the timestamp is acting as a logical
1028	   extension of the high-order bits of the sequence number.

1030	   The receiver algorithm does place some requirements on the frequency
1031	   of the timestamp clock.

1033	   (a)  The timestamp clock must not be "too slow".

1035	        It must tick at least once for each 2^31 bytes sent.  In fact,
1036	        in order to be useful to the sender for round trip timing, the
1037	        clock should tick at least once per window's worth of data, and
1038	        even with the window extension defined in Section 2.2, 2^31
1039	        bytes must be at least two windows.

1041	        To make this more quantitative, any clock faster than 1 tick/sec
1042	        will reject old duplicate segments for link speeds of ~8 Gbps.
1043	        A 1ms timestamp clock will work at link speeds up to 8 Tbps
1044	        (8*10^12) bps!

1046	   (b)  The timestamp clock must not be "too fast".

1048	        Its recycling time must be greater than MSL seconds.  Since the
1049	        clock (timestamp) is 32 bits and the worst-case MSL is 255
1050	        seconds, the maximum acceptable clock frequency is one tick
1051	        every 59 ns.

1053	        However, it is desirable to establish a much longer recycle
1054	        period, in order to handle outdated timestamps on idle
1055	        connections (see Section 4.2.3), and to relax the MSL
1056	        requirement for preventing sequence number wrap-around.  With a
1057	        1 ms timestamp clock, the 32-bit timestamp will wrap its sign
1058	        bit in 24.8 days.  Thus, it will reject old duplicates on the
1059	        same connection if MSL is 24.8 days or less.  This appears to be
1060	        a very safe figure; an MSL of 24.8 days or longer can probably
1061	        be assumed in the internet without requiring precise MSL
1062	        enforcement.

1064	   Based upon these considerations, we choose a timestamp clock
1065	   frequency in the range 1 ms to 1 sec per tick.  This range also
1066	   matches the requirements of the RTTM mechanism, which does not need
1067	   much more resolution than the granularity of the retransmit timer,
1068	   e.g., tens or hundreds of milliseconds.

1070	   The PAWS mechanism also puts a strong monotonicity requirement on the
1071	   sender's timestamp clock.  The method of implementation of the
1072	   timestamp clock to meet this requirement depends upon the system
1073	   hardware and software.

1075	   o  Some hosts have a hardware clock that is guaranteed to be
1076	      monotonic between hardware resets.

1078	   o  A clock interrupt may be used to simply increment a binary integer
1079	      by 1 periodically.

1081	   o  The timestamp clock may be derived from a system clock that is
1082	      subject to being abruptly changed, by adding a variable offset
1083	      value.  This offset is initialized to zero.  When a new timestamp
1084	      clock value is needed, the offset can be adjusted as necessary to
1085	      make the new value equal to or larger than the previous value
1086	      (which was saved for this purpose).

1088	4.2.3.  Outdated Timestamps

1090	   If a connection remains idle long enough for the timestamp clock of
1091	   the other TCP to wrap its sign bit, then the value saved in TS.Recent
1092	   will become too old; as a result, the PAWS mechanism will cause all
1093	   subsequent segments to be rejected, freezing the connection (until
1094	   the timestamp clock wraps its sign bit again).

1096	   With the chosen range of timestamp clock frequencies (1 sec to 1 ms),
1097	   the time to wrap the sign bit will be between 24.8 days and 24800
1098	   days.  A TCP connection that is idle for more than 24 days and then
1099	   comes to life is exceedingly unusual.  However, it is undesirable in
1100	   principle to place any limitation on TCP connection lifetimes.

1102	   We therefore require that an implementation of PAWS include a
1103	   mechanism to "invalidate" the TS.Recent value when a connection is
1104	   idle for more than 24 days.  (An alternative solution to the problem
1105	   of outdated timestamps would be to send keep-alive segments at a very
1106	   low rate, but still more often than the wrap-around time for
1107	   timestamps, e.g., once a day.  This would impose negligible overhead.
1108	   However, the TCP specification has never included keep-alives, so the
1109	   solution based upon invalidation was chosen.)

1111	   Note that a TCP does not know the frequency, and therefore, the
1112	   wraparound time, of the other TCP, so it must assume the worst.  The
1113	   validity of TS.Recent needs to be checked only if the basic PAWS
1114	   timestamp check fails, i.e., only if SEG.TSval < TS.Recent.  If
1115	   TS.Recent is found to be invalid, then the segment is accepted,
1116	   regardless of the failure of the timestamp check, and rule R3 updates
1117	   TS.Recent with the TSval from the new segment.

1119	   To detect how long the connection has been idle, the TCP may update a
1120	   clock or timestamp value associated with the connection whenever
1121	   TS.Recent is updated, for example.  The details will be
1122	   implementation-dependent.

1124	4.2.4.  Header Prediction

1126	   "Header prediction" [Jacobson90a] is a high-performance transport
1127	   protocol implementation technique that is most important for high-
1128	   speed links.  This technique optimizes the code for the most common
1129	   case, receiving a segment correctly and in order.  Using header
1130	   prediction, the receiver asks the question, "Is this segment the next
1131	   in sequence?"  This question can be answered in fewer machine
1132	   instructions than the question, "Is this segment within the window?"

1134	   Adding header prediction to our timestamp procedure leads to the
1135	   following recommended sequence for processing an arriving TCP
1136	   segment:

1138	   H1)  Check timestamp (same as step R1 above)

1140	   H2)  Do header prediction: if segment is next in sequence and if
1141	        there are no special conditions requiring additional processing,
1142	        accept the segment, record its timestamp, and skip H3.

1144	   H3)  Process the segment normally, as specified in RFC 793.  This
1145	        includes dropping segments that are outside the window and
1146	        possibly sending acknowledgments, and queueing in-window, out-
1147	        of-sequence segments.

1149	   Another possibility would be to interchange steps H1 and H2, i.e., to
1150	   perform the header prediction step H2 FIRST, and perform H1 and H3
1151	   only when header prediction fails.  This could be a performance
1152	   improvement, since the timestamp check in step H1 is very unlikely to
1153	   fail, and it requires unsigned modulo arithmetic.  To perform this
1154	   check on every single segment is contrary to the philosophy of header
1155	   prediction.  We believe that this change might produce a measurable
1156	   reduction in CPU time for TCP protocol processing on high-speed
1157	   networks.

1159	   However, putting H2 first would create a hazard: a segment from 2^32
1160	   bytes in the past might arrive at exactly the wrong time and be
1161	   accepted mistakenly by the header-prediction step.  The following
1162	   reasoning has been introduced in [RFC1185] to show that the
1163	   probability of this failure is negligible.

1165	      If all segments are equally likely to show up as old duplicates,
1166	      then the probability of an old duplicate exactly matching the left
1167	      window edge is the maximum segment size (MSS) divided by the size
1168	      of the sequence space.  This ratio must be less than 2^-16, since
1169	      MSS must be < 2^16; for example, it will be (2^12)/(2^32) = 2^-20
1170	      for a FDDI link.  However, the older a segment is, the less likely
1171	      it is to be retained in the Internet, and under any reasonable
1172	      model of segment lifetime the probability of an old duplicate
1173	      exactly at the left window edge must be much smaller than 2^-16.

1175	      The 16 bit TCP checksum also allows a basic unreliability of one
1176	      part in 2^16.  A protocol mechanism whose reliability exceeds the
1177	      reliability of the TCP checksum should be considered "good
1178	      enough", i.e., it won't contribute significantly to the overall
1179	      error rate.  We therefore believe we can ignore the problem of an
1180	      old duplicate being accepted by doing header prediction before
1181	      checking the timestamp.

1183	   However, this probabilistic argument is not universally accepted, and
1184	   the consensus at present is that the performance gain does not
1185	   justify the hazard in the general case.  It is therefore recommended
1186	   that H2 follow H1.

1188	4.2.5.  IP Fragmentation

1190	   At high data rates, the protection against old packets provided by
1191	   PAWS can be circumvented by errors in IP fragment reassembly (see

1193	   [RFC4963]).  The only way to protect against incorrect IP fragment
1194	   reassembly is to not allow the packets to be fragmented.  This is
1195	   done by setting the Don't Fragment (DF) bit in the IP header.
1196	   Setting the DF bit implies the use of Path MTU Discovery as described
1197	   in [RFC1191], [RFC1981], and [RFC4821], thus any TCP implementation
1198	   that implements PAWS must also implement Path MTU Discovery.

1200	4.3.  Duplicates from Earlier Incarnations of Connection

1202	   The PAWS mechanism protects against errors due to sequence number
1203	   wrap-around on high-speed connections.  Segments from an earlier
1204	   incarnation of the same connection are also a potential cause of old
1205	   duplicate errors.  In both cases, the TCP mechanisms to prevent such
1206	   errors depend upon the enforcement of a maximum segment lifetime
1207	   (MSL) by the Internet (IP) layer (see Appendix of RFC 1185 for a
1208	   detailed discussion).  Unlike the case of sequence space wrap-around,
1209	   the MSL required to prevent old duplicate errors from earlier
1210	   incarnations does not depend upon the transfer rate.  If the IP layer
1211	   enforces the recommended 2 minute MSL of TCP, and if the TCP rules
1212	   are followed, TCP connections will be safe from earlier incarnations,
1213	   no matter how high the network speed.  Thus, the PAWS mechanism is
1214	   not required for this case.

1216	   We may still ask whether the PAWS mechanism can provide additional
1217	   security against old duplicates from earlier connections, allowing us
1218	   to relax the enforcement of MSL by the IP layer.  Appendix B explores
1219	   this question, showing that further assumptions and/or mechanisms are
1220	   required, beyond those of PAWS.  This is not part of the current
1221	   extension.

1223	5.  Conclusions and Acknowledgements

1225	   This memo presented a set of extensions to TCP to provide efficient
1226	   operation over large-bandwidth*delay-product paths and reliable
1227	   operation over very high-speed paths.  These extensions are designed
1228	   to provide compatible interworking with TCP's that do not implement
1229	   the extensions.

1231	   These mechanisms are implemented using new TCP options for scaled
1232	   windows and timestamps.  The timestamps are used for two distinct
1233	   mechanisms: RTTM (Round Trip Time Measurement) and PAWS (Protection
1234	   Against Wrapped Sequences).

1236	   The Window Scale option was originally suggested by Mike St. Johns of
1237	   USAF/DCA.  The present form of the option was suggested by Mike
1238	   Karels of UC Berkeley in response to a more cumbersome scheme defined
1239	   by Van Jacobson.  Lixia Zhang helped formulate the PAWS mechanism
1240	   description in RFC 1185.

1242	   Finally, much of this work originated as the result of discussions
1243	   within the End-to-End Task Force on the theoretical limitations of
1244	   transport protocols in general and TCP in particular.  Task force
1245	   members and other on the end2end-interest list have made valuable
1246	   contributions by pointing out flaws in the algorithms and the
1247	   documentation.  Continued discussion and development since the
1248	   publication of RFC 1323 originally occurred in the IETF TCP Large
1249	   Windows Working Group, later on in the End-to-End Task Force, and
1250	   most recently in the IETF TCP Maintenance Working Group.  The authors
1251	   are grateful for all these contributions.

1253	6.  Security Considerations

1255	   The TCP sequence space is a fixed size, and as the window becomes
1256	   larger it becomes easier for an attacker to generate forged packets
1257	   that can fall within the TCP window, and be accepted as valid
1258	   packets.  While use of Timestamps and PAWS can help to mitigate this,
1259	   when using PAWS, if an attacker is able to forge a packet that is
1260	   acceptable to the TCP connection, a timestamp that is in the future
1261	   would cause valid packets to be dropped due to PAWS checks.  Hence,
1262	   implementors should take care to not open the TCP window drastically
1263	   beyond the requirements of the connection.

1265	   Middle boxes and options: If a middle box removes TCP options from
1266	   the SYN, such as TSopt, a high speed connection that needs PAWS would
1267	   not have that protection.  In this situation, an implementor could
1268	   provide a mechanism for the application to determine whether or not
1269	   PAWS is in use on the connection, and chose to terminate the
1270	   connection if that protection doesn't exist.

1272	   Mechanisms to protect the TCP header from modification should also
1273	   protect the TCP options.

1275	   Expanding the TCP window beyond 64K for IPv6 allows Jumbograms
1276	   [RFC2675] to be used when the local network supports packets larger
1277	   than 64K. When larger TCP packets are used, the TCP checksum becomes
1278	   weaker.

1280	7.  IANA Considerations

1282	   This document has no actions for IANA.

1284	8.  References
1285	8.1.  Normative References

1287	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
1288	              RFC 793, September 1981.

1290	   [RFC1191]  Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
1291	              November 1990.

1293	8.2.  Informative References

1295	   [Garlick77]
1296	              Garlick, L., Rom, R., and J. Postel, "Issues in Reliable
1297	              Host-to-Host Protocols", Proc. Second Berkeley Workshop on
1298	              Distributed Data Management and Computer Networks,
1299	              May 1977, .

1301	   [Hamming77]
1302	              Hamming, R., "Digital Filters", Prentice Hall, Englewood
1303	              Cliffs, N.J. ISBN 0-13-212571-4, 1977.

1305	   [Jacobson88a]
1306	              Jacobson, V., "Congestion Avoidance and Control", SIGCOMM
1307	              '88, Stanford,  CA., August 1988,
1308	              .

1310	   [Jacobson90a]
1311	              Jacobson, V., "4BSD Header Prediction", ACM Computer
1312	              Communication Review, April 1990.

1314	   [Jacobson90c]
1315	              Jacobson, V., "Modified TCP congestion avoidance
1316	              algorithm", Message to the end2end-interest mailing list,
1317	              April 1990,
1318	              .

1320	   [Jain86]   Jain, R., "Divergence of Timeout Algorithms for Packet
1321	              Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and
1322	              Comm., Scottsdale, Arizona, March 1986,
1323	              .

1325	   [Karn87]   Karn, P. and C. Partridge, "Estimating Round-Trip Times in
1326	              Reliable Transport Protocols", Proc. SIGCOMM '87,
1327	              August 1987.

1329	   [Martin03]
1330	              Martin, D., "[Tsvwg] RFC 1323.bis", Message to the tsvwg
1331	              mailing list, September 2003, .

1334	   [Mathis08]
1335	              Mathis, M., "[tcpm] Example of 1323 window retraction
1336	              problem", Message to the tcpm mailing list, March 2008,
1337	              .

1340	   [RFC0896]  Nagle, J., "Congestion control in IP/TCP internetworks",
1341	              RFC 896, January 1984.

1343	   [RFC1072]  Jacobson, V. and R. Braden, "TCP extensions for long-delay
1344	              paths", RFC 1072, October 1988.

1346	   [RFC1110]  McKenzie, A., "Problem with the TCP big window option",
1347	              RFC 1110, August 1989.

1349	   [RFC1122]  Braden, R., "Requirements for Internet Hosts -
1350	              Communication Layers", STD 3, RFC 1122, October 1989.

1352	   [RFC1185]  Jacobson, V., Braden, B., and L. Zhang, "TCP Extension for
1353	              High-Speed Paths", RFC 1185, October 1990.

1355	   [RFC1323]  Jacobson, V., Braden, B., and D. Borman, "TCP Extensions
1356	              for High Performance", RFC 1323, May 1992.

1358	   [RFC1981]  McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery
1359	              for IP version 6", RFC 1981, August 1996.

1361	   [RFC2018]  Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
1362	              Selective Acknowledgment Options", RFC 2018, October 1996.

1364	   [RFC2581]  Allman, M., Paxson, V., and W. Stevens, "TCP Congestion
1365	              Control", RFC 2581, April 1999.

1367	   [RFC2675]  Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms",
1368	              RFC 2675, August 1999.

1370	   [RFC2883]  Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An
1371	              Extension to the Selective Acknowledgement (SACK) Option
1372	              for TCP", RFC 2883, July 2000.

1374	   [RFC3517]  Blanton, E., Allman, M., Fall, K., and L. Wang, "A
1375	              Conservative Selective Acknowledgment (SACK)-based Loss
1376	              Recovery Algorithm for TCP", RFC 3517, April 2003.

1378	   [RFC4821]  Mathis, M. and J. Heffner, "Packetization Layer Path MTU
1379	              Discovery", RFC 4821, March 2007.

1381	   [RFC4963]  Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly
1382	              Errors at High Data Rates", RFC 4963, July 2007.

1384	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
1385	              Control", RFC 5681, September 2009.

1387	   [Watson81]
1388	              Watson, R., "Timer-based Mechanisms in Reliable Transport
1389	              Protocol Connection Management", Computer Networks, Vol.
1390	              5, 1981.

1392	   [Zhang86]  Zhang, L., "Why TCP Timers Don't Work Well", Proc. SIGCOMM
1393	              '86, Stowe, VT, August 1986.

1395	Appendix A.  Implementation Suggestions

1397	   TCP Option Layout

1399	      The following layouts are recommended for sending options on non-
1400	      SYN segments, to achieve maximum feasible alignment of 32-bit and
1401	      64-bit machines.

1403	                   +--------+--------+--------+--------+
1404	                   |   NOP  |  NOP   |  TSopt |   10   |
1405	                   +--------+--------+--------+--------+
1406	                   |           TSval  timestamp        |
1407	                   +--------+--------+--------+--------+
1408	                   |           TSecr  timestamp        |
1409	                   +--------+--------+--------+--------+

1411	   Interaction with the TCP Urgent Pointer

1413	      The TCP Urgent pointer, like the TCP window, is a 16 bit value.
1414	      Some of the original discussion for the TCP Window Scale option
1415	      included proposals to increase the Urgent pointer to 32 bits.  As
1416	      it turns out, this is unnecessary.  There are two observations
1417	      that should be made:

1419	      (1)  With IP Version 4, the largest amount of TCP data that can be
1420	           sent in a single packet is 65495 bytes (64K - 1 -- size of
1421	           fixed IP and TCP headers).

1423	      (2)  Updates to the urgent pointer while the user is in "urgent
1424	           mode" are invisible to the user.

1426	      This means that if the Urgent Pointer points beyond the end of the
1427	      TCP data in the current packet, then the user will remain in
1428	      urgent mode until the next TCP packet arrives.  That packet will
1429	      update the urgent pointer to a new offset, and the user will never
1430	      have left urgent mode.

1432	      Thus, to properly implement the Urgent Pointer, the sending TCP
1433	      only has to check for overflow of the 16 bit Urgent Pointer field
1434	      before filling it in.  If it does overflow, than a value of 65535
1435	      should be inserted into the Urgent Pointer.

1437	      The same technique applies to IP Version 6, except in the case of
1438	      IPv6 Jumbograms.  When IPv6 Jumbograms are supported, [RFC2675]
1439	      requires additional steps for dealing with the Urgent Pointer,
1440	      these are described in section 5.2 of [RFC2675].

1442	Appendix B.  Duplicates from Earlier Connection Incarnations

1444	   There are two cases to be considered: (1) a system crashing (and
1445	   losing connection state) and restarting, and (2) the same connection
1446	   being closed and reopened without a loss of host state.  These will
1447	   be described in the following two sections.

1449	B.1.  System Crash with Loss of State

1451	   TCP's quiet time of one MSL upon system startup handles the loss of
1452	   connection state in a system crash/restart.  For an explanation, see
1453	   for example "When to Keep Quiet" in the TCP protocol specification
1454	   [RFC0793].  The MSL that is required here does not depend upon the
1455	   transfer speed.  The current TCP MSL of 2 minutes seems acceptable as
1456	   an operational compromise, as many host systems take this long to
1457	   boot after a crash.

1459	   However, the timestamp option may be used to ease the MSL
1460	   requirements (or to provide additional security against data
1461	   corruption).  If timestamps are being used and if the timestamp clock
1462	   can be guaranteed to be monotonic over a system crash/restart, i.e.,
1463	   if the first value of the sender's timestamp clock after a crash/
1464	   restart can be guaranteed to be greater than the last value before
1465	   the restart, then a quiet time will be unnecessary.

1467	   To dispense totally with the quiet time would require that the host
1468	   clock be synchronized to a time source that is stable over the crash/
1469	   restart period, with an accuracy of one timestamp clock tick or
1470	   better.  We can back off from this strict requirement to take
1471	   advantage of approximate clock synchronization.  Suppose that the
1472	   clock is always re-synchronized to within N timestamp clock ticks and
1473	   that booting (extended with a quiet time, if necessary) takes more
1474	   than N ticks.  This will guarantee monotonicity of the timestamps,
1475	   which can then be used to reject old duplicates even without an
1476	   enforced MSL.

1478	B.2.  Closing and Reopening a Connection

1480	   When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT state
1481	   ties up the socket pair for 4 minutes (see Section 3.5 of [RFC0793].
1482	   Applications built upon TCP that close one connection and open a new
1483	   one (e.g., an FTP data transfer connection using Stream mode) must
1484	   choose a new socket pair each time.  The TIME-WAIT delay serves two
1485	   different purposes:

1487	   (a)  Implement the full-duplex reliable close handshake of TCP.

1489	        The proper time to delay the final close step is not really
1490	        related to the MSL; it depends instead upon the RTO for the FIN
1491	        segments and therefore upon the RTT of the path.  (It could be
1492	        argued that the side that is sending a FIN knows what degree of
1493	        reliability it needs, and therefore it should be able to
1494	        determine the length of the TIME-WAIT delay for the FIN's
1495	        recipient.  This could be accomplished with an appropriate TCP
1496	        option in FIN segments.)

1498	        Although there is no formal upper-bound on RTT, common network
1499	        engineering practice makes an RTT greater than 1 minute very
1500	        unlikely.  Thus, the 4 minute delay in TIME-WAIT state works
1501	        satisfactorily to provide a reliable full-duplex TCP close.
1502	        Note again that this is independent of MSL enforcement and
1503	        network speed.

1505	        The TIME-WAIT state could cause an indirect performance problem
1506	        if an application needed to repeatedly close one connection and
1507	        open another at a very high frequency, since the number of
1508	        available TCP ports on a host is less than 2^16.  However, high
1509	        network speeds are not the major contributor to this problem;
1510	        the RTT is the limiting factor in how quickly connections can be
1511	        opened and closed.  Therefore, this problem will be no worse at
1512	        high transfer speeds.

1514	   (b)  Allow old duplicate segments to expire.

1516	        To replace this function of TIME-WAIT state, a mechanism would
1517	        have to operate across connections.  PAWS is defined strictly
1518	        within a single connection; the last timestamp (TS.Recent) is
1519	        kept in the connection control block, and discarded when a
1520	        connection is closed.

1522	        An additional mechanism could be added to the TCP, a per-host
1523	        cache of the last timestamp received from any connection.  This
1524	        value could then be used in the PAWS mechanism to reject old
1525	        duplicate segments from earlier incarnations of the connection,
1526	        if the timestamp clock can be guaranteed to have ticked at least
1527	        once since the old connection was open.  This would require that
1528	        the TIME-WAIT delay plus the RTT together must be at least one
1529	        tick of the sender's timestamp clock.  Such an extension is not
1530	        part of the proposal of this RFC.

1532	        Note that this is a variant on the mechanism proposed by
1533	        Garlick, Rom, and Postel [Garlick77], which required each host
1534	        to maintain connection records containing the highest sequence
1535	        numbers on every connection.  Using timestamps instead, it is
1536	        only necessary to keep one quantity per remote host, regardless
1537	        of the number of simultaneous connections to that host.

1539	Appendix C.  Changes from RFC 1072, RFC 1185, and RFC 1323

1541	   The protocol extensions defined in RFC 1323 document differ in
1542	   several important ways from those defined in RFC 1072 and RFC 1185.

1544	   (a)  SACK has been split off into a separate document, [RFC2018].

1546	   (b)  The detailed rules for sending timestamp replies (see
1547	        Section 3.4) differ in important ways.  The earlier rules could
1548	        result in an under-estimate of the RTT in certain cases (packets
1549	        dropped or out of order).

1551	   (c)  The same value TS.Recent is now shared by the two distinct
1552	        mechanisms RTTM and PAWS.  This simplification became possible
1553	        because of change (b).

1555	   (d)  An ambiguity in RFC 1185 was resolved in favor of putting
1556	        timestamps on ACK as well as data segments.  This supports the
1557	        symmetry of the underlying TCP protocol.

1559	   (e)  The echo and echo reply options of RFC 1072 were combined into a
1560	        single Timestamps option, to reflect the symmetry and to
1561	        simplify processing.

1563	   (f)  The problem of outdated timestamps on long-idle connections,
1564	        discussed in Section 4.2.2, was realized and resolved.

1566	   (g)  RFC 1185 recommended that header prediction take precedence over
1567	        the timestamp check.  Based upon some skepticism about the
1568	        probabilistic arguments given in Section 4.2.4, it was decided
1569	        to recommend that the timestamp check be performed first.

1571	   (h)  The spec was modified so that the extended options will be sent
1572	        on  segments only when they are received in the
1573	        corresponding  segments.  This provides the most
1574	        conservative possible conditions for interoperation with
1575	        implementations without the extensions.

1577	   In addition to these substantive changes, the present RFC attempts to
1578	   specify the algorithms unambiguously by presenting modifications to
1579	   the Event Processing rules of RFC 793; see Appendix F.

1581	   There are additional changes in this document from RFC 1323.  These
1582	   changes are:

1584	   (a)  The description of which TSecr values can be used to update the
1585	        measured RTT has been clarified.  Specifically, with Timestamps,
1586	        the Karn algorithm [Karn87] is disabled.  The Karn algorithm
1587	        disables all RTT measurements during retransmission, since it is
1588	        ambiguous whether the ACK is is for the original packet, or the
1589	        retransmitted packet.  With Timestamps, that ambiguity is
1590	        removed since the TSecr in the ACK will contain the TSval from
1591	        whichever data packet made it to the destination.

1593	   (b)  In RFC1323, section 3.4, step (2) of the algorithm to control
1594	        which timestamp is echoed was incorrect in two regards:

1596	        (1)  It failed to update TS.recent for a retransmitted segment
1597	             that resulted from a lost ACK.

1599	        (2)  It failed if SEG.LEN = 0.

1601	        In the new algorithm, the case of SEG.TSval >= TS.recent is
1602	        included for consistency with the PAWS test.

1604	   (c)  One correction was made to the Event Processing Summary in
1605	        Appendix F.  In SEND CALL/ESTABLISHED STATE, RCV.WND is used to
1606	        fill in the SEG.WND value, not SND.WND.

1608	   (d)  New pseudo-code summary has been added in Appendix E.

1610	   (e)  Appendix A has been expanded with information about the TCP MSS
1611	        option and the TCP Urgent Pointer.

1613	   (f)  It is now recommended that Timestamps options be included in RST
1614	        packets if the incoming packet contained a Timestamps option.

1616	   (g)  RST packets are explicitly excluded from PAWS processing.

1618	   (h)  Snd.TSoffset and Snd.TSclock variables have been added.
1619	        Snd.TSclock is the sum of my.TSclock and Snd.TSoffset.  This
1620	        allows the starting points for timestamps to be randomized on a
1621	        per-connection basis.  Setting Snd.TSoffset to zero yields the
1622	        same results as [RFC1323].

1624	   (i)  RTTM update processing explicitly excludes packets containing
1625	        SACK options.  This addresses inflation of the RTT during
1626	        episodes of packet loss in both directions.

1628	   (j)  In Section 3.2 the if-clause allowing sending of timestamps only
1629	        when received in a  or  was removed, to allow for
1630	        late timestamp negotiation.

1632	   (k)  Section 2.4 was added describing the unavoidable window
1633	        retraction issue, and explicitly describing the mitigation steps
1634	        necessary.

1636	Appendix D.  Summary of Notation

1638	   The following notation has been used in this document.

1640	   Options

1642	      WSopt:            TCP Window Scale Option
1643	      TSopt:            TCP Timestamps Option

1645	   Option Fields

1647	      shift.cnt:        Window scale byte in WSopt
1648	      TSval:            32-bit Timestamp Value field in TSopt
1649	      TSecr:            32-bit Timestamp Reply field in TSopt

1651	   Option Fields in Current Segment

1653	      SEG.TSval:        TSval field from TSopt in current segment
1654	      SEG.TSecr:        TSecr field from TSopt in current segment
1655	      SEG.WSopt:        8-bit value in WSopt

1657	   Clock Values
1658	      my.TSclock:       System wide source of 32-bit timestamp values
1659	      my.TSclock.rate:  Period of my.TSclock (1 ms to 1 sec)
1660	      Snd.TSoffset:     A offset for randomizing Snd.TSclock
1661	      Snd.TSclock:      my.TSclock + Snd.TSoffset

1663	   Per-Connection State Variables

1665	      TS.Recent:        Latest received Timestamp
1666	      Last.ACK.sent:    Last ACK field sent
1667	      Snd.TS.OK:        1-bit flag
1668	      Snd.WS.OK:        1-bit flag
1669	      Rcv.Wind.Scale:   Receive window scale power
1670	      Snd.Wind.Scale:   Send window scale power
1671	      Start.Time:       Snd.TSclock value when segment being timed was
1672	                        sent (used by pre-1323 code).

1674	   Procedure

1676	      Update_SRTT(m)    Procedure to update the smoothed RTT and RTT
1677	                        variance estimates, using the rules of
1678	                        [Jacobson88a], given m, a new RTT measurement

1680	Appendix E.  Pseudo-code Summary

1682	   Create new TCB => {
1683	       Rcv.wind.scale =
1684	             MIN( 14, MAX(0, floor(log2(receive buffer space)) - 15) );
1685	       Snd.wind.scale = 0;
1686	       Last.ACK.sent = 0;
1687	       Snd.TS.OK = Snd.WS.OK = FALSE;
1688	       Snd.TSoffset = random 32 bit value
1689	   }

1691	   Send initial  segment => {
1692	       SEG.WND = MIN( RCV.WND, 65535 );
1693	       Include in segment: TSopt(TSval=Snd.TSclock, TCecr=0);
1694	       Include in segment: WSopt = Rcv.wind.scale;
1695	   }

1697	   Send  segment => {
1698	       SEG.ACK = Last.ACK.sent = RCV.NXT;
1699	       SEG.WND = MIN( RCV.WND, 65535 );
1700	       if (Snd.TS.OK) then
1701	             Include in segment:
1702	                   TSopt(TSval=Snd.TSclock, TSecr=TS.Recent);
1703	       if (Snd.WS.OK) then
1704	             Include in segment: WSopt = Rcv.wind.scale;
1705	   }

1707	   Receive  or  segment => {
1708	       if (Segment contains TSopt) then {
1709	             TS.Recent = SEG.TSval;
1710	             Snd.TS.OK = TRUE;
1711	             if (is  segment) then
1712	                   Update_SRTT(
1713	                          (Snd.TSclock - SEG.TSecr)/my.TSclock.rate);
1714	       }
1715	       if (Segment contains WSopt) then {
1716	             Snd.wind.scale = SEG.WSopt;
1717	             Snd.WS.OK = TRUE;
1718	             if (the ACK bit is not set, and Rcv.wind.scale has not been
1719	               initialized by the user) then
1720	                   Rcv.wind.scale = Snd.wind.scale;
1721	       }
1722	       else
1723	             Rcv.wind.scale = Snd.wind.scale = 0;
1724	   }

1726	   Send non-SYN segment => {
1727	       SEG.ACK = Last.ACK.sent = RCV.NXT;
1728	       SEG.WND = MIN( RCV.WND >> Rcv.wind.scale, 65535 );
1729	       if (Snd.TS.OK) then
1730	             Include in segment:
1731	                   TSopt(TSval=Snd.TSclock, TSecr=TS.Recent);
1732	   }

1734	   Receive non-SYN segment in (state >= ESTABLISHED) => {
1735	       Window = (SEG.WND << Snd.wind.scale);
1736	             /* Use 32-bit 'Window' instead of 16-bit 'SEG.WND'
1737	              * in rest of processing.
1738	              */
1739	       if (Segment contains TSopt) then {
1740	             if (SEG.TSval < TS.Recent && Idle less than 24 days) then {
1741	                   if (Send.TS.OK AND (NOT RST) ) then {
1742	                               /* Timestamp too old =>
1743	                                *    segment is unacceptable.
1744	                                */
1745	                         Send ACK segment;
1746	                         Discard segment and return;
1747	                   }
1748	             }
1749	             else {
1750	                   if (SEG.SEQ <= Last.ACK.sent) then
1751	                               TS.Recent = SEG.TSval;

1753	             }
1754	       }
1755	       if (SEG.ACK > SND.UNA) then {
1756	                    /* (At least part of) first segment in
1757	                     * retransmission queue has been ACKd
1758	                     */
1759	             if (Segment contains TSopt) then
1760	                   Update_SRTT(
1761	                          (Snd.TSclock - SEG.TSecr)/my.TSclock.rate);
1762	             else
1763	                   Update_SRTT( /* for compatibility */
1764	                          (Snd.TSclock - Start.Time)/my.TSclock.rate);
1765	       }
1766	   }

1768	Appendix F.  Event Processing Summary

1770	   OPEN Call

1772	      ...

1774	      An initial send sequence number (ISS) is selected.  Send a SYN
1775	      segment of the form:

1777	        

1779	      ...

1781	   SEND Call

1783	      CLOSED STATE (i.e., TCB does not exist)

1785	         ...

1787	      LISTEN STATE

1789	         If the foreign socket is specified, then change the connection
1790	         from passive to active, select an ISS.  Send a SYN segment
1791	         containing the options:  and
1792	         .  Set SND.UNA to ISS, SND.NXT to ISS+1.
1793	         Enter SYN-SENT state. ...

1795	      SYN-SENT STATE
1796	      SYN-RECEIVED STATE

1798	         ...

1800	      ESTABLISHED STATE
1801	      CLOSE-WAIT STATE

1803	         Segmentize the buffer and send it with a piggybacked
1804	         acknowledgment (acknowledgment value = RCV.NXT). ...

1806	         If the urgent flag is set ...

1808	         If the Snd.TS.OK flag is set, then include the TCP Timestamps
1809	         option  in each data
1810	         segment.

1812	         Scale the receive window for transmission in the segment
1813	         header:

1815	                   SEG.WND = (RCV.WND >> Rcv.Wind.Scale).

1817	   SEGMENT ARRIVES

1819	      ...

1821	      If the state is LISTEN then

1823	         first check for an RST

1825	            ...

1827	         second check for an ACK

1829	            ...

1831	         third check for a SYN

1833	            if the SYN bit is set, check the security.  If the ...

1835	               ...

1837	            if the SEG.PRC is less than the TCB.PRC then continue.

1839	            Check for a Window Scale option (WSopt); if one is found,
1840	            save SEG.WSopt in Snd.Wind.Scale and set Snd.WS.OK flag on.
1841	            Otherwise, set both Snd.Wind.Scale and Rcv.Wind.Scale to
1842	            zero and clear Snd.WS.OK flag.

1844	            Check for a TSopt option; if one is found, save SEG.TSval in
1845	            the variable TS.Recent and turn on the Snd.TS.OK bit.

1847	            Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any
1848	            other control or text should be queued for processing later.
1849	            ISS should be selected and a SYN segment sent of the form:

1851	                    

1853	            If the Snd.WS.OK bit is on, include a WSopt option
1854	             in this segment.  If the Snd.TS.OK
1855	            bit is on, include a TSopt
1856	             in this segment.
1857	            Last.ACK.sent is set to RCV.NXT.

1859	            SND.NXT is set to ISS+1 and SND.UNA to ISS.  The connection
1860	            state should be changed to SYN-RECEIVED.  Note that any
1861	            other incoming control or data (combined with SYN) will be
1862	            processed in the SYN-RECEIVED state, but processing of SYN
1863	            and ACK should not be repeated.  If the listen was not fully
1864	            specified (i.e., the foreign socket was not fully
1865	            specified), then the unspecified fields should be filled in
1866	            now.

1868	         fourth other text or control

1870	            ...

1872	      If the state is SYN-SENT then

1874	         first check the ACK bit

1876	            ...

1878	         ...

1880	         fourth check the SYN bit

1882	            ...

1884	            If the SYN bit is on and the security/compartment and
1885	            precedence are acceptable then, RCV.NXT is set to SEG.SEQ+1,
1886	            IRS is set to SEG.SEQ, and any acknowledgements on the
1887	            retransmission queue which are thereby acknowledged should
1888	            be removed.

1890	            Check for a Window Scale option (WSopt); if it is found,
1891	            save SEG.WSopt in Snd.Wind.Scale; otherwise, set both
1892	            Snd.Wind.Scale and Rcv.Wind.Scale to zero.

1894	            Check for a TSopt option; if one is found, save SEG.TSval in
1895	            variable TS.Recent and turn on the Snd.TS.OK bit in the
1896	            connection control block.  If the ACK bit is set, use
1897	            Snd.TSclock - SEG.TSecr as the initial RTT estimate.

1899	            If SND.UNA > ISS (our SYN has been ACKed), change the
1900	            connection state to ESTABLISHED, form an ACK segment:

1902	                    

1904	            and send it.  If the Snd.Echo.OK bit is on, include a TSopt
1905	            option  in this ACK
1906	            segment.  Last.ACK.sent is set to RCV.NXT.

1908	            Data or controls which were queued for transmission may be
1909	            included.  If there are other controls or text in the
1910	            segment then continue processing at the sixth step below
1911	            where the URG bit is checked, otherwise return.

1913	            Otherwise enter SYN-RECEIVED, form a SYN,ACK segment:

1915	                    

1917	            and send it.  If the Snd.Echo.OK bit is on, include a TSopt
1918	            option  in this segment.
1919	            If the Snd.WS.OK bit is on, include a WSopt option
1920	             in this segment.  Last.ACK.sent is
1921	            set to RCV.NXT.

1923	            If there are other controls or text in the segment, queue
1924	            them for processing after the ESTABLISHED state has been
1925	            reached, return.

1927	         fifth, if neither of the SYN or RST bits is set then drop the
1928	         segment and return.

1930	      Otherwise,

1932	      First, check sequence number

1934	         SYN-RECEIVED STATE
1935	         ESTABLISHED STATE
1936	         FIN-WAIT-1 STATE
1937	         FIN-WAIT-2 STATE
1938	         CLOSE-WAIT STATE
1939	         CLOSING STATE
1940	         LAST-ACK STATE
1941	         TIME-WAIT STATE
1942	            Segments are processed in sequence.  Initial tests on
1943	            arrival are used to discard old duplicates, but further
1944	            processing is done in SEG.SEQ order.  If a segment's
1945	            contents straddle the boundary between old and new, only the
1946	            new parts should be processed.

1948	            Rescale the received window field:

1950	                  TrueWindow = SEG.WND << Snd.Wind.Scale,

1952	            and use "TrueWindow" in place of SEG.WND in the following
1953	            steps.

1955	            Check whether the segment contains a Timestamps option and
1956	            bit Snd.TS.OK is on.  If so:

1958	               If SEG.TSval < TS.Recent and the RST bit is off, then
1959	               test whether connection has been idle less than 24 days;
1960	               if all are true, then the segment is not acceptable;
1961	               follow steps below for an unacceptable segment.

1963	               If SEG.SEQ is less than or equal to Last.ACK.sent, then
1964	               save SEG.TSval in variable TS.Recent.

1966	            There are four cases for the acceptability test for an
1967	            incoming segment:

1969	               ...

1971	            If an incoming segment is not acceptable, an acknowledgment
1972	            should be sent in reply (unless the RST bit is set, if so
1973	            drop the segment and return):

1975	                    

1977	            Last.ACK.sent is set to SEG.ACK of the acknowledgment.  If
1978	            the Snd.Echo.OK bit is on, include the Timestamps option
1979	             in this ACK segment.
1980	            Set Last.ACK.sent to SEG.ACK and send the ACK segment.
1981	            After sending the acknowledgment, drop the unacceptable
1982	            segment and return.

1984	      ...

1986	      fifth check the ACK field.

1988	         if the ACK bit is off drop the segment and return.

1990	         if the ACK bit is on

1992	            ...

1994	            ESTABLISHED STATE

1996	               If SND.UNA < SEG.ACK <= SND.NXT then, set SND.UNA <-
1997	               SEG.ACK.  Also compute a new estimate of round-trip time.
1998	               If Snd.TS.OK bit is on, use Snd.TSclock - SEG.TSecr;
1999	               otherwise use the elapsed time since the first segment in
2000	               the retransmission queue was sent.  Any segments on the
2001	               retransmission queue which are thereby entirely
2002	               acknowledged...

2004	      ...

2006	      Seventh, process the segment text.

2008	         ESTABLISHED STATE
2009	         FIN-WAIT-1 STATE
2010	         FIN-WAIT-2 STATE

2012	            ...

2014	            Send an acknowledgment of the form:

2016	                    

2018	            If the Snd.TS.OK bit is on, include Timestamps option
2019	             in this ACK segment.
2020	            Set Last.ACK.sent to SEG.ACK of the acknowledgment, and send
2021	            it.  This acknowledgment should be piggy-backed on a segment
2022	            being transmitted if possible without incurring undue delay.

2024	            ...

2026	Appendix G.  Timestamps Edge Cases

2028	   While the rules laid out for when to calculate RTTM produce the
2029	   correct results most of the time, there are some edge cases where an
2030	   incorrect RTTM can be calculated.  All of these situations involve
2031	   the loss of packets.  It is felt that these scenarios are rare, and
2032	   that if they should happen, they will cause a single RTTM measurement
2033	   to be inflated, which mitigates its effects on RTO calculations.

2035	   [Martin03] cites two similar cases when the returning ACK is lost,
2036	   and before the retransmission timer fires, another returning packet
2037	   arrives, which ACKs the data.  In this case, the RTTM calculated will
2038	   be inflated:

2040	           clock
2041	             tc=1    ------------------->

2043	             tc=2   (lost) <---- 
2044	                 (RTTM would have been 1)

2046	                    (receive window opens, window update is sent)
2047	             tc=5        <---- 
2048	                    (RTTM is calculated at 4)

2050	   One thing to note about this situation is that it is somewhat bounded
2051	   by RTO + RTT, limiting how far off the RTTM calculation will be.
2052	   While more complex scenarios can be constructed that produce larger
2053	   inflations (e.g., retransmissions are lost), those scenarios involve
2054	   multiple packet losses, and the connection will have other more
2055	   serious operational problems than using an inflated RTTM in the RTO
2056	   calculation.

2058	Authors' Addresses

2060	   David Borman
2061	   Quantum Corporation
2062	   Mendota Heights  MN 55120
2063	   USA

2065	   Email: david.borman@quantum.com

2067	   Bob Braden
2068	   University of Southern California
2069	   4676 Admiralty Way
2070	   Marina del Rey  CA 90292
2071	   USA

2073	   Email: braden@isi.edu
2074	   Van Jacobson
2075	   Packet Design
2076	   2465 Latham Street
2077	   Mountain View  CA 94040
2078	   USA

2080	   Email: van@packetdesign.com

2082	   Richard Scheffenegger (editor)
2083	   NetApp, Inc.
2084	   Am Euro Platz 2
2085	   Vienna,   1120
2086	   Austria

2088	   Email: rs@netapp.com