idnits 2.17.1 draft-minshall-nagle-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 469 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There is 1 instance of too long lines in the document, the longest one being 4 characters in excess of 72. ** The abstract seems to contain references ([RFC977], [RFC793], [RFC959], [RFC1122], [RFC896], [RFC2068], [RFC854]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. == There are 5 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 17, 1999) is 9078 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '

' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  ** Obsolete normative reference: RFC  977 (Obsoleted by RFC 3977)

  ** Obsolete normative reference: RFC  896 (Obsoleted by RFC 7805)

  ** Obsolete normative reference: RFC 2068 (Obsoleted by RFC 2616)


     Summary: 10 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Engineering Task Force                            Greg Minshall
2	INTERNET-DRAFT                                             Siara Systems
3	draft-minshall-nagle-01.txt                                    June 17, 1999

5	             A Proposed Modification to Nagle's Algorithm

7	Status of This Memo

9	   This document is an Internet-Draft and is in full conformance with
10	   all provisions of Section 10 of RFC2026.

12	   Internet-Drafts are working documents of the Internet Engineering
13	   Task Force (IETF), its areas, and its working groups.  Note that
14	   other groups may also distribute working documents as
15	   Internet-Drafts.

17	   Internet-Drafts are draft documents valid for a maximum of six
18	   months and may be updated, replaced, or obsoleted by other
19	   documents at any time.  It is inappropriate to use Internet-Drafts
20	   as reference material or to cite them other than as "work in
21	   progress."

23	     The list of current Internet-Drafts can be accessed at
24	     http://www.ietf.org/ietf/1id-abstracts.txt

26	     The list of Internet-Draft Shadow Directories can be accessed at
27	     http://www.ietf.org/shadow.html.

29	   This draft proposes a modification to Nagle's algorithm (as
30	   specified in RFC896) to allow TCP, under certain conditions, to
31	   send a small sized packet immediately after one or more maximum
32	   segment sized packet.

34	Abstract

36	   The Nagle algorithm is one of the primary mechanisms which protects
37	   the internet from poorly designed and/or poorly implemented
38	   applications.  However, for a certain class of applications
39	   (notably, request-response protocols) the Nagle algorithm interacts
40	   poorly with delayed acknowledgements to give these applications
41	   poorer performance.

43	   This draft is NOT suggesting that these applications should disable
44	   the Nagle algorithm.

46	   This draft suggests a fairly small and simple modification to the
47	   Nagle algorithm which preserves the Nagle algorithm as a means of
48	   protecting the internet while at the same time giving better
49	   performance to a wider class of applications.

51	Introduction to the Nagle algorithm

53	   The Nagle algorithm [RFC896] protects the internet from
54	   applications (most notably Telnet [RFC854], at the time the
55	   algorithm was developed) which tend to dribble small amounts of
56	   data to TCP.  Without the Nagle algorithm, TCP would transmit a
57	   packet, with a small amount of data, in response to each of the
58	   application's writes to TCP.  With the Nagle algorithm, a first
59	   small packet will be transmitted, then subsequent writes from the
60	   application will be buffered at the sending TCP until either i)
61	   enough application data has accumulated to enable TCP to transmit a
62	   maximum sized packet, or ii) the initial small packet is
63	   acknowledged by the receiving TCP.  This limits the number of small
64	   packets to one per round trip time.

66	   While the current Nagle algorithm does a very good job of
67	   protecting the internet from such applications, there are other
68	   applications, such as request-response protocols (with HTTP
69	   [RFC2068]  being a topical example) in which the current Nagle
70	   algorithm interacts with TCP's ``delayed ACK'' policy [RFC1122]
71	   to produce non-optimal results.

73	Delayed ACKs

75	   A receiving TCP tries to avoid acknowledging every received data
76	   packet in the hope of ``piggy-backing'' the acknowledgement on a
77	   data packet flowing in the reverse direction or combining the
78	   acknowledgement with a window update flowing in the reverse
79	   direction.  This process, known as ``delayed ACKing'' [RFC1122],
80	   typically causes an ACK to be generated for every other received
81	   (full-sized) data packet.  In the case of an ``isolated'' TCP
82	   packet (i.e., where a second TCP packet is not going to arrive
83	   anytime soon), the delayed ACK policy causes an acknowledgement for
84	   the data in the isolated packet to be delayed up to 200
85	   milliseconds of the receipt of the isolated packet (the actual
86	   maximum time the acknowledgement can be delayed is 500ms [RFC1122],
87	   but most systems implement a maximum of 200ms, and we shall assume
88	   that number in this document).  The way delayed ACKs are
89	   implemented in some systems causes the delayed ACK to be generated
90	   anytime between 0ms and 200ms; in this case, the average amount of
91	   time before the delayed ACK is generated is 100ms.

93	The interaction of delayed ACKs and Nagle

95	   If a TCP has more application data to transmit than will fit in one
96	   packet, but less than two full-sized packets' worth of data, it
97	   will transmit the first packet.  As a result of Nagle, it will not
98	   transmit the second packet until the first packet has been
99	   acknowledged.  On the other hand, the receiving TCP will delay
100	   acknowledging the first packet until either i) a second packet
101	   arrives (which, in this case, won't arrive), or ii) approximately
102	   100ms (and a maximum of 200ms) has elapsed.

104	   When the sending TCP receives the delayed ACK, it can then transmit
105	   its second packet.

107	   In a request-response protocol, this second packet will complete
108	   either a request or a response, which then enables a succeeding
109	   response or request.

111	   Note two (related) bad results of the interaction of delayed ACKs
112	   and the Nagle algorithm in this case: the request-response time may
113	   be increased by up to 400ms (if both the request and the response
114	   are delayed); and, consequently, the number of transactions per
115	   second is substantially reduced.

117	A proposed modification to the Nagle algorithm

119	   In the following discussion we make use of the following variables
120	   defined in the TCP RFC [RFC793] and in the host requirements RFC
121	   [RFC1122]: ``snd.nxt'' is a TCP variable which names the next byte
122	   of data to be transmitted; ``snd.una'' is a TCP variable which
123	   names the next byte of data to be acknowledged (if snd.nxt equals
124	   snd.una, then all previous packets have been acknowledged);
125	   Eff.snd.MSS is the largest TCP payload (user data) that can be
126	   transmitted in one packet.

128	   The current Nagle algorithm does not require any other state to be
129	   kept by TCP on a system.

131	   The proposed modification to the Nagle algorithm does,
132	   unfortunately, require one new state variable to be kept by TCP:
133	   ``snd.sml'' is a TCP variable which names the last byte of data in
134	   the most recently transmitted small packet.

136	   The current Nagle algorithm can be described as follows:

138	        "If a TCP has less than a full-sized packet to transmit,
139	        and if any previous packet has not yet been acknowledged,
140	        do not transmit a packet."

142	   and in pseudo-code:

144	        if ((packet.size < Eff.snd.MSS) && (snd.nxt > snd.una)) {
145	                do not send the packet;
146	        }

148	   The proposed Nagle algorithm modifies this as follows:

150	        "If a TCP has less than a full-sized packet to transmit,
151	        and if any previously transmitted less than full-sized
152	        packet has not yet been acknowledged, do not transmit
153	        a packet."

155	   and in pseudo-code:

157	        if (packet.size < Eff.snd.MSS) {
158	                if (snd.sml > snd.una)) {
159	                        do not send the packet;
160	                } else {
161	                        snd.sml = snd.nxt+packet.size;
162	                        send the packet;
163	                }
164	        }

166	   In other words, when running Nagle, only look at the recent
167	   transmission (and acknowledgement) of small packets (rather than
168	   all packets, as in the current Nagle).

170	   (In writing the above, I am aware that TCP acknowledges bytes, not
171	   packets.  However, expressing the algorithm in terms of packets
172	   seems to make the explanation a bit clearer.)

174	Implementing Nagle at Send

176	   The above description of the current Nagle algorithm and of the
177	   proposed modification assumes that the Nagle algorithm is being
178	   implemented just as TCP is about to hand a packet to IP to be
179	   transmitted, i.e., the algorithm is looking at the sizes of the
180	   packets it transmits.

182	   In reality, many TCPs essentially implement Nagle at the interface
183	   where applications present data to TCP to be transmitted (i.e., in
184	   the call to ``SEND'', as defined in section 3.8 of the TCP
185	   specification [RFC793]).  The motivation for this is to not
186	   penalize applications that provide data to TCP in large chunks
187	   (ideally a multiple of Eff.snd.MSS).

189	   This allows a single application send to be broken into zero or
190	   more full-sized packets, possibly followed by one small packet,
191	   without forcing any delay on the trailing small packet.  For
192	   example, one implementation with which the author is familiar first
193	   captures the boolean ``snd.nxt > snd.una'' in a temporary variable
194	   (``busy''):

196	        busy = (snd.nxt > snd.una);

198	   then goes into a loop transmitting packets out of the data which
199	   has been presented to TCP by the application; the loop contains the
200	   following code to implement the current Nagle algorithm:

202	        if ((packet.size < Eff.snd.MSS) && busy) {
203	                do not send the packet;
204	        }

206	   Since ``busy'' is a constant in the loop transmitting packets, a
207	   trailing small packet will be transmitted (after zero or more large
208	   packets transmitted by the same call to send) if the connection had
209	   no outstanding data at the time the application presented data to
210	   TCP for transmission (assuming the TCP window allows this).

212	   To implement the modified Nagle algorithm in such a system, we
213	   replace snd.sml with two variables: ``snd.sml.add'' is a TCP
214	   variable which names the last byte presented to TCP by the
215	   application with a ``small'' send (i.e., the application called
216	   SEND with fewer than Eff.snd.MSS bytes of data); and
217	   ``snd.sml.snt'' is a TCP variable which names the highest value of
218	   snd.sml.add which has, in fact, been transmitted.  The send routine
219	   contains the following code:

221	        if (byte.count < Eff.snd.MSS) {
222	                snd.sml.add = snd.una + snd.bytes.queued;
223	        }

225	   (where ``snd.bytes.queued'' is the number of bytes queued for
226	   transmission, and has already been updated with ``byte.count'', the
227	   number of bytes being presented to TCP in this call to SEND).

229	   The loop that transmits packets contains the following code:

231	        if (packet.size < Eff.snd.MSS) {
232	                if (snd.sm.snt > snd.una) {
233	                        do not send the packet;
234	                } else {
235	                        if ((snd.nxt + packet.size) <= snd.sm.add) {
236	                                snd.sm.snt = snd.sm.add;
237	                        }
238	                        send the packet;
239	                }
240	        }

242	   (In most implementations, the most deeply nested ``if'' statement
243	   above is unnecessary, as a small-sized packet will contain all the
244	   data available to be transmitted, and so will include, or be
245	   beyond, snd.sm.add.  In this case, the modified Nagle algorithm
246	   adds one test, one addition, and one assignment in the send
247	   routine, and one assignment in the output routine.)

249	A Failure Mode

251	   If an application sends a large amount of data, followed by a small
252	   amount of data, followed by a large amount of data, the current
253	   Nagle algorithm would perform better than the proposed
254	   modification.  The current Nagle algorithm would send at most one
255	   small packet (possibly the last packet), delaying the middle
256	   (small) amount of data which would allow the application to send
257	   the following large amount of data; the modified Nagle algorithm
258	   would send as many as two small packets (the middle packet, plus
259	   possibly a last packet).

261	A separate, but desirable, system facility

263	   In addition to the Nagle algorithm (or the modification proposed by
264	   this draft), it would be desirable for a system providing TCP
265	   service to applications to allow the application to set TCP into a
266	   mode in which the TCP would only transmit small packets at the
267	   explicit direction of the application.  For example, a system based
268	   on BSD might implement a socket option (using setsockopt(2))
269	   SO_EXPLICITPUSH, as well as a flag to sendto(2) (possibly
270	   overloading the semantics of an existing flag, such as MSG_EOF).

272	   In this scenario, an application would set a socket into
273	   SO_EXPLICITPUSH mode, then enter a mode of writing data to the
274	   socket and, at the last write, using send(2) with the MSG_EOF flag.
275	   The underlying TCP would recognize the MSG_EOF flag as an indicator
276	   to transmit the (possibly) small packet.

278	   Like the proposed modification to the Nagle algorithm, this is
279	   fairly simple to implement.

281	   If a system were to implement this interface, it would be important
282	   to NOT disable Nagle when using this interface.  In other words,
283	   when using this interface, the default mode for TCP would be to NOT
284	   transmit a small packet (even in the presence of MSG_EOF) if a
285	   previously transmitted small packet was as yet unacknowledged.

287	   Note, also, that implementing this interface does not eliminate the
288	   desirability of using the modification of the Nagle as the default
289	   for applications.  More sophisticated networking applications might
290	   well use the new interface, but naive applications will often be
291	   adequately served by the modified Nagle algorithm.

293	Application scenarios that will not be helped by this modification

295	   The proposed modification helps applications which do not need to
296	   transmit more than one small packet in a single round-trip time.
297	   This characterizes one way file transfer applications (such as FTP
298	   [RFC959]) and request/response protocols (such as NNTP [RFC977] and
299	   HTTP [RFC2068] without pipelining).

301	   However, applications that need to transmit more than one small
302	   packet in a single round-trip time are not served by this
303	   modification.  An example of such an application is HTTP [RFC2068]
304	   using ``pipelining'', in which multiple requests (responses) are
305	   transmitted asynchronously.

307	   Applications needing to transmit more than one small packet in a
308	   single round-trip time will need other mechanisms to satisfy their
309	   requirements.  (One possible such mechanism would be to use more
310	   than one TCP connection.)

312	   If an application developer is considering disabling the Nagle
313	   algorithm, they should be very careful to ensure that their
314	   application will generally provide data to TCP in chunks larger
315	   than two full-sized segments (> 2*Eff.snd.MSS), and they should
316	   verify after their development that this is, in fact, true.  With
317	   Nagle disabled, many writes of small blocks of data can add
318	   significant load to the network, reducing the network's performance.

320	Acknowledgements

322	   Jim Gettys, Henrik Frystyk Nielsen, Jeff Mogul, and Yasushi Saito,
323	   as well as a message forwarded to the end2end-interest list by Sean
324	   Doran, have motivated my current interest in the Nagle algorithm.
325	   John Heidemann's work related to the Nagle algorithm has informed
326	   some of the thinking in this draft; discussions with John have also
327	   been helpful.  Members of the End-to-End Research Group (under
328	   the direction of Bob Braden) patiently listened to my discussion of
329	   the current state of the Nagle algorithm and to the modifications
330	   proposed in this document.

332	   Members of the TCP implementors mailing list
333	    have been very helpful in refining this
334	   proposal.  In particular, Rick Jones, Neal Cardwell, Vernon
335	   Schryver, Bernie Volz, Sam Manthorpe, Art Shelest, David Borman,
336	   Kacheong Poon, Jon Snader, Eric Hall, Joe Touch, and Alan Cox.

338	Security Considerations

340	   The Nagle algorithm does not have major security consequences.

342	   Implementation of this algorithm should not negatively impact
343	   the performance of the internet.  The negative impact of
344	   implementation of this algorithm should be significantly less
345	   than disabling the Nagle algorithm.

347	Appendix -- Sample application code

349	   The following code is provided to give application developers a
350	   model for buffering.  We assume a BSD-style sockets API.

352	        #include 
353	        #include 
354	        #include 
355	        #include 
356	        #include 
357	        #include 

359	        #define SNDBUF_MULT 3     /* * 2 * TCP_MAXSEG -> SO_SNDBUF */

361	        /*
362	         * Given a connected socket (s), configure the socket
363	         * with good buffer size defaults, and return the
364	         * the size the application should use for issuing
365	         * writes to the socket.
366	         *
367	         * Returns size to use for application buffering, or
368	         * zero (0) on error.
369	         */
370	        int
371	        getbufsize(int s)
372	        {
373	                unsigned long bufsize, parm;
374	                int buflen;

376	                buflen = sizeof bufsize;
377	                if (getsockopt(s, IPPROTO_TCP, TCP_MAXSEG,
378	                                        &bufsize, &buflen) == -1) {
379	                        perror("getsockopt(...TCP_MAXSEG...)");
380	                        return 0;
381	                }

383	                /* Set socket transmit buffer */
384	                parm = 2*SNDBUF_MULT*bufsize;
385	                if (setsockopt(s, SOL_SOCKET, SO_SNDBUF,
386	                                        &parm, sizeof parm) == -1) {
387	                        perror("setsockopt(SO_SNDBUF)");
388	                        return 0;
389	                }

391	                /* Now, set socket low water threshhold */
392	                parm = 2*bufsize;
393	                if (setsockopt(s, SOL_SOCKET, SO_SNDLOWAT,
394	                                        &parm, sizeof parm) == -1) {
395	                        perror("setsockopt(...SO_SNDLOWAT...)");
396	                        return 0;
397	                }

399	                return 2*bufsize;
400	        }

402	        int
403	        main(int argc, char *argv[])
404	        {
405	                char *buffer = 0;
406	                int buflen;
407	                int sock;

409	                /*
410	                 * ... allocate a socket (sock) and get it connected
411	                 * via either connect(2) or listen(2)/accept(2).
412	                 */

414	                buflen = getbufsize(sock);
415	                if (buflen == 0) {
416	                        fprintf(stderr, "aborting\n");
417	                        exit(1);
418	                }

420	                buffer = malloc(buflen);
421	                if (buffer == 0) {
422	                        fprintf(stderr,
423	                                "no room for buffer of size %d\n",
424	                                                            buflen);
425	                        exit(1);
426	                }

428	                /*
429	                 * ... loop generating ``buflen'' data in buffer
430	                 * and using send(2) to hand it to TCP.
431	                 * When there is no more data to send, call
432	                 * send(2) one last time with <= ``buflen''
433	                 * bytes.
434	                 */

436	                return 0;
437	        }

439	References

441	[RFC793]        Postel, J. (ed), "Transmission Control Protocol",
442	                        Sep-1981.
443	[RFC854]        Postel, J., J. Reynolds, "Telnet Protocol
444	                        Specification", May-1983.
445	[RFC959]        Postel, J., J. Reynolds, "File Transfer Protocol
446	                        (FTP)", Oct-1985.
447	[RFC977]        Kantor, B., P. Lapsley, "Network News Transfer
448	                        Protocol", Feb-1986.
449	[RFC896]        Nagle, J., "Congestion control in IP/TCP internetworks",
450	                        Jan-06-1984.
451	[RFC1122]       Braden, R. T., "Requirements for Internet hosts -
452	                        communication layers", Oct-01-1989.
453	[RFC2068]       Fielding, R., J. Gettys, J. Mogul, H. Frystyk,
454	                        T. Berners-Lee, "Hypertext Transfer Protocol
455	                        -- HTTP/1.1".

457	Author's Address

459	   Greg Minshall
460	   Siara Systems
461	   300 Ferguson Drive, 2nd floor
462	   Mountain View, CA  94043
463	   USA

465