Internet Engineering Task Force Greg Minshall INTERNET-DRAFT Siara Systems draft-minshall-nagle-01.txt June 17, 1999 A Proposed Modification to Nagle's Algorithm Status of This Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This draft proposes a modification to Nagle's algorithm (as specified in RFC896) to allow TCP, under certain conditions, to send a small sized packet immediately after one or more maximum segment sized packet. Abstract The Nagle algorithm is one of the primary mechanisms which protects the internet from poorly designed and/or poorly implemented applications. However, for a certain class of applications (notably, request-response protocols) the Nagle algorithm interacts poorly with delayed acknowledgements to give these applications poorer performance. This draft is NOT suggesting that these applications should disable the Nagle algorithm. This draft suggests a fairly small and simple modification to the Nagle algorithm which preserves the Nagle algorithm as a means of protecting the internet while at the same time giving better performance to a wider class of applications. Introduction to the Nagle algorithm The Nagle algorithm [RFC896] protects the internet from applications (most notably Telnet [RFC854], at the time the algorithm was developed) which tend to dribble small amounts of data to TCP. Without the Nagle algorithm, TCP would transmit a packet, with a small amount of data, in response to each of the application's writes to TCP. With the Nagle algorithm, a first small packet will be transmitted, then subsequent writes from the application will be buffered at the sending TCP until either i) enough application data has accumulated to enable TCP to transmit a maximum sized packet, or ii) the initial small packet is acknowledged by the receiving TCP. This limits the number of small packets to one per round trip time. While the current Nagle algorithm does a very good job of protecting the internet from such applications, there are other applications, such as request-response protocols (with HTTP [RFC2068] being a topical example) in which the current Nagle algorithm interacts with TCP's ``delayed ACK'' policy [RFC1122] to produce non-optimal results. Delayed ACKs A receiving TCP tries to avoid acknowledging every received data packet in the hope of ``piggy-backing'' the acknowledgement on a data packet flowing in the reverse direction or combining the acknowledgement with a window update flowing in the reverse direction. This process, known as ``delayed ACKing'' [RFC1122], typically causes an ACK to be generated for every other received (full-sized) data packet. In the case of an ``isolated'' TCP packet (i.e., where a second TCP packet is not going to arrive anytime soon), the delayed ACK policy causes an acknowledgement for the data in the isolated packet to be delayed up to 200 milliseconds of the receipt of the isolated packet (the actual maximum time the acknowledgement can be delayed is 500ms [RFC1122], but most systems implement a maximum of 200ms, and we shall assume that number in this document). The way delayed ACKs are implemented in some systems causes the delayed ACK to be generated anytime between 0ms and 200ms; in this case, the average amount of time before the delayed ACK is generated is 100ms. The interaction of delayed ACKs and Nagle If a TCP has more application data to transmit than will fit in one packet, but less than two full-sized packets' worth of data, it will transmit the first packet. As a result of Nagle, it will not transmit the second packet until the first packet has been acknowledged. On the other hand, the receiving TCP will delay acknowledging the first packet until either i) a second packet arrives (which, in this case, won't arrive), or ii) approximately 100ms (and a maximum of 200ms) has elapsed. When the sending TCP receives the delayed ACK, it can then transmit its second packet. In a request-response protocol, this second packet will complete either a request or a response, which then enables a succeeding response or request. Note two (related) bad results of the interaction of delayed ACKs and the Nagle algorithm in this case: the request-response time may be increased by up to 400ms (if both the request and the response are delayed); and, consequently, the number of transactions per second is substantially reduced. A proposed modification to the Nagle algorithm In the following discussion we make use of the following variables defined in the TCP RFC [RFC793] and in the host requirements RFC [RFC1122]: ``snd.nxt'' is a TCP variable which names the next byte of data to be transmitted; ``snd.una'' is a TCP variable which names the next byte of data to be acknowledged (if snd.nxt equals snd.una, then all previous packets have been acknowledged); Eff.snd.MSS is the largest TCP payload (user data) that can be transmitted in one packet. The current Nagle algorithm does not require any other state to be kept by TCP on a system. The proposed modification to the Nagle algorithm does, unfortunately, require one new state variable to be kept by TCP: ``snd.sml'' is a TCP variable which names the last byte of data in the most recently transmitted small packet. The current Nagle algorithm can be described as follows: "If a TCP has less than a full-sized packet to transmit, and if any previous packet has not yet been acknowledged, do not transmit a packet." and in pseudo-code: if ((packet.size < Eff.snd.MSS) && (snd.nxt > snd.una)) { do not send the packet; } The proposed Nagle algorithm modifies this as follows: "If a TCP has less than a full-sized packet to transmit, and if any previously transmitted less than full-sized packet has not yet been acknowledged, do not transmit a packet." and in pseudo-code: if (packet.size < Eff.snd.MSS) { if (snd.sml > snd.una)) { do not send the packet; } else { snd.sml = snd.nxt+packet.size; send the packet; } } In other words, when running Nagle, only look at the recent transmission (and acknowledgement) of small packets (rather than all packets, as in the current Nagle). (In writing the above, I am aware that TCP acknowledges bytes, not packets. However, expressing the algorithm in terms of packets seems to make the explanation a bit clearer.) Implementing Nagle at Send The above description of the current Nagle algorithm and of the proposed modification assumes that the Nagle algorithm is being implemented just as TCP is about to hand a packet to IP to be transmitted, i.e., the algorithm is looking at the sizes of the packets it transmits. In reality, many TCPs essentially implement Nagle at the interface where applications present data to TCP to be transmitted (i.e., in the call to ``SEND'', as defined in section 3.8 of the TCP specification [RFC793]). The motivation for this is to not penalize applications that provide data to TCP in large chunks (ideally a multiple of Eff.snd.MSS). This allows a single application send to be broken into zero or more full-sized packets, possibly followed by one small packet, without forcing any delay on the trailing small packet. For example, one implementation with which the author is familiar first captures the boolean ``snd.nxt > snd.una'' in a temporary variable (``busy''): busy = (snd.nxt > snd.una); then goes into a loop transmitting packets out of the data which has been presented to TCP by the application; the loop contains the following code to implement the current Nagle algorithm: if ((packet.size < Eff.snd.MSS) && busy) { do not send the packet; } Since ``busy'' is a constant in the loop transmitting packets, a trailing small packet will be transmitted (after zero or more large packets transmitted by the same call to send) if the connection had no outstanding data at the time the application presented data to TCP for transmission (assuming the TCP window allows this). To implement the modified Nagle algorithm in such a system, we replace snd.sml with two variables: ``snd.sml.add'' is a TCP variable which names the last byte presented to TCP by the application with a ``small'' send (i.e., the application called SEND with fewer than Eff.snd.MSS bytes of data); and ``snd.sml.snt'' is a TCP variable which names the highest value of snd.sml.add which has, in fact, been transmitted. The send routine contains the following code: if (byte.count < Eff.snd.MSS) { snd.sml.add = snd.una + snd.bytes.queued; } (where ``snd.bytes.queued'' is the number of bytes queued for transmission, and has already been updated with ``byte.count'', the number of bytes being presented to TCP in this call to SEND). The loop that transmits packets contains the following code: if (packet.size < Eff.snd.MSS) { if (snd.sm.snt > snd.una) { do not send the packet; } else { if ((snd.nxt + packet.size) <= snd.sm.add) { snd.sm.snt = snd.sm.add; } send the packet; } } (In most implementations, the most deeply nested ``if'' statement above is unnecessary, as a small-sized packet will contain all the data available to be transmitted, and so will include, or be beyond, snd.sm.add. In this case, the modified Nagle algorithm adds one test, one addition, and one assignment in the send routine, and one assignment in the output routine.) A Failure Mode If an application sends a large amount of data, followed by a small amount of data, followed by a large amount of data, the current Nagle algorithm would perform better than the proposed modification. The current Nagle algorithm would send at most one small packet (possibly the last packet), delaying the middle (small) amount of data which would allow the application to send the following large amount of data; the modified Nagle algorithm would send as many as two small packets (the middle packet, plus possibly a last packet). A separate, but desirable, system facility In addition to the Nagle algorithm (or the modification proposed by this draft), it would be desirable for a system providing TCP service to applications to allow the application to set TCP into a mode in which the TCP would only transmit small packets at the explicit direction of the application. For example, a system based on BSD might implement a socket option (using setsockopt(2)) SO_EXPLICITPUSH, as well as a flag to sendto(2) (possibly overloading the semantics of an existing flag, such as MSG_EOF). In this scenario, an application would set a socket into SO_EXPLICITPUSH mode, then enter a mode of writing data to the socket and, at the last write, using send(2) with the MSG_EOF flag. The underlying TCP would recognize the MSG_EOF flag as an indicator to transmit the (possibly) small packet. Like the proposed modification to the Nagle algorithm, this is fairly simple to implement. If a system were to implement this interface, it would be important to NOT disable Nagle when using this interface. In other words, when using this interface, the default mode for TCP would be to NOT transmit a small packet (even in the presence of MSG_EOF) if a previously transmitted small packet was as yet unacknowledged. Note, also, that implementing this interface does not eliminate the desirability of using the modification of the Nagle as the default for applications. More sophisticated networking applications might well use the new interface, but naive applications will often be adequately served by the modified Nagle algorithm. Application scenarios that will not be helped by this modification The proposed modification helps applications which do not need to transmit more than one small packet in a single round-trip time. This characterizes one way file transfer applications (such as FTP [RFC959]) and request/response protocols (such as NNTP [RFC977] and HTTP [RFC2068] without pipelining). However, applications that need to transmit more than one small packet in a single round-trip time are not served by this modification. An example of such an application is HTTP [RFC2068] using ``pipelining'', in which multiple requests (responses) are transmitted asynchronously. Applications needing to transmit more than one small packet in a single round-trip time will need other mechanisms to satisfy their requirements. (One possible such mechanism would be to use more than one TCP connection.) If an application developer is considering disabling the Nagle algorithm, they should be very careful to ensure that their application will generally provide data to TCP in chunks larger than two full-sized segments (> 2*Eff.snd.MSS), and they should verify after their development that this is, in fact, true. With Nagle disabled, many writes of small blocks of data can add significant load to the network, reducing the network's performance. Acknowledgements Jim Gettys, Henrik Frystyk Nielsen, Jeff Mogul, and Yasushi Saito, as well as a message forwarded to the end2end-interest list by Sean Doran, have motivated my current interest in the Nagle algorithm. John Heidemann's work related to the Nagle algorithm has informed some of the thinking in this draft; discussions with John have also been helpful. Members of the End-to-End Research Group (under the direction of Bob Braden) patiently listened to my discussion of the current state of the Nagle algorithm and to the modifications proposed in this document. Members of the TCP implementors mailing list have been very helpful in refining this proposal. In particular, Rick Jones, Neal Cardwell, Vernon Schryver, Bernie Volz, Sam Manthorpe, Art Shelest, David Borman, Kacheong Poon, Jon Snader, Eric Hall, Joe Touch, and Alan Cox. Security Considerations The Nagle algorithm does not have major security consequences. Implementation of this algorithm should not negatively impact the performance of the internet. The negative impact of implementation of this algorithm should be significantly less than disabling the Nagle algorithm. Appendix -- Sample application code The following code is provided to give application developers a model for buffering. We assume a BSD-style sockets API. #include #include #include #include #include #include #define SNDBUF_MULT 3 /* * 2 * TCP_MAXSEG -> SO_SNDBUF */ /* * Given a connected socket (s), configure the socket * with good buffer size defaults, and return the * the size the application should use for issuing * writes to the socket. * * Returns size to use for application buffering, or * zero (0) on error. */ int getbufsize(int s) { unsigned long bufsize, parm; int buflen; buflen = sizeof bufsize; if (getsockopt(s, IPPROTO_TCP, TCP_MAXSEG, &bufsize, &buflen) == -1) { perror("getsockopt(...TCP_MAXSEG...)"); return 0; } /* Set socket transmit buffer */ parm = 2*SNDBUF_MULT*bufsize; if (setsockopt(s, SOL_SOCKET, SO_SNDBUF, &parm, sizeof parm) == -1) { perror("setsockopt(SO_SNDBUF)"); return 0; } /* Now, set socket low water threshhold */ parm = 2*bufsize; if (setsockopt(s, SOL_SOCKET, SO_SNDLOWAT, &parm, sizeof parm) == -1) { perror("setsockopt(...SO_SNDLOWAT...)"); return 0; } return 2*bufsize; } int main(int argc, char *argv[]) { char *buffer = 0; int buflen; int sock; /* * ... allocate a socket (sock) and get it connected * via either connect(2) or listen(2)/accept(2). */ buflen = getbufsize(sock); if (buflen == 0) { fprintf(stderr, "aborting\n"); exit(1); } buffer = malloc(buflen); if (buffer == 0) { fprintf(stderr, "no room for buffer of size %d\n", buflen); exit(1); } /* * ... loop generating ``buflen'' data in buffer * and using send(2) to hand it to TCP. * When there is no more data to send, call * send(2) one last time with <= ``buflen'' * bytes. */ return 0; } References [RFC793] Postel, J. (ed), "Transmission Control Protocol", Sep-1981. [RFC854] Postel, J., J. Reynolds, "Telnet Protocol Specification", May-1983. [RFC959] Postel, J., J. Reynolds, "File Transfer Protocol (FTP)", Oct-1985. [RFC977] Kantor, B., P. Lapsley, "Network News Transfer Protocol", Feb-1986. [RFC896] Nagle, J., "Congestion control in IP/TCP internetworks", Jan-06-1984. [RFC1122] Braden, R. T., "Requirements for Internet hosts - communication layers", Oct-01-1989. [RFC2068] Fielding, R., J. Gettys, J. Mogul, H. Frystyk, T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1". Author's Address Greg Minshall Siara Systems 300 Ferguson Drive, 2nd floor Mountain View, CA 94043 USA