idnits 2.17.1 draft-minshall-nagle-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 469 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There is 1 instance of too long lines in the document, the longest one being 4 characters in excess of 72. ** The abstract seems to contain references ([RFC977], [RFC793], [RFC959], [RFC1122], [RFC896], [RFC2068], [RFC854]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. == There are 5 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 17, 1999) is 9078 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 977 (Obsoleted by RFC 3977) ** Obsolete normative reference: RFC 896 (Obsoleted by RFC 7805) ** Obsolete normative reference: RFC 2068 (Obsoleted by RFC 2616) Summary: 10 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force Greg Minshall 2 INTERNET-DRAFT Siara Systems 3 draft-minshall-nagle-01.txt June 17, 1999 5 A Proposed Modification to Nagle's Algorithm 7 Status of This Memo 9 This document is an Internet-Draft and is in full conformance with 10 all provisions of Section 10 of RFC2026. 12 Internet-Drafts are working documents of the Internet Engineering 13 Task Force (IETF), its areas, and its working groups. Note that 14 other groups may also distribute working documents as 15 Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six 18 months and may be updated, replaced, or obsoleted by other 19 documents at any time. It is inappropriate to use Internet-Drafts 20 as reference material or to cite them other than as "work in 21 progress." 23 The list of current Internet-Drafts can be accessed at 24 http://www.ietf.org/ietf/1id-abstracts.txt 26 The list of Internet-Draft Shadow Directories can be accessed at 27 http://www.ietf.org/shadow.html. 29 This draft proposes a modification to Nagle's algorithm (as 30 specified in RFC896) to allow TCP, under certain conditions, to 31 send a small sized packet immediately after one or more maximum 32 segment sized packet. 34 Abstract 36 The Nagle algorithm is one of the primary mechanisms which protects 37 the internet from poorly designed and/or poorly implemented 38 applications. However, for a certain class of applications 39 (notably, request-response protocols) the Nagle algorithm interacts 40 poorly with delayed acknowledgements to give these applications 41 poorer performance. 43 This draft is NOT suggesting that these applications should disable 44 the Nagle algorithm. 46 This draft suggests a fairly small and simple modification to the 47 Nagle algorithm which preserves the Nagle algorithm as a means of 48 protecting the internet while at the same time giving better 49 performance to a wider class of applications. 51 Introduction to the Nagle algorithm 53 The Nagle algorithm [RFC896] protects the internet from 54 applications (most notably Telnet [RFC854], at the time the 55 algorithm was developed) which tend to dribble small amounts of 56 data to TCP. Without the Nagle algorithm, TCP would transmit a 57 packet, with a small amount of data, in response to each of the 58 application's writes to TCP. With the Nagle algorithm, a first 59 small packet will be transmitted, then subsequent writes from the 60 application will be buffered at the sending TCP until either i) 61 enough application data has accumulated to enable TCP to transmit a 62 maximum sized packet, or ii) the initial small packet is 63 acknowledged by the receiving TCP. This limits the number of small 64 packets to one per round trip time. 66 While the current Nagle algorithm does a very good job of 67 protecting the internet from such applications, there are other 68 applications, such as request-response protocols (with HTTP 69 [RFC2068] being a topical example) in which the current Nagle 70 algorithm interacts with TCP's ``delayed ACK'' policy [RFC1122] 71 to produce non-optimal results. 73 Delayed ACKs 75 A receiving TCP tries to avoid acknowledging every received data 76 packet in the hope of ``piggy-backing'' the acknowledgement on a 77 data packet flowing in the reverse direction or combining the 78 acknowledgement with a window update flowing in the reverse 79 direction. This process, known as ``delayed ACKing'' [RFC1122], 80 typically causes an ACK to be generated for every other received 81 (full-sized) data packet. In the case of an ``isolated'' TCP 82 packet (i.e., where a second TCP packet is not going to arrive 83 anytime soon), the delayed ACK policy causes an acknowledgement for 84 the data in the isolated packet to be delayed up to 200 85 milliseconds of the receipt of the isolated packet (the actual 86 maximum time the acknowledgement can be delayed is 500ms [RFC1122], 87 but most systems implement a maximum of 200ms, and we shall assume 88 that number in this document). The way delayed ACKs are 89 implemented in some systems causes the delayed ACK to be generated 90 anytime between 0ms and 200ms; in this case, the average amount of 91 time before the delayed ACK is generated is 100ms. 93 The interaction of delayed ACKs and Nagle 95 If a TCP has more application data to transmit than will fit in one 96 packet, but less than two full-sized packets' worth of data, it 97 will transmit the first packet. As a result of Nagle, it will not 98 transmit the second packet until the first packet has been 99 acknowledged. On the other hand, the receiving TCP will delay 100 acknowledging the first packet until either i) a second packet 101 arrives (which, in this case, won't arrive), or ii) approximately 102 100ms (and a maximum of 200ms) has elapsed. 104 When the sending TCP receives the delayed ACK, it can then transmit 105 its second packet. 107 In a request-response protocol, this second packet will complete 108 either a request or a response, which then enables a succeeding 109 response or request. 111 Note two (related) bad results of the interaction of delayed ACKs 112 and the Nagle algorithm in this case: the request-response time may 113 be increased by up to 400ms (if both the request and the response 114 are delayed); and, consequently, the number of transactions per 115 second is substantially reduced. 117 A proposed modification to the Nagle algorithm 119 In the following discussion we make use of the following variables 120 defined in the TCP RFC [RFC793] and in the host requirements RFC 121 [RFC1122]: ``snd.nxt'' is a TCP variable which names the next byte 122 of data to be transmitted; ``snd.una'' is a TCP variable which 123 names the next byte of data to be acknowledged (if snd.nxt equals 124 snd.una, then all previous packets have been acknowledged); 125 Eff.snd.MSS is the largest TCP payload (user data) that can be 126 transmitted in one packet. 128 The current Nagle algorithm does not require any other state to be 129 kept by TCP on a system. 131 The proposed modification to the Nagle algorithm does, 132 unfortunately, require one new state variable to be kept by TCP: 133 ``snd.sml'' is a TCP variable which names the last byte of data in 134 the most recently transmitted small packet. 136 The current Nagle algorithm can be described as follows: 138 "If a TCP has less than a full-sized packet to transmit, 139 and if any previous packet has not yet been acknowledged, 140 do not transmit a packet." 142 and in pseudo-code: 144 if ((packet.size < Eff.snd.MSS) && (snd.nxt > snd.una)) { 145 do not send the packet; 146 } 148 The proposed Nagle algorithm modifies this as follows: 150 "If a TCP has less than a full-sized packet to transmit, 151 and if any previously transmitted less than full-sized 152 packet has not yet been acknowledged, do not transmit 153 a packet." 155 and in pseudo-code: 157 if (packet.size < Eff.snd.MSS) { 158 if (snd.sml > snd.una)) { 159 do not send the packet; 160 } else { 161 snd.sml = snd.nxt+packet.size; 162 send the packet; 163 } 164 } 166 In other words, when running Nagle, only look at the recent 167 transmission (and acknowledgement) of small packets (rather than 168 all packets, as in the current Nagle). 170 (In writing the above, I am aware that TCP acknowledges bytes, not 171 packets. However, expressing the algorithm in terms of packets 172 seems to make the explanation a bit clearer.) 174 Implementing Nagle at Send 176 The above description of the current Nagle algorithm and of the 177 proposed modification assumes that the Nagle algorithm is being 178 implemented just as TCP is about to hand a packet to IP to be 179 transmitted, i.e., the algorithm is looking at the sizes of the 180 packets it transmits. 182 In reality, many TCPs essentially implement Nagle at the interface 183 where applications present data to TCP to be transmitted (i.e., in 184 the call to ``SEND'', as defined in section 3.8 of the TCP 185 specification [RFC793]). The motivation for this is to not 186 penalize applications that provide data to TCP in large chunks 187 (ideally a multiple of Eff.snd.MSS). 189 This allows a single application send to be broken into zero or 190 more full-sized packets, possibly followed by one small packet, 191 without forcing any delay on the trailing small packet. For 192 example, one implementation with which the author is familiar first 193 captures the boolean ``snd.nxt > snd.una'' in a temporary variable 194 (``busy''): 196 busy = (snd.nxt > snd.una); 198 then goes into a loop transmitting packets out of the data which 199 has been presented to TCP by the application; the loop contains the 200 following code to implement the current Nagle algorithm: 202 if ((packet.size < Eff.snd.MSS) && busy) { 203 do not send the packet; 204 } 206 Since ``busy'' is a constant in the loop transmitting packets, a 207 trailing small packet will be transmitted (after zero or more large 208 packets transmitted by the same call to send) if the connection had 209 no outstanding data at the time the application presented data to 210 TCP for transmission (assuming the TCP window allows this). 212 To implement the modified Nagle algorithm in such a system, we 213 replace snd.sml with two variables: ``snd.sml.add'' is a TCP 214 variable which names the last byte presented to TCP by the 215 application with a ``small'' send (i.e., the application called 216 SEND with fewer than Eff.snd.MSS bytes of data); and 217 ``snd.sml.snt'' is a TCP variable which names the highest value of 218 snd.sml.add which has, in fact, been transmitted. The send routine 219 contains the following code: 221 if (byte.count < Eff.snd.MSS) { 222 snd.sml.add = snd.una + snd.bytes.queued; 223 } 225 (where ``snd.bytes.queued'' is the number of bytes queued for 226 transmission, and has already been updated with ``byte.count'', the 227 number of bytes being presented to TCP in this call to SEND). 229 The loop that transmits packets contains the following code: 231 if (packet.size < Eff.snd.MSS) { 232 if (snd.sm.snt > snd.una) { 233 do not send the packet; 234 } else { 235 if ((snd.nxt + packet.size) <= snd.sm.add) { 236 snd.sm.snt = snd.sm.add; 237 } 238 send the packet; 239 } 240 } 242 (In most implementations, the most deeply nested ``if'' statement 243 above is unnecessary, as a small-sized packet will contain all the 244 data available to be transmitted, and so will include, or be 245 beyond, snd.sm.add. In this case, the modified Nagle algorithm 246 adds one test, one addition, and one assignment in the send 247 routine, and one assignment in the output routine.) 249 A Failure Mode 251 If an application sends a large amount of data, followed by a small 252 amount of data, followed by a large amount of data, the current 253 Nagle algorithm would perform better than the proposed 254 modification. The current Nagle algorithm would send at most one 255 small packet (possibly the last packet), delaying the middle 256 (small) amount of data which would allow the application to send 257 the following large amount of data; the modified Nagle algorithm 258 would send as many as two small packets (the middle packet, plus 259 possibly a last packet). 261 A separate, but desirable, system facility 263 In addition to the Nagle algorithm (or the modification proposed by 264 this draft), it would be desirable for a system providing TCP 265 service to applications to allow the application to set TCP into a 266 mode in which the TCP would only transmit small packets at the 267 explicit direction of the application. For example, a system based 268 on BSD might implement a socket option (using setsockopt(2)) 269 SO_EXPLICITPUSH, as well as a flag to sendto(2) (possibly 270 overloading the semantics of an existing flag, such as MSG_EOF). 272 In this scenario, an application would set a socket into 273 SO_EXPLICITPUSH mode, then enter a mode of writing data to the 274 socket and, at the last write, using send(2) with the MSG_EOF flag. 275 The underlying TCP would recognize the MSG_EOF flag as an indicator 276 to transmit the (possibly) small packet. 278 Like the proposed modification to the Nagle algorithm, this is 279 fairly simple to implement. 281 If a system were to implement this interface, it would be important 282 to NOT disable Nagle when using this interface. In other words, 283 when using this interface, the default mode for TCP would be to NOT 284 transmit a small packet (even in the presence of MSG_EOF) if a 285 previously transmitted small packet was as yet unacknowledged. 287 Note, also, that implementing this interface does not eliminate the 288 desirability of using the modification of the Nagle as the default 289 for applications. More sophisticated networking applications might 290 well use the new interface, but naive applications will often be 291 adequately served by the modified Nagle algorithm. 293 Application scenarios that will not be helped by this modification 295 The proposed modification helps applications which do not need to 296 transmit more than one small packet in a single round-trip time. 297 This characterizes one way file transfer applications (such as FTP 298 [RFC959]) and request/response protocols (such as NNTP [RFC977] and 299 HTTP [RFC2068] without pipelining). 301 However, applications that need to transmit more than one small 302 packet in a single round-trip time are not served by this 303 modification. An example of such an application is HTTP [RFC2068] 304 using ``pipelining'', in which multiple requests (responses) are 305 transmitted asynchronously. 307 Applications needing to transmit more than one small packet in a 308 single round-trip time will need other mechanisms to satisfy their 309 requirements. (One possible such mechanism would be to use more 310 than one TCP connection.) 312 If an application developer is considering disabling the Nagle 313 algorithm, they should be very careful to ensure that their 314 application will generally provide data to TCP in chunks larger 315 than two full-sized segments (> 2*Eff.snd.MSS), and they should 316 verify after their development that this is, in fact, true. With 317 Nagle disabled, many writes of small blocks of data can add 318 significant load to the network, reducing the network's performance. 320 Acknowledgements 322 Jim Gettys, Henrik Frystyk Nielsen, Jeff Mogul, and Yasushi Saito, 323 as well as a message forwarded to the end2end-interest list by Sean 324 Doran, have motivated my current interest in the Nagle algorithm. 325 John Heidemann's work related to the Nagle algorithm has informed 326 some of the thinking in this draft; discussions with John have also 327 been helpful. Members of the End-to-End Research Group (under 328 the direction of Bob Braden) patiently listened to my discussion of 329 the current state of the Nagle algorithm and to the modifications 330 proposed in this document. 332 Members of the TCP implementors mailing list 333 have been very helpful in refining this 334 proposal. In particular, Rick Jones, Neal Cardwell, Vernon 335 Schryver, Bernie Volz, Sam Manthorpe, Art Shelest, David Borman, 336 Kacheong Poon, Jon Snader, Eric Hall, Joe Touch, and Alan Cox. 338 Security Considerations 340 The Nagle algorithm does not have major security consequences. 342 Implementation of this algorithm should not negatively impact 343 the performance of the internet. The negative impact of 344 implementation of this algorithm should be significantly less 345 than disabling the Nagle algorithm. 347 Appendix -- Sample application code 349 The following code is provided to give application developers a 350 model for buffering. We assume a BSD-style sockets API. 352 #include 353 #include 354 #include 355 #include 356 #include 357 #include 359 #define SNDBUF_MULT 3 /* * 2 * TCP_MAXSEG -> SO_SNDBUF */ 361 /* 362 * Given a connected socket (s), configure the socket 363 * with good buffer size defaults, and return the 364 * the size the application should use for issuing 365 * writes to the socket. 366 * 367 * Returns size to use for application buffering, or 368 * zero (0) on error. 369 */ 370 int 371 getbufsize(int s) 372 { 373 unsigned long bufsize, parm; 374 int buflen; 376 buflen = sizeof bufsize; 377 if (getsockopt(s, IPPROTO_TCP, TCP_MAXSEG, 378 &bufsize, &buflen) == -1) { 379 perror("getsockopt(...TCP_MAXSEG...)"); 380 return 0; 381 } 383 /* Set socket transmit buffer */ 384 parm = 2*SNDBUF_MULT*bufsize; 385 if (setsockopt(s, SOL_SOCKET, SO_SNDBUF, 386 &parm, sizeof parm) == -1) { 387 perror("setsockopt(SO_SNDBUF)"); 388 return 0; 389 } 391 /* Now, set socket low water threshhold */ 392 parm = 2*bufsize; 393 if (setsockopt(s, SOL_SOCKET, SO_SNDLOWAT, 394 &parm, sizeof parm) == -1) { 395 perror("setsockopt(...SO_SNDLOWAT...)"); 396 return 0; 397 } 399 return 2*bufsize; 400 } 402 int 403 main(int argc, char *argv[]) 404 { 405 char *buffer = 0; 406 int buflen; 407 int sock; 409 /* 410 * ... allocate a socket (sock) and get it connected 411 * via either connect(2) or listen(2)/accept(2). 412 */ 414 buflen = getbufsize(sock); 415 if (buflen == 0) { 416 fprintf(stderr, "aborting\n"); 417 exit(1); 418 } 420 buffer = malloc(buflen); 421 if (buffer == 0) { 422 fprintf(stderr, 423 "no room for buffer of size %d\n", 424 buflen); 425 exit(1); 426 } 428 /* 429 * ... loop generating ``buflen'' data in buffer 430 * and using send(2) to hand it to TCP. 431 * When there is no more data to send, call 432 * send(2) one last time with <= ``buflen'' 433 * bytes. 434 */ 436 return 0; 437 } 439 References 441 [RFC793] Postel, J. (ed), "Transmission Control Protocol", 442 Sep-1981. 443 [RFC854] Postel, J., J. Reynolds, "Telnet Protocol 444 Specification", May-1983. 445 [RFC959] Postel, J., J. Reynolds, "File Transfer Protocol 446 (FTP)", Oct-1985. 447 [RFC977] Kantor, B., P. Lapsley, "Network News Transfer 448 Protocol", Feb-1986. 449 [RFC896] Nagle, J., "Congestion control in IP/TCP internetworks", 450 Jan-06-1984. 451 [RFC1122] Braden, R. T., "Requirements for Internet hosts - 452 communication layers", Oct-01-1989. 453 [RFC2068] Fielding, R., J. Gettys, J. Mogul, H. Frystyk, 454 T. Berners-Lee, "Hypertext Transfer Protocol 455 -- HTTP/1.1". 457 Author's Address 459 Greg Minshall 460 Siara Systems 461 300 Ferguson Drive, 2nd floor 462 Mountain View, CA 94043 463 USA 465