idnits 2.17.1 draft-stenberg-httpbis-tcp-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (August 3, 2016) is 2816 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '1' on line 471 -- Looks like a reference, but probably isn't: '2' on line 473 ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 7230 (Obsoleted by RFC 9110, RFC 9112) ** Obsolete normative reference: RFC 7540 (Obsoleted by RFC 9113) -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 httpbis D. Stenberg 3 Internet-Draft Mozilla 4 Intended status: Best Current Practice August 3, 2016 5 Expires: February 4, 2017 7 TCP Tuning for HTTP 8 draft-stenberg-httpbis-tcp-02 10 Abstract 12 This document records current best practice for using all versions of 13 HTTP over TCP. 15 Status of This Memo 17 This Internet-Draft is submitted in full conformance with the 18 provisions of BCP 78 and BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF). Note that other groups may also distribute 22 working documents as Internet-Drafts. The list of current Internet- 23 Drafts is at http://datatracker.ietf.org/drafts/current/. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet-Drafts as reference 28 material or to cite them other than as "work in progress." 30 This Internet-Draft will expire on February 4, 2017. 32 Copyright Notice 34 Copyright (c) 2016 IETF Trust and the persons identified as the 35 document authors. All rights reserved. 37 This document is subject to BCP 78 and the IETF Trust's Legal 38 Provisions Relating to IETF Documents 39 (http://trustee.ietf.org/license-info) in effect on the date of 40 publication of this document. Please review these documents 41 carefully, as they describe your rights and restrictions with respect 42 to this document. Code Components extracted from this document must 43 include Simplified BSD License text as described in Section 4.e of 44 the Trust Legal Provisions and are provided without warranty as 45 described in the Simplified BSD License. 47 Table of Contents 49 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 50 1.1. Notational Conventions . . . . . . . . . . . . . . . . . 3 51 2. Socket planning . . . . . . . . . . . . . . . . . . . . . . . 3 52 2.1. Number of open files . . . . . . . . . . . . . . . . . . 3 53 2.2. Number of concurrent network messages . . . . . . . . . . 3 54 2.3. Number of incoming TCP SYNs allowed to backlog . . . . . 3 55 2.4. Use the whole port range for local ports . . . . . . . . 4 56 2.5. Lower the TCP FIN timeout . . . . . . . . . . . . . . . . 4 57 2.6. Reuse sockets in TIME_WAIT state . . . . . . . . . . . . 4 58 2.7. TCP socket buffer sizes . . . . . . . . . . . . . . . . . 4 59 2.8. TCP Window Scaling . . . . . . . . . . . . . . . . . . . 5 60 2.9. Timers and timeouts . . . . . . . . . . . . . . . . . . . 5 61 3. TCP handshake . . . . . . . . . . . . . . . . . . . . . . . . 6 62 3.1. TCP Fast Open . . . . . . . . . . . . . . . . . . . . . . 6 63 3.2. Initial Window . . . . . . . . . . . . . . . . . . . . . 6 64 3.3. TCP SYN flood handling . . . . . . . . . . . . . . . . . 6 65 4. TCP transfers . . . . . . . . . . . . . . . . . . . . . . . . 7 66 4.1. Packet scheduling and flow control . . . . . . . . . . . 7 67 4.2. Explicit Congestion Control . . . . . . . . . . . . . . . 7 68 4.3. Nagle's Algorithm . . . . . . . . . . . . . . . . . . . . 7 69 4.4. Delayed ACKs . . . . . . . . . . . . . . . . . . . . . . 7 70 4.5. Keep-alive . . . . . . . . . . . . . . . . . . . . . . . 8 71 5. Re-using connections . . . . . . . . . . . . . . . . . . . . 8 72 5.1. Slow Start after Idle . . . . . . . . . . . . . . . . . . 8 73 5.2. TCP-Bound Authentications . . . . . . . . . . . . . . . . 8 74 6. Closing connections . . . . . . . . . . . . . . . . . . . . . 9 75 6.1. Half-close . . . . . . . . . . . . . . . . . . . . . . . 9 76 6.2. Abort . . . . . . . . . . . . . . . . . . . . . . . . . . 9 77 6.3. Close Idle Connections . . . . . . . . . . . . . . . . . 9 78 6.4. Tail Loss Probes . . . . . . . . . . . . . . . . . . . . 9 79 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 80 8. Security Considerations . . . . . . . . . . . . . . . . . . . 9 81 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 82 9.1. Normative References . . . . . . . . . . . . . . . . . . 9 83 9.2. Informative References . . . . . . . . . . . . . . . . . 10 84 9.3. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 11 85 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 11 86 Appendix B. Operating System Settings for Linux . . . . . . . . 11 87 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 12 89 1. Introduction 91 HTTP version 1.1 [RFC7230] as well as HTTP version 2 [RFC7540] are 92 defined to use TCP [RFC0793], and their performance can depend 93 greatly upon how TCP is configured. This document records the best 94 current practice for using HTTP over TCP, with a focus on improving 95 end-user perceived performance. 97 These practices are generally applicable to HTTP/1 as well as HTTP/2, 98 although some may note particular impact or nuance regarding a 99 particular protocol version. 101 There are countless scenarios, roles and setups where HTTP is being 102 using so there can be no single specific "Right Answer" to most TCP 103 questions. This document intends only to cover the most important 104 areas of concern and suggest possible actions. 106 1.1. Notational Conventions 108 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 109 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 110 document are to be interpreted as described in [RFC2119]. 112 2. Socket planning 114 Your HTTP server or intermediary may need configuration changes to 115 some system tunables and timeout periods to perform optimally. 116 Actual values will depend on how you are scaling the platform, 117 horizontally or vertically, and other connection semantics. Changing 118 system limits and altering thresholds will change the behavior of 119 your web service and its dependencies. These dependencies are 120 usually common to other services running on the same system, so good 121 planning and testing is advised. 123 This is a list of values to consider and some general advice on how 124 those values can be modified on Linux systems. 126 2.1. Number of open files 128 A modern HTTP server will serve a large number of TCP connections and 129 in most systems each open socket equals an open file. Make sure that 130 limit isn't a bottle neck. 132 2.2. Number of concurrent network messages 134 Raise the number of packets allowed to get queued when a particular 135 interface receives packets faster than the kernel can process them. 137 2.3. Number of incoming TCP SYNs allowed to backlog 139 The number of new connection requests that are allowed to queue up in 140 the kernel. These can be connections that are in SYN RECEIVED or 141 ESTABLISHED states. Historically, operating systems used a single 142 backlog queue for both of these states. Newer implemntations use two 143 separate queues: one for connections in SYN RECEIVED and one for 144 those which are ESTABLISHED state (better known as the accept queue). 146 2.4. Use the whole port range for local ports 148 To make sure the TCP stack can take full advantage of the entire set 149 of possible sockets, give it a larger range of local port numbers to 150 use. 152 2.5. Lower the TCP FIN timeout 154 High connection completion rates will consume ephemeral ports 155 quickly. Lower the time during which connections are in FIN-WAIT-2/ 156 TIME_WAIT states so that they can be purged faster and thus maintain 157 a maximal number of available sockets. The primitives for the 158 assignment of these values were described in [RFC0793], however 159 significantly lower values are commonly used. 161 2.6. Reuse sockets in TIME_WAIT state 163 When running backend servers on a managed, low latency network you 164 might allow the reuse of sockets in TIME_WAIT state for new 165 connections when a protocol complete termination has occurred. There 166 is no RFC that covers this behaviour. 168 2.7. TCP socket buffer sizes 170 Systems meant to handle and serve a huge number of TCP connections at 171 high speeds can require significant amounts of memory for TCP socket 172 buffers to maintain performance. On some systems you can tell the 173 TCP stack what default buffer sizes to use and how much they are 174 allowed to dynamically grow and shrink. Window Scaling is typically 175 linked to socket buffer sizes. 177 The minimum and default values for socket buffers tend to require 178 less proactive amendment than the maximum value. When deriving 179 maximum values for use, you should consider the BDP (Bandwidth Delay 180 Product) of the target environment and client paths. Consider also 181 that 'read' and 'write' values do not require to be synchronised, as 182 the BDP for a load balancer or middle-box might be very different 183 when acting as a sender or receiver due to the network charateristics 184 in either context. e.g. A cache with fast, low latency network to an 185 origin serving high latency clients. 187 Allowing needlessly high values beyond the expected limitations of 188 the platform will not improve performance however can cause buffer 189 induced delays within the path or excessive retransmissions during 190 congestion events. Extensions such as ECN coupled with AQM can help 191 mitigate this undesirable behaviour [RFC7141]. 193 2.8. TCP Window Scaling 195 Window Scaling is provided as a function of the congestion control 196 algorithm used on a platform. Initial and maximal values can usually 197 be configured. 199 The window size used at connection startup is a calculated value 200 using the MSS discovered during the 3WHS and the Initial Window (IW) 201 in both send and receive contexts (initcwnd and initrwnd). [RFC7323] 202 covers Window Scaling in greater detail. 204 You may have to increase the largest allowed window size from the 205 system default to increase the throuput for high latency clients. 206 Window scaling must be accommodated within the maximal values, 207 however it is not uncommon to see the maximum definable higher than 208 the scalable limit; these values can be statically defined within 209 socket parameters (SO_RCVBUF,SO_SNDBUF). 211 Changes to the size of the window or incomplete window usage are 212 common secondary symptoms of 'slow transfer rates' on a loss free 213 path. Locating the root cause of these symptoms, usually on the 214 client or server system, is an important step commonly overlooked in 215 favour of blaming the path. 217 2.9. Timers and timeouts 219 On a modern shared platform it can be common to plan for both long 220 and short lived connections on the same implementation. However, the 221 delivery of static assets and a 'web push' or 'long poll' service 222 provide very different quality of service promises. 224 Fail 'fast': TCP resources can be highly contended. For fault 225 tolerance reasons a server needs to be able to determine within a 226 reasonable time frame whether a connection is still active or 227 required. e.g. If static assets typically return in 100s of 228 milliseconds, and users 'switch off' after <10s keeping timeouts of 229 >30s make little sense and defining a 'quality of service' 230 appropriate to the target platform is encouraged. On a shared 231 platform with mixed session lifetimes, applications that require 232 longer render times have various options to ensure the underlying 233 service and upstream servers in the path can identify the session as 234 not failed: HTTP continuations, Redirects, 202s or sending data. 236 Clients and servers typically have many timeout options, a few 237 notable options are: Connect(client), time to request(server), time 238 to first byte(client), between bytes(server/client), total connection 239 time(server/client). Some implementations merge these values into a 240 single 'timeout' definition even when statistics are reported 241 individually. All should be considered as the defaults in many 242 implementations are highly underiable, even infinite timeouts have 243 been observed. 245 3. TCP handshake 247 3.1. TCP Fast Open 249 TCP Fast Open (a.k.a. TFO, [RFC7413]) allows data to be sent on the 250 TCP handshake, thereby allowing a request to be sent without any 251 delay if a connection is not open. 253 TFO requires both client and server support, and additionally 254 requires application knowledge, because the data sent on the SYN 255 needs to be idempotent. Therefore, TFO can only be used on 256 idempotent, safe HTTP methods (e.g., GET and HEAD), or with 257 intervening negotiation (e.g, using TLS). It should be noted that 258 TFO requires a secret to be defined on the server to mitigate 259 security vulnerabilities it introduces. TFO therefore requires more 260 server side deployment planning than other enhancements. 262 Support for TFO is growing in client platforms, especially mobile, 263 due to the significant performance advantage it gives. 265 3.2. Initial Window 267 [RFC6928] proposes a new IW of 10*MSS, and is now fairly widely 268 deployed server-side. Many implementations allow you to tune both 269 initcwnd and initrwnd values. Some implementations allow these 270 values to be applied to specific routes which allows a greater degree 271 of control over known paths. 273 There has been experimentation with larger initial windows in 274 combination with packet pacing, however IW10 has been reported to 275 perform fairly well even in both general and high volume use cases. 277 3.3. TCP SYN flood handling 279 TCP SYN Flood mitigations [RFC4987] are necessary and there will be 280 thresholds to tweak. 282 4. TCP transfers 284 4.1. Packet scheduling and flow control 286 TBD cubic codel pacing 288 4.2. Explicit Congestion Control 290 Apple deploying in iOS and OSX [1]. 292 4.3. Nagle's Algorithm 294 Nagle's Algorithm [RFC0896] is the mechanism that makes the TCP stack 295 hold (small) outgoing packets for a short period of time so that it 296 can potentially merge that packet with the next outgoing one. It is 297 optimized for throughput at the expense of latency. 299 HTTP/2 in particular requires that the client can send a packet back 300 fast even during transfers that are perceived as single direction 301 transfers. Even small delays in those sends can cause a significant 302 performance loss. 304 HTTP/1.1 is also affected, especially when sending off a full request 305 in a single write() system call. 307 In POSIX systems you switch it off like this: 309 int one = 1; 310 setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &one, sizeof(one)); 312 4.4. Delayed ACKs 314 Delayed ACK [RFC1122] is a mechanism enabled in most TCP stacks that 315 causes the stack to delay sending acknowledgement packets in response 316 to data. The ACK is delayed up until a certain threshold, or until 317 the peer has some data to send, in which case the ACK will be sent 318 along with that data. Depending on the traffic flow and TCP stack 319 this delay can be as long as 500ms. 321 This interacts poorly with peers that have Nagle's Algorithm enabled. 322 Because Nagle's Algorithm delays sending until either one MSS of data 323 is provided _or_ until an ACK is received for all sent data, delaying 324 ACKs can force Nagle's Algorithm to buffer packets when it doesn't 325 need to (that is, when the other peer has already processed the 326 outstanding data). 328 Delayed ACKs can be useful in situations where it is reasonable to 329 assume that a data packet will almost immediately (within 500ms) 330 cause data to be sent in the other direction. In general in both 331 HTTP/1.1 and HTTP/2 this is unlikely: therefore, disabling Delayed 332 ACKs can provide an improvement in latency. 334 However, the TLS handshake is a clear exception to this case. For 335 the duration of the TLS handshake it is likely to be useful to keep 336 Delayed ACKs enabled. 338 Additionally, for low-latency servers that can guarantee responses to 339 requests within 500ms, on long-running connections (such as HTTP/2), 340 and when requests are small enough to fit within a small packet, 341 leaving delayed ACKs turned on may provide minor performance 342 benefits. 344 Effective use of switching off delayed ACKs requires extensive 345 profiling. 347 4.5. Keep-alive 349 TCP keep-alive is likely disabled - at least on mobile clients for 350 energy saving purposes. App-level keep-alive is then required for 351 long-lived requests to detect failed peers or connections reset by 352 stateful firewalls etc. 354 5. Re-using connections 356 5.1. Slow Start after Idle 358 Slow-start is one of the algorithms that TCP uses to control 359 congestion inside the network. It is also known as the exponential 360 growth phase. Each TCP connection will start off in slow-start but 361 will also go back to slow-start after a certain amount of idle time. 363 5.2. TCP-Bound Authentications 365 There are several HTTP authentication mechanisms in use today that 366 are used or can be used to authenticate a connection rather than a 367 single HTTP request. Two popular ones are NTLM and Negotiate. 369 If such an authentication has been negotiated on a TCP connection, 370 that connection can remain authenticated throughout the rest of its 371 lifetime. This discrepancy with how other HTTP authentications work 372 makes it important to handle these connections with care. 374 6. Closing connections 376 6.1. Half-close 378 The client or server is free to half-close after a request or 379 response has been completed; or when there is no pending stream in 380 HTTP/2. 382 Half-closing is sometimes the only way for a server to make sure it 383 closes down connections cleanly so that it doesn't accept more 384 requests while still allowing clients to receive the ongoing 385 responses. 387 6.2. Abort 389 No client abort for HTTP/1.1 after the request body has been sent. 390 Delayed full close is expected following an error response to avoid 391 RST on the client. 393 6.3. Close Idle Connections 395 Keeping open connections around for subsequent connection reuse is 396 key for many HTTP clients' performance. The value of an existing 397 connection quickly degrades and after only a few minutes the chance 398 that a connection will successfully get reused by a web browser is 399 slim. 401 6.4. Tail Loss Probes 403 draft [2] 405 7. IANA Considerations 407 This document does not require action from IANA. 409 8. Security Considerations 411 TBD 413 9. References 415 9.1. Normative References 417 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 418 RFC 793, DOI 10.17487/RFC0793, September 1981, 419 . 421 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 422 Requirement Levels", BCP 14, RFC 2119, 423 DOI 10.17487/RFC2119, March 1997, 424 . 426 [RFC7230] Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer 427 Protocol (HTTP/1.1): Message Syntax and Routing", 428 RFC 7230, DOI 10.17487/RFC7230, June 2014, 429 . 431 [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext 432 Transfer Protocol Version 2 (HTTP/2)", RFC 7540, 433 DOI 10.17487/RFC7540, May 2015, 434 . 436 9.2. Informative References 438 [RFC0896] Nagle, J., "Congestion Control in IP/TCP Internetworks", 439 RFC 896, DOI 10.17487/RFC0896, January 1984, 440 . 442 [RFC1122] Braden, R., Ed., "Requirements for Internet Hosts - 443 Communication Layers", STD 3, RFC 1122, 444 DOI 10.17487/RFC1122, October 1989, 445 . 447 [RFC4987] Eddy, W., "TCP SYN Flooding Attacks and Common 448 Mitigations", RFC 4987, DOI 10.17487/RFC4987, August 2007, 449 . 451 [RFC6928] Chu, J., Dukkipati, N., Cheng, Y., and M. Mathis, 452 "Increasing TCP's Initial Window", RFC 6928, 453 DOI 10.17487/RFC6928, April 2013, 454 . 456 [RFC7141] Briscoe, B. and J. Manner, "Byte and Packet Congestion 457 Notification", BCP 41, RFC 7141, DOI 10.17487/RFC7141, 458 February 2014, . 460 [RFC7323] Borman, D., Braden, B., Jacobson, V., and R. 461 Scheffenegger, Ed., "TCP Extensions for High Performance", 462 RFC 7323, DOI 10.17487/RFC7323, September 2014, 463 . 465 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 466 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 467 . 469 9.3. URIs 471 [1] https://developer.apple.com/videos/wwdc/2015/?id=719 473 [2] http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01 475 Appendix A. Acknowledgments 477 This specification builds upon previous work and help from Mark 478 Nottingham, Craig Taylor 480 Appendix B. Operating System Settings for Linux 482 Here are some sample operating system settings for the Linux 483 operating system, along with the section it refers to. 485 Section 2.1 487 fs.file-max = 489 Section 2.2 491 net.core.netdev_max_backlog = 493 Section 2.3 495 net.core.somaxconn = 497 Section 2.4 499 net.ipv4.ip_local_port_range = 1024 65535 501 Section 2.5 503 net.ipv4.tcp_fin_timeout = 505 Section 2.6 507 net.ipv4.tcp_tw_reuse = 1 509 Section 2.7 511 net.ipv4.tcp_wmem = 513 Section 2.7 515 net.ipv4.tcp_rmem = 516 Section 2.8 518 net.core.rmem_max = 520 Section 2.8 522 net.core.wmem_max = 524 Section 5.1 526 net.ipv4.tcp_slow_start_after_idle = 0 528 Section 4.3 Turning off Nagle's Algorithm: 530 int one = 1; 531 setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &one, sizeof(one)); 533 Section 4.4 535 On recent Linux kernels (since Linux 2.4.4), Delayed ACKs can be 536 disabled like this: 538 int one = 1; 539 setsockopt(fd, IPPROTO_TCP, TCP_QUICKACK, &one, sizeof(one)); 541 Unlike disabling Nagle's Algorithm, disabling Delayed ACKs on Linux 542 is not a one-time operation: processing within the TCP stack can 543 cause Delayed ACKs to be re-enabled. As a result, to use 544 "TCP_QUICKACK" effectively requires setting and unsetting the socket 545 option during the life of the connection. 547 Author's Address 549 Daniel Stenberg 550 Mozilla 552 Email: daniel@haxx.se 553 URI: http://daniel.haxx.se