idnits 2.17.1 

draft-faber-time-wait-avoidance-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-16) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 5 instances of too long lines in the document, the longest one
     being 2 characters in excess of 72.

  ** The abstract seems to contain references ([2], [3], [4], [5], [6], [7],
     [8], [9], [1]), which it shouldn't.  Please replace those with straight
     textual mentions of the documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (August 1997) is 9741 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? '1' on line 52 looks like a reference

  -- Missing reference section? '2' on line 171 looks like a reference

  -- Missing reference section? '3' on line 84 looks like a reference

  -- Missing reference section? '4' on line 98 looks like a reference

  -- Missing reference section? '5' on line 99 looks like a reference

  -- Missing reference section? '6' on line 325 looks like a reference

  -- Missing reference section? '7' on line 115 looks like a reference

  -- Missing reference section? '8' on line 330 looks like a reference

  -- Missing reference section? '9' on line 523 looks like a reference


     Summary: 10 errors (**), 0 flaws (~~), 1 warning (==), 11 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	INTERNET-DRAFT                                                   T. Faber
2	Expires: February 10, 1998                                       J. Touch
3	draft-faber-time-wait-avoidance-00.txt                             W. Yue
4	                                                                 USC/ISI
5	                                                             August 1997

7	            Avoiding the TCP TIME_WAIT state at Busy Servers

9	Status of this Memo

11	   This document is an Internet-Draft.  Internet-Drafts are working doc-
12	   uments of the Internet Engineering Task Force (IETF), its areas, and
13	   its working groups.  Note that other groups may also distribute work-
14	   ing documents as Internet-Drafts.

16	   Internet-Drafts are draft documents valid for a maximum of six months
17	   and may be updated, replaced, or obsoleted by other documents at any
18	   time.  It is inappropriate to use Internet- Drafts as reference mate-
19	   rial or to cite them other than as "work in progress."

21	   To view the entire list of current Internet-Drafts, please check the
22	   "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow
23	   Directories on ftp.is.co.za (Africa), ftp.nordu.net (Europe),
24	   munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
25	   ftp.isi.edu (US West Coast).

27	Abstract

29	   This document describes the problems associated with the accumulation
30	   of TCP TIME_WAIT states at a network server, such as a web server,
31	   and details two methods for avoiding that accumulation.  Servers that
32	   have many TCP connections in TIME_WAIT state experience performance
33	   degradation, and can collapse.  One solution is a TCP modification
34	   that causes clients to enter TIME_WAIT state rather than servers.
35	   The other is an HTTP modification that allows the client to close the
36	   transport connection, maintaining the TIME_WAIT state at the client.
37	   The goal of both approaches is ensure that TIME_WAIT states accumu-
38	   late at the less loaded endpoint.

40	   The document also presents initial performance data from reference
41	   implementations of these solutions, which suggest that the modifica-
42	   tions improve HTTP connection rates at the server by as much as 50%,
43	   and allow servers to operate at small transaction throughputs that
44	   they cannot sustain their default configuration.

46	Introduction

48	   This draft describes the causes and effects of TIME_WAIT TCP protocol
49	   control block (TCB) accumulation at servers and proposes independent
50	   application and transport level modifications that remove that
51	   buildup.  We present experimental results showing a 50% improvement
52	   in HTTP connection rates, as measured by WebSTONE[1], as well as evi-
53	   dence that modified servers function at higher loads than unmodified
54	   servers can.

56	TIME_WAIT state and its effects

58	   TCP includes a mechanism to ensure that packets associated with one
59	   connection that are delayed in the network are not accepted by later
60	   connections between the same hosts[2].  The mechanism is implemented
61	   by the TIME_WAIT state of the TCP protocol.  When an endpoint closes
62	   a TCP connection, it keeps state about that connection, usually a
63	   copy of the TCB, for twice the maximum segment lifetime (MSL).  A
64	   connection in this state is in TIME_WAIT, and the endpoint holding
65	   the TIME_WAIT TCB rejects any packets addressed to the TIME_WAIT con-
66	   nection from the other endpoint.

68	   Keeping this TIME_WAIT TCB at either of the hosts prevents a new con-
69	   nection with the same combination of source address, source port,
70	   destination address, destination port from being created.  Either
71	   endpoint being in TIME_WAIT prevents data transfer on the connection,
72	   so protocol correctness is unaffected by which host holds the
73	   TIME_WAIT TCB.  Our modifications center on ensuring that the
74	   TIME_WAIT TCB is on the less loaded endpoint.

76	   Heavily loaded servers potentially keep thousands of TIME_WAIT TCBs,
77	   which consume memory and can slow active connections.  In BSD-based
78	   TCP implementations, TCBs are kept in mbufs, the memory allocation
79	   unit of the networking subsystem.  There are a finite number of mbufs
80	   available in the system, and mbufs consumed by TCBs cannot be used
81	   for other purposes, e.g., to move data.  Certain systems on high
82	   speed networks run out of mbufs due to TIME_WAIT buildup under high
83	   connection load.  A SPARCStation 20/71 under SunOS 4.1.3 on a 640
84	   Mb/sec Myrinet[3] cannot support more than 60 connections/sec.

86	   Incoming packets must be demultiplexed by finding the receiving con-
87	   nection in the host's TCB list.  This process can be slowed when the
88	   TCB list is full of TIME_WAIT TCBs.  In the simplest implementation,
89	   the TCB list is searched linearly to demultiplex the incoming packet
90	   to the appropriate connection, which can make TCB lookup a bottle-
91	   neck.  The additional search overhead can cut throughput between two
92	   SunOS 4.1.3 SPARCStations on a Myrinet in half.

94	Other Proposed Solutions

96	   There are other solutions to the increased lookup overhead problem,
97	   e.g., storing all TIME_WAIT TCBs at the end of the list and using
98	   them as a search terminator as BSDI's BSD/OS does[4], or hashing TCBs
99	   rather than keeping them in a list[5].  These solutions do not
100	   address the loss of memory due to accumulation of TIME_WAIT states,
101	   so servers may still be unable to serve a high client load.  These
102	   approaches improve system response until the server collapses due to
103	   lack of free mbufs; our approach of removing the TIME_WAIT state from
104	   the server eliminates this cause of server collapse.

106	   Allocating more memory to system mbufs or reducing the amount of data
107	   cached per connection allows servers to function under a higher load
108	   before collapsing.  The servers' performance will continue to
109	   degrade.  Moving TIME_WAIT to clients removes this cause of system
110	   degradation and collapse without changing resource allocations.

112	   The costs of accumulating TIME_WAIT TCBs have become more apparent as
113	   HTTP becomes more prevalent.  Under HTTP 1.1, servers terminate con-
114	   nections by closing the underlying TCP connection[6], which results
115	   in accumulation of TCBs at servers[7].

117	   HTTP 1.1 reduces the number of connections per transaction using per-
118	   sistent connections; however, with respect to TIME_WAIT buildup, the
119	   use of persistent connections[6] is similar to adding more memory to
120	   servers: servers can support a larger load before the effect becomes
121	   noticeable, but performance eventually degrades.  Servers supporting
122	   persistent connections can support more transactions per connection,
123	   and will benefit from our modifications by being able to support more
124	   connections.

126	Our Proposed Solutions

128	   Because the accumulation of TIME_WAIT TCBs is caused by the interac-
129	   tion between transport and application protocols, modifications can
130	   be made to either protocol to to alleviate it.  Changing the trans-
131	   port protocol confers the benefits to more applications, but there
132	   may be more resistance to changing a protocol on which many applica-
133	   tions depend.  Application level changes restrict the benefits (and
134	   drawbacks) to the application for which the solution is implemented.
135	   Furthermore, application solutions are not always possible; for exam-
136	   ple, protocols that use the closing of a transport connection to
137	   indicate end-of-file are not good candidates for removing TIME_WAIT
138	   TCBs at the application layer.

140	   This document proposes distinct extensions to TCP and to HTTP that
141	   allow hosts to control which end of the connection remains in
142	   TIME_WAIT state. A solution needs to be implemented at only one
143	   level, transport or application. We describe and measure both to have
144	   a basis for comparison.  Preliminary experiments indicate that both
145	   systems reduce the memory usage of web servers due to TIME_WAIT
146	   states to negligible levels, with accompanying performance improve-
147	   ments.  The TCP modifications require only client side changes, and
148	   can be deployed incrementally.  The HTTP changes affect client and
149	   server, but are compatible with HTTP 1.1 behavior, and can also be
150	   incrementally deployed.

152	   The remainder of this document is organized as follows: it presents
153	   the two proposed solutions, compares them, discusses the results of
154	   initial experiments with the solutions, and draws conclusions and
155	   outlines future work.

157	Transport Level (TCP) Solution

159	   The TCP solution exchanges the TIME_WAIT state between the server and
160	   client.  We modify the client's TCP implementation so that after it
161	   has completed a passive close of a transport connection, it sends an
162	   RST packet to the server and puts itself into a TIME_WAIT state.  The
163	   RST packet removes the TCB in TIME_WAIT state from the server; the
164	   explicit transition to a TIME_WAIT state in the client preserves cor-
165	   rect TCP behavior.  If the client RST is lost, both server and client
166	   remain in TIME_WAIT state, which also ensures correct behavior and is
167	   equivalent to a simultaneous close in the current protocol.  If
168	   either host reboots during the RST exchange, the behavior is the same
169	   as if a host running unmodified TCP fails with connections in
170	   TIME_WAIT state: packets will not be erroneously accepted if the host
171	   recovers and refuses connections until a 2 MSL period has elapsed[2].

173	   More formally, the change to the TCP state machine results in chang-
174	   ing the arc from LAST_ACK to CLOSED to an arc from LAST_ACK to
175	   TIME_WAIT and sending an RST when the arc is traversed.  These modi-
176	   fications need to be made only to clients.

178	   Hosts that act primarily as clients may be configured with the new
179	   behavior for all connections; clients that serve as both client and
180	   server, for example proxies, may be configured to support both behav-
181	   iors.  The implementation of both behaviors is straightforward,
182	   although it requires a more extensive modification of the TCP state
183	   machine.

185	   Allowing both behaviors on the same host requires splitting the
186	   LAST_ACK state into two states, one that represents the current
187	   behavior (LAST_ACK) and one which represents the modified behavior
188	   (LAST_ACK_SWAP).  These states may both be reported as LAST_ACK to
189	   monitoring tools. The state machine determines which state to enter
190	   from CLOSE_WAIT based on whether the application issues a close or a
191	   close_swap.

193	   The current passive close path is:

195	      server                            client
196	      -----------------------------------------------------------
197	      ESTABLISHED                       ESTABLISHED
198	      (get application close)
199	      goto FIN_WAIT_1
200	      send FIN           ---FIN--->
201	                                        goto CLOSE_WAIT
202	                        <---ACK---      send ACK
203	      goto FIN_WAIT_2
204	                                        (get application close)
205	                                        goto LAST_ACK

207	                        <---FIN---      send FIN
208	      goto TIME_WAIT
209	      send ACK           ---ACK--->
210	                                        goto CLOSED

212	   This solution adds this branch from CLOSE_WAIT on the client side:

214	      server                            client
215	      -----------------------------------------------------------
216	      ESTABLISHED                       ESTABLISHED
217	      (get application close)
218	      goto FIN_WAIT_1
219	      send FIN           ---FIN--->
220	                                        goto CLOSE_WAIT
221	                        <---ACK---      send ACK
222	      goto FIN_WAIT_2
223	                                        (get application close_swap)
224	                                        goto LAST_ACK_SWAP
225	                        <---FIN---      send FIN
226	      goto TIME_WAIT
227	      send ACK           ---ACK--->
228	                                        goto TIME_WAIT
229	                        <---RST---      send RST
230	      goto CLOSED

232	   Strictly speaking, the transition of the client to TIME_WAIT is
233	   extraneous, because any host sending an RST is obligated not to allow
234	   a connection between the same pair of addresses and ports for a time
235	   of at least 2 MSL.

237	   Distinguishing between close and close_swap does not require changing
238	   the application interface.  For example, a per-connection flag can be
239	   added to change the default behavior, where the default behavior is
240	   chosen based on whether the host is primarily a client or a server.
241	   Hosts that are primarily clients will follow the close_swap path
242	   unless overridden and servers the close path.

244	   Implementations of this system do not change the API at all if all
245	   connections from the same host have the same semantics; hosts which
246	   are primarily clients will see no change.  Only hosts that support
247	   both semantics will have a change to the API, and this will be an
248	   additional socket option or similar small change.

250	   The solution we propose is designed to interoperate with the existing
251	   TCP specification.  A cleaner implementation of our solution would be
252	   to change both endpoint implementations to negotiate which endpoint
253	   maintains the TIME_WAIT TCB.  However this would require changing all
254	   TCP implementations, which ours does not.

256	   A SunOS 4.1.3 patch is available from the authors.

258	Application Level Solution for HTTP

260	   Protocols that use the state of the transport connection as sig-
261	   nalling dictate which endpoint closes a connection, and therefore
262	   which incurs the cost of the TIME_WAIT TCB.  For example, early HTTP
263	   servers used the state of the transport connection as an implicit
264	   indicator of both transaction lifetime and request length.  The
265	   server closing the TCP connection indicated to the client that the
266	   whole response had arrived, and, because there were no persistent
267	   connections, that the HTTP exchange was over.  Because the server was
268	   using the close to mark the end of both the transaction and the
269	   exchange, it was required to initiate the close.

271	   HTTP 1.1 has sufficient framing to allow a modification to shifting
272	   TIME_WAIT TCBs to the clients[6].  Responses are self-delineating;
273	   all responses include the size of the response either in the headers
274	   or via the chunking mechanism.  When using persistent connections,
275	   which is the default behavior in HTTP/1.1, requests have fields which
276	   can be used to control the transport connection.  The server is no
277	   longer required by the protocol to close the transport connection.

279	   To control the distribution of TIME_WAIT TCBs from the application
280	   level, our HTTP modifications arrange that the client closes the TCP
281	   connection.  This requires the client to be able to detect the end of
282	   a response.  Under HTTP 1.1, this information is available to the
283	   client as a side effect of persistent connections.  We advocate a
284	   change in client behavior which requires them to close the transport
285	   connection underlying an HTTP connection, and an extension of the
286	   request format which allows the client to notify the server that it
287	   is breaking the TCP connection.

289	   We propose adding a CLIENT_CLOSE request to HTTP that indicates that
290	   a client is ending the HTTP exchange by closing the underlying TCP
291	   connection.  A CLIENT_CLOSE request requires no reply.  It terminates
292	   a series of requests on a persistent connection, and indicates to the
293	   server that the client has closed the TCP connection.  A client will
294	   initiate an active close on the TCP connection immediately after
295	   sending the CLIENT_CLOSE request to the server.

297	   A CLIENT_CLOSE request differs from including a "Connection: close"
298	   in the header of a request because a request that includes "connec-
299	   tion: close" still requires a reply from the server, and the server
300	   will (passively) close the connection[6].  A CLIENT_CLOSE request
301	   indicates that the client has severed the TCP connection, and that
302	   the server should close its end without replying.

304	   Incorporating the CLIENT_CLOSE into the transaction is a minor exten-
305	   sion to the HTTP protocol.  Current HTTP clients conduct an HTTP
306	   transaction by opening the TCP connection, making a series of
307	   requests with a "Connection: close" line in the final request header,
308	   and collecting the responses.  The server closes the connection after
309	   sending the final byte of the final request.  Modified clients open a
310	   connection to the server, make a series of requests, collect the
311	   responses, and send a CLIENT_CLOSE request to the server after the
312	   end of the last response.  The client closes the connection immedi-
313	   ately after sending the CLIENT_CLOSE.

315	   Modified clients are compatible with the HTTP 1.1 specification[6].
316	   A server that does not understand CLIENT_CLOSE will see a conven-
317	   tional HTTP exchange, followed by a request that it does not imple-
318	   ment, and a closed connection when it tries to send an error
319	   response.  A conformant server must be able to handle the client
320	   closing the TCP connection at any point.  The client has gotten its
321	   data, closed the connection and holds the TIME_WAIT TCB.

323	   Modifying servers to recognize CLIENT_CLOSE can make parts of their
324	   implementation easier.  Mogul et al. note that detecting closed con-
325	   nections can be difficult for servers[6].  CLIENT_CLOSE marks closing
326	   connections, which simplifies the server code that detects and closes
327	   connections that clients have intentionally closed.

329	   The CLIENT_CLOSE request has been implemented directly in the
330	   apache-1.24[8] server and test programs from the WebSTONE performance
331	   suite.  Patches are available from the authors.

333	Initial Implementation

335	   In this section we present experiments that demonstrate the problem
336	   and show our solutions effectiveness.  The proposed solutions have
337	   been implemented under SunOS 4.1.3 and initial evaluations of their
338	   performance have been made using both custom benchmark programs and
339	   the WebSTONE benchmark.  The tests were run on hosts connected to the
340	   640 Mb/sec Myrinet LAN.

342	   We performed two experiments. The first experiment shows that TCB
343	   load degrades server performance and that our modifications reduce
344	   that degradation.  The second illustrates that both our TCP and HTTP
345	   solutions improve server performance under the WebSTONE benchmark,
346	   which simulates typical HTTP traffic.  The last experiment shows that
347	   our modifications enable a server to support HTTP loads that it can-
348	   not in their default configurations.

350	   The first experiment was designed to determine if TCB load reduces
351	   server performance and if our modifications alleviate it.  This
352	   experiment used four Sparc 20/71's across the Myrinet using a user-
353	   level data transfer program over TCP.  The throughput is the average
354	   of each of two client hosts doing a simultaneous bulk transfer to the
355	   server host.  We vary the number of TIME_WAIT TCBs at the server by
356	   adding dummy TIME_WAIT states.

358	   The experiment was:

360	   1. Two client machines establish connections to the server

362	   2. The server is loaded with extraneous TIME-WAIT TCBs state by a
363	      fourth host.

365	   3. The two bulk transport connections transfer data.  (Throughput
366	      timing begins when the data transfer begins, not when the connec-
367	      tion is established.  TIME_WAIT TCBs may expire during the trans-
368	      fer.)

370	   4. Between runs, The server is allowed to idle and remove TCBs, to
371	      control conditions for all runs.

373	   Each result is the average of ten runs.

375	     Connections      Throughput (Mb/sec)     Throughput (Mb/sec)
376	     in TIME_WAIT     (Unmodified)            (with TCP Modification)
377	                       avg.        std.         avg.          std.
378	                                   dev.                       dev.
379	    -------------------------------------------------------------------
380	            0          66.8         3.3         66.8           3.3
381	          500          49.6         3.9         66.8           3.3
382	         1000          41.9         4.1         66.5           3.1
383	         1500          35.3         2.8         64.6           3.0
384	         2000          31.2         4.9         64.3           3.0
385	         2500          30.5         3.0         64.3           2.9

387	                 Table 1: Worst case throughput experiment

389	   The experimental procedure is designed to isolate a worst case at the
390	   server.  The client connections are established first to put them at
391	   the end of the list of TCBs in the server kernel, which will maximize
392	   the time needed to find them using SunOS's linear search.  Two
393	   clients are used to neutralize the simple caching behavior in the
394	   SunOS kernel, which consists of keeping a single pointer to the most
395	   recently accessed TCB.  Two distinct clients are used to allow for
396	   bursts from the two clients to interleave; two client programs on the
397	   same host send bursts in lock-step, which reduces the cost of the TCB
398	   list scans.

400	   The experiment shows that under worst case conditions, TCB load can
401	   reduce throughput by as much as 50%, and that our TCP modifications
402	   improve performance under those conditions.

404	   While it is useful that our modifications perform well in the worst
405	   case, it is important to asses the worth of the modifications under
406	   expected conditions.  The previous experiment constructed a worst
407	   case scenario; the following experiment uses WebSTONE to test our
408	   modifications under more typical HTTP load.

410	   WebSTONE is a standard benchmark used to measure web server perfor-
411	   mance in terms of connection rate and per connection throughput.  To
412	   measure server performance, several workstations make HTTP requests
413	   of a server and monitor the response time and throughput.  A central
414	   process collects and combines the information from the individual web
415	   clients. The benchmark has been augmented to measure the amount of
416	   memory consumed by TCBs on the server machine.  We used WebSTONE ver-
417	   sion 2 for these experiments.

419	   WebSTONE models a heavy load that simulates HTTP traffic.  Two hosts
420	   run multiple web clients which continuously request files ranging
421	   from 9KB to 5MB from the server.  Each host runs 20 web clients.

423	   Results from a typical run using clients are summarized:

425	    System                 Throughput     Connections     TCB Memory Use
426	    Type                   (Mb/sec)       per second      (Kbytes)
427	   -----------------------------------------------------------------------
428	    Unmodified               20.97           49.09            722.7
429	    TCP Modification         26.40           62.02             24.1
430	    HTTP Modifications       31.73           74.70             24.4

432	              Table 2: WebSTONE benchmark with large fileset

434	   Both modifications show marked improvements in throughput and connec-
435	   tion rate.  TCP modifications increase connection rate by 25% and
436	   HTTP modifications increase connection rate by 50%.  We believe the
437	   TCP modification is less effective than the HTTP modification because
438	   it adds another packet exchange.  Packet traces are being used to
439	   confirm this. [note: more will be included on this in later drafts]

441	   When more clients request smaller files, unmodified systems fail com-
442	   pletely because they run out of memory; systems using our modifica-
443	   tions can support much higher connection rates than unmodified sys-
444	   tems.  The following table reports data from a typical WebSTONE run
445	   using 8 clients on 4 hosts connecting to a dedicated server.  All
446	   clients request only 500 byte files.

448	    System                 Throughput     Connections     TCB Memory Use
449	    Type                   (Mb/sec)       per second      (Kbytes)
450	   -----------------------------------------------------------------------
451	    Unmodified               fails           fails            fails
452	    TCP Modification          1.14           223.8             16.1
453	    HTTP Modifications        1.14           222.4             16.1

455	               Table 3: WebSTONE benchmark with small files

457	   The experiments support the hypothesis that the proposed solutions
458	   reduce the memory load on servers.  The custom benchmark shows that
459	   the system with a modified transport performs much better in the
460	   worst case, and that server bandwidth loss can be considerable.  The
461	   WebSTONE benchmark shows that both systems reduce memory usage, and
462	   that this leads to performance gains.  Finally modified systems are
463	   able to handle workloads that unmodified systems cannot.

465	   This is a challenging test environment because the TCB load of the
466	   server host is spread across only two client hosts rather than the
467	   hundreds that would share the load in a real system.  The clients
468	   suffer some performance degradation due to the accumulating TCBs,
469	   much as the server does in the unmodified system.

471	Comparison of Methods

473	   The primary contrast between the TCP solution and the HTTP solution
474	   is that they are implemented at different levels of the protocol
475	   hierarchy.  The TCP solution has the benefits and drawbacks of a
476	   transport level solution: it applies the fix transparently to all
477	   application protocols running over TCP, but may be difficult to adopt
478	   for the same reason.  A change to TCP affects many applications, and
479	   many resist changes to TCP to avoid unintended consequences.  The
480	   HTTP solution has the trade-offs of an application modification: only
481	   HTTP will exhibit the new behavior, and the behavior of other appli-
482	   cations will be limited.  If another protocol causes a TIME_WAIT
483	   state buildup, an HTTP fix will not prevent it.

485	   The performance of our TCP modification will also be limited by how
486	   efficiently hosts process RST packets.  Hosts that incur a high over-
487	   head to handling RSTs, or delay processing them, will not perform as
488	   well.  This may be one reason that the TCP solution shows less
489	   improvement than the HTTP solution in the small file experiment
490	   above.  [note this will be expanded upon]

492	   The meaning of the RST packet is also changed by out TCP solution.
493	   An RST packet is intended to indicate an unusual condition or error
494	   in the connection.  We are proposing making it part of standard oper-
495	   ating procedure.  The change in semantics of the RST packet is a
496	   result of maintaining compatibility with current TCP.  Some browsers
497	   are currently using RST in unintended ways as well.

499	   Ideally, the two TCP endpoints would negotiate which would hold the
500	   TIME_WAIT TCB during connection establishment, but this would require
501	   changing the TCP packet format to allow room for that negotiation,
502	   and further changes to the state machine.  We believe such a system
503	   is the best solution to the TIME_WAIT TCB accumulation problem, but
504	   recognize that such a large change to TCP would be difficult to get
505	   adopted.

507	   Adopting the HTTP solution is effective if HTTP connections are the
508	   source of TIME_WAIT loading;  however, if another protocol begins
509	   loading servers with TIME_WAIT states, that protocol will have to be
510	   fixed as well.  Currently, we believe HTTP causes the bulk of
511	   TIME_WAIT loading, which is why we chose to implement our solution
512	   under HTTP; in the future other protocols may be the source.

514	   Not adopting a TCP fix means that future protocols should be designed
515	   to control TIME_WAIT loading, which will constrain their semantics.
516	   Specifically, application protocols will not be able to use the state
517	   of the transport connection as implicit signalling; application layer
518	   protocols will be constrained to include framing and connection
519	   control information or run the risk of TIME_WAIT loading servers.
520	   For example, streaming real-time transmission systems may make use of
521	   such implicit signalling.

523	   Some existing protocols, such as FTP[9], make use of implicit sig-
524	   nalling, and cannot be retrofitted with TIME_WAIT controls.  As these
525	   protocols are currently used, they do not appear to be major sources
526	   of TIME_WAIT loading.  They could become important to TIME_WAIT load
527	   if a protocol has a resurgence or is used in new ways, or if its
528	   smaller loading characteristics become significant after the HTTP
529	   load is reduced.  If that happens, a backward-compatible solution may
530	   not be possible.

532	   Both the TCP and the HTTP solutions are incrementally deployable and
533	   solve the problem at hand.  Which to deploy in the Internet depends
534	   on how the community weighs the changing the semantics of the exist-
535	   ing transport protocol versus restricting the semantics of future
536	   application protocols.

538	Conclusions

540	   This document has discussed the problem of server memory load due to
541	   terminated connections remaining in TIME_WAIT state.  Servers can
542	   become so memory poor at high connection rates that they are unable
543	   to transfer data at all.  Even if servers can continue to function,
544	   their performance can suffer.

546	   Two solutions to the memory load problem have been presented at the
547	   transport (TCP) level and the application (HTTP).  Both solutions
548	   allow a client to take on its share of the server memory load.  The
549	   transport level solution adds a new state and operation to the TCP
550	   state machine that explicitly moves the TIME_WAIT state from active
551	   close initiator to passive closer.  The application level solution
552	   adds an access method to HTTP that allows a client to notify a server
553	   that it is actively closing the connection, and maintaining the
554	   TIME_WAIT state.

556	   Both solutions will interoperate with existing systems, allowing for
557	   easy deployment.  Patches are available from the authors for both
558	   solutions: TCP modifications are available for SunOS 4.1.3 and HTTP
559	   modifications are available for apache-1.24.

561	   Although there are certainly other methods of dealing with TIME_WAIT
562	   state accumulation, the methods presented here have the benefits that
563	   they preserve current TCP behavior, are incrementally deployable, and
564	   are small simple changes to existing systems.  Most other solutions,
565	   such as ending connections with an RST or moving TIME_WAIT TCBs to
566	   other internal queues at the server, either break transport behavior,
567	   or do not address the memory load problem directly.

569	Security Considerations

571	   The practices advocated in this paper do not seem to affect the secu-
572	   rity of either the HTTP or TCP protocols.

574	   The increased use and change in semantics of RST packets may cause
575	   false alarms in systems that monitor them.

577	Authors' Addresses

579	        Ted Faber
580	        Joseph Touch
581	        Wei Yue
582	        University of Southern California/Information Sciences Institute
583	        4676 Admiralty Way
584	        Marina del Rey, CA 90292-6695
585	        USA
586	        Phone: +1 310 822 1511
587	        Fax:   +1 310 823 6714
588	        EMail: faber@isi.edu
589	               touch@isi.edu
590	               wyue@isi.edu

592	   This draft expires March 20, 1998.

594	References

596	1. Gene Trent and Mark Sake, "WebSTONE: The First Generation in HTTP
597	   Server Benchmarking," white paper, Silicon Graphics International
598	   (February 1995), available electronically at
599	   <http://www.sgi.com/Products/WebFORCE/WebStone/paper.html>.

601	2. Jon Postel, "Transmission Control Protocol," RFC-793/STD-7 (Septem-
602	   ber, 1981).

604	3. Myricom, Inc., Nannette J. Boden, Danny Cohen, Robert E. Felderman,
605	   Alan E Kulawik, Charles L. Seitz, Jakov N. Selovic, and Wen-King Su,
606	   "Myrinet: A Gigabit-per-second Local Area Network," IEEE Micro, pp.
607	   29-36, IEEE (February 1995).

609	4. Mike Karels and David Borman, Personal Communication (July 1997).

611	5. Paul E. McKenney and Ken F. Dove, "Efficient Demultiplexing of Incom-
612	   ing TCP Packets," Proceedings of SIGCOMM 1992, vol. 22, no. 4, pp.
613	   269-279, Baltimore, MD (August 17-20, 1992).

615	6. R. Fielding, J. Gettys, J. Mogul, H. Frystyk, and T. Berners-Lee,
616	   "Hypertext Transport Protocol - HTTP/1.1," RFC-2068 (January, 1997).

618	7. Robert G. Moskowitz, "Why in the World Is the Web So Slow," Network
619	   Computing, pp. 22-24 (March 15, 1996).

621	8. Roy T. Fielding and Gail Kaiser, "Collaborative Work: The Apache
622	   Server Project," IEEE Internet Computing, vol. 1, no. 4, pp. 88-90,
623	   IEEE (July/August 1997), available electronically at
624	   <http://www.computer.org/internet/ic1997/pdf/w4088.pdf>.

626	9. J. Postel and J. K. Reynolds, "File Transfer Protocol," RFC-959
627	   (October, 1985).