idnits 2.17.1 

draft-ietf-ecm-cm-03.txt:
  ** The Abstract section seems to be numbered


  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 734 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 4 instances of too long lines in the document, the longest one
     being 6 characters in excess of 72.

  ** There are 5 instances of lines with control characters in the document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC 2581 (ref. 'Allman99') (Obsoleted by
     RFC 5681)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Andersen00'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Balakrishnan98'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Balakrishnan99'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Clark90'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Eggert00'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Floyd99a'

  ** Obsolete normative reference: RFC 2582 (ref. 'Floyd99b') (Obsoleted by
     RFC 3782)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Jacobson88'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Mahdavi98'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Padmanabhan98'

  ** Obsolete normative reference: RFC  793 (ref. 'Postel81') (Obsoleted by
     RFC 9293)

  ** Obsolete normative reference: RFC 2481 (ref. 'Ramakrishnan98')
     (Obsoleted by RFC 3168)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Stevens94'

  ** Obsolete normative reference: RFC 2140 (ref. 'Touch97') (Obsoleted by
     RFC 9040)


     Summary: 15 errors (**), 0 flaws (~~), 2 warnings (==), 12 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Engineering Task Force                       Hari Balakrishnan
2	INTERNET DRAFT                                                  MIT LCS
3	Document: draft-ietf-ecm-cm-03.txt                    Srinivasan Seshan
4	                                                                    CMU
5	                                                         November, 2000
6							      Expires: May 2001

8	                        The Congestion Manager

10	Status of this Memo

12	   This document is an Internet-Draft and is in full conformance with
13	      all provisions of Section 10 of RFC-2026 [Bradner96].

15	   Internet-Drafts are working documents of the Internet Engineering
16	   Task Force (IETF), its areas, and its working groups. Note that
17	   other groups may also distribute working documents as Internet-
18	   Drafts. Internet-Drafts are draft documents valid for a maximum of
19	   six months and may be updated, replaced, or obsoleted by other
20	   documents at any time. It is inappropriate to use Internet- Drafts
21	   as reference material or to cite them other than as "work in
22	   progress."
23	   The list of current Internet-Drafts can be accessed at
24	   http://www.ietf.org/ietf/1id-abstracts.txt
25	   The list of Internet-Draft Shadow Directories can be accessed at
26	   http://www.ietf.org/shadow.html.

28	1.      Abstract

30	   This document describes the Congestion Manager (CM), an end-system
31	   module that:

33	   (i) Enables an ensemble of multiple concurrent streams from a
34	   sender destined to the same receiver and sharing the same
35	   congestion properties to perform proper congestion avoidance and
36	   control, and

38	   (ii) Allows applications to easily adapt to network congestion.

40	   The framework described in this document integrates congestion
41	   management across all applications and transport protocols. The CM
42	   maintains congestion parameters (available aggregate and per-stream
43	   bandwidth, per-receiver round-trip times, etc.) and exports an API
44	   that enables applications to learn about network characteristics,
45	   pass information to the CM, share congestion information with each
46	   other, and schedule data transmissions.  This document focuses on
47	   applications and transport protocols with their own independent
48	   per-byte or per-packet sequence number information, and does not
49	   require modifications to the receiver protocol stack.  However, the
50	   receiving application must provide feedback to the sending
51	   application about received packets and losses, and the latter is
52	   expected to use the CM API to update CM state.  This document does
53	   not address networks with reservations or service differentiation.

55	2.      Conventions used in this document:
56	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
57	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
58	   this document are to be interpreted as described in RFC-2119
59	   [Bradner97].

61	   STREAM
62		A group of packets that all share the same source and
63	        destination IP address, IP type-of-service, transport
64	        protocol, and source and destination transport-layer port
65	        numbers.

67	   MACROFLOW
68		A group of streams that all use the same congestion management
69	        and scheduling algorithms, and share congestion state
70	        information.  Currently, streams destined to different
71	        receivers belong to different macroflows.  Streams destined to
72	        the same receiver MAY belong to different macroflows.  Streams
73	        that experience identical congestion behavior in the Internet
74	        and use the same congestion control algorithm SHOULD belong to
75	        the same macroflow.

77	   APPLICATION
78		Any software module that uses the CM.  This includes
79	        user-level applications such as Web servers or audio/video
80	        servers, as well as in-kernel protocols such as TCP [Postel81]
81	        that use the CM for congestion control.

83	   WELL-BEHAVED APPLICATION
84	        An application that only transmits when allowed by the CM and
85	        accurately accounts for all data that it has sent to the
86	        receiver by informing the CM using the CM API.

88	   PATH MAXIMUM TRANSMISSION UNIT (PMTU)
89	        The size of the largest packet that the sender can transmit
90	        without it being fragmented en route to the receiver.  It
91	        includes the sizes of all headers and data except the IP
92	        header.

94	   CONGESTION WINDOW (cwnd)
95	        A CM state variable that modulates the amount of outstanding
96	        data between sender and receiver.

98	   OUTSTANDING WINDOW (ownd)
99	        The number of bytes that has been transmitted by the source,
100	        but not known to have been either received by the destination
101	        or lost in the network.

103	   INITIAL WINDOW (IW)
104	        The size of the sender's congestion window at the beginning of
105	        a macroflow.

107	   DATA TYPE SYNTAX
108	           We use "u64" for unsigned 64-bit, "u32" for unsigned 32-
109	   bit, "u16" for unsigned 16-bit, "u8" for unsigned 8-bit, "i32" for
110	   signed 32-bit, "i16" for signed 16-bit quantities, "float" for IEEE
111	   floating point values. The type "void" is used to indicate that no
112	   return value is expected from a call. Pointers are referred to
113	   using "*" syntax, following C language convention.

115		   We emphasize that all the API functions described in this
116	   document are "abstract" calls and that conformant CM
117	   implementations may differ in specific implementation details.

119	3.      Introduction

121	   The CM is an end-system module that enables an ensemble of multiple
122	   concurrent streams to perform stable congestion avoidance and
123	   control, and allows applications to easily adapt their
124	   transmissions to prevailing network conditions.  It integrates
125	   congestion management across all applications and transport
126	   protocols.  It maintains congestion parameters (available aggregate
127	   and per-stream bandwidth, per-receiver round-trip times, etc.) and
128	   exports an API that enables applications to learn about network
129	   characteristics, pass information to the CM, share congestion
130	   information with each other, and schedule data transmissions.  All
131	   data transmissions MUST be done with the explicit consent of the CM
132	   via this API to ensure proper congestion behavior.

134	   This document focuses on applications and networks where the
135	   following conditions hold:

137	   1. Applications are well-behaved with their own independent
138	      per-byte or per-packet sequence number information, and use the
139	      CM API to update internal state in the CM.

141	   2. Networks are best-effort without service discrimination or
142	      reservations.  In particular, it does not address situations
143	      where different streams between the same pair of hosts traverse
144	      paths with differing characteristics.

146	   The Congestion Manager framework can be extended to support
147	   applications that do not provide their own feedback and to
148	   differentially-served networks.  These extensions will be addressed
149	   in later documents.

151	   The CM is motivated by two main goals:

153	   (i) Enable efficient multiplexing.  Increasingly, the trend on the
154	   Internet is for unicast data senders (e.g., Web servers) to
155	   transmit heterogeneous types of data to receivers, ranging from
156	   unreliable real-time streaming content to reliable Web pages and
157	   applets.  As a result, many logically different streams share the
158	   same path between sender and receiver.  For the Internet to remain
159	   stable, each of these streams must incorporate control protocols
160	   that safely probe for spare bandwidth and react to
161	   congestion. Unfortunately, these concurrent streams typically compete
162	   with each other for network resources, rather than share them
163	   effectively. Furthermore, they do not learn from each other about
164	   the state of the network. Even if they each independently implement
165	   congestion control (e.g., a group of TCP connections each
166	   implementing the algorithms in [Jacobson88, Allman99]), the
167	   ensemble of streams tends to be more aggressive in the face of
168	   congestion than a single TCP connection implementing standard TCP
169	   congestion control and avoidance [Balakrishnan98].

171	   (ii) Enable application adaptation to congestion. Increasingly
172	   popular real-time streaming applications run over UDP using their
173	   own user-level transport protocols for good application
174	   performance, but in most cases today do not adapt or react properly
175	   to network congestion.  By implementing a stable control algorithm
176	   and exposing an adaptation API, the CM enables easy application
177	   adaptation to congestion.  Applications adapt the data they
178	   transmit to the current network conditions.

180	   The CM framework builds on recent work on TCP control block sharing
181	   [Touch97], integrated TCP congestion control (TCP-Int)
182	   [Balakrishnan98] and TCP sessions [Padmanabhan98].  [Touch97]
183	   advocates the sharing of some of the state in the TCP control block
184	   to improve transient transport performance and describes sharing
185	   across an ensemble of TCP connections.  [Balakrishnan98],
186	   [Padmanabhan98], and [Eggert00] describe several experiments that
187	   quantify the benefits of sharing congestion state, including
188	   improved stability in the face of congestion and better loss
189	   recovery.  Integrating loss recovery across concurrent connections
190	   significantly improves performance because losses on one connection
191	   can be detected by noticing that later data sent on another
192	   connection has been received and acknowledged.  The CM framework
193	   extends these ideas in two significant ways: (i) it extends
194	   congestion management to non-TCP streams, which are becoming
195	   increasingly common and often do not implement proper congestion
196	   management, and (ii) it provides an API for applications to adapt
197	   their transmissions to current network conditions.  For an extended
198	   discussion of the motivation for the CM, its architecture, API,
199	   and algorithms, see [Balakrishnan99]; for a description of an
200	   implementation and performance results, see [Andersen00].

202	   The resulting end-host protocol architecture at the sender is shown
203	   in Figure 1.  The CM helps achieve network stability by
204	   implementing stable congestion avoidance and control algorithms
205	   that are "TCP-friendly" [Mahdavi98] based on algorithms described in
206	   [Allman99].  However, it does not attempt to enforce proper
207	   congestion behavior for all applications (but it does not preclude
208	   a policer on the host that performs this task).  Note that while
209	   the policer at the end-host can use CM, the network has to be
210	   protected against compromises to the CM and the policer at the end
211	   hosts, a task that requires router machinery [Floyd99a]. We do not
212	   address this issue further in this document.

214	   |--------| |--------| |--------| |--------|       |--------------|
215	   |  HTTP  | |  FTP   | |  RTP 1 | |  RTP 2 |       |              |
216	   |--------| |--------| |--------| |--------|       |              |
217	       |          |         |  ^       |  ^          |              |
218	       |          |         |  |       |  |          |   Scheduler  |
219	       |          |         |  |       |  |  |---|   |              |
220	       |          |         |  |-------|--+->|   |   |              |
221	       |          |         |          |     |   |<--|              |
222	       v          v         v          v     |   |   |--------------|
223	   |--------| |--------|  |-------------|    |   |           ^
224	   |  TCP 1 | |  TCP 2 |  |    UDP 1    |    | A |           |
225	   |--------| |--------|  |-------------|    |   |           |
226	      ^   |      ^   |              |        |   |   |--------------|
227	      |   |      |   |              |        | P |-->|              |
228	      |   |      |   |              |        |   |   |              |
229	      |---|------+---|--------------|------->|   |   |  Congestion  |
230	          |          |              |        | I |   |              |
231	          v          v              v        |   |   |  Controller  |
232	     |-----------------------------------|   |   |   |              |
233	     |               IP                  |-->|   |   |              |
234	     |-----------------------------------|   |   |   |--------------|
235	                                             |---|

237	                                   Figure 1

239	   The key components of the CM framework are (i) the API, (ii) the
240	   congestion controller, and (iii) the scheduler.  The API is (in
241	   part) motivated by the requirements of application-level framing
242	   (ALF) [Clark90], and is described in Section 4.  The CM internals
243	   (Section 5) include a congestion controller (Section 5.1) and a
244	   scheduler to orchestrate data transmissions between concurrent
245	   streams in a macroflow (Section 5.2).  The congestion controller
246	   adjusts the aggregate transmission rate between sender and receiver
247	   based on its estimate of congestion in the network.  It obtains
248	   feedback about its past transmissions from applications themselves
249	   via the API.  The scheduler apportions available bandwidth amongst
250	   the different streams within each macroflow and notifies
251	   applications when they are permitted to send data.  This document
252	   focuses on well-behaved applications; a future one will describe
253	   the sender-receiver protocol and header formats that will handle
254	   applications that do not incorporate their own feedback to the CM.

256	4.      CM API

258	   Using the CM API, streams can determine their share of the available
259	   bandwidth, request and have their data transmissions scheduled,
260	   inform the CM about successful transmissions, and be informed when
261	   the CM's estimate of path bandwidth changes. Thus, the CM frees
262	   applications from having to maintain information about the state of
263	   congestion and available bandwidth along any path.

265	   The function prototypes below follow standard C language
266	   convention.  We emphasize that these API functions are abstract
267	   calls and conformant CM implementations may differ in specific
268	   details, as long as equivalent functionality is provided.

270	   When a new stream is created by an application, it passes some
271	   information to the CM via the cm_open(stream_info) API call.
272	   Currently, stream_info consists of the following information: (i)
273	   the source IP address, (ii) the source port, (iii) the destination
274	   IP address, (iv) the destination port, and (v) the IP protocol
275	   number.

277	   4.1 State maintenance

279	   1. Open: All applications MUST call cm_open(stream_info) before
280	      using the CM API.  This returns a handle, cm_streamid, for the
281	      application to use for all further CM API invocations for that
282	      stream.  If the returned cm_streamid is -1, then the cm_open()
283	      failed and that stream cannot use the CM.

285	      All other calls to the CM for a stream use the cm_streamid
286	      returned from the cm_open() call.

288	   2. Close: When a stream terminates, the application SHOULD invoke
289	      cm_close(cm_streamid) to inform the CM about the termination
290	      of the stream.

292	   3. Packet size: cm_mtu(cm_streamid) returns the estimated PMTU of
293	      the path between sender and receiver.  Internally, this
294	      information SHOULD be obtained via path MTU discovery
295	      [Mogul90].  It MAY be statically configured in the absence of
296	      such a mechanism.

298	   4.2 Data transmission

300	   The CM accommodates two types of adaptive senders, enabling
301	   applications to dynamically adapt their content based on
302	   prevailing network conditions, and supporting ALF-based
303	   applications.

305	   1. Callback-based transmission. The callback-based transmission API
306	   puts the stream in firm control of deciding what to transmit at
307	   each point in time. To achieve this, the CM does not buffer any
308	   data; instead, it allows streams the opportunity to adapt to
309	   unexpected network changes at the last possible instant.  Thus,
310	   this enables streams to "pull out" and repacketize data upon
311	   learning about any rate change, which is hard to do once the data
312	   has been buffered.  The CM must implement a cm_request(i32
313	   cm_streamid) call for streams wishing to send data in this style.
314	   After some time, depending on the rate, the CM MUST
315	   invoke a callback using cmapp_send(), which is
316	   a grant for the stream to send up to PMTU bytes.  The
317	   callback-style API is the recommended choice for ALF-based streams.
318	   Note that cm_request() does not take the number of bytes or
319	   MTU-sized units as an argument; each call to cm_request() is an
320	   implicit request for sending up to PMTU bytes. The CM MAY provide
321	   an alternate interface, cm_request(int k). The cmapp_send callback
322	   for this request is granted the right to send up to k PMTU sized
323	   segments.  Section 4.3 discusses the time duration for which the
324	   transmission grant is valid, while Section 5.2 describes how these
325	   requests are scheduled and callbacks made.

327	   2. Synchronous-style.  The above callback-based API accommodates a
328	   class of ALF streams that are "asynchronous."  Asynchronous
329	   transmitters do not transmit based on a periodic clock, but do so
330	   triggered by asynchronous events like file reads or captured
331	   frames.  On the other hand, there are many streams that are
332	   "synchronous" transmitters, which transmit periodically based on
333	   their own internal timers (e.g., an audio senders that sends at a
334	   constant sampling rate).  While CM callbacks could be configured to
335	   periodically interrupt such transmitters, the transmit loop of such
336	   applications is less affected if they retain their original
337	   timer-based loop.  In addition, it complicates the CM API to have a
338	   stream express the periodicity and granularity of its callbacks.
339	   Thus, the CM MUST export an API that allows such streams to be informed
340	   of changes in rates using the cmapp_update(u64 newrate, u32 srtt,
341	   u32 rttdev) callback function, where newrate is the new rate in
342	   bits per second for this stream, srtt is the current smoothed round
343	   trip time estimate in microseconds, and rttdev is the smoothed
344	   linear deviation in the round-trip time estimate calculated using
345	   the same algorithm as in TCP [Paxson00].  The newrate value reports
346	   an instantaneous rate calculated, for example, by taking the ratio
347	   of cwnd and srtt, and dividing by the fraction of that ratio
348	   allocated to the stream.  In response, the stream MUST adapt its
349	   packet size or change its timer interval to conform to (i.e., not
350	   exceed) the allowed rate.  Of course, it may choose not to use all
351	   of this rate.  Note that the CM is not on the data path of the
352	   actual transmission.

354	   To avoid unnecessary cmapp_update() callbacks that the application
355	   will only ignore, the CM MUST provide a cm_thresh(float
356	   rate_downthresh, float rate_upthresh, float rtt_downthresh, float
357	   rtt_upthresh) function that a stream can use at any stage in its execution.
358	   In response, the CM SHOULD invoke the callback only when the rate decreases
359	   to less than (rate_downthresh * lastrate) or increases to more than
360	   (rate_upthresh * lastrate), where lastrate is the rate last
361	   notified to the stream, or when the round-trip time changes
362	   correspondingly by the requisite thresholds.  This information is
363	   used as a hint by the CM, in the sense the cmapp_update() can be
364	   called even if these conditions are not met.

366	   The CM MUST implement a cm_query(i32 cm_streamid, u64* rate,
367	   u32* srtt, u32* rttdev) to allow an application to query
368	   the current CM state.  This sets the rate variable to
369	   the current rate estimate in bits per second, the
370	   srtt variable to the current smoothed round-trip time estimate in
371	   microseconds, and rttdev to the mean linear deviation.  If the CM
372	   does not have valid estimates for the macroflow, it fills in
373	   negative values for the rate, srtt, and rttdev.

375	   Note that a stream can use more than one of the above transmission
376	   APIs at the same time.  In particular, the knowledge of sustainable
377	   rate is useful for asynchronous streams as well as synchronous
378	   ones; e.g., an asynchronous Web server disseminating images using
379	   TCP may use cmapp_send() to schedule its transmissions and
380	   cmapp_update() to decide whether to send a low-resolution or
381	   high-resolution image.  A TCP implementation using the CM is
382	   described in Section 6.1.1, where the benefit of the cm_request()
383	   callback API for TCP will become apparent.

385	   The reader will notice that the basic CM API does not provide an
386	   interface for buffered congestion-controlled transmissions.  This
387	   is intentional, since this transmission mode can be implemented
388	   using the callback-based primitive.  Section 6.1.2 describes how
389	   congestion-controlled UDP sockets may be implemented using the CM
390	   API.

392	   4.3 Application notification

394	   When a stream receives feedback from receivers, it MUST use
395	   cm_update(i32 cm_streamid, u32 nrecd, u32 nlost, u8 lossmode, i32
396	   rtt) to inform the CM about events such as congestion losses,
397	   successful receptions, type of loss (timeout event, Explicit
398	   Congestion Notification [Ramakrishnan98], etc.) and round-trip time
399	   samples.  The nrecd parameter indicates how many bytes were
400	   successfully received by the receiver since the last cm_update
401	   call, while the nrecd parameter identifies how many bytes were
402	   received were lost during the same time period. The rtt value
403	   indicates the round-trip time measured during the transmission of
404	   these bytes.  The rtt value must be set to -1 if no valid
405	   round-trip sample was obtained by the application.  The lossmode
406	   parameter provides an indicator of how a loss was detected.  A
407	   value of CM_NO_FEEDBACK indicates that the application has received
408	   no feedback for all its outstanding data, and is reporting this to
409	   the CM.  For example, a TCP that has experienced a timeout would
410	   use this parameter to inform the CM of this.  A value of
411	   CM_LOSS_FEEDBACK indicates that the application has experienced
412	   some loss, which it believes to be due to congestion, but not all
413	   outstanding data has been lost.  For example, a TCP segment loss
414	   detected using duplicate (selective) acknowledgements or other
415	   data-driven techniques fits this category.  A value of
416	   CM_EXPLICIT_CONGESTION indicates that the receiver echoed an
417	   explicit congestion notification message.  Finally, a value of
418	   CM_NO_CONGESTION indicates that no congestion-related loss has
419	   occurred.  The lossmode parameter MUST be reported as a bit-vector
420	   where the bits correspond to CM_NO_FEEDBACK, CM_LOSS_FEEDBACK,
421	   CM_EXPLICIT_CONGESTION, and CM_NO_CONGESTION.  Note that over links
422	   (paths) that experience losses for reasons other than congestion,
423	   an application SHOULD inform the CM of losses, with the
424	   CM_NO_CONGESTION field set.

426	   cm_notify(i32 cm_streamid, u32 nsent) MUST be called when data is
427	   transmitted from the host (e.g., in the IP output routine) to
428	   inform the CM that nsent bytes were just transmitted on a given
429	   stream.  This allows the CM to update its estimate of the number of
430	   outstanding bytes for the macroflow and for the stream.

432	   A cmapp_send() grant from the CM to an application is valid only
433	   for an expiration time, equal to the larger of the round-trip time
434	   and an implementation-dependent threshold communicated as an
435	   argument to the cmapp_send() callback function.  The application
436	   MUST NOT send data based on this callback after this time has
437	   expired.  Furthermore, if the application decides not to send data
438	   after receiving this callback, it SHOULD call
439	   cm_notify(stream_info, 0) to allow the CM to permit other streams
440	   in the macroflow to transmit data.  The CM congestion controller
441	   MUST be robust to applications forgetting to invoke
442	   cm_notify(stream_info, 0) correctly, or applications that crash or
443	   disappear after having made a cm_request() call.

445	   4.4 Querying

447	   If applications wish to learn about per-stream available bandwidth
448	   and round-trip time, they can use the CM's cm_query(i32
449	   cm_streamid, i64* rate, i32* srtt, i32* rttdev) call, which fills
450	   in the desired quantities.  If the CM does not have valid estimates
451	   for the macroflow, it fills in negative values for the rate, srtt,
452	   and rttdev.

454	   4.5 Sharing granularity

456	   One of the decisions the CM needs to make is the granularity at
457	   which a macroflow is constructed, by deciding which streams belong
458	   to the same macroflow and share congestion information.  The API
459	   provides two functions that allow applications to decide which of
460	   their streams ought to belong to the same macroflow.

462	   cm_getmacroflow(i32 cm_streamid) returns a unique i32 macroflow
463	   identifier.  cm_setmacroflow(i32 cm_macroflowid, i32 cm_streamid)
464	   sets the macroflow of the stream cm_streamid to cm_macroflowid.  If the
465	   cm_macroflowid that is passed to cm_setmacroflow() is -1, then a
466	   new macroflow is constructed and this is returned to the caller.
467	   Each call to cm_setmacroflow() overrides the previous macroflow
468	   association for the stream, should one exist.

470	   The default suggested aggregation method is to aggregate by
471	   destination IP address; i.e., all streams to the same destination
472	   address are aggregated to a single macroflow by default.  The
473	   cm_getmacroflow() and cm_setmacroflow() calls can then be used to
474	   change this as needed.  We do note that there are some cases where
475	   this may not be optimal, even over best-effort networks.  For
476	   example, when a group of receivers are behind a NAT device, the
477	   sender will see them all as one address.  If the hosts behind the
478	   NAT are in fact connected over different bottleneck links, some of
479	   those hosts could see worse performance than before.  It is
480	   possible to detect such hosts when using delay and loss estimates,
481	   although the specific mechanisms for doing so are beyond the scope
482	   of this document.

484	   The objective of this interface is to set up sharing of groups not
485	   sharing policy of relative weights of streams in a macroflow.  The
486	   latter requires the scheduler to provide an interface to set
487	   sharing policy.  However, because we want to support many different
488	   schedulers (each of which may need different information to set
489	   policy), we do not specify a complete API to the scheduler (but see
490	   Section 5.2).  A later guideline document is expected to describe a
491	   few simple schedulers (e.g., weighted round-robin, hierarchical
492	   scheduling) and the API they export to provide relative
493	   prioritization.

495	5.      CM internals

497	   This section describes the internal components of the CM.  It
498	   includes a Congestion Controller and a Scheduler, with
499	   well-defined, abstract interfaces exported by them.

501	   5.1 Congestion controller

503	   Associated with each macroflow is a congestion control algorithm;
504	   the collection of all these algorithms comprises the congestion
505	   controller of the CM.  The control algorithm decides when and how
506	   much data can be transmitted by a macroflow.  It uses application
507	   notifications (Section 4.3) from concurrent streams on the same
508	   macroflow to build up information about the congestion state of the
509	   network path used by the macroflow.

511	   The congestion controller MUST implement a "TCP-friendly"
512	   [Mahdavi98] congestion control algorithm.  Several macroflows MAY
513	   (and indeed, often will) use the same congestion control algorithm
514	   but each macroflow maintains state about the network used by its
515	   streams.

517	   The congestion control module MUST implement the following abstract
518	   interfaces.  We emphasize that these are not directly visible to
519	   applications; they are within the context of a macroflow, and are
520	   different from the CM API functions of Section 4.

522	   - void query(u64 *rate, u32 *srtt, u32 *rttdev): This function
523	     returns the estimated rate (in bits per second) and smoothed
524	     round trip time (in microseconds) for the macroflow.

526	   - void notify(u32 nsent): This function MUST be used to notify the
527	     congestion control module whenever data is sent by an
528	     application.  The nsent parameter indicates the number of bytes
529	     just sent by the application.

531	   - void update(u32 nsent, u32 nrecd, u32 rtt, u32 lossmode): This
532	     function is called whenever any of the CM streams associated with
533	     a macroflow identifies that data has reached the receiver or has
534	     been lost en route.  The nrecd parameter indicates the number of
535	     bytes that have just arrived at the receiver.  The nsent
536	     parameter is the sum of the number of bytes just received and the
537	     number of bytes identified as lost en route. The rtt parameter is
538	     the estimated round trip time in microseconds during the
539	     transfer.  The lossmode parameter provides an indicator of how a
540	     loss was detected (section 4.3).

542	   Although these interfaces are not visible to applications, the
543	   congestion controller MUST implement these abstract interfaces to
544	   provide for modular inter-operability with different
545	   separately-developed schedulers.

547	   The congestion control module MUST also call the associated
548	   scheduler's schedule function (section 5.2) when it believes that
549	   the current congestion state allows an MTU-sized packet to be sent.

551	   5.2 Scheduler

553	   While it is the responsibility of the congestion control module to
554	   determine when and how much data can be transmitted, it is the
555	   responsibility of a macroflow's scheduler module to determine which
556	   of the streams should get the opportunity to transmit data.

558	   The Scheduler MUST implement the following interfaces:

560	   - void schedule(u32 num_bytes): When the congestion control module
561	     determines that data can be sent, the schedule() routine MUST be
562	     called with no more than the number of bytes that can be sent.
563	     In turn, the scheduler MAY call the cmapp_send() function that CM
564	     applications must provide.

566	   - float query_share(i32 cm_streamid): This call returns the
567	     described stream's share of the total bandwidth available to the
568	     macroflow.  This call combined with the query call of the
569	     congestion controller provides the information to satisfy an
570	     application's cm_query() request.

572	   - void notify(i32 cm_streamid, u32 nsent): This interface is used
573	     to notify the scheduler module whenever data is sent by a CM
574	     application.  The nsent parameter indicates the number of bytes
575	     just sent by the application.

577	     The Scheduler MAY implement many additional interfaces.  As
578	     experience with CM schedulers increases, future documents may
579	     make additions and/or changes to some parts of the scheduler
580	     API.

582	6.      Examples

584	   6.1 Example applications

586	   The following describes the possible use of the CM API by an
587	   asynchronous application (an implementation of a TCP sender) and a
588	   synchronous application (an audio server).  More details of these
589	   applications and CM implementation optimizations for efficient
590	   operation are described in [Andersen00].  We emphasize that the
591	   protocols in this section are examples and suggestions for
592	   implementation, rather than requirements of any conformant
593	   implementation.

595	   6.1.1 TCP

597	   A TCP MUST use the cmapp_send() callback API. TCP only identifies
598	   which data it should send upon the arrival of an acknowledgement or
599	   expiration of a timer. As a result, it requires tight control over
600	   when and if new data or retransmissions are sent.

602	   When TCP either connects to or accepts a connection from another
603	   host, it performs a cm_open() call to associate the TCP connection
604	   with a cm_streamid.

606	   Once a connection is established, the CM is used to control the
607	   transmission of outgoing data.  The CM eliminates the need for
608	   tracking and reacting to congestion in TCP, because the CM and its
609	   transmission API ensure proper congestion behavior.  Loss recovery
610	   is still performed by TCP based on fast retransmissions and
611	   recovery as well as timeouts.  In addition, TCP is also modified to
612	   have its own outstanding window (tcp_ownd) estimate.  Whenever data
613	   segments are sent from its cmapp_send() callback, TCP updates its
614	   tcp_ownd value. The ownd variable is also updated after each
615	   cm_update() call. TCP also maintains a count of the number of
616	   outstanding segments (pkt_cnt).  At any time, TCP can calculate the
617	   average packet size (avg_pkt_size) as tcp_ownd/pkt_cnt.  The
618	   avg_pkt_size is used by TCP to help estimate the amount of
619	   outstanding data.  Note that this is not needed if the SACK option
620	   is used on the connection, since this information is explicitly
621	   available.

623	   The TCP output routines are modified as follows:

625	     1. All congestion window (cwnd) checks are removed.

627	     2. When application data is available.  The TCP output routines
628	     perform all non-congestion checks (Nagle algorithm,
629	     receiver-advertised window check, etc).  If these checks pass,
630	     the output routine queues the data and calls cm_request() for the
631	     stream.

633	     3. If incoming data or timers result in a loss being detected,
634	     the retransmission is also placed in a queue and cm_request() is
635	     called for the stream.

637	     4. The cmapp_send() callback for TCP is set to an output
638	     routine. If any retransmission is enqueued, the routine outputs
639	     the retransmission.  Otherwise, the routine outputs as much new
640	     data as the TCP connection state allows.  However, the
641	     cmapp_send() never sends more than a single segment per call.
642	     This routine arranges for the other output computations to be
643	     done, such as header and options computations.

645	   The IP output routine on the host calls cm_notify() when the
646	   packets are actually sent out.  Because it does not know which
647	   cm_streamid is responsible for the packet, cm_notify() takes the
648	   stream_info as argument (see Section 4 for what the stream_info
649	   should contain).  Because cm_notify() reports the IP payload size,
650	   TCP keeps track of the total header size and incorporates these
651	   updates.

653	   The TCP input routines are modified as follows:

655	     1. RTT estimation is done as normal using either timestamps or
656	     Karn's algorithm.  Any rtt estimate that is generated is passed
657	     to CM via the cm_update call.

659	     2. All cwnd and slow start threshold (ssthresh) updates are
660	     removed.

662	     3. Upon the arrival of an ack for new data, TCP computes the
663	     value of in_flight (the amount of data in flight) as
664	     snd_max-ack-1 (i.e. MAX Sequence Sent - Current Ack - 1). TCP
665	     then calls cm_update(streamid, tcp_ownd - in_flight, 0,
666	     CM_NO_CONGESTION, rtt).

668	     4. Upon the arrival of a duplicate acknowledgement, TCP must
669	     check its dupack count (dup_acks) to determine its action. If
670	     dup_acks < 3, the TCP does nothing.  If dup_acks == 3, TCP
671	     assumes that a packet was lost and that at least 3 packets
672	     arrived to generate these duplicate acks. Therefore, it calls
673	     cm_update(streamid, 4 * avg_pkt_size, 3 * avg_pkt_size,
674	     CM_LOSS_FEEDBACK, rtt).  The average packet size is used since the
675	     acknowledgements do not indicate exactly how much data has
676	     reached the other end.  Most TCP implementations interpret a
677	     duplicate ACK as an indication that a full MSS has reached its
678	     destination.  Once a new ACK is received, these TCP sender
679	     implementations may resynchronize with TCP receiver.  The CM API
680	     does not provide a mechanism for TCP to pass information from
681	     this resynchronization.  Therefore, TCP can only infer the
682	     arrival of an avg_pkt_size amount of data from each duplicate
683	     ack. TCP also enqueues a retransmission of the lost segment and
684	     calls cm_request().  If dup_acks > 3, TCP assumes that a packet
685	     has reached the other end and caused this ack to be sent.  As a
686	     result, it calls cm_update(streamid, avg_pkt_size, avg_pkt_size,
687	     CM_NO_CONGESTION, rtt).

689	     5. Upon the arrival of a partial acknowledgment (one that does
690	     not exceed the highest segment transmitted at the time the loss
691	     occurred, as defined in [Floyd99b]), TCP assumes that a packet
692	     was lost and that the retransmitted packet has reached the
693	     recipient.  Therefore, it calls cm_update(streamid, 2 *
694	     avg_pkt_size, avg_pkt_size, CM_NO_CONGESTION,
695	     rtt).  CM_NO_CONGESTION is used since the loss period has already
696	     been reported. TCP also enqueues a retransmission of the lost
697	     segment and calls cm_request().

699	   When the TCP retransmission timer expires, the sender identifies
700	   that a segment has been lost and calls cm_update(streamid,
701	   avg_pkt_size, 0, CM_NO_FEEDBACK, 0) to signify that no feedback has
702	   been received from the receiver and that one segment is sure to
703	   have "left the pipe."  TCP also enqueues a retransmission of the
704	   lost segment and calls cm_request().

706	   6.1.2 Congestion-controlled UDP

708	   Congestion-controlled UDP is a useful CM application, which we
709	   describe in the context of Berkeley sockets [Stevens94].  They
710	   provide the same functionality as standard Berkeley UDP sockets,
711	   but instead of immediately sending the data from the kernel packet
712	   queue to lower layers for transmission, the buffered socket
713	   implementation makes calls to the API exported by the CM inside the
714	   kernel and gets callbacks from the CM.  When a CM UDP socket is
715	   created, it is bound to a particular stream.  Later, when data is
716	   added to the packet queue, cm_request() is called on the stream
717	   associated with the socket.  When the CM schedules this stream for
718	   transmission, it calls udp_ccappsend() in the UDP module.  This
719	   function transmits one MTU from the packet queue, and schedules the
720	   transmission of any remaining packets.  The in-kernel
721	   implementation of the CM UDP API SHOULD NOT require any additional
722	   data copies and SHOULD support all standard UDP options.  Modifying
723	   existing applications to use congestion-controlled UDP requires the
724	   implementation of a new socket option on the socket.  To work
725	   correctly, the sender MUST obtain feedback about congestion.  This
726	   can be done in at least two ways: (i) the UDP receiver application
727	   can provide feedback to the sender application, which will inform
728	   the CM of network conditions using cm_update(); (ii) the UDP
729	   receiver implementation can provide feedback to the sending UDP.
730	   Note that this latter alternative requires changes to the
731	   receiver's network stack and the sender UDP cannot assume that all
732	   receivers support this option without explicit negotiation.

734	   6.1.3 Audio server

736	   A typical audio application often has access to the sample in a
737	   multitude of data rates and qualities. The objective of the
738	   application is then to deliver the highest possible quality of
739	   audio (typically the highest data rate) its clients. The selection
740	   of which version of audio to transmit should be based on the
741	   current congestion state of the network.  In addition, the source
742	   will want audio delivered to its users at a consistent sampling
743	   rate.  As a result, it must send data a regular rate, minimizing
744	   delaying transmissions and reducing buffering before playback. To
745	   meet these requirements, this application can use the synchronous
746	   sender API (Section 4.2).

748	   When the source first starts, it uses the cm_query() call to get an
749	   initial estimate of network bandwidth and delay.  If some other
750	   streams on that macroflow have already been active, then it gets an
751	   initial estimate that is valid; otherwise, it gets negative values,
752	   which it ignores.  It then chooses an encoding that does not exceed
753	   these estimates (or, in the case of an invalid estimate, uses
754	   application-specific initial values) and begins transmitting
755	   data. The application also implements the cmapp_update() callback.
756	   When the CM determines that network characteristics have changed,
757	   it calls the application's cmapp_update() function and passes it a
758	   new rate and round-trip time estimate. The application MUST change
759	   its choice of audio encoding to ensure that it does not exceed
760	   these new estimates.

762	   To use the CM, the application MUST incorporate feedback from the
763	   receiver. In this example, it must periodically (typically once or
764	   twice per round trip time) determine how many of its packets
765	   arrived at the receiver. When the source gets this feedback, it
766	   MUST use cm_update() to inform the CM of this new information.
767	   This results in the CM updating ownd and may result in CM changing
768	   its estimates and calling cmapp_update() of the streams of the
769	   macroflow.

771	   6.3 Example congestion control module

773	   To illustrate the responsibilities of a congestion control module,
774	   the following describes some of the actions of a simple TCP-like
775	   congestion control module that implements Additive Increase
776	   Multiplicative Decrease congestion control (AIMD_CC):

778	   - query(): AIMD_CC returns the current congestion window (cwnd)
779	     divided by the smoothed rtt (srtt) as its bandwidth estimate. It
780	     returns the smoothed rtt estimate as srtt.

782	   - notify(): AIMD_CC adds the number of bytes sent to its
783	     outstanding data window (ownd).

785	   - update(): AIMD_CC subtracts nsent from ownd. If the value of rtt
786	     is non-zero, AIMD_CC updates srtt using the TCP srtt calculation.
787	     If the update indicates that data has been lost, AIMD_CC sets
788	     cwnd to 1 MTU if the loss_mode is CM_NO_FEEDBACK and to cwnd/2
789	     (with a minimum of 1 MTU) if the loss_mode is CM_LOSS_FEEDBACK or
790	     CM_EXPLICIT_CONGESTION.  AIMD_CC also sets its internal ssthresh
791	     variable to cwnd/2. If no loss had occurred, AIMD_CC mimics TCP
792	     slow start and linear growth modes.  It increments cwnd by nsent
793	     when cwnd < ssthresh (bounded by a maximum of ssthresh-cwnd) and
794	     by nsent * MTU/cwnd when cwnd > ssthresh.

796	   - When cwnd or ownd are updated and indicate that at least one MTU
797	     may be transmitted, AIMD_CC calls the CM to schedule a
798	     transmission.

800	   6.4 Example Scheduler Module

802	   To clarify the responsibilities of a scheduler module, the
803	   following describes some of the actions of a simple round robin
804	   scheduler module (RR_sched):

806	   - schedule(): RR_sched schedules as many streams as possible in round
807	     robin fashion.

809	   - query_share(): RR_sched returns 1/(number of streams in macroflow).

811	   - notify(): RR_sched does nothing. Round robin scheduling is not
812	     affected by the amount of data sent.

814	7.      Security considerations

816	   The CM provides many of the same services that the congestion
817	   control in TCP provides.  As such, it is vulnerable to many of the
818	   same security problems.  For example, incorrect reports of losses
819	   and transmissions will give the CM an inaccurate picture of the
820	   network's congestion state.  By giving CM a high estimate of
821	   congestion, an attacker can degrade the performance observed by
822	   applications.  The more dangerous form of attack is giving CM a low
823	   estimate of congestion.  This would cause CM to be overly
824	   aggressive and allow data to be sent much more quickly than sound
825	   congestion control policies would allow.  [Touch97] describes the
826	   security problems that arise with congestion information sharing in
827	   more detail.

829	8.      References

831	   [Allman99] Allman, M. and Paxson, V.,  TCP Congestion Control,
832	   RFC-2581, April 1999.

834	   [Andersen00] Andersen, D., Bansal, D., Curtis, D., Seshan, S., and
835	      Balakrishnan, H., System Support for Bandwidth Management and
836	      Content Adaptation in Internet Applications, Proc. 4th Symp. on
837	      Operating Systems Design and Implementation, San Diego, CA,
838	      October 2000.  Available from
839	      http://nms.lcs.mit.edu/papers/cm-osdi2000.html

841	   [Balakrishnan98] Balakrishnan, H., Padmanabhan, V., Seshan, S.,
842	      Stemm, M., and Katz, R., "TCP Behavior of a Busy Web Server:
843	      Analysis and Improvements," Proc. IEEE INFOCOM, San Francisco,
844	      CA, March 1998.

846	   [Balakrishnan99] Balakrishnan, H., Rahul, H., and Seshan, S., "An
847	      Integrated Congestion Management Architecture for Internet
848	      Hosts," Proc. ACM SIGCOMM, Cambridge, MA, September 1999.

850	   [Bradner96] Bradner, S., "The Internet Standards Process ---
851	      Revision 3", BCP 9, RFC-2026, October 1996.

853	   [Bradner97] Bradner, S., "Key words for use in RFCs to Indicate
854	      Requirement Levels", BCP 14, RFC-2119, March 1997.

856	   [Clark90] Clark, D. and Tennenhouse, D., "Architectural
857	      Consideration for a New Generation of Protocols", Proc. ACM
858	      SIGCOMM, Philadelphia, PA, September 1990.

860	   [Eggert00] Eggert, L., Heidemann, J., and Touch, J., "Effects of
861	      Ensemble TCP," ACM Computer Comm. Review, January 2000.

863	   [Floyd99a] Floyd, S. and Fall, K.," Promoting the Use of End-to-End
864	       Congestion Control in the Internet," IEEE/ACM Trans. on
865	       Networking, 7(4), August 1999, pp. 458-472.

867	   [Floyd99b] Floyd, S. and Henderson, T., "The NewReno Modification
868	      to TCP's Fast Recovery Algorithm," RFC-2582, April
869	      1999. (Experimental.)

871	   [Jacobson88] Jacobson, V., "Congestion Avoidance and Control,"
872	      Proc. ACM SIGCOMM, Stanford, CA, August 1988.

874	   [Mahdavi98] Mahdavi, J. and Floyd, S., "The TCP Friendly Website,"
875	      http://www.psc.edu/networking/tcp_friendly.html

877	   [Mogul90] Mogul, J. and Deering, S., "Path MTU Discovery,"
878	      RFC-1191, November 1990.

880	   [Padmanabhan98] Padmanabhan, V., "Addressing the Challenges of Web
881	      Data Transport," PhD thesis, Univ. of California, Berkeley,
882	      December 1998.

884	   [Paxson00] Paxson. V. and Allman, M., "Computing TCP's
885	      Retransmission Timer," Internet Draft
886	      draft-paxson-tcp-rto-01.txt, April 2000.  (Expires October
887	      2000.)

889	   [Postel81] Postel, J. (ed.), "Transmission Control Protocol,"
890	      RFC-793, September 1981.

892	   [Ramakrishnan98] Ramakrishnan, K. and Floyd, S., "A Proposal to Add
893	      Explicit Congestion Notification (ECN) to IP," RFC-2481.
894	      (Experimental.)

896	   [Stevens94] Stevens, W., TCP/IP Illustrated, Volume 1.
897	      Addison-Wesley, Reading, MA, 1994.

899	   [Touch97] Touch, J., "TCP Control Block Interdependence," RFC-2140,
900	      April 1997. (Informational.)

902	9.      Acknowledgments

904	   We thank David Andersen, Deepak Bansal, and Dorothy Curtis for
905	   their work on the CM design and implementation.  We thank Vern
906	   Paxson for his detailed comments and patience, and Sally Floyd,
907	   Mark Handley, and Steven McCanne for useful feedback on the CM
908	   architecture.

910	10.     Authors' addresses

912	   Hari Balakrishnan
913	   Laboratory for Computer Science
914	   200 Technology Square
915	   Massachusetts Institute of Technology
916	   Cambridge, MA 02139
917	   Email: hari@lcs.mit.edu
918	   Web: http://nms.lcs.mit.edu/~hari/

920	   Srinivasan Seshan
921	   School of Computer Science
922	   Carnegie Mellon University
923	   5000 Forbes Ave.
924	   Pittsburgh, PA 15213
925	   Email: srini@cmu.edu
926	   Web: http://www.cs.cmu.edu/~srini/

928	Full Copyright Statement

930	   "Copyright (C) The Internet Society (date). All Rights Reserved.
931	   This document and translations of it may be copied and furnished to
932	   others, and derivative works that comment on or otherwise explain
933	   it or assist in its implementation may be prepared, copied,
934	   published and distributed, in whole or in part, without restriction
935	   of any kind, provided that the above copyright notice and this
936	   paragraph are included on all such copies and derivative works.
937	   However, this document itself may not be modified in any way, such
938	   as by removing the copyright notice or references to the Internet
939	   Society or other Internet organizations, except as needed for the
940	   purpose of developing Internet standards in which case the
941	   procedures for copyrights defined in the Internet Standards process
942	   must be followed, or as required to translate it into the final
943	   draft output.