idnits 2.17.1 draft-ietf-ecm-cm-01.txt: ** The Abstract section seems to be numbered Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 703 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There is 1 instance of too long lines in the document, the longest one being 2 characters in excess of 72. ** There are 6 instances of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 2581 (ref. 'Allman99') (Obsoleted by RFC 5681) -- Possible downref: Non-RFC (?) normative reference: ref. 'Andersen00' -- Possible downref: Non-RFC (?) normative reference: ref. 'Balakrishnan98' -- Possible downref: Non-RFC (?) normative reference: ref. 'Balakrishnan99' -- Possible downref: Non-RFC (?) normative reference: ref. 'Clark90' -- Possible downref: Non-RFC (?) normative reference: ref. 'Eggert00' -- Possible downref: Non-RFC (?) normative reference: ref. 'Floyd99a' ** Obsolete normative reference: RFC 2582 (ref. 'Floyd99b') (Obsoleted by RFC 3782) -- Possible downref: Non-RFC (?) normative reference: ref. 'Jacobson88' -- Possible downref: Non-RFC (?) normative reference: ref. 'Mahdavi98' -- Possible downref: Non-RFC (?) normative reference: ref. 'Padmanabhan98' ** Obsolete normative reference: RFC 793 (ref. 'Postel81') (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 2481 (ref. 'Ramakrishnan98') (Obsoleted by RFC 3168) -- Possible downref: Non-RFC (?) normative reference: ref. 'Stevens94' ** Obsolete normative reference: RFC 2140 (ref. 'Touch97') (Obsoleted by RFC 9040) Summary: 15 errors (**), 0 flaws (~~), 2 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force Hari Balakrishnan 2 INTERNET DRAFT MIT LCS 3 Document: draft-ietf-ecm-cm-01.txt Srinivasan Seshan 4 CMU 5 July, 2000 6 Expires: January 2001 8 The Congestion Manager 10 Status of this Memo 12 This document is an Internet-Draft and is in full conformance with 13 all provisions of Section 10 of RFC-2026 [Bradner96]. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that 17 other groups may also distribute working documents as Internet- 18 Drafts. Internet-Drafts are draft documents valid for a maximum of 19 six months and may be updated, replaced, or obsoleted by other 20 documents at any time. It is inappropriate to use Internet- Drafts 21 as reference material or to cite them other than as "work in 22 progress." 23 The list of current Internet-Drafts can be accessed at 24 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html. 28 1. Abstract 30 This document describes the Congestion Manager (CM), an end-system 31 module that (i) enables an ensemble of multiple concurrent streams 32 from a sender destined to the same receiver and sharing the same 33 congestion properties to perform proper congestion avoidance and 34 control, and (ii) allows applications to easily adapt to network 35 congestion. This CM framework integrates congestion management 36 across all applications and transport protocols. The CM maintains 37 congestion parameters (available aggregate and per-stream bandwidth, 38 per-receiver round-trip times, etc.) and exports an API that 39 enables applications to learn about network characteristics, pass 40 information to the CM, share congestion information with each 41 other, and schedule data transmissions. This document focuses on 42 applications and transport protocols with their own independent 43 per-byte or per-packet sequence number information, and does not 44 require modifications to the receiver protocol stack. The 45 receiving application must provide feedback to the sending 46 application about received packets and losses, and the latter uses 47 the CM API to update CM state. This document does not address 48 networks with reservations or service discrimination. 50 2. Conventions used in this document: 51 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 52 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 53 this document are to be interpreted as described in RFC-2119 54 [Bradner97]. 56 STREAM 57 A group of packets that all share the same source and 58 destination IP address, IP type-of-service, transport 59 protocol, and source and destination transport port numbers. 61 FLOW 62 Identical to STREAM. 64 MACROFLOW 65 A group of streams that all use the same congestion management 66 and scheduling algorithms, and share congestion state 67 information. Currently, streams destined to different 68 receivers belong to different macroflows. Streams destined to 69 the same receiver MAY belong to different macroflows. Streams 70 that experience identical congestion behavior in the Internet 71 and use the same congestion control algorithm SHOULD belong to 72 the same macroflow. 74 APPLICATION 75 Any software module that uses the CM. This includes 76 user-level applications such as Web servers or audio/video 77 servers, as well as in-kernel protocols such as TCP [Postel81] 78 that use the CM for congestion control. 80 WELL-BEHAVED APPLICATION 81 An application that only transmits when allowed by the CM and 82 accurately accounts for all data that it has sent to the 83 receiver by informing the CM using the CM API. 85 PATH MAXIMUM TRANSMISSION UNIT (PMTU) 86 The size of the largest packet that the sender can transmit 87 without it being fragmented en route to the receiver. It 88 includes the sizes of all headers and data except the IP 89 header. 91 CONGESTION WINDOW (cwnd) 92 A CM state variable that modulates the amount of outstanding 93 data between sender and receiver. 95 OUTSTANDING WINDOW (ownd) 96 The number of bytes that has been transmitted by the source, 97 but not known to have been either received by the destination 98 or lost in the network. 100 INITIAL WINDOW (IW) 101 The size of the sender's congestion window at the beginning of 102 a macroflow. 104 DATA TYPE SYNTAX 105 We use "u64" for unsigned 64-bit, "u32" for unsigned 32- 106 bit, "u16" for unsigned 16-bit, "u8" for unsigned 8-bit, "i32" for 107 signed 32-bit, "i16" for signed 16-bit quantities, "float" for IEEE 108 floating point values. The type "void" is used to indicate that no 109 return value is expected from a call. Pointers are referred to 110 using "*" syntax, following C language convention. 112 We emphasize that all the API functions described in this 113 document are "abstract" calls and that conformant CM 114 implementations may differ in specific implementation details. 116 3. Introduction 118 The CM is an end-system module that enables an ensemble of multiple 119 concurrent streams to perform proper congestion avoidance and 120 control, and allows applications to easily adapt their 121 transmissions to prevailing network conditions. It integrates 122 congestion management across all applications and transport 123 protocols. It maintains congestion parameters (available aggregate 124 and per-stream bandwidth, per-receiver round-trip times, etc.) and 125 exports an API that enables applications to learn about network 126 characteristics, pass information to the CM, share congestion 127 information with each other, and schedule data transmissions. All 128 data transmissions MUST be done with the explicit consent of the CM 129 via this API to ensure proper congestion behavior. 131 This document focuses on applications and networks where the 132 following conditions hold: 134 1. Applications are well-behaved with their own independent 135 per-byte or per-packet sequence number information, and use the 136 CM API to update internal state in the CM. 138 2. Networks are best-effort without service discrimination or 139 reservations. In particular, it does not address situations 140 where different streams between the same pair of hosts traverse 141 paths with differing characteristics. 143 The Congestion Manager framework can be extended to support 144 applications that do not provide their own feedback and to 145 differentially served networks. These extensions will be addressed 146 in later documents. 148 The CM is motivated by two main goals: 150 (i) Enable efficient multiplexing. Increasingly, the trend on the 151 Internet is for unicast data senders (e.g., Web servers) to 152 transmit heterogeneous types of data to receivers, ranging from 153 unreliable real-time streaming content to reliable Web pages and 154 applets. As a result, many logically different streams share the 155 same path between sender and receiver. For the Internet to remain 156 stable, each of these streams must incorporate control protocols 157 that safely probe for spare bandwidth and react to 158 congestion. Unfortunately, these concurrent streams typically compete 159 with each other for network resources, rather than share them 160 effectively. Furthermore, they do not learn from each other about 161 the state of the network. Even if they each independently implement 162 congestion control (e.g., a group of TCP connections each 163 implementing the algorithms in [Jacobson88, Allman99]), the 164 ensemble of streams tends to be more aggressive in the face of 165 congestion than a single TCP connection implementing standard TCP 166 congestion control and avoidance [Balakrishnan98]. 168 (ii) Enable application adaptation to congestion. Increasingly 169 popular real-time streaming applications run over UDP using their 170 own user-level transport protocols for good application 171 performance, but in most cases today do not adapt or react properly 172 to network congestion. By implementing a stable control algorithm 173 and exposing an adaptation API, the CM enables easy application 174 adaptation to congestion. Applications adapt the data they 175 transmit to the current network conditions. 177 The CM framework builds on recent work on TCP control block sharing 178 [Touch97], integrated TCP congestion control (TCP-Int) 179 [Balakrishnan98] and TCP sessions [Padmanabhan98]. [Touch97] 180 advocates the sharing of some of the state in the TCP control block 181 to improve transient transport performance and describes sharing 182 across an ensemble of TCP connections. [Balakrishnan98], 183 [Padmanabhan98], and [Eggert00] describe several experiments that 184 quantify the benefits of sharing congestion state, including 185 improved stability in the face of congestion and better loss 186 recovery. Integrating loss recovery across concurrent connections 187 significantly improves performance because losses on one connection 188 can be detected by noticing that later data sent on another 189 connection has been received and acknowledged. The CM framework 190 extends these ideas in two significant ways: (i) it extends 191 congestion management to non-TCP streams, which are becoming 192 increasingly common and often do not implement proper congestion 193 management, and (ii) it provides an API for applications to adapt 194 their transmissions to current network conditions. For an extended 195 discussion of the motivation for the CM, its architecture, API, 196 and algorithms, see [Balakrishnan99]; for a description of an 197 implementation and performance results, see [Andersen00]. 199 The resulting end-host protocol architecture at the sender is shown 200 in Figure 1. The CM helps achieve network stability by 201 implementing stable congestion avoidance and control algorithms 202 that are "TCP-friendly" [Mahdavi98] based on algorithms described in 203 [Allman99]. However, it does not attempt to enforce proper 204 congestion behavior for all applications (but it does not preclude 205 a policer on the host that performs this task). Note that while 206 the policer at the end-host can use CM, the network has to be 207 protected against compromises to the CM and the policer at the end 208 hosts, a task that requires router machinery [Floyd99a]. We do not 209 address this issue further in this document. 211 |--------| |--------| |--------| |--------| |--------------| 212 | HTTP | | FTP | | RTP 1 | | RTP 2 | | | 213 |--------| |--------| |--------| |--------| | | 214 | | | ^ | ^ | | 215 | | | | | | | Scheduler | 216 | | | | | | |---| | | 217 | | | |-------|--+->| | | | 218 | | | | | |<--| | 219 v v v v | | |--------------| 220 |--------| |--------| |-------------| | | ^ 221 | TCP 1 | | TCP 2 | | UDP 1 | | A | | 222 |--------| |--------| |-------------| | | | 223 ^ | ^ | | | | |--------------| 224 | | | | | | P |-->| | 225 | | | | | | | | | 226 |---|------+---|--------------|------->| | | Congestion | 227 | | | | I | | | 228 v v v | | | Controller | 229 |-----------------------------------| | | | | 230 | IP |-->| | | | 231 |-----------------------------------| | | |--------------| 232 |---| 234 Figure 1 236 The key components of the CM framework are (i) the API, (ii) the 237 congestion controller, (iii) the scheduler. The API is (in part) 238 motivated by the ideas of application-level framing (ALF) [Clark90] 239 and is described in Section 4. The CM internals (Section 5) 240 include a congestion controller (Section 5.1) and a scheduler to 241 orchestrate data transmissions between concurrent streams in a 242 macroflow (Section 5.2). The congestion controller adjusts the 243 aggregate transmission rate between sender and receiver based on 244 its estimate of congestion in the network. It obtains feedback 245 about its past transmissions from applications themselves via the 246 API. The scheduler apportions available bandwidth amongst the 247 different streams within each macroflow and notifies applications 248 when they are permitted to send data. This document focuses on 249 well-behaved applications; a future one will describe the 250 sender-receiver protocol and header formats that will handle 251 applications that do not incorporate their own feedback to the CM. 253 4. CM API 255 Using the CM API, streams can determine their share of the available 256 bandwidth, request and have their data transmissions scheduled, 257 inform the CM about successful transmissions, and be informed when 258 the CM's estimate of path bandwidth changes. Thus, the CM frees 259 applications from having to maintain information about the state of 260 congestion and available bandwidth along any path. 262 The function prototypes below follow standard C language 263 convention. We emphasize that these API functions are abstract 264 calls and conformant CM implementations may differ in specific 265 details, as long as equivalent functionality is provided. 267 When a new stream is created by an application, it passes some 268 information to the CM via the cm_open(stream_info) API call. 269 Currently, stream_info consists of the following information: (i) 270 the source IP address, (ii) the source port, (iii) the destination 271 IP address, (iv) the destination port, and (v) the IP protocol 272 number. 274 4.1 State maintenance 276 1. Open: All applications MUST call cm_open(stream_info) before 277 using the CM API. This returns a handle, cm_streamid, for the 278 application to use for all further CM API invocations for that 279 stream. If cm_streamid is -1, then the cm_open() failed and that 280 stream cannot use the CM. 282 All other calls to the CM for a stream use the cm_streamid 283 returned from the cm_open() call. 285 2. Close: When a stream terminates, the application SHOULD invoke 286 cm_close(cm_streamid) to inform the CM about the termination 287 of the stream. 289 3. Packet size: cm_mtu(cm_streamid) returns the estimated PMTU of 290 the path between sender and receiver. Internally, this 291 information SHOULD be obtained via path MTU discovery 292 [Mogul90]. It MAY be statically configured in the absence of 293 such a mechanism. 295 4.2 Data transmission 297 The CM accommodates two types of adaptive senders, enabling 298 applications to dynamically adapt their content based on 299 prevailing network conditions, and supporting ALF-based 300 applications. 302 1. Callback-based transmission. The callback-based transmission API 303 puts the stream in firm control of deciding what to transmit at 304 each point in time. To achieve this, the CM does not buffer any 305 data; instead, it allows streams the opportunity to adapt to 306 unexpected network changes at the last possible instant. Thus, 307 this enables streams to "pull out" and repacketize data upon 308 learning about any rate change, which is hard to do once the data 309 has been buffered. A stream wishing to send data in this style 310 MUST call cm_request(i32 cm_streamid). After some time, depending 311 on the rate, the CM invokes a callback using cmapp_send(), which is 312 a grant for the stream to send up to PMTU bytes. The 313 callback-style API is the recommended choice for ALF-based streams. 314 Note that cm_request() does not take the number of bytes or 315 MTU-sized units as an argument; each call to cm_request() is an 316 implicit request for sending up to PMTU bytes. Section 4.3 317 discusses the time duration for which the transmission grant is 318 valid, while Section 5.2 describes how these requests are scheduled 319 and callbacks made. 321 2. Synchronous-style. The above callback-based API accommodates a 322 class of ALF streams that are "asynchronous." Asynchronous 323 transmitters do not transmit based on a periodic clock, but do so 324 triggered by asynchronous events like file reads or captured 325 frames. On the other hand, there are many streams that are 326 "synchronous" transmitters, which transmit periodically based on 327 their own internal timers (e.g., an audio senders that sends at a 328 constant sampling rate). While CM callbacks could be configured to 329 periodically interrupt such transmitters, the transmit loop of such 330 applications is less affected if they retain their original 331 timer-based loop. In addition, it complicates the CM API to have a 332 stream express the periodicity and granularity of its callbacks. 333 Thus, the CM exports an API that allows such streams to be informed 334 of changes in rates using the cmapp_update(u64 newrate, u32 srtt, 335 u32 rttdev) callback function, where newrate is the new rate in 336 bits per second for this stream, srtt is the current smoothed round 337 trip time estimate in microseconds, and rttdev is the smoothed 338 linear deviation in the round-trip time estimate. The newrate 339 value reports an instantaneous rate calculated, for example, by 340 taking the ratio of cwnd and srtt, and dividing by the fraction of 341 that ratio allocated to the stream. In response, the stream MUST 342 adapt its packet size or change its timer interval to conform to 343 (i.e., not exceed) the allowed rate. Of course, it may choose not 344 to use all of this rate. Note that the CM is not on the data path 345 of the actual transmission. 347 To avoid unnecessary cmapp_update() callbacks that the application 348 will only ignore, the stream can use the cm_thresh(float 349 rate_downthresh, float rate_upthresh, float rtt_downthresh, float 350 rtt_upthresh) function at any stage in its execution. In response, 351 the CM will invoke the callback only when the rate decreases to 352 less than (rate_downthresh * lastrate) or increases to more than 353 (rate_upthresh * lastrate), where lastrate is the rate last 354 notified to the stream, or when the round-trip time changes 355 correspondingly by the requisite thresholds. This information is 356 used as a hint by the CM, in the sense the cmapp_update() can be 357 called even if these conditions are not met. 359 An application can query the current CM state by using cm_query(i32 360 cm_streamid, u64* rate, u32* srtt, u32* rttdev). This sets the 361 rate variable to the current rate estimate in bits per second, the 362 srtt variable to the current smoothed round-trip time estimate in 363 microseconds, and rttdev to the mean linear deviation. If the CM 364 does not have valid estimates for the macroflow, it fills in 365 negative values for the rate, srtt, and rttdev. 367 Note that a stream can use more than one of the above transmission 368 APIs at the same time. In particular, the knowledge of sustainable 369 rate is useful for asynchronous streams as well as synchronous 370 ones; e.g., an asynchronous Web server disseminating images using 371 TCP may use cmapp_send() to schedule its transmissions and 372 cmapp_update() to decide whether to send a low-resolution or 373 high-resolution image. A TCP implementation using the CM is 374 described in Section 6.1.1, where the benefit of the cm_request() 375 callback API for TCP will become apparent. 377 The reader will notice that the basic CM API does not provide an 378 interface for buffered congestion-controlled transmissions. This 379 is intentional, since this transmission mode can be implemented 380 using the callback-based primitive. Section 6.1.2 describes how 381 congestion-controlled UDP sockets may be implemented using the CM 382 API. 384 4.3 Application notification 386 When a stream receives feedback from receivers, it MUST use 387 cm_update(i32 cm_streamid, u32 nrecd, u32 nlost, u8 lossmode, i32 388 rtt) to inform the CM about events such as congestion losses, 389 successful receptions, type of loss (timeout event, Explicit 390 Congestion Notification [Ramakrishnan98], etc.) and round-trip time 391 samples. The nrecd parameter indicates how many bytes were 392 successfully received by the receiver since the last cm_update 393 call, while the nrecd parameter identifies how many bytes were 394 received were lost during the same time period. The rtt value 395 indicates the round-trip time measured during the transmission of 396 these bytes. The rtt value must be set to -1 if no valid 397 round-trip sample was obtained by the application. The lossmode 398 parameter provides an indicator of how a loss was detected. A 399 value of CM_PERSISTENT indicates that the application believes 400 congestion to be severe, e.g., a TCP that has experienced a 401 timeout. A value of CM_TRANSIENT indicates that the application 402 believes that the congestion is not severe, e.g., a TCP loss 403 detected using duplicate (selective) acknowledgements or other 404 data-driven techniques. A value of CM_ECN indicates that the 405 receiver echoed an explicit congestion notification message. 406 Finally, a value of CM_NOLOSS indicates that no congestion-related 407 loss has occurred. The lossmode parameter MUST be reported as a 408 bit-vector where the bits correspond to CM_PERSISTENT, 409 CM_TRANSIENT, and CM_ECN. 411 cm_notify(i32 cm_streamid, u32 nsent) MUST be called when data is 412 transmitted from the host (e.g., in the IP output routine) to 413 inform the CM that nsent bytes were just transmitted on a given 414 stream. This allows the CM to update its estimate of the number of 415 outstanding bytes for the macroflow and for the stream. 417 A cmapp_send() grant from the CM to an application is valid only 418 for an expiration time, equal to the larger of the round-trip time 419 and an implementation-dependent threshold communicated as an 420 argument to the cmapp_send() callback function. The application 421 MUST NOT send data based on this callback after this time has 422 expired. Furthermore, if the application decides not to send data 423 after receiving this callback, it SHOULD call 424 cm_notify(stream_info, 0) to allow the CM to permit other streams 425 in the macroflow to transmit data. The CM congestion controller 426 MUST be robust to applications forgetting to invoke 427 cm_notify(stream_info, 0) correctly, or applications that crash or 428 disappear after having made a cm_request() call. 430 4.4 Querying 432 If applications wish to learn about per-stream available bandwidth 433 and round-trip time, they can use the CM's cm_query(i32 434 cm_streamid, i64* rate, i32* srtt, i32* rttdev) call, which fills 435 in the desired quantities. If the CM does not have valid estimates 436 for the macroflow, it fills in negative values for the rate, srtt, 437 and rttdev. 439 4.5 Sharing granularity 441 One of the decisions the CM needs to make is the granularity at 442 which a macroflow is constructed, by deciding which streams belong 443 to the same macroflow and share congestion information. The API 444 provides two functions that allow applications to decide which of 445 their streams ought to belong to the same macroflow. 447 cm_getmacroflow(i32 cm_streamid) returns a unique i32 macroflow 448 identifier. cm_setmacroflow(i32 cm_macroflowid, i32 cm_streamid) 449 sets the macroflow of the stream cm_streamid to cm_macroflowid. If the 450 cm_macroflowid that is passed to cm_setmacroflow() is -1, then a 451 new macroflow is constructed and this is returned to the caller. 452 Each call to cm_setmacroflow() overrides the previous macroflow 453 association for the stream, should one exist. 455 The default suggested aggregation method is to aggregate by 456 destination IP address; i.e., all streams to the same destination 457 address are aggregated to a single macroflow by default. The 458 cm_getmacroflow() and cm_setmacroflow() calls can then be used to 459 change this as needed. 461 The objective of this interface is to set up sharing of groups not 462 sharing policy of relative weights of streams in a macroflow. The 463 latter requires the scheduler to provide an interface to set 464 sharing policy. However, because we want to support many different 465 schedulers (each of which may need different information to set 466 policy), we do not specify a complete API to the scheduler (but see 467 Section 5.2). A later guideline document intends to describe a few 468 simple schedulers (e.g., weighted round-robin, hierarchical 469 scheduling) and the API they export to provide relative 470 prioritization. 472 5. CM internals 474 This section describes the internal components of the CM. It 475 includes a Congestion Controller and a Scheduler, with 476 well-defined, abstract interfaces exported by them. 478 5.1 Congestion controller 480 Associated with each macroflow is a congestion control algorithm; 481 the collection of all these algorithms comprises the congestion 482 controller of the CM. The control algorithm decides when and how 483 much data can be transmitted by a macroflow. It uses application 484 notifications (Section 4.3) from concurrent streams on the same 485 macroflow to build up information about the congestion state of the 486 network path used by the macroflow. 488 The congestion controller MUST implement a "TCP-friendly" 489 [Mahdavi98] congestion control algorithm. Several macroflows MAY 490 (and indeed, often will) use the same congestion control algorithm 491 but each macroflow maintains state about the network used by its 492 streams. 494 The congestion control module MUST implement the following abstract 495 interfaces. We emphasize that these are not directly visible to 496 applications; they are within the context of a macroflow, and are 497 different from the CM API functions of Section 4. 499 - void query(u64 *rate, u32 *srtt, u32 *rttdev): This function 500 returns the estimated rate (in bits per second) and smoothed 501 round trip time (in microseconds) for the macroflow. 503 - void notify(u32 nsent): This function MUST be used to notify the 504 congestion control module whenever data is sent by an 505 application. The nsent parameter indicates the number of bytes 506 just sent by the application. 508 - void update(u32 nsent, u32 nrecd, u32 rtt, u32 lossmode): This 509 function is called whenever any of the CM streams associated with 510 a macroflow identifies that data has reached the receiver or has 511 been lost en route. The nrecd parameter indicates the number of 512 bytes that have just arrived at the receiver. The nsent 513 parameter is the sum of the number of bytes just received and the 514 number of bytes identified as lost en route. The rtt parameter is 515 the estimated round trip time in microseconds during the 516 transfer. The lossmode parameter provides an indicator of how a 517 loss was detected (section 4.3). 519 Although these interfaces are not visible to applications, the 520 congestion controller MUST implement these abstract interfaces to 521 provide for modular inter-operability with different 522 separately-developed schedulers. 524 The congestion control module MUST also call the associated 525 scheduler's schedule function (section 5.2) when it believes that 526 the current congestion state allows an MTU-sized packet to be sent. 528 5.2 Scheduler 530 While it is the responsibility of the congestion control module to 531 determine when and how much data can be transmitted, it is the 532 responsibility of a macroflow's scheduler module to determine which 533 of the streams should get the opportunity to transmit data. 535 The Scheduler MUST implement the following interfaces: 537 - void schedule(u32 num_bytes): When the congestion control module 538 determines that data can be sent, the schedule() routine MUST be 539 called with no more than the number of bytes that can be sent. 540 In turn, the scheduler MAY call the cmapp_send() function that CM 541 applications must provide. 543 - float query_share(i32 cm_streamid): This call returns the 544 described stream's share of the total bandwidth available to the 545 macroflow. This call combined with the query call of the 546 congestion controller provides the information to satisfy an 547 application's cm_query() request. 549 - void notify(i32 cm_streamid, u32 nsent): This interface is used 550 to notify the scheduler module whenever data is sent by a CM 551 application. The nsent parameter indicates the number of bytes 552 just sent by the application. 554 6. Examples 556 6.1 Example applications 558 The following describes the possible use of the CM API by an 559 asynchronous application (an implementation of a TCP sender) and a 560 synchronous application (an audio server). More details of these 561 applications and CM implementation optimizations for efficient 562 operation are described in [Andersen00]. We emphasize that the 563 protocols in this section are examples and suggestions for 564 implementation, rather than requirements of any conformant 565 implementation. 567 6.1.1 TCP 569 A TCP MUST use the cmapp_send() callback API. TCP only identifies 570 which data it should send upon the arrival of an acknowledgement or 571 expiration of a timer. As a result, it requires tight control over 572 when and if new data or retransmissions are sent. 574 When TCP either connects to or accepts a connection from another 575 host, it performs a cm_open() call to associate the TCP connection 576 with a cm_streamid. 578 Once a connection is established, the CM is used to control the 579 transmission of outgoing data. The CM eliminates the need for 580 tracking and reacting to congestion in TCP, because the CM and its 581 transmission API ensure proper congestion behavior. Loss recovery 582 is still performed by TCP based on fast retransmissions and 583 recovery as well as timeouts. In addition, TCP is also modified to 584 have its own outstanding window (tcp_ownd) estimate. Whenever data 585 segments are sent from its cmapp_send() callback, TCP updates its 586 tcp_ownd value. The ownd variable is also updated after each 587 cm_update() call. TCP also maintains a count of the number of 588 outstanding segments (pkt_cnt). At any time, TCP can calculate the 589 average packet size (avg_pkt_size) as tcp_ownd/pkt_cnt. The 590 avg_pkt_size is used by TCP to help estimate the amount of 591 outstanding data. Note that this is not needed if the SACK option 592 is used on the connection, since this information is explicitly 593 available. 595 The TCP output routines are modified as follows: 597 1. All congestion window (cwnd) checks are removed. 599 2. When application data is available. The TCP output routines 600 perform all non-congestion checks (Nagle algorithm, 601 receiver-advertised window check, etc). If these checks pass, 602 the output routine queues the data and calls cm_request() for the 603 stream. 605 3. If incoming data or timers result in a loss being detected, 606 the retransmission is also placed in a queue and cm_request() is 607 called for the stream. 609 4. The cmapp_send() callback for TCP is set to an output 610 routine. If any retransmission is enqueued, the routine outputs 611 the retransmission. Otherwise, the routine outputs as much new 612 data as the TCP connection state allows. However, the 613 cmapp_send() never sends more than a single segment per call. 614 This routine arranges for the other output computations to be 615 done, such as header and options computations. 617 The IP output routine on the host calls cm_notify() when the 618 packets are actually sent out. Because it does not know which 619 cm_streamid is responsible for the packet, cm_notify() takes the 620 stream_info as argument (see Section 4 for what the stream_info 621 should contain). Because cm_notify() reports the IP payload size, 622 TCP keeps track of the total header size and incorporates these 623 updates. 625 The TCP input routines are modified as follows: 627 1. RTT estimation is done as normal using either timestamps or 628 Karn's algorithm. Any rtt estimate that is generated is passed 629 to CM via the cm_update call. 631 2. All cwnd and slow start threshold (ssthresh) updates are 632 removed. 634 3. Upon the arrival of an ack for new data, TCP computes the 635 value of in_flight (the amount of data in flight) as 636 snd_max-ack-1 (i.e. MAX Sequence Sent - Current Ack - 1). TCP 637 then calls cm_update(streamid, tcp_ownd - in_flight, 0, 638 CM_NOLOSS, rtt). 640 4. Upon the arrival of a duplicate acknowledgement, TCP must 641 check its dupack count (dup_acks) to determine its action. If 642 dup_acks < 3, the TCP does nothing. If dup_acks == 3, TCP 643 assumes that a packet was lost and that at least 3 packets 644 arrived to generate these duplicate acks. Therefore, it calls 645 cm_update(streamid, 4 * avg_pkt_size, 3 * avg_pkt_size, 646 CM_TRANSIENT, rtt). The average packet size is used since the 647 acknowledgements do not indicate exactly how much data has 648 reached the other end. Most TCP implementations interpret a 649 duplicate ACK as an indication that a full MSS has reached its 650 destination. Once a new ACK is received, these TCP sender 651 implementations may resynchronize with TCP receiver. The CM API 652 does not provide a mechanism for TCP to pass information from 653 this resynchronization. Therefore, TCP can only infer the 654 arrival of an avg_pkt_size amount of data from each duplicate 655 ack. TCP also enqueues a retransmission of the lost segment and 656 calls cm_request(). If dup_acks > 3, TCP assumes that a packet 657 has reached the other end and caused this ack to be sent. As a 658 result, it calls cm_update(streamid, avg_pkt_size, avg_pkt_size, 659 CM_NOLOSS, rtt). 661 5. Upon the arrival of a partial acknowledgment (one that does 662 not exceed the highest segment transmitted at the time the loss 663 occurred, as defined in [Floyd99b]), TCP assumes that a packet 664 was lost and that the retransmitted packet has reached the 665 recipient. Therefore, it calls cm_update(streamid, 2 * 666 avg_pkt_size, avg_pkt_size, CM_NOLOSS, rtt). CM_NOLOSS is used 667 since the loss period has already been reported. TCP also 668 enqueues a retransmission of the lost segment and calls 669 cm_request(). 671 When the TCP retransmission timer expires, the sender identifies 672 that a segment has been lost and calls cm_update(streamid, 673 avg_pkt_size, 0, CM_PERSISTENT, 0) to signify the occurrence of 674 persistent congestion to the CM. TCP also enqueues a 675 retransmission of the lost segment and calls cm_request(). 677 6.1.2 Congestion-controlled UDP 679 Congestion-controlled UDP is a useful CM application, which we 680 describe in the context of Berkeley sockets [Stevens94]. They 681 provide the same functionality as standard Berkeley UDP sockets, 682 but instead of immediately sending the data from the kernel packet 683 queue to lower layers for transmission, the buffered socket 684 implementation makes calls to the API exported by the CM inside the 685 kernel and gets callbacks from the CM. When a CM UDP socket is 686 created, it is bound to a particular stream. Later, when data is 687 added to the packet queue, cm_request() is called on the stream 688 associated with the socket. When the CM schedules this stream for 689 transmission, it calls udp_ccappsend() in the UDP module. This 690 function transmits one MTU from the packet queue, and schedules the 691 transmission of any remaining packets. The in-kernel 692 implementation of the CM UDP API SHOULD NOT require any additional 693 data copies and SHOULD support all standard UDP options. Modifying 694 existing applications to use congestion-controlled UDP requires the 695 implementation of a new socket option on the socket. To work 696 correctly, the sender MUST obtain feedback about congestion. This 697 can be done in at least two ways: (i) the UDP receiver application 698 can provide feedback to the sender application, which will inform 699 the CM of network conditions using cm_update(); (ii) the UDP 700 receiver implementation can provide feedback to the sending UDP. 701 Note that this latter alternative requires changes to the 702 receiver's network stack and the sender UDP cannot assume that all 703 receivers support this option without explicit negotiation. 705 6.1.3 Audio server 707 A typical audio application often has access to the sample in a 708 multitude of data rates and qualities. The objective of the 709 application is then to deliver the highest possible quality of 710 audio (typically the highest data rate) its clients. The selection 711 of which version of audio to transmit should be based on the 712 current congestion state of the network. In addition, the source 713 will want audio delivered to its users at a consistent sampling 714 rate. As a result, it must send data a regular rate, minimizing 715 delaying transmissions and reducing buffering before playback. To 716 meet these requirements, this application can use the synchronous 717 sender API (Section 4.2). 719 When the source first starts, it uses the cm_query() call to get an 720 initial estimate of network bandwidth and delay. If some other 721 streams on that macroflow have already been active, then it gets an 722 initial estimate that is valid; otherwise, it gets negative values, 723 which it ignores. It then chooses an encoding that does not exceed 724 these estimates (or, in the case of an invalid estimate, uses 725 application-specific initial values) and begins transmitting 726 data. The application also implements the cmapp_update() callback. 727 When the CM determines that network characteristics have changed, 728 it calls the application's cmapp_update() function and passes it a 729 new rate and round-trip time estimate. The application MUST change 730 its choice of audio encoding to ensure that it does not exceed 731 these new estimates. 733 To use the CM, the application MUST incorporate feedback from the 734 receiver. In this example, it must periodically (typically once or 735 twice per round trip time) determine how many of its packets 736 arrived at the receiver. When the source gets this feedback, it 737 MUST use cm_update() to inform the CM of this new information. 738 This results in the CM updating ownd and may result in CM changing 739 its estimates and calling cmapp_update() of the streams of the 740 macroflow. 742 6.3 Example congestion control module 744 To illustrate the responsibilities of a congestion control module, 745 the following describes some of the actions of a simple TCP-like 746 congestion control module that implements Additive Increase 747 Multiplicative Decrease congestion control (AIMD_CC): 749 - query(): AIMD_CC returns the current congestion window (cwnd) 750 divided by the smoothed rtt (srtt) as its bandwidth estimate. It 751 returns the smoothed rtt estimate as srtt. 753 - notify(): AIMD_CC adds the number of bytes sent to its 754 outstanding data window (ownd). 756 - update(): AIMD_CC subtracts nsent from ownd. If the value of rtt 757 is non-zero, AIMD_CC updates srtt using the TCP srtt calculation. 758 If the update indicates that data has been lost, AIMD_CC sets 759 cwnd to 1 MTU if the loss_mode is CM_PERSISTENT and to cwnd/2 760 (with a minimum of 1 MTU) if the loss_mode is CM_TRANSIENT or 761 CM_ECN. AIMD_CC also sets its internal ssthresh variable to 762 cwnd/2. If no loss had occurred, AIMD_CC mimics TCP slow start 763 and linear growth modes. It increments cwnd by nsent when cwnd < 764 ssthresh (bounded by a maximum of ssthresh-cwnd) and by nsent * 765 MTU/cwnd when cwnd > ssthresh. 767 - When cwnd or ownd are updated and indicate that at least one MTU 768 may be transmitted, AIMD_CC calls the CM to schedule a 769 transmission. 771 6.4 Example Scheduler Module 773 To clarify the responsibilities of a scheduler module, the 774 following describes some of the actions of a simple round robin 775 scheduler module (RR_sched): 777 - schedule(): RR_sched schedules as many streams as possible in round 778 robin fashion. 780 - query_share(): RR_sched returns 1/(number of streams in macroflow). 782 - notify(): RR_sched does nothing. Round robin scheduling is not 783 affected by the amount of data sent. 785 7. Security considerations 787 The CM provides many of the same services that the congestion 788 control in TCP provides. As such, it is vulnerable to many of the 789 same security problems. For example, incorrect reports of losses 790 and transmissions will give the CM an inaccurate picture of the 791 network's congestion state. By giving CM a high estimate of 792 congestion, an attacker can degrade the performance observed by 793 applications. The more dangerous form of attack is giving CM a low 794 estimate of congestion. This would cause CM to be overly 795 aggressive and allow data to be sent much more quickly than sound 796 congestion control policies would allow. [Touch97] describes the 797 security problems that arise with congestion information sharing in 798 more detail. 800 8. References 802 [Allman99] Allman, M. and Paxson, V., TCP Congestion Control, 803 RFC-2581, April 1999. 805 [Andersen00] Andersen, D., Bansal, D., Curtis, D., Seshan, S., and 806 Balakrishnan, H., System Support for Bandwidth Management and 807 Content Adaptation in Internet Applications, Proc. 4th Symp. on 808 Operating Systems Design and Implementation, San Diego, CA, 809 October 2000. 811 [Balakrishnan98] Balakrishnan, H., Padmanabhan, V., Seshan, S., 812 Stemm, M., and Katz, R., "TCP Behavior of a Busy Web Server: 813 Analysis and Improvements," Proc. IEEE INFOCOM, San Francisco, 814 CA, March 1998. 816 [Balakrishnan99] Balakrishnan, H., Rahul, H., and Seshan, S., "An 817 Integrated Congestion Management Architecture for Internet 818 Hosts," Proc. ACM SIGCOMM, Cambridge, MA, September 1999. 820 [Bradner96] Bradner, S., "The Internet Standards Process --- 821 Revision 3", BCP 9, RFC-2026, October 1996. 823 [Bradner97] Bradner, S., "Key words for use in RFCs to Indicate 824 Requirement Levels", BCP 14, RFC-2119, March 1997. 826 [Clark90] Clark, D. and Tennenhouse, D., "Architectural 827 Consideration for a New Generation of Protocols", Proc. ACM 828 SIGCOMM, Philadelphia, PA, September 1990. 830 [Eggert00] Eggert, L., Heidemann, J., and Touch, J., "Effects of 831 Ensemble TCP," ACM Computer Comm. Review, January 2000. 833 [Floyd99a] Floyd, S. and Fall, K.," Promoting the Use of End-to-End 834 Congestion Control in the Internet," IEEE/ACM Trans. on 835 Networking, 7(4), August 1999, pp. 458-472. 837 [Floyd99b] Floyd, S. and Henderson, T., "The NewReno Modification 838 to TCP's Fast Recovery Algorithm," RFC-2582, April 839 1999. (Experimental.) 841 [Jacobson88] Jacobson, V., "Congestion Avoidance and Control," 842 Proc. ACM SIGCOMM, Stanford, CA, August 1988. 844 [Mahdavi98] Mahdavi, J. and Floyd, S., "The TCP Friendly Website," 845 http://www.psc.edu/networking/tcp_friendly.html 847 [Mogul90] Mogul, J. and Deering, S., "Path MTU Discovery," 848 RFC-1191, November 1990. 850 [Padmanabhan98] Padmanabhan, V., "Addressing the Challenges of Web 851 Data Transport," PhD thesis, Univ. of California, Berkeley, 852 December 1998. 854 [Postel81] Postel, J. (ed.), "Transmission Control Protocol", 855 RFC-793, September 1981. 857 [Ramakrishnan98] Ramakrishnan, K. and Floyd, S., "A Proposal to Add 858 Explicit Congestion Notification (ECN) to IP," RFC-2481. 859 (Experimental.) 861 [Stevens94] Stevens, W., TCP/IP Illustrated, Volume 1. 862 Addison-Wesley, Reading, MA, 1994. 864 [Touch97] Touch, J., "TCP Control Block Interdependence," RFC-2140, 865 April 1997. (Informational.) 867 9. Acknowledgments 869 We thank David Andersen, Deepak Bansal, and Dorothy Curtis for 870 their work on the CM design and implementation. We thank Vern 871 Paxson for his detailed comments and patience, and Sally Floyd, 872 Mark Handley, and Steven McCanne for useful feedback on the CM 873 architecture. 875 10. Authors' addresses 877 Hari Balakrishnan 878 Laboratory for Computer Science 879 545 Technology Square 880 Massachusetts Institute of Technology 881 Cambridge, MA 02139 882 Email: hari@lcs.mit.edu 883 Web: http://nms.lcs.mit.edu/~hari/ 885 Srinivasan Seshan 886 School of Computer Science 887 Carnegie Mellon University 888 5000 Forbes Ave. 889 Pittsburgh, PA 15213 890 Email: srini@seshan.org 891 Web: http://www.seshan.org/ 893 Full Copyright Statement 895 "Copyright (C) The Internet Society (date). All Rights Reserved. 896 This document and translations of it may be copied and furnished to 897 others, and derivative works that comment on or otherwise explain 898 it or assist in its implementation may be prepared, copied, 899 published and distributed, in whole or in part, without restriction 900 of any kind, provided that the above copyright notice and this 901 paragraph are included on all such copies and derivative works. 902 However, this document itself may not be modified in any way, such 903 as by removing the copyright notice or references to the Internet 904 Society or other Internet organizations, except as needed for the 905 purpose of developing Internet standards in which case the 906 procedures for copyrights defined in the Internet Standards process 907 must be followed, or as required to translate it into the final 908 draft output.