idnits 2.17.1 draft-ietf-ecm-cm-03.txt: ** The Abstract section seems to be numbered Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 734 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 4 instances of too long lines in the document, the longest one being 6 characters in excess of 72. ** There are 5 instances of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 2581 (ref. 'Allman99') (Obsoleted by RFC 5681) -- Possible downref: Non-RFC (?) normative reference: ref. 'Andersen00' -- Possible downref: Non-RFC (?) normative reference: ref. 'Balakrishnan98' -- Possible downref: Non-RFC (?) normative reference: ref. 'Balakrishnan99' -- Possible downref: Non-RFC (?) normative reference: ref. 'Clark90' -- Possible downref: Non-RFC (?) normative reference: ref. 'Eggert00' -- Possible downref: Non-RFC (?) normative reference: ref. 'Floyd99a' ** Obsolete normative reference: RFC 2582 (ref. 'Floyd99b') (Obsoleted by RFC 3782) -- Possible downref: Non-RFC (?) normative reference: ref. 'Jacobson88' -- Possible downref: Non-RFC (?) normative reference: ref. 'Mahdavi98' -- Possible downref: Non-RFC (?) normative reference: ref. 'Padmanabhan98' ** Obsolete normative reference: RFC 793 (ref. 'Postel81') (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 2481 (ref. 'Ramakrishnan98') (Obsoleted by RFC 3168) -- Possible downref: Non-RFC (?) normative reference: ref. 'Stevens94' ** Obsolete normative reference: RFC 2140 (ref. 'Touch97') (Obsoleted by RFC 9040) Summary: 15 errors (**), 0 flaws (~~), 2 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force Hari Balakrishnan 2 INTERNET DRAFT MIT LCS 3 Document: draft-ietf-ecm-cm-03.txt Srinivasan Seshan 4 CMU 5 November, 2000 6 Expires: May 2001 8 The Congestion Manager 10 Status of this Memo 12 This document is an Internet-Draft and is in full conformance with 13 all provisions of Section 10 of RFC-2026 [Bradner96]. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that 17 other groups may also distribute working documents as Internet- 18 Drafts. Internet-Drafts are draft documents valid for a maximum of 19 six months and may be updated, replaced, or obsoleted by other 20 documents at any time. It is inappropriate to use Internet- Drafts 21 as reference material or to cite them other than as "work in 22 progress." 23 The list of current Internet-Drafts can be accessed at 24 http://www.ietf.org/ietf/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html. 28 1. Abstract 30 This document describes the Congestion Manager (CM), an end-system 31 module that: 33 (i) Enables an ensemble of multiple concurrent streams from a 34 sender destined to the same receiver and sharing the same 35 congestion properties to perform proper congestion avoidance and 36 control, and 38 (ii) Allows applications to easily adapt to network congestion. 40 The framework described in this document integrates congestion 41 management across all applications and transport protocols. The CM 42 maintains congestion parameters (available aggregate and per-stream 43 bandwidth, per-receiver round-trip times, etc.) and exports an API 44 that enables applications to learn about network characteristics, 45 pass information to the CM, share congestion information with each 46 other, and schedule data transmissions. This document focuses on 47 applications and transport protocols with their own independent 48 per-byte or per-packet sequence number information, and does not 49 require modifications to the receiver protocol stack. However, the 50 receiving application must provide feedback to the sending 51 application about received packets and losses, and the latter is 52 expected to use the CM API to update CM state. This document does 53 not address networks with reservations or service differentiation. 55 2. Conventions used in this document: 56 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 57 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 58 this document are to be interpreted as described in RFC-2119 59 [Bradner97]. 61 STREAM 62 A group of packets that all share the same source and 63 destination IP address, IP type-of-service, transport 64 protocol, and source and destination transport-layer port 65 numbers. 67 MACROFLOW 68 A group of streams that all use the same congestion management 69 and scheduling algorithms, and share congestion state 70 information. Currently, streams destined to different 71 receivers belong to different macroflows. Streams destined to 72 the same receiver MAY belong to different macroflows. Streams 73 that experience identical congestion behavior in the Internet 74 and use the same congestion control algorithm SHOULD belong to 75 the same macroflow. 77 APPLICATION 78 Any software module that uses the CM. This includes 79 user-level applications such as Web servers or audio/video 80 servers, as well as in-kernel protocols such as TCP [Postel81] 81 that use the CM for congestion control. 83 WELL-BEHAVED APPLICATION 84 An application that only transmits when allowed by the CM and 85 accurately accounts for all data that it has sent to the 86 receiver by informing the CM using the CM API. 88 PATH MAXIMUM TRANSMISSION UNIT (PMTU) 89 The size of the largest packet that the sender can transmit 90 without it being fragmented en route to the receiver. It 91 includes the sizes of all headers and data except the IP 92 header. 94 CONGESTION WINDOW (cwnd) 95 A CM state variable that modulates the amount of outstanding 96 data between sender and receiver. 98 OUTSTANDING WINDOW (ownd) 99 The number of bytes that has been transmitted by the source, 100 but not known to have been either received by the destination 101 or lost in the network. 103 INITIAL WINDOW (IW) 104 The size of the sender's congestion window at the beginning of 105 a macroflow. 107 DATA TYPE SYNTAX 108 We use "u64" for unsigned 64-bit, "u32" for unsigned 32- 109 bit, "u16" for unsigned 16-bit, "u8" for unsigned 8-bit, "i32" for 110 signed 32-bit, "i16" for signed 16-bit quantities, "float" for IEEE 111 floating point values. The type "void" is used to indicate that no 112 return value is expected from a call. Pointers are referred to 113 using "*" syntax, following C language convention. 115 We emphasize that all the API functions described in this 116 document are "abstract" calls and that conformant CM 117 implementations may differ in specific implementation details. 119 3. Introduction 121 The CM is an end-system module that enables an ensemble of multiple 122 concurrent streams to perform stable congestion avoidance and 123 control, and allows applications to easily adapt their 124 transmissions to prevailing network conditions. It integrates 125 congestion management across all applications and transport 126 protocols. It maintains congestion parameters (available aggregate 127 and per-stream bandwidth, per-receiver round-trip times, etc.) and 128 exports an API that enables applications to learn about network 129 characteristics, pass information to the CM, share congestion 130 information with each other, and schedule data transmissions. All 131 data transmissions MUST be done with the explicit consent of the CM 132 via this API to ensure proper congestion behavior. 134 This document focuses on applications and networks where the 135 following conditions hold: 137 1. Applications are well-behaved with their own independent 138 per-byte or per-packet sequence number information, and use the 139 CM API to update internal state in the CM. 141 2. Networks are best-effort without service discrimination or 142 reservations. In particular, it does not address situations 143 where different streams between the same pair of hosts traverse 144 paths with differing characteristics. 146 The Congestion Manager framework can be extended to support 147 applications that do not provide their own feedback and to 148 differentially-served networks. These extensions will be addressed 149 in later documents. 151 The CM is motivated by two main goals: 153 (i) Enable efficient multiplexing. Increasingly, the trend on the 154 Internet is for unicast data senders (e.g., Web servers) to 155 transmit heterogeneous types of data to receivers, ranging from 156 unreliable real-time streaming content to reliable Web pages and 157 applets. As a result, many logically different streams share the 158 same path between sender and receiver. For the Internet to remain 159 stable, each of these streams must incorporate control protocols 160 that safely probe for spare bandwidth and react to 161 congestion. Unfortunately, these concurrent streams typically compete 162 with each other for network resources, rather than share them 163 effectively. Furthermore, they do not learn from each other about 164 the state of the network. Even if they each independently implement 165 congestion control (e.g., a group of TCP connections each 166 implementing the algorithms in [Jacobson88, Allman99]), the 167 ensemble of streams tends to be more aggressive in the face of 168 congestion than a single TCP connection implementing standard TCP 169 congestion control and avoidance [Balakrishnan98]. 171 (ii) Enable application adaptation to congestion. Increasingly 172 popular real-time streaming applications run over UDP using their 173 own user-level transport protocols for good application 174 performance, but in most cases today do not adapt or react properly 175 to network congestion. By implementing a stable control algorithm 176 and exposing an adaptation API, the CM enables easy application 177 adaptation to congestion. Applications adapt the data they 178 transmit to the current network conditions. 180 The CM framework builds on recent work on TCP control block sharing 181 [Touch97], integrated TCP congestion control (TCP-Int) 182 [Balakrishnan98] and TCP sessions [Padmanabhan98]. [Touch97] 183 advocates the sharing of some of the state in the TCP control block 184 to improve transient transport performance and describes sharing 185 across an ensemble of TCP connections. [Balakrishnan98], 186 [Padmanabhan98], and [Eggert00] describe several experiments that 187 quantify the benefits of sharing congestion state, including 188 improved stability in the face of congestion and better loss 189 recovery. Integrating loss recovery across concurrent connections 190 significantly improves performance because losses on one connection 191 can be detected by noticing that later data sent on another 192 connection has been received and acknowledged. The CM framework 193 extends these ideas in two significant ways: (i) it extends 194 congestion management to non-TCP streams, which are becoming 195 increasingly common and often do not implement proper congestion 196 management, and (ii) it provides an API for applications to adapt 197 their transmissions to current network conditions. For an extended 198 discussion of the motivation for the CM, its architecture, API, 199 and algorithms, see [Balakrishnan99]; for a description of an 200 implementation and performance results, see [Andersen00]. 202 The resulting end-host protocol architecture at the sender is shown 203 in Figure 1. The CM helps achieve network stability by 204 implementing stable congestion avoidance and control algorithms 205 that are "TCP-friendly" [Mahdavi98] based on algorithms described in 206 [Allman99]. However, it does not attempt to enforce proper 207 congestion behavior for all applications (but it does not preclude 208 a policer on the host that performs this task). Note that while 209 the policer at the end-host can use CM, the network has to be 210 protected against compromises to the CM and the policer at the end 211 hosts, a task that requires router machinery [Floyd99a]. We do not 212 address this issue further in this document. 214 |--------| |--------| |--------| |--------| |--------------| 215 | HTTP | | FTP | | RTP 1 | | RTP 2 | | | 216 |--------| |--------| |--------| |--------| | | 217 | | | ^ | ^ | | 218 | | | | | | | Scheduler | 219 | | | | | | |---| | | 220 | | | |-------|--+->| | | | 221 | | | | | |<--| | 222 v v v v | | |--------------| 223 |--------| |--------| |-------------| | | ^ 224 | TCP 1 | | TCP 2 | | UDP 1 | | A | | 225 |--------| |--------| |-------------| | | | 226 ^ | ^ | | | | |--------------| 227 | | | | | | P |-->| | 228 | | | | | | | | | 229 |---|------+---|--------------|------->| | | Congestion | 230 | | | | I | | | 231 v v v | | | Controller | 232 |-----------------------------------| | | | | 233 | IP |-->| | | | 234 |-----------------------------------| | | |--------------| 235 |---| 237 Figure 1 239 The key components of the CM framework are (i) the API, (ii) the 240 congestion controller, and (iii) the scheduler. The API is (in 241 part) motivated by the requirements of application-level framing 242 (ALF) [Clark90], and is described in Section 4. The CM internals 243 (Section 5) include a congestion controller (Section 5.1) and a 244 scheduler to orchestrate data transmissions between concurrent 245 streams in a macroflow (Section 5.2). The congestion controller 246 adjusts the aggregate transmission rate between sender and receiver 247 based on its estimate of congestion in the network. It obtains 248 feedback about its past transmissions from applications themselves 249 via the API. The scheduler apportions available bandwidth amongst 250 the different streams within each macroflow and notifies 251 applications when they are permitted to send data. This document 252 focuses on well-behaved applications; a future one will describe 253 the sender-receiver protocol and header formats that will handle 254 applications that do not incorporate their own feedback to the CM. 256 4. CM API 258 Using the CM API, streams can determine their share of the available 259 bandwidth, request and have their data transmissions scheduled, 260 inform the CM about successful transmissions, and be informed when 261 the CM's estimate of path bandwidth changes. Thus, the CM frees 262 applications from having to maintain information about the state of 263 congestion and available bandwidth along any path. 265 The function prototypes below follow standard C language 266 convention. We emphasize that these API functions are abstract 267 calls and conformant CM implementations may differ in specific 268 details, as long as equivalent functionality is provided. 270 When a new stream is created by an application, it passes some 271 information to the CM via the cm_open(stream_info) API call. 272 Currently, stream_info consists of the following information: (i) 273 the source IP address, (ii) the source port, (iii) the destination 274 IP address, (iv) the destination port, and (v) the IP protocol 275 number. 277 4.1 State maintenance 279 1. Open: All applications MUST call cm_open(stream_info) before 280 using the CM API. This returns a handle, cm_streamid, for the 281 application to use for all further CM API invocations for that 282 stream. If the returned cm_streamid is -1, then the cm_open() 283 failed and that stream cannot use the CM. 285 All other calls to the CM for a stream use the cm_streamid 286 returned from the cm_open() call. 288 2. Close: When a stream terminates, the application SHOULD invoke 289 cm_close(cm_streamid) to inform the CM about the termination 290 of the stream. 292 3. Packet size: cm_mtu(cm_streamid) returns the estimated PMTU of 293 the path between sender and receiver. Internally, this 294 information SHOULD be obtained via path MTU discovery 295 [Mogul90]. It MAY be statically configured in the absence of 296 such a mechanism. 298 4.2 Data transmission 300 The CM accommodates two types of adaptive senders, enabling 301 applications to dynamically adapt their content based on 302 prevailing network conditions, and supporting ALF-based 303 applications. 305 1. Callback-based transmission. The callback-based transmission API 306 puts the stream in firm control of deciding what to transmit at 307 each point in time. To achieve this, the CM does not buffer any 308 data; instead, it allows streams the opportunity to adapt to 309 unexpected network changes at the last possible instant. Thus, 310 this enables streams to "pull out" and repacketize data upon 311 learning about any rate change, which is hard to do once the data 312 has been buffered. The CM must implement a cm_request(i32 313 cm_streamid) call for streams wishing to send data in this style. 314 After some time, depending on the rate, the CM MUST 315 invoke a callback using cmapp_send(), which is 316 a grant for the stream to send up to PMTU bytes. The 317 callback-style API is the recommended choice for ALF-based streams. 318 Note that cm_request() does not take the number of bytes or 319 MTU-sized units as an argument; each call to cm_request() is an 320 implicit request for sending up to PMTU bytes. The CM MAY provide 321 an alternate interface, cm_request(int k). The cmapp_send callback 322 for this request is granted the right to send up to k PMTU sized 323 segments. Section 4.3 discusses the time duration for which the 324 transmission grant is valid, while Section 5.2 describes how these 325 requests are scheduled and callbacks made. 327 2. Synchronous-style. The above callback-based API accommodates a 328 class of ALF streams that are "asynchronous." Asynchronous 329 transmitters do not transmit based on a periodic clock, but do so 330 triggered by asynchronous events like file reads or captured 331 frames. On the other hand, there are many streams that are 332 "synchronous" transmitters, which transmit periodically based on 333 their own internal timers (e.g., an audio senders that sends at a 334 constant sampling rate). While CM callbacks could be configured to 335 periodically interrupt such transmitters, the transmit loop of such 336 applications is less affected if they retain their original 337 timer-based loop. In addition, it complicates the CM API to have a 338 stream express the periodicity and granularity of its callbacks. 339 Thus, the CM MUST export an API that allows such streams to be informed 340 of changes in rates using the cmapp_update(u64 newrate, u32 srtt, 341 u32 rttdev) callback function, where newrate is the new rate in 342 bits per second for this stream, srtt is the current smoothed round 343 trip time estimate in microseconds, and rttdev is the smoothed 344 linear deviation in the round-trip time estimate calculated using 345 the same algorithm as in TCP [Paxson00]. The newrate value reports 346 an instantaneous rate calculated, for example, by taking the ratio 347 of cwnd and srtt, and dividing by the fraction of that ratio 348 allocated to the stream. In response, the stream MUST adapt its 349 packet size or change its timer interval to conform to (i.e., not 350 exceed) the allowed rate. Of course, it may choose not to use all 351 of this rate. Note that the CM is not on the data path of the 352 actual transmission. 354 To avoid unnecessary cmapp_update() callbacks that the application 355 will only ignore, the CM MUST provide a cm_thresh(float 356 rate_downthresh, float rate_upthresh, float rtt_downthresh, float 357 rtt_upthresh) function that a stream can use at any stage in its execution. 358 In response, the CM SHOULD invoke the callback only when the rate decreases 359 to less than (rate_downthresh * lastrate) or increases to more than 360 (rate_upthresh * lastrate), where lastrate is the rate last 361 notified to the stream, or when the round-trip time changes 362 correspondingly by the requisite thresholds. This information is 363 used as a hint by the CM, in the sense the cmapp_update() can be 364 called even if these conditions are not met. 366 The CM MUST implement a cm_query(i32 cm_streamid, u64* rate, 367 u32* srtt, u32* rttdev) to allow an application to query 368 the current CM state. This sets the rate variable to 369 the current rate estimate in bits per second, the 370 srtt variable to the current smoothed round-trip time estimate in 371 microseconds, and rttdev to the mean linear deviation. If the CM 372 does not have valid estimates for the macroflow, it fills in 373 negative values for the rate, srtt, and rttdev. 375 Note that a stream can use more than one of the above transmission 376 APIs at the same time. In particular, the knowledge of sustainable 377 rate is useful for asynchronous streams as well as synchronous 378 ones; e.g., an asynchronous Web server disseminating images using 379 TCP may use cmapp_send() to schedule its transmissions and 380 cmapp_update() to decide whether to send a low-resolution or 381 high-resolution image. A TCP implementation using the CM is 382 described in Section 6.1.1, where the benefit of the cm_request() 383 callback API for TCP will become apparent. 385 The reader will notice that the basic CM API does not provide an 386 interface for buffered congestion-controlled transmissions. This 387 is intentional, since this transmission mode can be implemented 388 using the callback-based primitive. Section 6.1.2 describes how 389 congestion-controlled UDP sockets may be implemented using the CM 390 API. 392 4.3 Application notification 394 When a stream receives feedback from receivers, it MUST use 395 cm_update(i32 cm_streamid, u32 nrecd, u32 nlost, u8 lossmode, i32 396 rtt) to inform the CM about events such as congestion losses, 397 successful receptions, type of loss (timeout event, Explicit 398 Congestion Notification [Ramakrishnan98], etc.) and round-trip time 399 samples. The nrecd parameter indicates how many bytes were 400 successfully received by the receiver since the last cm_update 401 call, while the nrecd parameter identifies how many bytes were 402 received were lost during the same time period. The rtt value 403 indicates the round-trip time measured during the transmission of 404 these bytes. The rtt value must be set to -1 if no valid 405 round-trip sample was obtained by the application. The lossmode 406 parameter provides an indicator of how a loss was detected. A 407 value of CM_NO_FEEDBACK indicates that the application has received 408 no feedback for all its outstanding data, and is reporting this to 409 the CM. For example, a TCP that has experienced a timeout would 410 use this parameter to inform the CM of this. A value of 411 CM_LOSS_FEEDBACK indicates that the application has experienced 412 some loss, which it believes to be due to congestion, but not all 413 outstanding data has been lost. For example, a TCP segment loss 414 detected using duplicate (selective) acknowledgements or other 415 data-driven techniques fits this category. A value of 416 CM_EXPLICIT_CONGESTION indicates that the receiver echoed an 417 explicit congestion notification message. Finally, a value of 418 CM_NO_CONGESTION indicates that no congestion-related loss has 419 occurred. The lossmode parameter MUST be reported as a bit-vector 420 where the bits correspond to CM_NO_FEEDBACK, CM_LOSS_FEEDBACK, 421 CM_EXPLICIT_CONGESTION, and CM_NO_CONGESTION. Note that over links 422 (paths) that experience losses for reasons other than congestion, 423 an application SHOULD inform the CM of losses, with the 424 CM_NO_CONGESTION field set. 426 cm_notify(i32 cm_streamid, u32 nsent) MUST be called when data is 427 transmitted from the host (e.g., in the IP output routine) to 428 inform the CM that nsent bytes were just transmitted on a given 429 stream. This allows the CM to update its estimate of the number of 430 outstanding bytes for the macroflow and for the stream. 432 A cmapp_send() grant from the CM to an application is valid only 433 for an expiration time, equal to the larger of the round-trip time 434 and an implementation-dependent threshold communicated as an 435 argument to the cmapp_send() callback function. The application 436 MUST NOT send data based on this callback after this time has 437 expired. Furthermore, if the application decides not to send data 438 after receiving this callback, it SHOULD call 439 cm_notify(stream_info, 0) to allow the CM to permit other streams 440 in the macroflow to transmit data. The CM congestion controller 441 MUST be robust to applications forgetting to invoke 442 cm_notify(stream_info, 0) correctly, or applications that crash or 443 disappear after having made a cm_request() call. 445 4.4 Querying 447 If applications wish to learn about per-stream available bandwidth 448 and round-trip time, they can use the CM's cm_query(i32 449 cm_streamid, i64* rate, i32* srtt, i32* rttdev) call, which fills 450 in the desired quantities. If the CM does not have valid estimates 451 for the macroflow, it fills in negative values for the rate, srtt, 452 and rttdev. 454 4.5 Sharing granularity 456 One of the decisions the CM needs to make is the granularity at 457 which a macroflow is constructed, by deciding which streams belong 458 to the same macroflow and share congestion information. The API 459 provides two functions that allow applications to decide which of 460 their streams ought to belong to the same macroflow. 462 cm_getmacroflow(i32 cm_streamid) returns a unique i32 macroflow 463 identifier. cm_setmacroflow(i32 cm_macroflowid, i32 cm_streamid) 464 sets the macroflow of the stream cm_streamid to cm_macroflowid. If the 465 cm_macroflowid that is passed to cm_setmacroflow() is -1, then a 466 new macroflow is constructed and this is returned to the caller. 467 Each call to cm_setmacroflow() overrides the previous macroflow 468 association for the stream, should one exist. 470 The default suggested aggregation method is to aggregate by 471 destination IP address; i.e., all streams to the same destination 472 address are aggregated to a single macroflow by default. The 473 cm_getmacroflow() and cm_setmacroflow() calls can then be used to 474 change this as needed. We do note that there are some cases where 475 this may not be optimal, even over best-effort networks. For 476 example, when a group of receivers are behind a NAT device, the 477 sender will see them all as one address. If the hosts behind the 478 NAT are in fact connected over different bottleneck links, some of 479 those hosts could see worse performance than before. It is 480 possible to detect such hosts when using delay and loss estimates, 481 although the specific mechanisms for doing so are beyond the scope 482 of this document. 484 The objective of this interface is to set up sharing of groups not 485 sharing policy of relative weights of streams in a macroflow. The 486 latter requires the scheduler to provide an interface to set 487 sharing policy. However, because we want to support many different 488 schedulers (each of which may need different information to set 489 policy), we do not specify a complete API to the scheduler (but see 490 Section 5.2). A later guideline document is expected to describe a 491 few simple schedulers (e.g., weighted round-robin, hierarchical 492 scheduling) and the API they export to provide relative 493 prioritization. 495 5. CM internals 497 This section describes the internal components of the CM. It 498 includes a Congestion Controller and a Scheduler, with 499 well-defined, abstract interfaces exported by them. 501 5.1 Congestion controller 503 Associated with each macroflow is a congestion control algorithm; 504 the collection of all these algorithms comprises the congestion 505 controller of the CM. The control algorithm decides when and how 506 much data can be transmitted by a macroflow. It uses application 507 notifications (Section 4.3) from concurrent streams on the same 508 macroflow to build up information about the congestion state of the 509 network path used by the macroflow. 511 The congestion controller MUST implement a "TCP-friendly" 512 [Mahdavi98] congestion control algorithm. Several macroflows MAY 513 (and indeed, often will) use the same congestion control algorithm 514 but each macroflow maintains state about the network used by its 515 streams. 517 The congestion control module MUST implement the following abstract 518 interfaces. We emphasize that these are not directly visible to 519 applications; they are within the context of a macroflow, and are 520 different from the CM API functions of Section 4. 522 - void query(u64 *rate, u32 *srtt, u32 *rttdev): This function 523 returns the estimated rate (in bits per second) and smoothed 524 round trip time (in microseconds) for the macroflow. 526 - void notify(u32 nsent): This function MUST be used to notify the 527 congestion control module whenever data is sent by an 528 application. The nsent parameter indicates the number of bytes 529 just sent by the application. 531 - void update(u32 nsent, u32 nrecd, u32 rtt, u32 lossmode): This 532 function is called whenever any of the CM streams associated with 533 a macroflow identifies that data has reached the receiver or has 534 been lost en route. The nrecd parameter indicates the number of 535 bytes that have just arrived at the receiver. The nsent 536 parameter is the sum of the number of bytes just received and the 537 number of bytes identified as lost en route. The rtt parameter is 538 the estimated round trip time in microseconds during the 539 transfer. The lossmode parameter provides an indicator of how a 540 loss was detected (section 4.3). 542 Although these interfaces are not visible to applications, the 543 congestion controller MUST implement these abstract interfaces to 544 provide for modular inter-operability with different 545 separately-developed schedulers. 547 The congestion control module MUST also call the associated 548 scheduler's schedule function (section 5.2) when it believes that 549 the current congestion state allows an MTU-sized packet to be sent. 551 5.2 Scheduler 553 While it is the responsibility of the congestion control module to 554 determine when and how much data can be transmitted, it is the 555 responsibility of a macroflow's scheduler module to determine which 556 of the streams should get the opportunity to transmit data. 558 The Scheduler MUST implement the following interfaces: 560 - void schedule(u32 num_bytes): When the congestion control module 561 determines that data can be sent, the schedule() routine MUST be 562 called with no more than the number of bytes that can be sent. 563 In turn, the scheduler MAY call the cmapp_send() function that CM 564 applications must provide. 566 - float query_share(i32 cm_streamid): This call returns the 567 described stream's share of the total bandwidth available to the 568 macroflow. This call combined with the query call of the 569 congestion controller provides the information to satisfy an 570 application's cm_query() request. 572 - void notify(i32 cm_streamid, u32 nsent): This interface is used 573 to notify the scheduler module whenever data is sent by a CM 574 application. The nsent parameter indicates the number of bytes 575 just sent by the application. 577 The Scheduler MAY implement many additional interfaces. As 578 experience with CM schedulers increases, future documents may 579 make additions and/or changes to some parts of the scheduler 580 API. 582 6. Examples 584 6.1 Example applications 586 The following describes the possible use of the CM API by an 587 asynchronous application (an implementation of a TCP sender) and a 588 synchronous application (an audio server). More details of these 589 applications and CM implementation optimizations for efficient 590 operation are described in [Andersen00]. We emphasize that the 591 protocols in this section are examples and suggestions for 592 implementation, rather than requirements of any conformant 593 implementation. 595 6.1.1 TCP 597 A TCP MUST use the cmapp_send() callback API. TCP only identifies 598 which data it should send upon the arrival of an acknowledgement or 599 expiration of a timer. As a result, it requires tight control over 600 when and if new data or retransmissions are sent. 602 When TCP either connects to or accepts a connection from another 603 host, it performs a cm_open() call to associate the TCP connection 604 with a cm_streamid. 606 Once a connection is established, the CM is used to control the 607 transmission of outgoing data. The CM eliminates the need for 608 tracking and reacting to congestion in TCP, because the CM and its 609 transmission API ensure proper congestion behavior. Loss recovery 610 is still performed by TCP based on fast retransmissions and 611 recovery as well as timeouts. In addition, TCP is also modified to 612 have its own outstanding window (tcp_ownd) estimate. Whenever data 613 segments are sent from its cmapp_send() callback, TCP updates its 614 tcp_ownd value. The ownd variable is also updated after each 615 cm_update() call. TCP also maintains a count of the number of 616 outstanding segments (pkt_cnt). At any time, TCP can calculate the 617 average packet size (avg_pkt_size) as tcp_ownd/pkt_cnt. The 618 avg_pkt_size is used by TCP to help estimate the amount of 619 outstanding data. Note that this is not needed if the SACK option 620 is used on the connection, since this information is explicitly 621 available. 623 The TCP output routines are modified as follows: 625 1. All congestion window (cwnd) checks are removed. 627 2. When application data is available. The TCP output routines 628 perform all non-congestion checks (Nagle algorithm, 629 receiver-advertised window check, etc). If these checks pass, 630 the output routine queues the data and calls cm_request() for the 631 stream. 633 3. If incoming data or timers result in a loss being detected, 634 the retransmission is also placed in a queue and cm_request() is 635 called for the stream. 637 4. The cmapp_send() callback for TCP is set to an output 638 routine. If any retransmission is enqueued, the routine outputs 639 the retransmission. Otherwise, the routine outputs as much new 640 data as the TCP connection state allows. However, the 641 cmapp_send() never sends more than a single segment per call. 642 This routine arranges for the other output computations to be 643 done, such as header and options computations. 645 The IP output routine on the host calls cm_notify() when the 646 packets are actually sent out. Because it does not know which 647 cm_streamid is responsible for the packet, cm_notify() takes the 648 stream_info as argument (see Section 4 for what the stream_info 649 should contain). Because cm_notify() reports the IP payload size, 650 TCP keeps track of the total header size and incorporates these 651 updates. 653 The TCP input routines are modified as follows: 655 1. RTT estimation is done as normal using either timestamps or 656 Karn's algorithm. Any rtt estimate that is generated is passed 657 to CM via the cm_update call. 659 2. All cwnd and slow start threshold (ssthresh) updates are 660 removed. 662 3. Upon the arrival of an ack for new data, TCP computes the 663 value of in_flight (the amount of data in flight) as 664 snd_max-ack-1 (i.e. MAX Sequence Sent - Current Ack - 1). TCP 665 then calls cm_update(streamid, tcp_ownd - in_flight, 0, 666 CM_NO_CONGESTION, rtt). 668 4. Upon the arrival of a duplicate acknowledgement, TCP must 669 check its dupack count (dup_acks) to determine its action. If 670 dup_acks < 3, the TCP does nothing. If dup_acks == 3, TCP 671 assumes that a packet was lost and that at least 3 packets 672 arrived to generate these duplicate acks. Therefore, it calls 673 cm_update(streamid, 4 * avg_pkt_size, 3 * avg_pkt_size, 674 CM_LOSS_FEEDBACK, rtt). The average packet size is used since the 675 acknowledgements do not indicate exactly how much data has 676 reached the other end. Most TCP implementations interpret a 677 duplicate ACK as an indication that a full MSS has reached its 678 destination. Once a new ACK is received, these TCP sender 679 implementations may resynchronize with TCP receiver. The CM API 680 does not provide a mechanism for TCP to pass information from 681 this resynchronization. Therefore, TCP can only infer the 682 arrival of an avg_pkt_size amount of data from each duplicate 683 ack. TCP also enqueues a retransmission of the lost segment and 684 calls cm_request(). If dup_acks > 3, TCP assumes that a packet 685 has reached the other end and caused this ack to be sent. As a 686 result, it calls cm_update(streamid, avg_pkt_size, avg_pkt_size, 687 CM_NO_CONGESTION, rtt). 689 5. Upon the arrival of a partial acknowledgment (one that does 690 not exceed the highest segment transmitted at the time the loss 691 occurred, as defined in [Floyd99b]), TCP assumes that a packet 692 was lost and that the retransmitted packet has reached the 693 recipient. Therefore, it calls cm_update(streamid, 2 * 694 avg_pkt_size, avg_pkt_size, CM_NO_CONGESTION, 695 rtt). CM_NO_CONGESTION is used since the loss period has already 696 been reported. TCP also enqueues a retransmission of the lost 697 segment and calls cm_request(). 699 When the TCP retransmission timer expires, the sender identifies 700 that a segment has been lost and calls cm_update(streamid, 701 avg_pkt_size, 0, CM_NO_FEEDBACK, 0) to signify that no feedback has 702 been received from the receiver and that one segment is sure to 703 have "left the pipe." TCP also enqueues a retransmission of the 704 lost segment and calls cm_request(). 706 6.1.2 Congestion-controlled UDP 708 Congestion-controlled UDP is a useful CM application, which we 709 describe in the context of Berkeley sockets [Stevens94]. They 710 provide the same functionality as standard Berkeley UDP sockets, 711 but instead of immediately sending the data from the kernel packet 712 queue to lower layers for transmission, the buffered socket 713 implementation makes calls to the API exported by the CM inside the 714 kernel and gets callbacks from the CM. When a CM UDP socket is 715 created, it is bound to a particular stream. Later, when data is 716 added to the packet queue, cm_request() is called on the stream 717 associated with the socket. When the CM schedules this stream for 718 transmission, it calls udp_ccappsend() in the UDP module. This 719 function transmits one MTU from the packet queue, and schedules the 720 transmission of any remaining packets. The in-kernel 721 implementation of the CM UDP API SHOULD NOT require any additional 722 data copies and SHOULD support all standard UDP options. Modifying 723 existing applications to use congestion-controlled UDP requires the 724 implementation of a new socket option on the socket. To work 725 correctly, the sender MUST obtain feedback about congestion. This 726 can be done in at least two ways: (i) the UDP receiver application 727 can provide feedback to the sender application, which will inform 728 the CM of network conditions using cm_update(); (ii) the UDP 729 receiver implementation can provide feedback to the sending UDP. 730 Note that this latter alternative requires changes to the 731 receiver's network stack and the sender UDP cannot assume that all 732 receivers support this option without explicit negotiation. 734 6.1.3 Audio server 736 A typical audio application often has access to the sample in a 737 multitude of data rates and qualities. The objective of the 738 application is then to deliver the highest possible quality of 739 audio (typically the highest data rate) its clients. The selection 740 of which version of audio to transmit should be based on the 741 current congestion state of the network. In addition, the source 742 will want audio delivered to its users at a consistent sampling 743 rate. As a result, it must send data a regular rate, minimizing 744 delaying transmissions and reducing buffering before playback. To 745 meet these requirements, this application can use the synchronous 746 sender API (Section 4.2). 748 When the source first starts, it uses the cm_query() call to get an 749 initial estimate of network bandwidth and delay. If some other 750 streams on that macroflow have already been active, then it gets an 751 initial estimate that is valid; otherwise, it gets negative values, 752 which it ignores. It then chooses an encoding that does not exceed 753 these estimates (or, in the case of an invalid estimate, uses 754 application-specific initial values) and begins transmitting 755 data. The application also implements the cmapp_update() callback. 756 When the CM determines that network characteristics have changed, 757 it calls the application's cmapp_update() function and passes it a 758 new rate and round-trip time estimate. The application MUST change 759 its choice of audio encoding to ensure that it does not exceed 760 these new estimates. 762 To use the CM, the application MUST incorporate feedback from the 763 receiver. In this example, it must periodically (typically once or 764 twice per round trip time) determine how many of its packets 765 arrived at the receiver. When the source gets this feedback, it 766 MUST use cm_update() to inform the CM of this new information. 767 This results in the CM updating ownd and may result in CM changing 768 its estimates and calling cmapp_update() of the streams of the 769 macroflow. 771 6.3 Example congestion control module 773 To illustrate the responsibilities of a congestion control module, 774 the following describes some of the actions of a simple TCP-like 775 congestion control module that implements Additive Increase 776 Multiplicative Decrease congestion control (AIMD_CC): 778 - query(): AIMD_CC returns the current congestion window (cwnd) 779 divided by the smoothed rtt (srtt) as its bandwidth estimate. It 780 returns the smoothed rtt estimate as srtt. 782 - notify(): AIMD_CC adds the number of bytes sent to its 783 outstanding data window (ownd). 785 - update(): AIMD_CC subtracts nsent from ownd. If the value of rtt 786 is non-zero, AIMD_CC updates srtt using the TCP srtt calculation. 787 If the update indicates that data has been lost, AIMD_CC sets 788 cwnd to 1 MTU if the loss_mode is CM_NO_FEEDBACK and to cwnd/2 789 (with a minimum of 1 MTU) if the loss_mode is CM_LOSS_FEEDBACK or 790 CM_EXPLICIT_CONGESTION. AIMD_CC also sets its internal ssthresh 791 variable to cwnd/2. If no loss had occurred, AIMD_CC mimics TCP 792 slow start and linear growth modes. It increments cwnd by nsent 793 when cwnd < ssthresh (bounded by a maximum of ssthresh-cwnd) and 794 by nsent * MTU/cwnd when cwnd > ssthresh. 796 - When cwnd or ownd are updated and indicate that at least one MTU 797 may be transmitted, AIMD_CC calls the CM to schedule a 798 transmission. 800 6.4 Example Scheduler Module 802 To clarify the responsibilities of a scheduler module, the 803 following describes some of the actions of a simple round robin 804 scheduler module (RR_sched): 806 - schedule(): RR_sched schedules as many streams as possible in round 807 robin fashion. 809 - query_share(): RR_sched returns 1/(number of streams in macroflow). 811 - notify(): RR_sched does nothing. Round robin scheduling is not 812 affected by the amount of data sent. 814 7. Security considerations 816 The CM provides many of the same services that the congestion 817 control in TCP provides. As such, it is vulnerable to many of the 818 same security problems. For example, incorrect reports of losses 819 and transmissions will give the CM an inaccurate picture of the 820 network's congestion state. By giving CM a high estimate of 821 congestion, an attacker can degrade the performance observed by 822 applications. The more dangerous form of attack is giving CM a low 823 estimate of congestion. This would cause CM to be overly 824 aggressive and allow data to be sent much more quickly than sound 825 congestion control policies would allow. [Touch97] describes the 826 security problems that arise with congestion information sharing in 827 more detail. 829 8. References 831 [Allman99] Allman, M. and Paxson, V., TCP Congestion Control, 832 RFC-2581, April 1999. 834 [Andersen00] Andersen, D., Bansal, D., Curtis, D., Seshan, S., and 835 Balakrishnan, H., System Support for Bandwidth Management and 836 Content Adaptation in Internet Applications, Proc. 4th Symp. on 837 Operating Systems Design and Implementation, San Diego, CA, 838 October 2000. Available from 839 http://nms.lcs.mit.edu/papers/cm-osdi2000.html 841 [Balakrishnan98] Balakrishnan, H., Padmanabhan, V., Seshan, S., 842 Stemm, M., and Katz, R., "TCP Behavior of a Busy Web Server: 843 Analysis and Improvements," Proc. IEEE INFOCOM, San Francisco, 844 CA, March 1998. 846 [Balakrishnan99] Balakrishnan, H., Rahul, H., and Seshan, S., "An 847 Integrated Congestion Management Architecture for Internet 848 Hosts," Proc. ACM SIGCOMM, Cambridge, MA, September 1999. 850 [Bradner96] Bradner, S., "The Internet Standards Process --- 851 Revision 3", BCP 9, RFC-2026, October 1996. 853 [Bradner97] Bradner, S., "Key words for use in RFCs to Indicate 854 Requirement Levels", BCP 14, RFC-2119, March 1997. 856 [Clark90] Clark, D. and Tennenhouse, D., "Architectural 857 Consideration for a New Generation of Protocols", Proc. ACM 858 SIGCOMM, Philadelphia, PA, September 1990. 860 [Eggert00] Eggert, L., Heidemann, J., and Touch, J., "Effects of 861 Ensemble TCP," ACM Computer Comm. Review, January 2000. 863 [Floyd99a] Floyd, S. and Fall, K.," Promoting the Use of End-to-End 864 Congestion Control in the Internet," IEEE/ACM Trans. on 865 Networking, 7(4), August 1999, pp. 458-472. 867 [Floyd99b] Floyd, S. and Henderson, T., "The NewReno Modification 868 to TCP's Fast Recovery Algorithm," RFC-2582, April 869 1999. (Experimental.) 871 [Jacobson88] Jacobson, V., "Congestion Avoidance and Control," 872 Proc. ACM SIGCOMM, Stanford, CA, August 1988. 874 [Mahdavi98] Mahdavi, J. and Floyd, S., "The TCP Friendly Website," 875 http://www.psc.edu/networking/tcp_friendly.html 877 [Mogul90] Mogul, J. and Deering, S., "Path MTU Discovery," 878 RFC-1191, November 1990. 880 [Padmanabhan98] Padmanabhan, V., "Addressing the Challenges of Web 881 Data Transport," PhD thesis, Univ. of California, Berkeley, 882 December 1998. 884 [Paxson00] Paxson. V. and Allman, M., "Computing TCP's 885 Retransmission Timer," Internet Draft 886 draft-paxson-tcp-rto-01.txt, April 2000. (Expires October 887 2000.) 889 [Postel81] Postel, J. (ed.), "Transmission Control Protocol," 890 RFC-793, September 1981. 892 [Ramakrishnan98] Ramakrishnan, K. and Floyd, S., "A Proposal to Add 893 Explicit Congestion Notification (ECN) to IP," RFC-2481. 894 (Experimental.) 896 [Stevens94] Stevens, W., TCP/IP Illustrated, Volume 1. 897 Addison-Wesley, Reading, MA, 1994. 899 [Touch97] Touch, J., "TCP Control Block Interdependence," RFC-2140, 900 April 1997. (Informational.) 902 9. Acknowledgments 904 We thank David Andersen, Deepak Bansal, and Dorothy Curtis for 905 their work on the CM design and implementation. We thank Vern 906 Paxson for his detailed comments and patience, and Sally Floyd, 907 Mark Handley, and Steven McCanne for useful feedback on the CM 908 architecture. 910 10. Authors' addresses 912 Hari Balakrishnan 913 Laboratory for Computer Science 914 200 Technology Square 915 Massachusetts Institute of Technology 916 Cambridge, MA 02139 917 Email: hari@lcs.mit.edu 918 Web: http://nms.lcs.mit.edu/~hari/ 920 Srinivasan Seshan 921 School of Computer Science 922 Carnegie Mellon University 923 5000 Forbes Ave. 924 Pittsburgh, PA 15213 925 Email: srini@cmu.edu 926 Web: http://www.cs.cmu.edu/~srini/ 928 Full Copyright Statement 930 "Copyright (C) The Internet Society (date). All Rights Reserved. 931 This document and translations of it may be copied and furnished to 932 others, and derivative works that comment on or otherwise explain 933 it or assist in its implementation may be prepared, copied, 934 published and distributed, in whole or in part, without restriction 935 of any kind, provided that the above copyright notice and this 936 paragraph are included on all such copies and derivative works. 937 However, this document itself may not be modified in any way, such 938 as by removing the copyright notice or references to the Internet 939 Society or other Internet organizations, except as needed for the 940 purpose of developing Internet standards in which case the 941 procedures for copyrights defined in the Internet Standards process 942 must be followed, or as required to translate it into the final 943 draft output.