idnits 2.17.1 draft-falk-xcp-spec-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 17. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1479. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1456. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1463. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1469. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: The X, Delta_Throughput, and RTT fields are unused and SHOULD be set to zero. A router MUST not perform any processing on a minimal format header. This format is intended for use in empty ACK packets, to return congestion information from receiver to sender. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 13, 2005) is 6763 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'TBD' is mentioned on line 338, but not defined == Missing Reference: 'IP2' is mentioned on line 493, but not defined == Missing Reference: 'XCP' is mentioned on line 496, but not defined == Missing Reference: 'IP1' is mentioned on line 495, but not defined == Missing Reference: 'TCP' is mentioned on line 497, but not defined == Unused Reference: 'Padhye98' is defined on line 1351, but no explicit reference was found in the text == Outdated reference: A later version (-13) exists of draft-ietf-dccp-spec-05 -- Possible downref: Non-RFC (?) normative reference: ref. 'Jacobson88' -- Possible downref: Non-RFC (?) normative reference: ref. 'KHR02' -- Possible downref: Non-RFC (?) normative reference: ref. 'Kapoor05' -- Possible downref: Non-RFC (?) normative reference: ref. 'Katabi03' -- Possible downref: Non-RFC (?) normative reference: ref. 'Katabi04' -- Possible downref: Non-RFC (?) normative reference: ref. 'Padhye98' ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 813 (Obsoleted by RFC 7805) ** Obsolete normative reference: RFC 2309 (Obsoleted by RFC 7567) ** Obsolete normative reference: RFC 2401 (Obsoleted by RFC 4301) ** Obsolete normative reference: RFC 2402 (Obsoleted by RFC 4302, RFC 4305) ** Obsolete normative reference: RFC 2406 (Obsoleted by RFC 4303, RFC 4305) ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Obsolete normative reference: RFC 2960 (Obsoleted by RFC 4960) ** Downref: Normative reference to an Informational RFC: RFC 3135 -- Possible downref: Non-RFC (?) normative reference: ref. 'Zhang05' Summary: 13 errors (**), 0 flaws (~~), 10 warnings (==), 14 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group A. Falk 3 Internet-Draft Y. Pryadkin 4 Expires: April 16, 2006 ISI 5 D. Katabi 6 MIT 7 October 13, 2005 9 Specification for the Explicit Control Protocol (XCP) 10 draft-falk-xcp-spec-01.txt 12 Status of this Memo 14 By submitting this Internet-Draft, each author represents that any 15 applicable patent or other IPR claims of which he or she is aware 16 have been or will be disclosed, and any of which he or she becomes 17 aware will be disclosed, in accordance with Section 6 of BCP 79. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as Internet- 22 Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six months 25 and may be updated, replaced, or obsoleted by other documents at any 26 time. It is inappropriate to use Internet-Drafts as reference 27 material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on April 16, 2006. 37 Copyright Notice 39 Copyright (C) The Internet Society (2005). 41 Abstract 43 This document contains an initial specification for the Explicit 44 Control Protocol (XCP), an experimental congestion control protocol. 45 XCP is designed to deliver the highest possible end-to-end throughput 46 over a broad range of network infrastructure, including links with 47 very large bandwidth-delay products, which are not well served by the 48 current control algorithms. XCP is potentially applicable to any 49 transport protocol, although initial testing has applied it to TCP in 50 particular. XCP routers are required to perform a small calculation 51 on congestion state carried in each data packet. XCP routers also 52 periodically recalculate the local parameters required to provide 53 fairness. On the other hand, there is no per-flow congestion state 54 in XCP routers. 56 This version specification (-01) includes protocol changes that move 57 per-packet divisions from the router to the sender. 59 Table of Contents 61 1. Changes Since Last Version . . . . . . . . . . . . . . . . . . 3 62 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 63 2.1. XCP Protocol Overview . . . . . . . . . . . . . . . . . . 5 64 3. The Congestion Header . . . . . . . . . . . . . . . . . . . . 9 65 3.1. Header placement . . . . . . . . . . . . . . . . . . . . . 9 66 3.2. Congestion Header Formats . . . . . . . . . . . . . . . . 9 67 3.3. IPsec issues . . . . . . . . . . . . . . . . . . . . . . . 12 68 3.4. NAT, middlebox issues . . . . . . . . . . . . . . . . . . 12 69 3.5. MPLS/Tunneling Issues . . . . . . . . . . . . . . . . . . 13 70 4. XCP Functions . . . . . . . . . . . . . . . . . . . . . . . . 14 71 4.1. End-System Functions . . . . . . . . . . . . . . . . . . . 14 72 4.1.1. Sending Packets . . . . . . . . . . . . . . . . . . . 14 73 4.1.2. Processing Feedback at the Receiver . . . . . . . . . 16 74 4.1.3. Processing Feedback at the Sender . . . . . . . . . . 16 75 4.2. Router functions . . . . . . . . . . . . . . . . . . . . . 18 76 4.2.1. Calculations Upon Packet Arrival . . . . . . . . . . . 19 77 4.2.2. Calculations Upon Control Interval Timeout . . . . . . 20 78 4.2.3. Calculations Upon Packet Departure . . . . . . . . . . 22 79 4.2.4. The Control Interval . . . . . . . . . . . . . . . . . 25 80 4.2.5. Obtaining the Persistent Queue . . . . . . . . . . . . 25 81 5. Unresolved Issues . . . . . . . . . . . . . . . . . . . . . . 28 82 5.1. XCP With Non-XCP Routers . . . . . . . . . . . . . . . . . 28 83 5.2. Variable Rate Links . . . . . . . . . . . . . . . . . . . 29 84 5.3. XCP as a TCP PEP . . . . . . . . . . . . . . . . . . . . . 29 85 5.4. Sharing resources between XCP and TCP . . . . . . . . . . 30 86 5.5. A Generalized Router Model . . . . . . . . . . . . . . . . 30 87 5.6. Host back-to-back operation . . . . . . . . . . . . . . . 30 88 6. Security Considerations . . . . . . . . . . . . . . . . . . . 32 89 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 34 90 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 35 91 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 35 92 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 38 93 Intellectual Property and Copyright Statements . . . . . . . . . . 39 95 1. Changes Since Last Version 97 Changes between version -00 and -01 99 o Updated protocol to reflect movement of the per-packet division 100 from the router to the end-system. 102 o Incremented version number (to 0x02) to reflect change in packet 103 header format. 105 o Reordered the Protocol, Length, Version, and Format fields in the 106 congestion header (in anticipation of future support of IPv6 107 extension headers). 109 o Routers now MUST (from SHOULD) ignore fields other than 110 reverse_feedback when minimal header is used. 112 o No longer ignore packets with RTT set to zero. Senders with 113 coarse-grained timer may generate these if the RTT is less than 114 the timer precision. 116 o Added 'Open Issues' section on variable-rate links. 118 2. Introduction 120 The Van Jacobson congestion control algorithms [Jacobson88] [RFC2581] 121 are used by the Internet transport protocols TCP [RFC0793] and SCTP 122 [RFC2960]. The Jacobson algorithms are fundamental to stable and 123 efficient Internet operation, and they have been highly successful 124 over many orders of magnitude of Internet bandwidth and delay. 126 However, the Jacobson congestion control algorithms have begun to 127 reach their limits. Gigabit-per-second file transfers, lossy 128 wireless links, and high latency connections are all driving current 129 TCP congestion control outside of its natural operating regime. The 130 resulting performance problems are of great concern for important 131 network applications. 133 The original Jacobson algorithm was a purely end-to-end solution, 134 requiring no congestion-related state in routers. More recent 135 modifications have backed off from this purity. Active queue 136 management (AQM) in routers (e.g., RED) [RFC2309] improves 137 performance by keeping queues small, while Explicit Congestion 138 Notification (ECN) [RFC3168] passes one bit of congestion information 139 back to senders. These measures do improve performance, but there is 140 a limit to how much can be accomplished without more information from 141 routers. The requirement of extreme scalability together with 142 robustness has been a difficult hurdle to accelerating information 143 flow. 145 This document concerns the Explicit Control Protocol (XCP) developed 146 by Dina Katabi of MIT [KHR02]. XCP represents a significant advance 147 in Internet congestion control: it extracts congestion information 148 directly from routers, without any per-flow state. XCP should be 149 able to deliver the highest possible application performance over a 150 broad range of network infrastructure, including extremely high speed 151 and very high delay links that are not well served by the current 152 control algorithms. XCP achieves fairness, maximum link utilization, 153 and efficient use of bandwidth. XCP is novel in separating the 154 efficiency and fairness policies of congestion control, enabling 155 routers to put available capacity to work quickly while 156 conservatively managing the allocation of capacity to flows. XCP is 157 potentially applicable to any transport protocol, although initial 158 testing has applied it to TCP in particular. 160 XCP's scalability is built upon the principle of carrying per-flow 161 congestion state in packets. XCP packets carry a congestion header 162 through which the sender requests a desired throughput. Routers make 163 a fair per-flow bandwidth allocation without maintaining any per-flow 164 state. This enables the sender to learn the bottleneck router's 165 allocation to a particular flow in a single round trip. 167 The gains of XCP come with some pain. XCP is more difficult to 168 deploy than other proposed Internet congestion control improvements, 169 since it requires changes in the routers as well as in end systems. 170 It will be necessary to develop and test XCP with real user traffic 171 and in real environments, to gain experience with real router and 172 host implementations and to collect data on performance. Providing 173 specifications is an important step towards enabling experimentation 174 which, in turn, will lead to deployment. XCP deployment issues will 175 be addressed in more detail in a subsequent version of this document. 177 This document contains an initial specification of the protocol and 178 algorithms used by XCP, as an experimental protocol. The XCP 179 algorithms defined here are based upon Katabi's SIGCOMM paper 180 [KHR02], her MIT thesis [Katabi03], and her ns simulation. However, 181 this document includes algorithmic modifications and clarifications 182 that have arisen from early experience with implementing and testing 183 XCP at the USC Information Sciences Institute. (See 184 http://www.isi.edu/isi-xcp for our project page.) This document is 185 intended to provide a baseline for further engineering and testing of 186 XCP. 188 This document is organized as follows. The remainder of Section 2 189 provides an overview of the XCP protocol, Section 3 discusses the 190 format of the congestion header, Section 4 describes the functions 191 occurring in the end-systems and routers, and Section 5 lists some 192 unresolved issues. 194 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 195 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 196 document are to be interpreted as described in [RFC2119]. 198 2.1. XCP Protocol Overview 200 The participants in the XCP protocol include sender hosts, receiver 201 hosts, and intermediate nodes in which queuing occurs along the path 202 from the sender to the receiver. The intermediate nodes are 203 generally routers, but link-layer switches may also contain packet 204 queues. 206 XCP supplies feedback from the network to the sender on the maximum 207 rate (throughput) for injecting data into the network. XCP feedback 208 is acquired through the use of a congestion header on each packet 209 that is sent. Routers along the path may update the congestion 210 header as it moves from the sender to the receiver. The receiver 211 copies the network feedback into outbound packets of the same flow. 212 An end-system may function as both a sender and a receiver in the 213 case of a bidirectional flow. 215 The figure below illustrates four entities participating in XCP. The 216 sender initializes the congestion header, two routers along the way 217 may update it, and the receiver copies the feedback from the network 218 into a returning packet in the same flow. 220 +----------+ +--------+ +--------+ +----------+ 221 | |------->| Router |---->| Router |------->| | 222 | Sender | +--------+ +--------+ | Receiver | 223 | |<----------------------------------------| | 224 +----------+ +----------+ 226 The congestion header contains four pieces of data: 228 o RTT: Set by the sender to its current estimate of the round-trip 229 time. 231 o X: Set by the sender to its current estimate of inter-packet time 232 gap. This quantity is used in place of Throughput in earlier 233 drafts of these specifications to avoid per-packet division in the 234 router. See Section 4.1.1 for more about X. 236 o Delta_Throughput: Initialized to the amount which the sender would 237 like to change (increase or decrease) its throughput, and updated 238 by the routers along the path to be the network's allocated change 239 in throughput. This value will be a negative number if an XCP- 240 capable queue along the path wants the sender to slow down. 242 o Reverse_Feedback: When a data packet reaches the receiver, its 243 Delta_Throughput value is returned to the sender in the 244 Reverse_Feedback field of a congestion header of a returning 245 packet (e.g., in an ACK packet). 247 An XCP-capable router calculates a fair capacity re-allocation for 248 each packet. A flow only receives this re-allocation from a 249 particular router if that router is the bottleneck for that flow. 250 For XCP, a bottleneck router is defined to be a router that has 251 insufficient capacity to accept a flow's current or desired 252 throughput. 254 Based on current conditions, an XCP-capable router generates positive 255 or negative feedback each time a packet arrives and compares it the 256 packet's Delta_Throughput field. Delta_Throughput is reduced if the 257 current value exceeds this calculated feedback allocation. Each XCP- 258 capable router along the path from sender to receiver performs this 259 processing. A packet reaching the receiver therefore contains the 260 minimal feedback allocation from the network, i.e., the capacity 261 reallocation from the bottleneck router. 263 The receiver copies this value into the Reverse_Feedback field of a 264 returning packet in the same flow (e.g., an ACK or DATA-ACK for TCP) 265 and, in one round-trip, the sender learns the flow's per-packet 266 throughput allocation. 268 The sender uses the reverse feedback information to adjust its 269 allowed sending rate. For the transport protocol TCP [RFC0793], for 270 example, this may be accomplished by adjusting the congestion window, 271 or cwnd, that limits the amount of unacknowledged data in the 272 network. (Cwnd is defined for Van Jacobson congestion control in 273 [RFC2581].) 275 Additionally, it is possible to use XCP's explicit notification of 276 the bottleneck capacity allocation for other types of applications. 277 For example, XCP may be implemented to support multimedia streams 278 over DCCP [I-D.ietf-dccp-spec] or other transport protocols. 280 An XCP-capable router maintains two control algorithms on each output 281 port: a congestion controller and a fairness controller. The 282 congestion controller is responsible for making maximal use of the 283 outbound link while at the same time draining any standing queues. 284 The fairness controller is responsible for fairly allocating 285 bandwidth to flows sharing the link. These two algorithms are 286 executed only periodically, at an interval known as the "control 287 interval". The algorithms defined below set this interval to the RTT 288 averaged across all flows. Further work on choosing an appropriate 289 value for the control interval may be required. 291 Each port-specific instance of XCP is independent of every other, and 292 references to an "XCP router" should be considered an instance of XCP 293 running on a particular output port. 295 Actually, it is an oversimplification to say that congestion in 296 routers only appears at output ports. Routers are complex devices 297 which may experience resource contention in many forms and 298 locations. Correctly expressing congestion which doesn't occur at 299 the router output port is a topic for further study. Even so, it 300 is important to correctly identify where the queue will build up 301 in a router. The XCP algorithm will drain a standing queue; 302 however it is necessary to measure that queue in order for correct 303 operation. For more discussion of this issue see Section 304 Section 5.5. 306 More context, analysis, and background can be found in [KHR02] and 307 [Katabi03]. 309 3. The Congestion Header 311 The congestion control data required for XCP are placed in a new 312 header which is called the Congestion Header. 314 3.1. Header placement 316 The Congestion Header is located between the IP and transport 317 headers. This is consistent with the fact that XCP is neither hop- 318 by-hop communication -- as in IP -- nor end-to-end communication -- 319 as in TCP or UDP -- but is rather end-system-to-network 320 communication. It should allow a router to "easily" locate the 321 congestion header on a packet with no IP options. 323 Other choices were considered for header location. For example, 324 making the Congestion Header a TCP option was suggested. This 325 made sense as the congestion information is related to the 326 transport protocol. However, it requires that routers be aware of 327 the header format for every new transport protocol that might ever 328 use XCP. This seemed like an unreasonable burden to place on the 329 routers and would impede deployment of new transport protocols 330 and/or XCP. 332 It has also been suggested that the Congestion Header be an IPv4- 333 style option. While this proposal is transport protocol 334 independent, it would generally force XCP packets to take the slow 335 path on non-XCP routers along a path. This could severely impact 336 performance. 338 The XCP protocol uses protocol number [TBD], assigned by IANA. IP 339 packets containing XCP headers will use this protocol number in the 340 IP header's Protocol field [RFC0791] to indicate to routers and end- 341 systems that an XCP congestion header follows the IP header. 343 3.2. Congestion Header Formats 345 This section defines the XCP Congestion Header formats. This holds 346 for IPv4; the corresponding header for IPv6 is a subject for further 347 study. 349 XCP-capable IPv4 packets carry the following Congestion Header: 351 0 1 2 3 352 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 353 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 354 | Protocol | Length |Version|Format | unused | 355 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 356 | RTT | 357 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 358 | X | 359 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 360 | Delta_Throughput | 361 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 362 | Reverse_Feedback | 363 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 365 Protocol: 8 bits 367 This field indicates the next-level protocol used in the data 368 portion of the packet. The values for various protocols are 369 specified by IANA. 371 Length: 8 bits 373 This field indicates the length of the congestion header, measured 374 in bytes. Length in this version of XCP will always be 40 bytes, 375 or 0x28. 377 Version: 4 bits 379 This field indicates the version of XCP that is in use. The 380 version of XCP described in this document corresponds to a value 381 of 0x02. Future values will be assigned by IANA 382 (http://www.iana.org). See note in the IANA Considerations 383 section. 385 Format: 4 bits 387 This field contains a code to indicate the congestion header 388 format. The current format codes are defined below. 390 +-----------------+------+ 391 | Format | Code | 392 +-----------------+------+ 393 | Standard Format | 0x1 | 394 | | | 395 | Minimal Format | 0x2 | 396 +-----------------+------+ 398 Table 1 400 Standard Format 402 The standard format includes the X, Delta_Throughput, and RTT 403 fields shown above. This format is used by XCP in data packets 404 flowing from sender to receiver. 406 Minimal Format 408 The X, Delta_Throughput, and RTT fields are unused and SHOULD be 409 set to zero. A router MUST not perform any processing on a 410 minimal format header. This format is intended for use in empty 411 ACK packets, to return congestion information from receiver to 412 sender. 414 Other formats may be defined in the future, to define different 415 representation formats for the X, Delta_Throughput, and/or RTT 416 fields, for example. The corresponding format values format values 417 will be assigned by IANA. See IANA Considerations section below. 419 unused: 8 bits 421 This field is unused and MUST be set to zero in this version of 422 XCP. 424 RTT: 32bits 426 This field indicates the round-trip time measured by the sender, 427 in fixed point format with 28 bits after the binary point, in 428 seconds. Thus, the value of 1 corresponds to 2^(-28) of a second. 429 This field is an unsigned integer. 431 The minimum value expressible in this field is 0s. A value of 432 zero in the RTT field is legal and indicates that the sender 433 either does not yet know the round-trip time, or operates at a 434 coarse-grained timer granularity. The maximum value expressible 435 in this field is 15.9999999963 seconds, in steps of 3.7ns. 437 X: 32 bits 439 This field indicates the inter-packet time of the flow as 440 calculated by the sender, in fixed point format with 28 bits after 441 the binary point, in seconds. This is the same format as used by 442 the RTT field. 444 Delta_Throughput: 32 bits 446 This field indicates the desired or allocated change in 447 throughput. It is set by the sender to indicate the amount by 448 which the sender would like to adjust its throughput, and it may 449 be subsequently reduced by routers along the path (See 450 Section 4.2). It is measured in bytes per second and is a signed, 451 2's complement value. 453 The minimum throughput change expressible in this field is -17 454 Gbps. The maximum value expressible in this field is 17 Gbps, in 455 steps of 8 bits per second. 457 Reverse_Feedback: 32bits 459 This field indicates the value of Delta_Throughput received by the 460 data receiver. The receiver copies the field Delta_Throughput 461 into the Reverse_Feedback field of the next outgoing packet in the 462 same flow. See Section 4.1.2. 464 3.3. IPsec issues 466 IPsec [RFC2401] must be slightly modified to accommodate use of XCP. 467 The specifications for the IP Authenticated Header (AH) [RFC2402] and 468 IP Encapsulating Security Payload (ESP) [RFC2406] state that the 469 IPsec headers immediately follow the IP header. This would be a 470 problem for XCP in that a) it would make the XCP headers harder to 471 find by the routers, b) ESP encryption would make it impossible for 472 routers along the path to read and write congestion header 473 information and c) AH authentication would fail if any router along 474 the path had modified a congestion header. Therefore, the XCP 475 congestion header should immediately follow the IP header and precede 476 any AH. 478 3.4. NAT, middlebox issues 480 Middleboxes that attempt to perform actions invisibly on flows must 481 preserve the congestion header. Middleboxes that terminate the TCP 482 connection should terminate the XCP connection. Middleboxes that 483 insert queues into the forwarding path should participate in XCP. 485 3.5. MPLS/Tunneling Issues 487 When a flow enters an IP tunnel [RFC2003], IPsec ESP tunnel 488 [RFC2406], or MPLS [RFC3031], network ingress point, the congestion 489 header should be replicated on the "front" of the outer IP header. 490 For example, when a packet enters an IP tunnel, the following 491 transformation should occur: 493 [IP2] \_ outer header 494 __--> [XCP] / 495 [IP1]/ [IP1] \ 496 [XCP] ---> [XCP] |_ inner header 497 [TCP] [TCP] | 498 ... ... / 500 Here the XCP header appended to the front of the outer header is 501 copied from the inner header, with the appropriate change to the 502 Protocol field to indicate that the next protocol is IP. 504 When the packet exits the tunnel, the congestion header, which may 505 have been modified by routers along the tunneled path, is copied from 506 the outer header into the inner header. 508 4. XCP Functions 510 XCP is concerned with the sender and receiver end-systems and the 511 routers along the packet's path. This section describes the XCP- 512 related algorithms to be implemented in each of these entities. The 513 specific case of TCP as transport protocol is also described. The 514 emphasis in this section is on explicit and detailed definition of 515 the XCP algorithms. The theoretical derivation and analysis of the 516 algorithms can found in [Katabi03]. 518 4.1. End-System Functions 520 4.1.1. Sending Packets 522 The sender is responsible for maintaining five parameters, or their 523 equivalent: (1) a desired throughput value, (2) a current estimate of 524 the actual throughput, (3) the maximum throughput allowed by XCP, (4) 525 a current estimate of inter-packet time, denoted X, and (5) a current 526 estimate of the round-trip time 528 A sender may choose to use any reasonable value, i.e., any achievable 529 value, for desired throughput. An application may supply this value 530 via an API, or it might be the speed of the local interface. 532 When sending a packet, the sender fills in the fields of the 533 congestion header as follows: 535 o The sender sets the RTT field to a scaled smoothed round-trip time 536 estimate, or to zero if the round-trip time is not yet known. 538 o The sender sets the X field to the current inter-packet time 539 estimate, or to zero if an estimate is not yet available. Packets 540 carrying zero X field can receive negative feedback, but not 541 positive. Since XCP requires only a single round trip for a flow 542 to gain an estimate of RTT, this is expected to have negligible 543 effect. The value of X may be estimated as the smoothed round- 544 trip time estimate divided by the number of outstanding packets 545 (or congestion window size in packets, for window-based 546 protocols). Alternatively, it may be derived from the ratio of 547 the packet size to the current throughput estimate. Using 548 instantaneous RTT estimates in the calculation of X may yield 549 better results than using the smoothed RTT, especially for senders 550 with coarse-grained timers, i.e., timers with precision less than 551 the RTT. This is a subject for further study. Also,see Section 552 Section 4.1.3.3 for a discussion of RTT estimation. 554 o The sender calculates a desired change (typically, an increase) in 555 throughput. This is normally the difference between the current 556 estimated throughput and the desired throughput. However, if the 557 sender does not have sufficient data to send at the current 558 allowed throughput, the desired change in throughput SHOULD be 559 zero. 561 o The sender then divides the desired throughput change by the 562 number of packets in one round-trip time, and puts the result in 563 the Delta_Throughput field of the Congestion Header. This per- 564 packet distribution of the throughput change is necessary because 565 an XCP router does not maintain per-flow congestion state; it must 566 treat each packet independently of others in the same flow. The 567 number of packets in an RTT may be estimated by the product of the 568 current throughput and the RTT, divided by the Maximum Segment 569 Size (MSS). 571 The Delta_Throughput (in bytes/second) can be calculated as: 573 desired_throughput - Throughput 574 Delta_Throughput = ----------------------------------- 575 Throughput * ( RTT/MSS ) 577 where: 578 desired_throughput is measured in bytes/second 579 Throughput is measured in bytes/second 580 RTT is measured in seconds 581 MSS is measured in bytes 583 However, Delta_Throughput should be set to zero if, for any 584 reason, no additional capacity is needed, e.g., there is 585 insufficient data to maintain Throughput for the next RTT, as 586 discussed above. 588 o An issue for future consideration is how to treat the case when 589 Delta_Throughput is calculated to be < 1 Bps. Since an integer 590 representation is passed in the Congestion Header, the result will 591 appear as zero. It would be possible to send a fraction of the 592 packets in a round trip time with non-zero Delta_Throughput 593 values. 595 o For TCP, the throughput estimate can be obtained by dividing the 596 congestion window cwnd (in bytes) by RTT (in seconds), for 597 example. Alternatively, it could be measured. The ratio cwnd/RTT 598 differs from true throughput in two respects. First, cwnd doesn't 599 account for header size. This may become significant should XCP 600 be applied to real-time flows that send large numbers of small 601 packets, but it is probably not much worry for TCP flows that tend 602 to use the largest possible packet size. Second, cwnd represents 603 permission for the sender to transmit data. If the application 604 doesn't use all of the available cwnd, the advertised throughput 605 will be larger than the true throughput. This may result in an 606 XCP router creating an unfair allocation of negative feedback to a 607 flow. 609 4.1.2. Processing Feedback at the Receiver 611 An XCP receiver is responsible for copying the Delta_Throughput data 612 it sees on arriving packets into the Reverse_Feedback field of 613 outgoing packets. In TCP, outgoing packets would normally be ACK- 614 only segments. 616 In some cases returning packets are sent less frequently than 617 arriving packets, e.g., with delayed acknowledgments [RFC1122]. The 618 receiver is responsible for calculating the sum of the arriving 619 Delta_Throughput fields for placement in outgoing Reverse_Feedback 620 fields. 622 4.1.2.1. Feedback from the Receiver 624 The receiver end-system returns XCP congestion feedback from the 625 network to the sender, by copying the Delta_Throughput information 626 from arriving packets into the Reverse_Feedback field of Congestion 627 Headers in outgoing packets (possibly aggregating data as described 628 in Section 4.1.2). 630 It is possible that even empty ACK packets may create or encounter 631 congestion in the reverse direction. Although TCP implementations 632 generally do not perform congestion-based pacing of empty ACK 633 segments, some transport protocols (e.g., DCCP) may be. Such a 634 transport protocol may choose to use XCP congestion control on the 635 returning ACKs as well as on the data. 637 In the normal case of a unidirectional data flow with XCP applied 638 only to that data flow, the feedback can be sent in a Minimal format 639 Congestion Header, in which the RTT, X, and Delta_Throughput fields 640 are set to zero. 642 4.1.3. Processing Feedback at the Sender 644 When packets arrive back to the sender carrying reverse feedback, the 645 sender must adjust its sending rate accordingly. 647 As noted earlier, this throughput adjustment may be made in TCP by 648 updating the sender's congestion window, cwnd. This should use the 649 formula: 651 cwnd = max(cwnd + feedback * RTT, MSS) 653 where: 654 cwnd = current congestion window (bytes) 655 feedback = Reverse_Feedback field from received packet, 656 (bytes/sec, may be +/-) 657 RTT = Sender's current round-trip time estimate 658 (seconds) 659 MSS = maximum segment size (bytes) 661 The value of cwnd has a minimum of MSS to avoid the "Silly Window 662 Syndrome" [RFC0813]. 664 4.1.3.1. Aging the Allowed Throughput 666 When a sending application does not send data fast enough to fully 667 utilize the allowed throughput, XCP should reduce the allowed 668 throughput as time passes, to avoid sudden bursts of data into the 669 network if the application starts to send data later. 671 We present a slight modification of the algorithm for aging the 672 allowed throughput below. It is based on Section 4.5 of [Katabi03]. 673 Each RTT in which the sender sends with actual throughput which is 674 less than the allowed throughput, the allowed throughput MUST be 675 reduced by the following exponential averaging formula: 677 Allowed_Throughput = Allowed_Throughput*(1-p) + 678 Actual_Throughput * p 680 where: p is a parameter controlling the speed of aging, 681 ranged between 0 and 1. 683 Using p = 0.5 is suggested. Consideration of values of p or other 684 algorithms is a research topic. 686 4.1.3.2. Response to Packet Loss 688 When the transport protocol is TCP, a packet drop or detection of an 689 ECN notification [RFC3168] should trigger a transition to standard 690 TCP congestion control behavior[RFC2581]. In other words, cwnd 691 should be halved and Jacobson's fast retransmission/fast recovery, 692 slow start, and congestion avoidance algorithms should be applied for 693 the remainder of the connection or until the congestion event is 694 known to have passed (see Section 5.1 for discussion of alternative 695 approaches to this issue. The assumption is that the packet drop 696 reveals the presence of a congested non-XCP router in the path. 697 Transitioning to standard TCP behavior is a conservative response. 699 Note also the following: 701 o The change in congestion control algorithm should be delayed until 702 the three DUPACKs have arrived, according to the Fast 703 Retransmission/Fast Recovery algorithm[RFC2581]. 705 o Once the change to standard TCP congestion control has occurred, 706 cwnd should be managed using the RFC2581 algorithm. 708 o The X field in outgoing packets should continue to reflect the 709 current inter-packet time. This allows the XCP processes in the 710 routers along the path to continue to monitor the flow's 711 utilization. 713 o Further study is needed to determine whether it will be possible 714 to return a connection to XCP congestion control, once it has 715 transitioned to Van Jacobson mode. 717 o For transport protocols other than TCP, the response to a packet 718 loss or ECN notification is a subject for further study. 720 4.1.3.3. RTT Estimates 722 Having a good estimate of the round trip time is more important in 723 XCP than in Van Jacobson congestion control. There is evidence that 724 small errors in the RTT estimate can result in larger errors in the 725 throughput and X estimates. The current cwnd divided by SRTT is only 726 an approximation of the actual throughput. Likewise, SRTT divided by 727 cwnd in packets is only an approximation of the highly variable 728 inter-packet time, X. The RTT used in the ns-2 code in [KHR02] used a 729 smoothed floating-point RTT estimator, rather than instantaneous 730 measurements. Additional research is needed to develop 731 recommendations for RTT estimation. 733 4.2. Router functions 735 The router calculations for XCP are divided into those that occur 736 upon packet arrival, those that occur upon control interval timeout, 737 those that occur upon packet departure, and the assessment of the 738 persistent queue, which uses a separate timer. The calculations are 739 presented in the following sections as annotated pseudo-code. 741 4.2.1. Calculations Upon Packet Arrival 743 When a packet arrives at a router, several parameters used by XCP 744 need to be updated. The steps are described in the following pseudo- 745 code. 747 ======================================================== 748 On packet arrival do: 750 1. input_traffic += Pkt_size 752 2. sum_x += X 754 3. if (Rtt < MAX_INTERVAL) then 756 4. sum_xrtt += X * Rtt 758 5. else 760 6. sum_xrtt += X * MAX_INTERVAL 761 ======================================================== 763 Line 1: The variable input_traffic accumulates the volume of data 764 that have arrived during a control interval. When a packet 765 arrives, the packet size is taken from the IP header and is added 766 to the ongoing count. 768 Line 2: The variable sum_x is used in the control interval 769 calculation (see equation 4.2 of [Katabi03]) and in capacity 770 allocation. For each packet, values of X from the XCP header is 771 accumulated. It is recommended that sum_x is stored in a 64-bit 772 unsigned integer variable. 774 Lines 3 and 5: A test is performed to check whether the round trip 775 time of the flow exceeds the maximum allowable control interval. 776 If so, MAX_INTERVAL, the maximum allowable control interval, is 777 used in the subsequent calculations. Too large a control interval 778 will delay new flows from acquiring their fair allocation of 779 capacity. See Section 4.2.4 for a discussion of the recommended 780 value for MAX_INTERVAL. 782 Lines 4 and 6: As in Line 2, the variable sum_xrtt is used in the 783 control interval calculation. It is recommended that it is stored 784 in a 96-bit unsigned variable. 786 4.2.2. Calculations Upon Control Interval Timeout 788 When the control timer expires, several variables need to be updated 789 as shown below. 791 Note that several calculations show divisions. These divisions 792 should either be accomplished using floating-point arithmetic or 793 integer arithmetic and appropriate scaling to avoid over- or under- 794 flow. 796 ======================================================== On 797 estimation-control timeout do: 799 7. avg_rtt = sum_xrtt / sum_x 801 8. input_bw = input_traffic / ctl_interval 803 9. F = a * (capacity - input_bw) - b * queue / avg_rtt 805 10. shuffled_traffic = shuffle_function(...) 807 11. residue_pos_fbk = shuffled_traffic + max(F,0) 809 12. residue_neg_fbk = shuffled_traffic + max(-F,0) 811 13. Cp = residue_pos_fbk / sum_x 813 14. Cn = residue_neg_fbk / input_traffic 815 15. input_traffic = 0 817 16. sum_x = 0 819 17. sum_xrtt = 0 821 18. ctl_interval = max(avg_rtt, MIN_INTERVAL) 823 19. timer.reschedule(ctl_interval) 824 ======================================================== 825 Line 7: Update avg_rtt by taking the ratio of the two sums 826 accumulated in the previous section. This value is used to 827 determine the control interval (line 17). 829 Line 8: The average bandwidth of arriving traffic is calculated by 830 dividing the bytes received in the previous control interval by 831 the duration of the previous control interval. 833 Line 9: The aggregate feedback, F, is calculated. The variable 834 'capacity' is the ability of the outbound link to carry IP 835 packets, in bytes/second. The variable 'avg_rtt' was calculated 836 in line 7. The variable 'queue' is the persistent queue and is 837 defined in section Section 4.2.5. The values a and b are constant 838 parameters. According to [Katabi03], the constant a may be any 839 positive number such that a < (pi/4*sqrt(2)). A nominal value of 840 0.4 is recommended. The constant b is defined to be b = 841 a^2*sqrt(2). (If the nominal value of a is used, the value for b 842 would be 0.226.) Note that F may be positive or negative. 844 Line 10: This line establishes the amount of capacity that will be 845 shuffled in the next control interval through the use of the 846 shuffle_function. Shuffling takes a small amount of the available 847 capacity and redistributes it by adding it to both the positive 848 and negative feedback pools. This allows new flows to acquire 849 capacity in a full loaded system. 851 The recommended shuffle_function is as follows: 853 shuffled_traffic = max(0, 0.1 * input_bw - |F|) 855 The variable 'input_bw' is defined above in Line 8. Implementers 856 may choose other functions. It is important to consider that more 857 shuffled traffic decreases the time for new flows to acquire 858 capacity and converge to fairness. However, too much shuffling 859 may impede flows from acquiring their fair share of available 860 capacity. (For example, consider a setup of N flows bottlenecked 861 downstream from the given router and another flow, not limited as 862 those, trying to acquire its fair share. In this case shuffling 863 leads to under-utilization of the available bandwidth and impedes 864 the unlimited flow.) Shuffled_traffic is always a positive value. 866 The objective of the feedback calculations is to obtain a per-packet 867 feedback allocation from the router. Lines 13 and 14 obtain factors 868 in this calculation that, unfortunately, have no physical meaning. 869 One might view them as per-flow capacity allocations that have some 870 additional processing to prepare them for per-packet allocation. 871 Note that, with the use of shuffled_traffic, a non-idle router will 872 always start a control interval with non-zero values for both Cn and 873 Cp. 875 Line 11: The variable 'residue_pos_fbk' keeps track of the pool of 876 available positive capacity a router has to allocate. It is 877 initialized to the positive aggregate feedback. 879 Line 12: The variable 'residue_neg_fbk' keeps track of the pool of 880 available negative capacity a router has to allocate. It is 881 initialized to the negative aggregate feedback. This variable is 882 always positive. 884 Line 13: This line calculates the positive feedback scale factor, Cp. 885 The variables residue_pos_fbk, and sum_x are defined above. 887 Line 14: This line calculates the negative feedback scale factor, Cn. 888 This is a positive value. The definitions for residue_neg_fbk, 889 and input_traffic are given above. 891 Line 15-17: Reset various counters for the next control interval. 893 Line 18: Set the next control interval. The use of MIN_INTERVAL is 894 important to establish a reasonable control interval when the 895 router is idle. 897 Line 19: Set timer. 899 4.2.3. Calculations Upon Packet Departure 901 An XCP router processes each packet using the feedback parameters 902 calculated above. As stated earlier, each packet indicates the 903 current inter-packet time (X) and a throughput adjustment, 904 Delta_Throughput. The router calculates a per-packet capacity change 905 which will be compared to the Delta_Throughput field in the packet 906 header. Using the AIMD rule, positive feedback is applied equally 907 per-flow, while negative feedback is made proportional to each flow's 908 capacity. 910 To accommodate high-speed routers, XCP uses a fixed-point numeric 911 representation for the Congestion Header fields. This means that the 912 per-packet calculations defined below result in residual error that 913 is less than 1 Bps per packet. These errors accumulate across all 914 the packets in a control interval, resulting in an inaccuracy in 915 XCP's allocation of available bandwidth to flows. Further work is 916 needed to understand whether this will be a significant problem and, 917 if so, whether there is any solution short of using 64 bit precision 918 or floating point. 920 Processing should be done according to the pseudo-code below. 922 ======================================================== 923 On packet departure: 925 20. pos_fbk = Cp * X 927 21. neg_fbk = Cn * Pkt_size 929 22. feedback = pos_fbk - neg_fbk 931 23. if(Delta_Throughput > feedback) then 933 24. Delta_Throughput = feedback 935 25. else 937 26. neg_fbk = min(residue_neg_fbk, neg_fbk + 938 (feedback - Delta_Throughput)) 940 27. pos_fbk = Delta_Throughput + neg_fbk 942 28. residue_pos_fbk = max(0, residue_pos_fbk - pos_fbk) 944 29. residue_neg_fbk = max(0, residue_neg_fbk - neg_fbk) 946 30. if (residue_pos_fbk <= 0) then Cp = 0 948 31. if (residue_neg_fbk <= 0) then Cn = 0 950 ======================================================== 952 Line 20: The contribution of positive feedback for the current packet 953 is calculated using Cp, defined in line 13, and X (the flow's 954 advertised inter-packet time) from the Congestion Header. Note 955 that if Cp (and Cn in Line 21) is implemented as a floating point 956 number, this calculation would be implemented by multiplying the 957 Cp-mantissa by the value of X, then shifting the result by the 958 amount of the Cp-exponent. 960 Line 21: The contribution of negative feedback for the current packet 961 is calculated using Cn, defined in line 14, and Pkt_size from the 962 IP header. This value of neg_fbk is positive. 964 Line 22: The router's allocated feedback for the packet is the 965 positive per-packet feedback minus the negative per-packet 966 feedback. This value may be positive or negative. 968 Line 23-24: Line 23 tests whether the packet is requesting greater 969 capacity increase (via the packet's Delta_Throughput field) than 970 the router has allocated. If so, this means the the sender's 971 desired throughput needs to be reduced to be the router's 972 allocation. In line 24 the Delta_Throughput field in the packet 973 header updated with the router feedback allocation. 975 Line 25: This branch is executed when the packet is requesting a 976 smaller throughput increase than the router's allocation. In this 977 branch, and the rest of this pseudo-code, the packet header is not 978 updated and the remaining code is to correctly update the feedback 979 pool variables. 981 Line 26: In this line, the packet's negative feedback contribution, 982 neg_fbk, is set to be the smaller of two terms. The first term, 983 residue_neg_fbk, is the pool of negative feedback, i.e., this 984 drains the remaining negative feedback in the pool. The second 985 term increases the nominal negative feedback from the router by 986 the amount which the Delta_Throughput is less than net router 987 allocation. This allows the router to capture feedback which is 988 allocated by an upstream bottleneck. 990 Line 27: The positive allocation, pos_fbk, is adjusted to be the sum 991 of Delta_Throughput and neg_fbk, from Line 26. This is required 992 for the sum of pos_fbk and neg_fbk to equal Delta_Throughput. 994 Line 28-29: In these two lines, the feedback pools, residue_pos_fbk 995 and residue_neg_fbk, are reduced by the values of pos_fbk and 996 neg_fbk accordingly, but prevented from going negative. 998 Line 30-31: When a feedback pool becomes empty, set the scale factor 999 to zero, i.e., stop handing out associated feedback. 1001 4.2.4. The Control Interval 1003 The capacity allocation algorithm in XCP router updates several 1004 parameters every Control Interval. The Control Interval is currently 1005 defined to be the average RTT of the flows passing through the 1006 router, i.e., avg_rtt calculated in Line 7 above. Other possible 1007 choices for the control interval are under study. 1009 Notes on avg_rtt: 1011 o In this document, the quantity 'avg_rtt' refers to the last 1012 calculated value. In other words, the avg_rtt calculated based on 1013 packets arriving in the previous control interval. 1015 o The avg_rtt calculation should ignore packets with an RTT of zero 1016 in the header. 1018 o avg_rtt MUST have a minimum value. This is to allow flows to 1019 acquire bandwidth from a previously idle router. The default 1020 minimum value, MIN_INTERVAL, should be max(5-10ms, propagation 1021 delay on attached link). 1023 o avg_rtt MUST have a maximum value. The default maximum value, 1024 MAX_INTERVAL, should be max(0.5-1 sec, propagation delay on 1025 attached link). 1027 4.2.5. Obtaining the Persistent Queue 1029 In Section 4.2.2 the variable 'queue' contains the persistent queue 1030 over the control interval. This is intended to be the minimum 1031 standing queue over the queue estimation interval. 1033 The following pseudo-code describes how to obtain the minimum 1034 persistent queue: 1036 ======================================================== 1037 On packet departure do: 1039 32. min_queue = min(min_queue, inst_queue) 1041 ======================================================== 1043 When the queue-computation timer expires do: 1045 33. queue = min_queue 1047 34. min_queue = inst_queue 1049 35. Tq = max(ALLOWED_QUEUE, (avg_rtt - inst_queue/capacity)/2) 1051 36. queue_timer.reschedule(Tq) 1053 ======================================================== 1055 Line 32: The current instantaneous queue length is checked each time 1056 a packet departs compute the minimum queue size. 1058 If avg_rtt is being used as the Control Interval, it MUST NOT be used 1059 as the interval for measuring the minimum persistent queue. Doing so 1060 can result in a feed-forward loop. For example, if a queue develops 1061 the average RTT will increase. If the avg_rtt increases, it takes 1062 longer to react to the growing queue and the queue gets larger, 1063 leading to instability. 1065 Line 33: Upon expiration of the queue estimation timer, Tq, the 1066 variable queue, the persistent queue, is set to be the minimum 1067 queue occupancy over the last Tq. 1069 Line 34: Upon expiration of the queue estimation timer, reset the 1070 running estimate of the minimum queue to be the current queue 1071 occupancy. 1073 Line 35: The first term in the max function, ALLOWED_QUEUE, is the 1074 time to drain a standing queue that you are willing to tolerate. 1075 (A nominal value of 2ms worth of queuing is recommended but this 1076 may be tuned by implementers.) The second term is an estimate of 1077 the propagation delay. In other words the persistent queue is a 1078 queue that does not drain in a propagation delay. the division by 1079 2 is a conservative factor to avoid overestimating the propagation 1080 delay. 1082 Line 36: The queue computation timer is set. 1084 5. Unresolved Issues 1086 XCP is a work-in-progress. This section describes some known issues 1087 that need to be resolved. 1089 5.1. XCP With Non-XCP Routers 1091 Obviously, non-XCP routers will exist in networks before XCP becomes 1092 ubiquitously deployed and we expect other non-XCP systems to continue 1093 to be in the network indefinitely. Long term non-XCP network 1094 elements include any sort of link-level switches with queuing, e.g. 1095 ATM switches and sophisticated Ethernet switches. Even simple 1096 multiplexers are non-XCP queues with very little buffering. 1098 Sources and the network care about these non-XCP elements because any 1099 one of them can be a site of network congestion, and if an XCP 1100 endpoint is bottlenecked at one of these non-XCP elements, no router 1101 feedback will inform the endpoint to slow down. If nothing is done, 1102 such an element will probably collapse under congestion. 1104 Although exactly how XCP sources will operate in this environment is 1105 an open issue, a current promising direction is for endpoints to run 1106 a traditional end-to-end congestion detection algorithm in parallel 1107 with the XCP algorithm and switch over to using that algorithm for 1108 control when congestion is detected that XCP is not controlling. For 1109 example, an XCP source that detects 3 duplicate acknowledgments would 1110 fall back to TCP Reno behavior. 1112 An endpoint that is limited by its end-to-end congestion algorithm 1113 would indicate so to XCP routers by setting a bit in the packet 1114 header. A router may process such packets differently than packets 1115 from endpoints that are being controlled by XCP. For example, the 1116 router might allocate end-to-end controlled packets less feedback or 1117 not reduce its feedback pools by the full amount when assigning 1118 feedback to those packets. 1120 Though using its end-to-end algorithm to control its sending rate, an 1121 endpoint will also monitor the XCP feedback and if the source 1122 discovers that the XCP feedback would be more restrictive than the 1123 end-to-end control over a round trip time, the endpoint will revert 1124 to following XCP feedback. XCP feedback that is more restrictive 1125 over a round trip time is an indication that the endpoint's 1126 bottleneck is once again at an XCP router and the endpoint should 1127 take advantage of the more precise XCP information. 1129 Evaluation of these algorithms is ongoing work 1131 5.2. Variable Rate Links 1133 As discussed in [Zhang05], XCP may perform poorly over shared links. 1134 When a link is shared, such as in CSMA ethernet or 802.11 wireless 1135 networks, a single queue's drain rate is often a function of the load 1136 in the shared medium. So, using a constant value for the variable 1137 'capacity' in the routing control algorithm may not work well. For 1138 correct operation, the XCP router's notion of capacity needs to 1139 reflect how the link capacity is shared. 1141 5.3. XCP as a TCP PEP 1143 In the Internet today TCP performance-enhancing proxies (PEPs) are 1144 sometimes used to improve application performance over certain 1145 networks. TCP PEPs, and the issues surrounding their use are 1146 described in [RFC3135]. A common mechanism used in TCP PEPs is to 1147 split a TCP connection into three parts where the first and last run 1148 TCP and a more aggressive transport protocol is run in the middle 1149 (across the path which generates poor TCP performance). This 1150 improves performance over the "problematic" portion of the path and 1151 does not require changing the protocol stacks on the end systems. 1152 For example, if a high-speed satellite link was used to connect a LAN 1153 to the Internet, a TCP PEP may be placed on either side of the 1154 satellite link. 1156 It is not unusual today to find TCP PEPs which, to get high data 1157 rates, do not use congestion control at all. Of course, this limits 1158 the environments in which they can be used. However, XCP may be used 1159 in between two TCP PEPs to get high transfer rates and still respond 1160 to congestion in a correct and scalable way. 1162 Work on using XCP as a TCP PEP is just beginning [Kapoor05]. 1163 Objectives for such a mechanism would be: 1165 o preserve end-to-end TCP semantics as much as possible 1167 o enable some form of discovery for one PEP to determine that 1168 another PEP was in the path 1170 o allow for recovery should one half of a PEP pair fail or the route 1171 change so that one or both PEPs are not on the path 1173 o enable aggregation of multiple flows between two PEPs. 1175 A system which met the above objectives could also be used for 1176 incremental deployment of XCP. A network operator could deploy XCP- 1177 capable routers and use PEPs at the periphery of the network to 1178 convert from traditional TCP to XCP congestion control. This may 1179 result in smaller queues and improved link utilization within the XCP 1180 network. (This may be of more interest to wireless network providers 1181 than to over-provisioned fiber backbones.) 1183 5.4. Sharing resources between XCP and TCP 1185 Katabi describes a system for sharing router bandwidth between XCP 1186 and TCP flows that is based on sharing the output capacity based on 1187 the average throughputs of XCP and TCP sources using the 1188 router.[Katabi03] Two queues are installed at the router and they are 1189 served with different weights by the forwarding engine. The 1190 algorithm is work-conserving; the forwarding engine is never idle 1191 when either queue has packets in it. 1193 A TFRC-like algorithm estimates the average congestion window of TCP 1194 sources and the XCP algorithm estimates the average throughput of XCP 1195 sources. These averages are used to dynamically weight the time the 1196 processor spends on each queue. Initial experiments indicate that 1197 this system can provide fairness between the two flow classes (XCP/ 1198 TCP).[Katabi03] 1200 Open issues remain, however; for example there are questions about 1201 how well the TFRC-like algorithm can estimate TCP throughput with 1202 only access to local drop rates, convergence time of the weighting 1203 algorithm has never been explored, and no system for buffer 1204 allocation to complement the link capacity allocation has been put 1205 forward. These open issues are under study. 1207 5.5. A Generalized Router Model 1209 The XCP algorithm described here and in [Katabi03] manages congestion 1210 at a single point in a router, most likely an output queue. However, 1211 resource contention can occur at many points in a router. Input 1212 queues, backplanes, computational resources can 'congest' in addition 1213 to output buffers. There is a need to develop a general model and a 1214 variety of mechanisms to identify and manage resource contention 1215 throughout the router. 1217 5.6. Host back-to-back operation 1219 XCP hosts should be capable of back-to-back operation, i.e., with no 1220 router in the path. Nominally, this should not be a problem. A 1221 sender initializes delta_throughput to the desired value, no router 1222 modifies it and, thus, it is automatically granted. However, it has 1223 not yet been decided whether an XCP receiver should be capable of (or 1224 require) adjusting the delta_throughput to request flow control from 1225 the receiver to the sender. 1227 At this point, XCP offers no mechanism for flow control. (Open 1228 question: Should it?) It is believed that running XCP on the output 1229 queue of a host would solve this problem. However, it isn't clear 1230 that the complexity is justified by the need to solve this situation. 1232 6. Security Considerations 1234 The presence of a header which may be read and written by entities 1235 not participating in the end-to-end communication opens some 1236 potential security vulnerabilities. This section describes them and 1237 tries to give enough context so that users can understand the risks. 1239 Man-in-the-Middle Attacks 1241 There is a man-in-the-middle attack where a malicious user can 1242 force a sender to stop sending by inserting negative feedback into 1243 flow. This is little different from a malicious user discarding 1244 packets belonging to a flow using VJ congestion control or setting 1245 ECN bits. One question worth investigating further is whether the 1246 XCP attack is harder to diagnose. 1248 Covert Data Channels 1250 IPsec needs to be modified, as discussed in Section 3.3, to allow 1251 routers to read the entire congestion header and write the 1252 delta_feedback field. This could become a covert data-channel, 1253 i.e., a way in which an end-system can make data viewable to 1254 observers in the network, on a compromised end-system. 1256 Malicious Sources 1258 The XCP algorithms rely on senders to advertise information about 1259 their current RTT and X and correctly respond to feedback 1260 delivered from the network. Naturally, the possibility occurs 1261 that a sender won't perform these functions correctly. Chapter 7 1262 of [Katabi03] and [Katabi04] examine these issues. 1264 A source which lies about its values of X and hence throughput 1265 cannot affect the link utilization and, in the worst case, can 1266 unfairly acquire capacity. However, this is equivalent to a 1267 sender opening up multiple TCP flows. So, there is an incentive 1268 to lie about X. However, because X is explicitly stated in each 1269 packet header, it is a simpler matter to police it at the edge of 1270 the network than, say, for TCP. 1272 A source which lies about its RTT can disrupt the router control 1273 algorithm, particularly when a large number of sources lie about 1274 their RTT and the router control interval is adaptive and uses the 1275 average RTT. However, there is little incentive to lie as it will 1276 not affect the fair allocation of capacity and the liar will 1277 experience the same degradation as the non-lying flows. Lying 1278 about RTT should be considered a weak denial-of-service attack. 1280 A flow may also ignore negative feedback from the router. Such a 1281 flow can obtain unfair throughput in a congested router. However, 1282 as with lying, the explicit nature of XCP makes it possible to 1283 verify that flows are responding to feedback. For example, a 1284 policing function in the path (presumeably near the edge so that 1285 the load is manageable and it can be expected to see packet flow 1286 in both directions) may inspect congestion headers for a flow in 1287 both directions. If the policer sees negative feedback heading 1288 towards a source and no reduction in throughput it may, e.g., 1289 punish the flow by severely restricting the throughput. Note that 1290 this can be applied on a probabilistic basis, sampling flows only 1291 occasionally. 1293 7. IANA Considerations 1295 XCP requires the assignment of an IP protocol number. Once this 1296 value has been assigned, the number may be inserted (by the RFC 1297 Editor) into Section 3.1 and this paragraph may be removed prior to 1298 publication. 1300 8. Acknowledgements 1302 The authors would like to acknowledge the many contributors who have 1303 assisted in this work. Bob Braden applied his usual sage guidance to 1304 the project and to the spec, in particular. Ted Faber wrote the 1305 initial implementation framework and provided much wisdom on kernel 1306 development and congestion control. Yuri Pryadkin performed 1307 significant analysis on the algorithm and developed much of the code. 1308 John Wroclawski advised on project priorities and strategy. Eric Coe 1309 developed the initial implementation and testbed. Aman Kapoor 1310 performed supporting simulations and debugged kernel code. Padma 1311 Haldar ported ns-2 simulation code to ns-2 distribution. Jasmeet 1312 Bagga and Anuraag Mittal conducted simulations on various aspects of 1313 XCP performance. On the XCP mailing list, Tim Shepherd, Tom 1314 Henderson, and Matt Mathis made valuable contributions to the effort. 1315 To all the above go our sincere thanks. 1317 9. References 1319 [I-D.ietf-dccp-spec] 1320 Kohler, E., "Datagram Congestion Control Protocol (DCCP)", 1321 draft-ietf-dccp-spec-05 (work in progress), October 2003. 1323 [Jacobson88] 1324 Jacobson, V., "Congestion Avoidance and Control", ACM 1325 Computer Communication Review Proceedings of the Sigcomm 1326 '88 Symposium, August 1988. 1328 [KHR02] Katabi, D., Handley, M., and C. Rohr, "Internet Congestion 1329 Control for Future High Bandwidth-Delay Product 1330 Environments", ACM Computer Communication 1331 Review Proceedings of the Sigcomm '02 Symposium, 1332 August 2002. 1334 [Kapoor05] 1335 Kapoor, A., Falk, A., Faber, T., and Y. Pryadkin, 1336 "Achieving Faster Access to Satellite Link Bandwidth", 1337 IEEE 8th IEEE Global Internet Symposium, Miami, FL, March 1338 2005, 2005. 1340 [Katabi03] 1341 Katabi, D., "Decoupling Congestion Control and Bandwidth 1342 Allocation Policy With Application to High Bandwidth-Delay 1343 Product Networks", MIT PhD. Thesis, March 2003. 1345 [Katabi04] 1346 Katabi, D., "XCP's Performance in the Presence of 1347 Malicious Flows", Second International Workshop on 1348 Protocols for Fast Long-Distance Networks, Presentation, 1349 February 2004. 1351 [Padhye98] 1352 Padhye, J., Firoiu, V., Towsley, D., and J. Krusoe, 1353 "Modeling TCP throughput: A simple model and its empirical 1354 validation", ACM SIGCOMM '98 conference on 1355 Applications,technologies, architectures, and protocols 1356 for computer communication, 1998. 1358 [RFC0791] Postel, J., "Internet Protocol", STD 5, RFC 791, 1359 September 1981. 1361 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 1362 RFC 793, September 1981. 1364 [RFC0813] Clark, D., "Window and Acknowledgement Strategy in TCP", 1365 RFC 813, July 1982. 1367 [RFC1122] Braden, R., "Requirements for Internet Hosts - 1368 Communication Layers", STD 3, RFC 1122, October 1989. 1370 [RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003, 1371 October 1996. 1373 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1374 Requirement Levels", BCP 14, RFC 2119, March 1997. 1376 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 1377 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 1378 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 1379 S., Wroclawski, J., and L. Zhang, "Recommendations on 1380 Queue Management and Congestion Avoidance in the 1381 Internet", RFC 2309, April 1998. 1383 [RFC2401] Kent, S. and R. Atkinson, "Security Architecture for the 1384 Internet Protocol", RFC 2401, November 1998. 1386 [RFC2402] Kent, S. and R. Atkinson, "IP Authentication Header", 1387 RFC 2402, November 1998. 1389 [RFC2406] Kent, S. and R. Atkinson, "IP Encapsulating Security 1390 Payload (ESP)", RFC 2406, November 1998. 1392 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 1393 Control", RFC 2581, April 1999. 1395 [RFC2960] Stewart, R., Xie, Q., Morneault, K., Sharp, C., 1396 Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., 1397 Zhang, L., and V. Paxson, "Stream Control Transmission 1398 Protocol", RFC 2960, October 2000. 1400 [RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol 1401 Label Switching Architecture", RFC 3031, January 2001. 1403 [RFC3135] Border, J., Kojo, M., Griner, J., Montenegro, G., and Z. 1404 Shelby, "Performance Enhancing Proxies Intended to 1405 Mitigate Link-Related Degradations", RFC 3135, June 2001. 1407 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1408 of Explicit Congestion Notification (ECN) to IP", 1409 RFC 3168, September 2001. 1411 [Zhang05] Zhang, Y. and T. Henderson, "An Implementation and 1412 Experimental Study of the eXplicit Control Protocol 1413 (XCP)", IEEE Proceedings of the 24th IEEE International 1414 Conference on Computer Communications (INFOCOM 2005), pp. 1415 1037-1048, Miami, Florida, Mar 2005., 2005. 1417 Authors' Addresses 1419 Aaron Falk 1420 USC Information Sciences Institute 1421 4676 Admiralty Way 1422 Suite 1001 1423 Marina Del Rey, CA 90292 1425 Phone: 310-448-9327 1426 Email: falk@isi.edu 1427 URI: http://www.isi.edu/~falk 1429 Yuri Pryadkin 1430 USC Information Sciences Institute 1431 4676 Admiralty Way 1432 Suite 1001 1433 Marina Del Rey, CA 90292 1435 Phone: 310-448-8412 1436 Email: yuri@isi.edu 1438 Dina Katabi 1439 Massachusetts Institute of Technology 1440 200 Technology Square 1441 Cambridge, MA 02139 1443 Phone: 617-324-6027 1444 Email: dk@mit.edu 1445 URI: http://www.ana.lcs.mit.edu/dina/ 1447 Intellectual Property Statement 1449 The IETF takes no position regarding the validity or scope of any 1450 Intellectual Property Rights or other rights that might be claimed to 1451 pertain to the implementation or use of the technology described in 1452 this document or the extent to which any license under such rights 1453 might or might not be available; nor does it represent that it has 1454 made any independent effort to identify any such rights. Information 1455 on the procedures with respect to rights in RFC documents can be 1456 found in BCP 78 and BCP 79. 1458 Copies of IPR disclosures made to the IETF Secretariat and any 1459 assurances of licenses to be made available, or the result of an 1460 attempt made to obtain a general license or permission for the use of 1461 such proprietary rights by implementers or users of this 1462 specification can be obtained from the IETF on-line IPR repository at 1463 http://www.ietf.org/ipr. 1465 The IETF invites any interested party to bring to its attention any 1466 copyrights, patents or patent applications, or other proprietary 1467 rights that may cover technology that may be required to implement 1468 this standard. Please address the information to the IETF at 1469 ietf-ipr@ietf.org. 1471 Disclaimer of Validity 1473 This document and the information contained herein are provided on an 1474 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1475 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 1476 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 1477 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 1478 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1479 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1481 Copyright Statement 1483 Copyright (C) The Internet Society (2005). This document is subject 1484 to the rights, licenses and restrictions contained in BCP 78, and 1485 except as set forth therein, the authors retain all their rights. 1487 Acknowledgment 1489 Funding for the RFC Editor function is currently provided by the 1490 Internet Society.