idnits 2.17.1 draft-falk-xcp-spec-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 17. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 1515. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1526. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1533. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1539. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: The X, Delta_Throughput, and RTT fields are unused and SHOULD be set to zero. A router MUST not perform any processing on a minimal format header. This format is intended for use in empty ACK packets, to return congestion information from receiver to sender. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 5, 2007) is 6141 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'TBD' is mentioned on line 372, but not defined == Missing Reference: 'IP2' is mentioned on line 527, but not defined == Missing Reference: 'XCP' is mentioned on line 530, but not defined == Missing Reference: 'IP1' is mentioned on line 529, but not defined == Missing Reference: 'TCP' is mentioned on line 531, but not defined -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 813 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 2309 (Obsoleted by RFC 7567) -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 2960 (Obsoleted by RFC 4960) Summary: 1 error (**), 0 flaws (~~), 7 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group A. Falk 3 Internet-Draft Y. Pryadkin 4 Intended status: Experimental ISI 5 Expires: January 6, 2008 D. Katabi 6 MIT 7 July 5, 2007 9 Specification for the Explicit Control Protocol (XCP) 10 draft-falk-xcp-spec-03.txt 12 Status of this Memo 14 By submitting this Internet-Draft, each author represents that any 15 applicable patent or other IPR claims of which he or she is aware 16 have been or will be disclosed, and any of which he or she becomes 17 aware will be disclosed, in accordance with Section 6 of BCP 79. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as Internet- 22 Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six months 25 and may be updated, replaced, or obsoleted by other documents at any 26 time. It is inappropriate to use Internet-Drafts as reference 27 material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on January 6, 2008. 37 Copyright Notice 39 Copyright (C) The IETF Trust (2007). 41 Abstract 43 This document contains an initial specification for the Explicit 44 Control Protocol (XCP), an experimental congestion control protocol. 45 XCP is designed to deliver the highest possible end-to-end throughput 46 over a broad range of network infrastructure, including links with 47 very large bandwidth-delay products, which are not well served by the 48 current control algorithms. XCP is potentially applicable to any 49 transport protocol, although initial testing has applied it to TCP in 50 particular. XCP routers are required to perform a small calculation 51 on congestion state carried in each data packet. XCP routers also 52 periodically recalculate the local parameters required to provide 53 fairness. On the other hand, there is no per-flow congestion state 54 in XCP routers. 56 XCP is currently not ready for wide-scale deployment on the public 57 Internet and the intent of this document is to provide a starting 58 point for future experimentation and development as well as to record 59 the authors' implementation experiences and caveats. 61 Table of Contents 63 1. Changes Since Last Version . . . . . . . . . . . . . . . . . . 4 64 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 65 2.1. XCP Protocol Overview . . . . . . . . . . . . . . . . . . 6 66 3. The Congestion Header . . . . . . . . . . . . . . . . . . . . 10 67 3.1. Header placement . . . . . . . . . . . . . . . . . . . . . 10 68 3.2. Congestion Header Formats . . . . . . . . . . . . . . . . 10 69 3.3. IPsec issues . . . . . . . . . . . . . . . . . . . . . . . 13 70 3.4. NAT, middlebox issues . . . . . . . . . . . . . . . . . . 13 71 3.5. MPLS/Tunneling Issues . . . . . . . . . . . . . . . . . . 14 72 4. XCP Functions . . . . . . . . . . . . . . . . . . . . . . . . 15 73 4.1. End-System Functions . . . . . . . . . . . . . . . . . . . 15 74 4.1.1. Sending Packets . . . . . . . . . . . . . . . . . . . 15 75 4.1.2. Processing Feedback at the Receiver . . . . . . . . . 17 76 4.1.3. Processing Feedback at the Sender . . . . . . . . . . 17 77 4.2. Router functions . . . . . . . . . . . . . . . . . . . . . 19 78 4.2.1. Calculations Upon Packet Arrival . . . . . . . . . . . 20 79 4.2.2. Calculations Upon Control Interval Timeout . . . . . . 21 80 4.2.3. Calculations Upon Packet Departure . . . . . . . . . . 23 81 4.2.4. The Control Interval . . . . . . . . . . . . . . . . . 26 82 4.2.5. Obtaining the Persistent Queue . . . . . . . . . . . . 26 83 5. Unresolved Issues . . . . . . . . . . . . . . . . . . . . . . 29 84 5.1. XCP With Non-XCP Routers . . . . . . . . . . . . . . . . . 29 85 5.2. Variable Rate Links . . . . . . . . . . . . . . . . . . . 30 86 5.3. XCP as a TCP PEP . . . . . . . . . . . . . . . . . . . . . 30 87 5.4. Sharing resources between XCP and TCP . . . . . . . . . . 31 88 5.5. A Generalized Router Model . . . . . . . . . . . . . . . . 31 89 5.6. Host back-to-back operation . . . . . . . . . . . . . . . 31 90 6. Security Considerations . . . . . . . . . . . . . . . . . . . 33 91 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 35 92 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 36 93 9. Informative References . . . . . . . . . . . . . . . . . . . . 37 94 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 40 95 Intellectual Property and Copyright Statements . . . . . . . . . . 41 97 1. Changes Since Last Version 99 Changes between version -02 and -03 101 o Added text in the abstract and introduction clarifying 102 experimental nature of the protocol in preparation for ICCRG 103 review. 105 o Re-ordered XCP header fields to allow more cycles for XCP 106 calculations in lines 20-22 when packets are streamed 32-bits per 107 cycle. This change improved performance of an experimental FPGA 108 implementation. 110 o Incremented version number (to 0x03) to reflect the change in the 111 XCP header 113 o Added a subsection on malicious receivers. 115 o Moved per-packet divisions from the router to the sender. 117 Changes between version -01 and -02 119 o Minor edits and typo fixes. 121 Changes between version -00 and -01 123 o Updated protocol to reflect movement of the per-packet division 124 from the router to the end-system. 126 o Incremented version number (to 0x02) to reflect change in packet 127 header format. 129 o Reordered the Protocol, Length, Version, and Format fields in the 130 congestion header (in anticipation of future support of IPv6 131 extension headers). 133 o Routers now MUST (from SHOULD) ignore fields other than 134 reverse_feedback when minimal header is used. 136 o No longer ignore packets with RTT set to zero. Senders with 137 coarse-grained timer may generate these if the RTT is less than 138 the timer precision. 140 o Added 'Open Issues' section on variable-rate links. 142 2. Introduction 144 The Van Jacobson congestion control algorithms [Jacobson88] [RFC2581] 145 are used by the Internet transport protocols TCP [RFC0793] and SCTP 146 [RFC2960]. The Jacobson algorithms are fundamental to stable and 147 efficient Internet operation, and they have been highly successful 148 over many orders of magnitude of Internet bandwidth and delay. 150 However, the Jacobson congestion control algorithms have begun to 151 reach their limits. Gigabit-per-second file transfers, lossy 152 wireless links, and high latency connections are all driving current 153 TCP congestion control outside of its natural operating regime. The 154 resulting performance problems are of great concern for important 155 network applications. 157 The original Jacobson algorithm was a purely end-to-end solution, 158 requiring no congestion-related state in routers. More recent 159 modifications have backed off from this purity. Active queue 160 management (AQM) in routers (e.g., RED) [RFC2309] improves 161 performance by keeping queues small, while Explicit Congestion 162 Notification (ECN) [RFC3168] passes one bit of congestion information 163 back to senders. These measures do improve performance, but there is 164 a limit to how much can be accomplished without more information from 165 routers. The requirement of extreme scalability together with 166 robustness has been a difficult hurdle to accelerating information 167 flow. 169 This document concerns the Explicit Control Protocol (XCP) developed 170 by Dina Katabi of MIT [KHR02]. XCP represents a significant advance 171 in Internet congestion control: it extracts congestion information 172 directly from routers, without any per-flow state. XCP should be 173 able to deliver the highest possible application performance over a 174 broad range of network infrastructure, including extremely high speed 175 and very high delay links that are not well served by the current 176 control algorithms. XCP achieves fairness, maximum link utilization, 177 and efficient use of bandwidth. XCP is novel in separating the 178 efficiency and fairness policies of congestion control, enabling 179 routers to put available capacity to work quickly while 180 conservatively managing the allocation of capacity to flows. XCP is 181 potentially applicable to any transport protocol, although initial 182 testing has applied it to TCP in particular. 184 XCP's scalability is built upon the principle of carrying per-flow 185 congestion state in packets. XCP packets carry a congestion header 186 through which the sender requests a desired throughput. Routers make 187 a fair per-flow bandwidth allocation without maintaining any per-flow 188 state. This enables the sender to learn the bottleneck router's 189 allocation to a particular flow in a single round trip. 191 The gains of XCP come with some pain. XCP is more difficult to 192 deploy than other proposed Internet congestion control improvements, 193 since it requires changes in the routers as well as in end systems. 194 It will be necessary to develop and test XCP with real user traffic 195 and in real environments, to gain experience with real router and 196 host implementations and to collect data on performance. Providing 197 specifications is an important step towards enabling experimentation 198 which, in turn, will lead to deployment. XCP deployment issues will 199 be addressed in more detail in a subsequent version of this document. 201 XCP, in its current form, is intended for deployment the public 202 Internet. The protocol requires participation by all queues in the 203 path which could become bottleneck, an unlikely occurrence. 204 Additionally, interoperability with other congestion control 205 mechanisms, such as implemented in TCP, remains an open issue and 206 overshadow potential gains that XCP might provide. This document 207 therefore is intended only facilitate controlled experimentation, to 208 record authors' implementation, and provide a starting point for 209 future development. 211 This document contains an initial specification of the protocol and 212 algorithms used by XCP, as an experimental protocol. The XCP 213 algorithms defined here are based upon Katabi's SIGCOMM paper 214 [KHR02], her MIT thesis [Katabi03], and her ns simulation. However, 215 this document includes algorithmic modifications and clarifications 216 that have arisen from early experience with implementing and testing 217 XCP at the USC Information Sciences Institute. (See 218 http://www.isi.edu/isi-xcp for our project page.) This document is 219 intended to provide a baseline for further engineering and testing of 220 XCP. 222 This document is organized as follows. The remainder of Section 2 223 provides an overview of the XCP protocol, Section 3 discusses the 224 format of the congestion header, Section 4 describes the functions 225 occurring in the end-systems and routers, and Section 5 lists some 226 unresolved issues. 228 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 229 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 230 document are to be interpreted as described in [RFC2119]. 232 2.1. XCP Protocol Overview 234 The participants in the XCP protocol include sender hosts, receiver 235 hosts, and intermediate nodes in which queuing occurs along the path 236 from the sender to the receiver. The intermediate nodes are 237 generally routers, but link-layer switches may also contain packet 238 queues. 240 XCP supplies feedback from the network to the sender on the maximum 241 rate (throughput) for injecting data into the network. XCP feedback 242 is acquired through the use of a congestion header on each packet 243 that is sent. Routers along the path may update the congestion 244 header as it moves from the sender to the receiver. The receiver 245 copies the network feedback into outbound packets of the same flow. 246 An end-system may function as both a sender and a receiver in the 247 case of a bidirectional flow. 249 The figure below illustrates four entities participating in XCP. The 250 sender initializes the congestion header, two routers along the way 251 may update it, and the receiver copies the feedback from the network 252 into a returning packet in the same flow. 254 +----------+ +--------+ +--------+ +----------+ 255 | |------->| Router |---->| Router |------->| | 256 | Sender | +--------+ +--------+ | Receiver | 257 | |<----------------------------------------| | 258 +----------+ +----------+ 260 The congestion header contains four pieces of data: 262 o RTT: Set by the sender to its current estimate of the round-trip 263 time. 265 o X: Set by the sender to its current estimate of inter-packet time 266 gap. This quantity is used in place of Throughput in earlier 267 drafts of these specifications to avoid per-packet division in the 268 router. See Section 4.1.1 for more about X. 270 o Delta_Throughput: Initialized to the amount which the sender would 271 like to change (increase or decrease) its throughput, and updated 272 by the routers along the path to be the network's allocated change 273 in throughput. This value will be a negative number if an XCP- 274 capable queue along the path wants the sender to slow down. 276 o Reverse_Feedback: When a data packet reaches the receiver, its 277 Delta_Throughput value is returned to the sender in the 278 Reverse_Feedback field of a congestion header of a returning 279 packet (e.g., in an ACK packet). 281 An XCP-capable router calculates a fair capacity re-allocation for 282 each packet. A flow only receives this re-allocation from a 283 particular router if that router is the bottleneck for that flow. 284 For XCP, a bottleneck router is defined to be a router that has 285 insufficient capacity to accept a flow's current or desired 286 throughput. 288 Based on current conditions, an XCP-capable router generates positive 289 or negative feedback each time a packet arrives and compares it the 290 packet's Delta_Throughput field. Delta_Throughput is reduced if the 291 current value exceeds this calculated feedback allocation. Each XCP- 292 capable router along the path from sender to receiver performs this 293 processing. A packet reaching the receiver therefore contains the 294 minimal feedback allocation from the network, i.e., the capacity 295 reallocation from the bottleneck router. 297 The receiver copies this value into the Reverse_Feedback field of a 298 returning packet in the same flow (e.g., an ACK or DATA-ACK for TCP) 299 and, in one round-trip, the sender learns the flow's per-packet 300 throughput allocation. 302 The sender uses the reverse feedback information to adjust its 303 allowed sending rate. For the transport protocol TCP [RFC0793], for 304 example, this may be accomplished by adjusting the congestion window, 305 or cwnd, that limits the amount of unacknowledged data in the 306 network. (Cwnd is defined for Van Jacobson congestion control in 307 [RFC2581].) 309 Additionally, it is possible to use XCP's explicit notification of 310 the bottleneck capacity allocation for other types of applications. 311 For example, XCP may be implemented to support multimedia streams 312 over DCCP [RFC4340] or other transport protocols. 314 An XCP-capable router maintains two control algorithms on each output 315 port: a congestion controller and a fairness controller. The 316 congestion controller is responsible for making maximal use of the 317 outbound link while at the same time draining any standing queues. 318 The fairness controller is responsible for fairly allocating 319 bandwidth to flows sharing the link. These two algorithms are 320 executed only periodically, at an interval known as the "control 321 interval". The algorithms defined below set this interval to the RTT 322 averaged across all flows. Further work on choosing an appropriate 323 value for the control interval may be required. 325 Each port-specific instance of XCP is independent of every other, and 326 references to an "XCP router" should be considered an instance of XCP 327 running on a particular output port. 329 Actually, it is an oversimplification to say that congestion in 330 routers only appears at output ports. Routers are complex devices 331 which may experience resource contention in many forms and 332 locations. Correctly expressing congestion which doesn't occur at 333 the router output port is a topic for further study. Even so, it 334 is important to correctly identify where the queue will build up 335 in a router. The XCP algorithm will drain a standing queue; 336 however it is necessary to measure that queue in order for correct 337 operation. For more discussion of this issue see Section 338 Section 5.5. 340 More context, analysis, and background can be found in [KHR02] and 341 [Katabi03]. 343 3. The Congestion Header 345 The congestion control data required for XCP are placed in a new 346 header which is called the Congestion Header. 348 3.1. Header placement 350 The Congestion Header is located between the IP and transport 351 headers. This is consistent with the fact that XCP is neither hop- 352 by-hop communication -- as in IP -- nor end-to-end communication -- 353 as in TCP or UDP -- but is rather end-system-to-network 354 communication. It should allow a router to "easily" locate the 355 congestion header on a packet with no IP options. 357 Other choices were considered for header location. For example, 358 making the Congestion Header a TCP option was suggested. This 359 made sense as the congestion information is related to the 360 transport protocol. However, it requires that routers be aware of 361 the header format for every new transport protocol that might ever 362 use XCP. This seemed like an unreasonable burden to place on the 363 routers and would impede deployment of new transport protocols 364 and/or XCP. 366 It has also been suggested that the Congestion Header be an IPv4- 367 style option. While this proposal is transport protocol 368 independent, it would generally force XCP packets to take the slow 369 path on non-XCP routers along a path. This could severely impact 370 performance. 372 The XCP protocol uses protocol number [TBD], assigned by IANA. IP 373 packets containing XCP headers will use this protocol number in the 374 IP header's Protocol field [RFC0791] to indicate to routers and end- 375 systems that an XCP congestion header follows the IP header. 377 3.2. Congestion Header Formats 379 This section defines the XCP Congestion Header formats. This holds 380 for IPv4; the corresponding header for IPv6 is a subject for further 381 study. 383 XCP-capable IPv4 packets carry the following Congestion Header: 385 0 1 2 3 386 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 387 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 388 | Protocol | Length |Version|Format | unused | 389 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 390 | X | 391 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 392 | RTT | 393 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 394 | Reverse_Feedback | 395 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 396 | Delta_Throughput | 397 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 399 Protocol: 8 bits 401 This field indicates the next-level protocol used in the data 402 portion of the packet. The values for various protocols are 403 specified by IANA. 405 Length: 8 bits 407 This field indicates the length of the congestion header, measured 408 in bytes. Length in this version of XCP will always be 20 bytes, 409 or 0x14. 411 Version: 4 bits 413 This field indicates the version of XCP that is in use. The 414 version of XCP described in this document corresponds to a value 415 of 0x03. Future values will be assigned by IANA 416 (http://www.iana.org). See note in the IANA Considerations 417 section. 419 Format: 4 bits 421 This field contains a code to indicate the congestion header 422 format. The current format codes are defined below. 424 +-----------------+------+ 425 | Format | Code | 426 +-----------------+------+ 427 | Standard Format | 0x1 | 428 | | | 429 | Minimal Format | 0x2 | 430 +-----------------+------+ 432 Table 1 434 Standard Format 436 The standard format includes the X, Delta_Throughput, and RTT 437 fields shown above. This format is used by XCP in data packets 438 flowing from sender to receiver. 440 Minimal Format 442 The X, Delta_Throughput, and RTT fields are unused and SHOULD be 443 set to zero. A router MUST not perform any processing on a 444 minimal format header. This format is intended for use in empty 445 ACK packets, to return congestion information from receiver to 446 sender. 448 Other formats may be defined in the future, to define different 449 representation formats for the X, Delta_Throughput, and/or RTT 450 fields, for example. The corresponding format values format values 451 will be assigned by IANA. See IANA Considerations section below. 453 unused: 8 bits 455 This field is unused and MUST be set to zero in this version of 456 XCP. 458 RTT: 32bits 460 This field indicates the round-trip time measured by the sender, 461 in fixed point format with 28 bits after the binary point, in 462 seconds. Thus, the value of 1 corresponds to 2^(-28) of a second. 463 This field is an unsigned integer. 465 The minimum value expressible in this field is 0s. A value of 466 zero in the RTT field is legal and indicates that the sender 467 either does not yet know the round-trip time, or operates at a 468 coarse-grained timer granularity. The maximum value expressible 469 in this field is 15.9999999963 seconds, in steps of 3.7ns. 471 X: 32 bits 473 This field indicates the inter-packet time of the flow as 474 calculated by the sender, in fixed point format with 28 bits after 475 the binary point, in seconds. This is the same format as used by 476 the RTT field. 478 Delta_Throughput: 32 bits 480 This field indicates the desired or allocated change in 481 throughput. It is set by the sender to indicate the amount by 482 which the sender would like to adjust its throughput, and it may 483 be subsequently reduced by routers along the path (See 484 Section 4.2). It is measured in bytes per second and is a signed, 485 2's complement value. 487 The minimum throughput change expressible in this field is -17 488 Gbps. The maximum value expressible in this field is 17 Gbps, in 489 steps of 8 bits per second. 491 Reverse_Feedback: 32bits 493 This field indicates the value of Delta_Throughput received by the 494 data receiver. The receiver copies the field Delta_Throughput 495 into the Reverse_Feedback field of the next outgoing packet in the 496 same flow. See Section 4.1.2. 498 3.3. IPsec issues 500 IPsec [RFC4301] must be slightly modified to accommodate use of XCP. 501 The specifications for the IP Authenticated Header (AH) [RFC4302] and 502 IP Encapsulating Security Payload (ESP) [RFC4303] state that the 503 IPsec headers immediately follow the IP header. This would be a 504 problem for XCP in that a) it would make the XCP headers harder to 505 find by the routers, b) ESP encryption would make it impossible for 506 routers along the path to read and write congestion header 507 information and c) AH authentication would fail if any router along 508 the path had modified a congestion header. Therefore, the XCP 509 congestion header should immediately follow the IP header and precede 510 any AH. 512 3.4. NAT, middlebox issues 514 Middleboxes that attempt to perform actions invisibly on flows must 515 preserve the congestion header. Middleboxes that terminate the TCP 516 connection should terminate the XCP connection. Middleboxes that 517 insert queues into the forwarding path should participate in XCP. 519 3.5. MPLS/Tunneling Issues 521 When a flow enters an IP tunnel [RFC2003], IPsec ESP tunnel 522 [RFC4303], or MPLS [RFC3031], network ingress point, the congestion 523 header should be replicated on the "front" of the outer IP header. 524 For example, when a packet enters an IP tunnel, the following 525 transformation should occur: 527 [IP2] \_ outer header 528 __--> [XCP] / 529 [IP1]/ [IP1] \ 530 [XCP] ---> [XCP] |_ inner header 531 [TCP] [TCP] | 532 ... ... / 534 Here the XCP header appended to the front of the outer header is 535 copied from the inner header, with the appropriate change to the 536 Protocol field to indicate that the next protocol is IP. 538 When the packet exits the tunnel, the congestion header, which may 539 have been modified by routers along the tunneled path, is copied from 540 the outer header into the inner header. 542 4. XCP Functions 544 XCP is concerned with the sender and receiver end-systems and the 545 routers along the packet's path. This section describes the XCP- 546 related algorithms to be implemented in each of these entities. The 547 specific case of TCP as transport protocol is also described. The 548 emphasis in this section is on explicit and detailed definition of 549 the XCP algorithms. The theoretical derivation and analysis of the 550 algorithms can found in [Katabi03]. 552 4.1. End-System Functions 554 4.1.1. Sending Packets 556 The sender is responsible for maintaining five parameters, or their 557 equivalent: (1) a desired throughput value, (2) a current estimate of 558 the actual throughput, (3) the maximum throughput allowed by XCP, (4) 559 a current estimate of inter-packet time, denoted X, and (5) a current 560 estimate of the round-trip time 562 A sender may choose to use any reasonable value, i.e., any achievable 563 value, for desired throughput. An application may supply this value 564 via an API, or it might be the speed of the local interface. 566 When sending a packet, the sender fills in the fields of the 567 congestion header as follows: 569 o The sender sets the RTT field to a scaled smoothed round-trip time 570 estimate, or to zero if the round-trip time is not yet known. 572 o The sender sets the X field to the current inter-packet time 573 estimate, or to zero if an estimate is not yet available. Packets 574 carrying zero X field can receive negative feedback, but not 575 positive. Since XCP requires only a single round trip for a flow 576 to gain an estimate of RTT, this is expected to have negligible 577 effect. The value of X may be estimated as the smoothed round- 578 trip time estimate divided by the number of outstanding packets 579 (or congestion window size in packets, for window-based 580 protocols). Alternatively, it may be derived from the ratio of 581 the packet size to the current throughput estimate. Using 582 instantaneous RTT estimates in the calculation of X may yield 583 better results than using the smoothed RTT, especially for senders 584 with coarse-grained timers, i.e., timers with precision less than 585 the RTT. This is a subject for further study. Also,see Section 586 Section 4.1.3.3 for a discussion of RTT estimation. 588 o The sender calculates a desired change (typically, an increase) in 589 throughput. This is normally the difference between the current 590 estimated throughput and the desired throughput. However, if the 591 sender does not have sufficient data to send at the current 592 allowed throughput, the desired change in throughput SHOULD be 593 zero. 595 o The sender then divides the desired throughput change by the 596 number of packets in one round-trip time, and puts the result in 597 the Delta_Throughput field of the Congestion Header. This per- 598 packet distribution of the throughput change is necessary because 599 an XCP router does not maintain per-flow congestion state; it must 600 treat each packet independently of others in the same flow. The 601 number of packets in an RTT may be estimated by the product of the 602 current throughput and the RTT, divided by the Maximum Segment 603 Size (MSS). 605 The Delta_Throughput (in bytes/second) can be calculated as: 607 desired_throughput - Throughput 608 Delta_Throughput = ----------------------------------- 609 Throughput * ( RTT/MSS ) 611 where: 612 desired_throughput is measured in bytes/second 613 Throughput is measured in bytes/second 614 RTT is measured in seconds 615 MSS is measured in bytes 617 However, Delta_Throughput should be set to zero if, for any 618 reason, no additional capacity is needed, e.g., there is 619 insufficient data to maintain Throughput for the next RTT, as 620 discussed above. 622 o An issue for future consideration is how to treat the case when 623 Delta_Throughput is calculated to be < 1 Bps. Since an integer 624 representation is passed in the Congestion Header, the result will 625 appear as zero. It would be possible to send a fraction of the 626 packets in a round trip time with non-zero Delta_Throughput 627 values. 629 o For TCP, the throughput estimate can be obtained by dividing the 630 congestion window cwnd (in bytes) by RTT (in seconds), for 631 example. Alternatively, it could be measured. The ratio cwnd/RTT 632 differs from true throughput in two respects. First, cwnd doesn't 633 account for header size. This may become significant should XCP 634 be applied to real-time flows that send large numbers of small 635 packets, but it is probably not much worry for TCP flows that tend 636 to use the largest possible packet size. Second, cwnd represents 637 permission for the sender to transmit data. If the application 638 doesn't use all of the available cwnd, the advertised throughput 639 will be larger than the true throughput. This may result in an 640 XCP router creating an unfair allocation of negative feedback to a 641 flow. 643 4.1.2. Processing Feedback at the Receiver 645 An XCP receiver is responsible for copying the Delta_Throughput data 646 it sees on arriving packets into the Reverse_Feedback field of 647 outgoing packets. In TCP, outgoing packets would normally be ACK- 648 only segments. 650 In some cases returning packets are sent less frequently than 651 arriving packets, e.g., with delayed acknowledgments [RFC1122]. The 652 receiver is responsible for calculating the sum of the arriving 653 Delta_Throughput fields for placement in outgoing Reverse_Feedback 654 fields. 656 4.1.2.1. Feedback from the Receiver 658 The receiver end-system returns XCP congestion feedback from the 659 network to the sender, by copying the Delta_Throughput information 660 from arriving packets into the Reverse_Feedback field of Congestion 661 Headers in outgoing packets (possibly aggregating data as described 662 in Section 4.1.2). 664 It is possible that even empty ACK packets may create or encounter 665 congestion in the reverse direction. Although TCP implementations 666 generally do not perform congestion-based pacing of empty ACK 667 segments, some transport protocols (e.g., DCCP) may be. Such a 668 transport protocol may choose to use XCP congestion control on the 669 returning ACKs as well as on the data. 671 In the normal case of a unidirectional data flow with XCP applied 672 only to that data flow, the feedback can be sent in a Minimal format 673 Congestion Header, in which the RTT, X, and Delta_Throughput fields 674 are set to zero. 676 4.1.3. Processing Feedback at the Sender 678 When packets arrive back to the sender carrying reverse feedback, the 679 sender must adjust its sending rate accordingly. 681 As noted earlier, this throughput adjustment may be made in TCP by 682 updating the sender's congestion window, cwnd. This should use the 683 formula: 685 cwnd = max(cwnd + feedback * RTT, MSS) 687 where: 688 cwnd = current congestion window (bytes) 689 feedback = Reverse_Feedback field from received packet, 690 (bytes/sec, may be +/-) 691 RTT = Sender's current round-trip time estimate 692 (seconds) 693 MSS = maximum segment size (bytes) 695 The value of cwnd has a minimum of MSS to avoid the "Silly Window 696 Syndrome" [RFC0813]. 698 4.1.3.1. Aging the Allowed Throughput 700 When a sending application does not send data fast enough to fully 701 utilize the allowed throughput, XCP should reduce the allowed 702 throughput as time passes, to avoid sudden bursts of data into the 703 network if the application starts to send data later. 705 We present a slight modification of the algorithm for aging the 706 allowed throughput below. It is based on Section 4.5 of [Katabi03]. 707 Each RTT in which the sender sends with actual throughput which is 708 less than the allowed throughput, the allowed throughput MUST be 709 reduced by the following exponential averaging formula: 711 Allowed_Throughput = Allowed_Throughput*(1-p) + 712 Actual_Throughput * p 714 where: p is a parameter controlling the speed of aging, 715 ranged between 0 and 1. 717 Using p = 0.5 is suggested. Consideration of values of p or other 718 algorithms is a research topic. 720 4.1.3.2. Response to Packet Loss 722 When the transport protocol is TCP, a packet drop or detection of an 723 ECN notification [RFC3168] should trigger a transition to standard 724 TCP congestion control behavior[RFC2581]. In other words, cwnd 725 should be halved and Jacobson's fast retransmission/fast recovery, 726 slow start, and congestion avoidance algorithms should be applied for 727 the remainder of the connection or until the congestion event is 728 known to have passed (see Section 5.1 for discussion of alternative 729 approaches to this issue. The assumption is that the packet drop 730 reveals the presence of a congested non-XCP router in the path. 731 Transitioning to standard TCP behavior is a conservative response. 733 Note also the following: 735 o The change in congestion control algorithm should be delayed until 736 the three DUPACKs have arrived, according to the Fast 737 Retransmission/Fast Recovery algorithm[RFC2581]. 739 o Once the change to standard TCP congestion control has occurred, 740 cwnd should be managed using the RFC2581 algorithm. 742 o The X field in outgoing packets should continue to reflect the 743 current inter-packet time. This allows the XCP processes in the 744 routers along the path to continue to monitor the flow's 745 utilization. 747 o Further study is needed to determine whether it will be possible 748 to return a connection to XCP congestion control, once it has 749 transitioned to Van Jacobson mode. 751 o For transport protocols other than TCP, the response to a packet 752 loss or ECN notification is a subject for further study. 754 4.1.3.3. RTT Estimates 756 Having a good estimate of the round trip time is more important in 757 XCP than in Van Jacobson congestion control. There is evidence that 758 small errors in the RTT estimate can result in larger errors in the 759 throughput and X estimates. The current cwnd divided by SRTT is only 760 an approximation of the actual throughput. Likewise, SRTT divided by 761 cwnd in packets is only an approximation of the highly variable 762 inter-packet time, X. The RTT used in the ns-2 code in [KHR02] used a 763 smoothed floating-point RTT estimator, rather than instantaneous 764 measurements. Additional research is needed to develop 765 recommendations for RTT estimation. 767 4.2. Router functions 769 The router calculations for XCP are divided into those that occur 770 upon packet arrival, those that occur upon control interval timeout, 771 those that occur upon packet departure, and the assessment of the 772 persistent queue, which uses a separate timer. The calculations are 773 presented in the following sections as annotated pseudo-code. 775 4.2.1. Calculations Upon Packet Arrival 777 When a packet arrives at a router, several parameters used by XCP 778 need to be updated. The steps are described in the following pseudo- 779 code. 781 ======================================================== 782 On packet arrival do: 784 1. input_traffic += Pkt_size 786 2. sum_x += X 788 3. if (Rtt < MAX_INTERVAL) then 790 4. sum_xrtt += X * Rtt 792 5. else 794 6. sum_xrtt += X * MAX_INTERVAL 795 ======================================================== 797 Line 1: The variable input_traffic accumulates the volume of data 798 that have arrived during a control interval. When a packet 799 arrives, the packet size is taken from the IP header and is added 800 to the ongoing count. 802 Line 2: The variable sum_x is used in the control interval 803 calculation (see equation 4.2 of [Katabi03]) and in capacity 804 allocation. For each packet, values of X from the XCP header is 805 accumulated. It is recommended that sum_x is stored in a 64-bit 806 unsigned integer variable. 808 Lines 3 and 5: A test is performed to check whether the round trip 809 time of the flow exceeds the maximum allowable control interval. 810 If so, MAX_INTERVAL, the maximum allowable control interval, is 811 used in the subsequent calculations. Too large a control interval 812 will delay new flows from acquiring their fair allocation of 813 capacity. See Section 4.2.4 for a discussion of the recommended 814 value for MAX_INTERVAL. 816 Lines 4 and 6: As in Line 2, the variable sum_xrtt is used in the 817 control interval calculation. It is recommended that it is stored 818 in a 96-bit unsigned variable. 820 4.2.2. Calculations Upon Control Interval Timeout 822 When the control timer expires, several variables need to be updated 823 as shown below. 825 Note that several calculations show divisions. These divisions 826 should either be accomplished using floating-point arithmetic or 827 integer arithmetic and appropriate scaling to avoid over- or under- 828 flow. 830 ======================================================== On 831 estimation-control timeout do: 833 7. avg_rtt = sum_xrtt / sum_x 835 8. input_bw = input_traffic / ctl_interval 837 9. F = a * (capacity - input_bw) - b * queue / avg_rtt 839 10. shuffled_traffic = shuffle_function(...) 841 11. residue_pos_fbk = shuffled_traffic + max(F,0) 843 12. residue_neg_fbk = shuffled_traffic + max(-F,0) 845 13. Cp = residue_pos_fbk / sum_x 847 14. Cn = residue_neg_fbk / input_traffic 849 15. input_traffic = 0 851 16. sum_x = 0 853 17. sum_xrtt = 0 855 18. ctl_interval = max(avg_rtt, MIN_INTERVAL) 857 19. timer.reschedule(ctl_interval) 858 ======================================================== 859 Line 7: Update avg_rtt by taking the ratio of the two sums 860 accumulated in the previous section. This value is used to 861 determine the control interval (line 17). 863 Line 8: The average bandwidth of arriving traffic is calculated by 864 dividing the bytes received in the previous control interval by 865 the duration of the previous control interval. 867 Line 9: The aggregate feedback, F, is calculated. The variable 868 'capacity' is the ability of the outbound link to carry IP 869 packets, in bytes/second. The variable 'avg_rtt' was calculated 870 in line 7. The variable 'queue' is the persistent queue and is 871 defined in section Section 4.2.5. The values a and b are constant 872 parameters. According to [Katabi03], the constant a may be any 873 positive number such that a < (pi/4*sqrt(2)). A nominal value of 874 0.4 is recommended. The constant b is defined to be b = 875 a^2*sqrt(2). (If the nominal value of a is used, the value for b 876 would be 0.226.) Note that F may be positive or negative. 878 Line 10: This line establishes the amount of capacity that will be 879 shuffled in the next control interval through the use of the 880 shuffle_function. Shuffling takes a small amount of the available 881 capacity and redistributes it by adding it to both the positive 882 and negative feedback pools. This allows new flows to acquire 883 capacity in a full loaded system. 885 The recommended shuffle_function is as follows: 887 shuffled_traffic = max(0, 0.1 * input_bw - |F|) 889 The variable 'input_bw' is defined above in Line 8. Implementers 890 may choose other functions. It is important to consider that more 891 shuffled traffic decreases the time for new flows to acquire 892 capacity and converge to fairness. However, too much shuffling 893 may impede flows from acquiring their fair share of available 894 capacity. (For example, consider a setup of N flows bottlenecked 895 downstream from the given router and another flow, not limited as 896 those, trying to acquire its fair share. In this case shuffling 897 leads to under-utilization of the available bandwidth and impedes 898 the unlimited flow.) Shuffled_traffic is always a positive value. 900 The objective of the feedback calculations is to obtain a per-packet 901 feedback allocation from the router. Lines 13 and 14 obtain factors 902 in this calculation that, unfortunately, have no physical meaning. 903 One might view them as per-flow capacity allocations that have some 904 additional processing to prepare them for per-packet allocation. 905 Note that, with the use of shuffled_traffic, a non-idle router will 906 always start a control interval with non-zero values for both Cn and 907 Cp. 909 Line 11: The variable 'residue_pos_fbk' keeps track of the pool of 910 available positive capacity a router has to allocate. It is 911 initialized to the positive aggregate feedback. 913 Line 12: The variable 'residue_neg_fbk' keeps track of the pool of 914 available negative capacity a router has to allocate. It is 915 initialized to the negative aggregate feedback. This variable is 916 always positive. 918 Line 13: This line calculates the positive feedback scale factor, 919 Cp. The variables residue_pos_fbk, and sum_x are defined above. 921 Line 14: This line calculates the negative feedback scale factor, 922 Cn. This is a positive value. The definitions for 923 residue_neg_fbk, and input_traffic are given above. 925 Line 15-17: Reset various counters for the next control interval. 927 Line 18: Set the next control interval. The use of MIN_INTERVAL is 928 important to establish a reasonable control interval when the 929 router is idle. 931 Line 19: Set timer. 933 4.2.3. Calculations Upon Packet Departure 935 An XCP router processes each packet using the feedback parameters 936 calculated above. As stated earlier, each packet indicates the 937 current inter-packet time (X) and a throughput adjustment, 938 Delta_Throughput. The router calculates a per-packet capacity change 939 which will be compared to the Delta_Throughput field in the packet 940 header. Using the AIMD rule, positive feedback is applied equally 941 per-flow, while negative feedback is made proportional to each flow's 942 capacity. 944 To accommodate high-speed routers, XCP uses a fixed-point numeric 945 representation for the Congestion Header fields. This means that the 946 per-packet calculations defined below result in residual error that 947 is less than 1 Bps per packet. These errors accumulate across all 948 the packets in a control interval, resulting in an inaccuracy in 949 XCP's allocation of available bandwidth to flows. Further work is 950 needed to understand whether this will be a significant problem and, 951 if so, whether there is any solution short of using 64 bit precision 952 or floating point. 954 Processing should be done according to the pseudo-code below. 956 ======================================================== 957 On packet departure: 959 20. pos_fbk = Cp * X 961 21. neg_fbk = Cn * Pkt_size 963 22. feedback = pos_fbk - neg_fbk 965 23. if(Delta_Throughput > feedback) then 967 24. Delta_Throughput = feedback 969 25. else 971 26. neg_fbk = min(residue_neg_fbk, neg_fbk + 972 (feedback - Delta_Throughput)) 974 27. pos_fbk = Delta_Throughput + neg_fbk 976 28. residue_pos_fbk = max(0, residue_pos_fbk - pos_fbk) 978 29. residue_neg_fbk = max(0, residue_neg_fbk - neg_fbk) 980 30. if (residue_pos_fbk <= 0) then Cp = 0 982 31. if (residue_neg_fbk <= 0) then Cn = 0 984 ======================================================== 986 Line 20: The contribution of positive feedback for the current 987 packet is calculated using Cp, defined in line 13, and X (the 988 flow's advertised inter-packet time) from the Congestion Header. 989 Note that if Cp (and Cn in Line 21) is implemented as a floating 990 point number, this calculation would be implemented by multiplying 991 the Cp-mantissa by the value of X, then shifting the result by the 992 amount of the Cp-exponent. 994 Line 21: The contribution of negative feedback for the current 995 packet is calculated using Cn, defined in line 14, and Pkt_size 996 from the IP header. This value of neg_fbk is positive. 998 Line 22: The router's allocated feedback for the packet is the 999 positive per-packet feedback minus the negative per-packet 1000 feedback. This value may be positive or negative. 1002 Line 23-24: Line 23 tests whether the packet is requesting greater 1003 capacity increase (via the packet's Delta_Throughput field) than 1004 the router has allocated. If so, this means the the sender's 1005 desired throughput needs to be reduced to be the router's 1006 allocation. In line 24 the Delta_Throughput field in the packet 1007 header updated with the router feedback allocation. 1009 Line 25: This branch is executed when the packet is requesting a 1010 smaller throughput increase than the router's allocation. In this 1011 branch, and the rest of this pseudo-code, the packet header is not 1012 updated and the remaining code is to correctly update the feedback 1013 pool variables. 1015 Line 26: In this line, the packet's negative feedback contribution, 1016 neg_fbk, is set to be the smaller of two terms. The first term, 1017 residue_neg_fbk, is the pool of negative feedback, i.e., this 1018 drains the remaining negative feedback in the pool. The second 1019 term increases the nominal negative feedback from the router by 1020 the amount which the Delta_Throughput is less than net router 1021 allocation. This allows the router to capture feedback which is 1022 allocated by an upstream bottleneck. 1024 Line 27: The positive allocation, pos_fbk, is adjusted to be the sum 1025 of Delta_Throughput and neg_fbk, from Line 26. This is required 1026 for the sum of pos_fbk and neg_fbk to equal Delta_Throughput. 1028 Line 28-29: In these two lines, the feedback pools, residue_pos_fbk 1029 and residue_neg_fbk, are reduced by the values of pos_fbk and 1030 neg_fbk accordingly, but prevented from going negative. 1032 Line 30-31: When a feedback pool becomes empty, set the scale factor 1033 to zero, i.e., stop handing out associated feedback. 1035 4.2.4. The Control Interval 1037 The capacity allocation algorithm in XCP router updates several 1038 parameters every Control Interval. The Control Interval is currently 1039 defined to be the average RTT of the flows passing through the 1040 router, i.e., avg_rtt calculated in Line 7 above. Other possible 1041 choices for the control interval are under study. 1043 Notes on avg_rtt: 1045 o In this document, the quantity 'avg_rtt' refers to the last 1046 calculated value. In other words, the avg_rtt calculated based on 1047 packets arriving in the previous control interval. 1049 o The avg_rtt calculation should ignore packets with an RTT of zero 1050 in the header. 1052 o avg_rtt MUST have a minimum value. This is to allow flows to 1053 acquire bandwidth from a previously idle router. The default 1054 minimum value, MIN_INTERVAL, should be max(5-10ms, propagation 1055 delay on attached link). 1057 o avg_rtt MUST have a maximum value. The default maximum value, 1058 MAX_INTERVAL, should be max(0.5-1 sec, propagation delay on 1059 attached link). 1061 4.2.5. Obtaining the Persistent Queue 1063 In Section 4.2.2 the variable 'queue' contains the persistent queue 1064 over the control interval. This is intended to be the minimum 1065 standing queue over the queue estimation interval. 1067 The following pseudo-code describes how to obtain the minimum 1068 persistent queue: 1070 ======================================================== 1071 On packet departure do: 1073 32. min_queue = min(min_queue, inst_queue) 1075 ======================================================== 1077 When the queue-computation timer expires do: 1079 33. queue = min_queue 1081 34. min_queue = inst_queue 1083 35. Tq = max(ALLOWED_QUEUE, (avg_rtt - inst_queue/capacity)/2) 1085 36. queue_timer.reschedule(Tq) 1087 ======================================================== 1089 Line 32: The current instantaneous queue length is checked each time 1090 a packet departs compute the minimum queue size. 1092 If avg_rtt is being used as the Control Interval, it MUST NOT be used 1093 as the interval for measuring the minimum persistent queue. Doing so 1094 can result in a feed-forward loop. For example, if a queue develops 1095 the average RTT will increase. If the avg_rtt increases, it takes 1096 longer to react to the growing queue and the queue gets larger, 1097 leading to instability. 1099 Line 33: Upon expiration of the queue estimation timer, Tq, the 1100 variable queue, the persistent queue, is set to be the minimum 1101 queue occupancy over the last Tq. 1103 Line 34: Upon expiration of the queue estimation timer, reset the 1104 running estimate of the minimum queue to be the current queue 1105 occupancy. 1107 Line 35: The first term in the max function, ALLOWED_QUEUE, is the 1108 time to drain a standing queue that you are willing to tolerate. 1109 (A nominal value of 2ms worth of queuing is recommended but this 1110 may be tuned by implementers.) The second term is an estimate of 1111 the propagation delay. In other words the persistent queue is a 1112 queue that does not drain in a propagation delay. the division by 1113 2 is a conservative factor to avoid overestimating the propagation 1114 delay. 1116 Line 36: The queue computation timer is set. 1118 5. Unresolved Issues 1120 XCP is a work-in-progress. This section describes some known issues 1121 that need to be resolved. 1123 5.1. XCP With Non-XCP Routers 1125 Obviously, non-XCP routers will exist in networks before XCP becomes 1126 ubiquitously deployed and we expect other non-XCP systems to continue 1127 to be in the network indefinitely. Long term non-XCP network 1128 elements include any sort of link-level switches with queuing, e.g. 1129 ATM switches and sophisticated Ethernet switches. Even simple 1130 multiplexers are non-XCP queues with very little buffering. 1132 Sources and the network care about these non-XCP elements because any 1133 one of them can be a site of network congestion, and if an XCP 1134 endpoint is bottlenecked at one of these non-XCP elements, no router 1135 feedback will inform the endpoint to slow down. If nothing is done, 1136 such an element will probably collapse under congestion. 1138 Although exactly how XCP sources will operate in this environment is 1139 an open issue, a current promising direction is for endpoints to run 1140 a traditional end-to-end congestion detection algorithm in parallel 1141 with the XCP algorithm and switch over to using that algorithm for 1142 control when congestion is detected that XCP is not controlling. For 1143 example, an XCP source that detects 3 duplicate acknowledgments would 1144 fall back to TCP Reno behavior. 1146 An endpoint that is limited by its end-to-end congestion algorithm 1147 would indicate so to XCP routers by setting a bit in the packet 1148 header. A router may process such packets differently than packets 1149 from endpoints that are being controlled by XCP. For example, the 1150 router might allocate end-to-end controlled packets less feedback or 1151 not reduce its feedback pools by the full amount when assigning 1152 feedback to those packets. 1154 Though using its end-to-end algorithm to control its sending rate, an 1155 endpoint will also monitor the XCP feedback and if the source 1156 discovers that the XCP feedback would be more restrictive than the 1157 end-to-end control over a round trip time, the endpoint will revert 1158 to following XCP feedback. XCP feedback that is more restrictive 1159 over a round trip time is an indication that the endpoint's 1160 bottleneck is once again at an XCP router and the endpoint should 1161 take advantage of the more precise XCP information. 1163 Evaluation of these algorithms is ongoing work 1165 5.2. Variable Rate Links 1167 As discussed in [Zhang05], XCP may perform poorly over shared links. 1168 When a link is shared, such as in CSMA ethernet or 802.11 wireless 1169 networks, a single queue's drain rate is often a function of the load 1170 in the shared medium. So, using a constant value for the variable 1171 'capacity' in the routing control algorithm may not work well. For 1172 correct operation, the XCP router's notion of capacity needs to 1173 reflect how the link capacity is shared. 1175 5.3. XCP as a TCP PEP 1177 In the Internet today TCP performance-enhancing proxies (PEPs) are 1178 sometimes used to improve application performance over certain 1179 networks. TCP PEPs, and the issues surrounding their use are 1180 described in [RFC3135]. A common mechanism used in TCP PEPs is to 1181 split a TCP connection into three parts where the first and last run 1182 TCP and a more aggressive transport protocol is run in the middle 1183 (across the path which generates poor TCP performance). This 1184 improves performance over the "problematic" portion of the path and 1185 does not require changing the protocol stacks on the end systems. 1186 For example, if a high-speed satellite link was used to connect a LAN 1187 to the Internet, a TCP PEP may be placed on either side of the 1188 satellite link. 1190 It is not unusual today to find TCP PEPs which, to get high data 1191 rates, do not use congestion control at all. Of course, this limits 1192 the environments in which they can be used. However, XCP may be used 1193 in between two TCP PEPs to get high transfer rates and still respond 1194 to congestion in a correct and scalable way. 1196 Work on using XCP as a TCP PEP is just beginning [Kapoor05]. 1197 Objectives for such a mechanism would be: 1199 o preserve end-to-end TCP semantics as much as possible 1201 o enable some form of discovery for one PEP to determine that 1202 another PEP was in the path 1204 o allow for recovery should one half of a PEP pair fail or the route 1205 change so that one or both PEPs are not on the path 1207 o enable aggregation of multiple flows between two PEPs. 1209 A system which met the above objectives could also be used for 1210 incremental deployment of XCP. A network operator could deploy XCP- 1211 capable routers and use PEPs at the periphery of the network to 1212 convert from traditional TCP to XCP congestion control. This may 1213 result in smaller queues and improved link utilization within the XCP 1214 network. (This may be of more interest to wireless network providers 1215 than to over-provisioned fiber backbones.) 1217 5.4. Sharing resources between XCP and TCP 1219 Katabi describes a system for sharing router bandwidth between XCP 1220 and TCP flows that is based on sharing the output capacity based on 1221 the average throughputs of XCP and TCP sources using the 1222 router.[Katabi03] Two queues are installed at the router and they are 1223 served with different weights by the forwarding engine. The 1224 algorithm is work-conserving; the forwarding engine is never idle 1225 when either queue has packets in it. 1227 A TFRC-like algorithm estimates the average congestion window of TCP 1228 sources and the XCP algorithm estimates the average throughput of XCP 1229 sources. These averages are used to dynamically weight the time the 1230 processor spends on each queue. Initial experiments indicate that 1231 this system can provide fairness between the two flow classes (XCP/ 1232 TCP).[Katabi03] 1234 Open issues remain, however; for example there are questions about 1235 how well the TFRC-like algorithm can estimate TCP throughput with 1236 only access to local drop rates, convergence time of the weighting 1237 algorithm has never been explored, and no system for buffer 1238 allocation to complement the link capacity allocation has been put 1239 forward. These open issues are under study. 1241 5.5. A Generalized Router Model 1243 The XCP algorithm described here and in [Katabi03] manages congestion 1244 at a single point in a router, most likely an output queue. However, 1245 resource contention can occur at many points in a router. Input 1246 queues, backplanes, computational resources can 'congest' in addition 1247 to output buffers. There is a need to develop a general model and a 1248 variety of mechanisms to identify and manage resource contention 1249 throughout the router. 1251 5.6. Host back-to-back operation 1253 XCP hosts should be capable of back-to-back operation, i.e., with no 1254 router in the path. Nominally, this should not be a problem. A 1255 sender initializes delta_throughput to the desired value, no router 1256 modifies it and, thus, it is automatically granted. However, it has 1257 not yet been decided whether an XCP receiver should be capable of (or 1258 require) adjusting the delta_throughput to request flow control from 1259 the receiver to the sender. 1261 At this point, XCP offers no mechanism for flow control. (Open 1262 question: Should it?) It is believed that running XCP on the output 1263 queue of a host would solve this problem. However, it isn't clear 1264 that the complexity is justified by the need to solve this situation. 1266 6. Security Considerations 1268 The presence of a header which may be read and written by entities 1269 not participating in the end-to-end communication opens some 1270 potential security vulnerabilities. This section describes them and 1271 tries to give enough context so that users can understand the risks. 1273 Man-in-the-Middle Attacks 1275 There is a man-in-the-middle attack where a malicious user can 1276 force a sender to stop sending by inserting negative feedback into 1277 flow. This is little different from a malicious user discarding 1278 packets belonging to a flow using VJ congestion control or setting 1279 ECN bits. One question worth investigating further is whether the 1280 XCP attack is harder to diagnose. 1282 Covert Data Channels 1284 IPsec needs to be modified, as discussed in Section 3.3, to allow 1285 routers to read the entire congestion header and write the 1286 delta_feedback field. This could become a covert data-channel, 1287 i.e., a way in which an end-system can make data viewable to 1288 observers in the network, on a compromised end-system. 1290 Malicious Sources 1292 The XCP algorithms rely on senders to advertise information about 1293 their current RTT and X and correctly respond to feedback 1294 delivered from the network. Naturally, the possibility occurs 1295 that a sender won't perform these functions correctly. Chapter 7 1296 of [Katabi03] and [Katabi04] examine these issues. 1298 A source which lies about its values of X and hence throughput 1299 cannot affect the link utilization and, in the worst case, can 1300 unfairly acquire capacity. However, this is equivalent to a 1301 sender opening up multiple TCP flows. So, there is an incentive 1302 to lie about X. However, because X is explicitly stated in each 1303 packet header, it is a simpler matter to police it at the edge of 1304 the network than, say, for TCP. 1306 A source which lies about its RTT can disrupt the router control 1307 algorithm, particularly when a large number of sources lie about 1308 their RTT and the router control interval is adaptive and uses the 1309 average RTT. However, there is little incentive to lie as it will 1310 not affect the fair allocation of capacity and the liar will 1311 experience the same degradation as the non-lying flows. Lying 1312 about RTT should be considered a weak denial-of-service attack. 1314 A flow may also ignore negative feedback from the router. Such a 1315 flow can obtain unfair throughput in a congested router. However, 1316 as with lying, the explicit nature of XCP makes it possible to 1317 verify that flows are responding to feedback. For example, a 1318 policing function in the path (presumably near the edge so that 1319 the load is manageable and it can be expected to see packet flow 1320 in both directions) may inspect congestion headers for a flow in 1321 both directions. If the policer sees negative feedback heading 1322 towards a source and no reduction in throughput it may, e.g., 1323 punish the flow by severely restricting the throughput. Note that 1324 this can be applied on a probabilistic basis, sampling flows only 1325 occasionally. 1327 Malicious Receivers 1329 An XCP receiver is required to return XCP congestion feedback 1330 unmodified to the sender. It is possible for a receiver to lie 1331 about the feedback by sending a greater value than XCP actually 1332 allocated thus causing the flow to acquire an unfair share of 1333 bandwidth. It is worth noting that a similar problem exists in 1334 TCP [Sherwood05] whereby TCP receivers can ACK data before it 1335 arrives (optimistic ACKing), however its technical implementation 1336 in XCP is much easier. 1338 The explicit nature of XCP though makes it possible to verify that 1339 the receiver does a truthful job. For example, a policer at the 1340 edge can monitor the feedback going in both directions and may 1341 punish flows whose feedback values have been tampered with; 1342 another possibility is for a bottleneck router to punish packets 1343 that report an unfairly large throughput. 1345 7. IANA Considerations 1347 XCP requires the assignment of an IP protocol number. Once this 1348 value has been assigned, the number may be inserted (by the RFC 1349 Editor) into Section 3.1 and this paragraph may be removed prior to 1350 publication. 1352 8. Acknowledgements 1354 The authors would like to acknowledge the many contributors who have 1355 assisted in this work. Bob Braden applied his usual sage guidance to 1356 the project and to the spec, in particular. Ted Faber wrote the 1357 initial implementation framework and provided much wisdom on kernel 1358 development and congestion control. John Wroclawski advised on 1359 project priorities and strategy. Eric Coe developed the initial 1360 implementation and testbed. Aman Kapoor performed supporting 1361 simulations and debugged kernel code. Padma Haldar ported ns-2 1362 simulation code to ns-2 distribution. Jasmeet Bagga and Anuraag 1363 Mittal conducted simulations on various aspects of XCP performance. 1364 On the XCP mailing list, Tim Shepherd, Tom Henderson, and Matt Mathis 1365 made valuable contributions to the effort. To all the above go our 1366 sincere thanks. 1368 This material is based upon work supported by the National Science 1369 Foundation under Grant No. 0230738. 1371 9. Informative References 1373 [Jacobson88] 1374 Jacobson, V., "Congestion Avoidance and Control", ACM 1375 Computer Communication Review Proceedings of the Sigcomm 1376 '88 Symposium, August 1988. 1378 [KHR02] Katabi, D., Handley, M., and C. Rohr, "Internet Congestion 1379 Control for Future High Bandwidth-Delay Product 1380 Environments", ACM Computer Communication 1381 Review Proceedings of the Sigcomm '02 Symposium, 1382 August 2002. 1384 [Kapoor05] 1385 Kapoor, A., Falk, A., Faber, T., and Y. Pryadkin, 1386 "Achieving Faster Access to Satellite Link Bandwidth", 1387 IEEE 8th IEEE Global Internet Symposium, Miami, FL, March 1388 2005, 2005. 1390 [Katabi03] 1391 Katabi, D., "Decoupling Congestion Control and Bandwidth 1392 Allocation Policy With Application to High Bandwidth-Delay 1393 Product Networks", MIT PhD. Thesis, March 2003. 1395 [Katabi04] 1396 Katabi, D., "XCP's Performance in the Presence of 1397 Malicious Flows", Second International Workshop on 1398 Protocols for Fast Long-Distance Networks, Presentation, 1399 February 2004. 1401 [RFC0791] Postel, J., "Internet Protocol", STD 5, RFC 791, 1402 September 1981. 1404 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 1405 RFC 793, September 1981. 1407 [RFC0813] Clark, D., "Window and Acknowledgement Strategy in TCP", 1408 RFC 813, July 1982. 1410 [RFC1122] Braden, R., "Requirements for Internet Hosts - 1411 Communication Layers", STD 3, RFC 1122, October 1989. 1413 [RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003, 1414 October 1996. 1416 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1417 Requirement Levels", BCP 14, RFC 2119, March 1997. 1419 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 1420 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 1421 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 1422 S., Wroclawski, J., and L. Zhang, "Recommendations on 1423 Queue Management and Congestion Avoidance in the 1424 Internet", RFC 2309, April 1998. 1426 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 1427 Control", RFC 2581, April 1999. 1429 [RFC2960] Stewart, R., Xie, Q., Morneault, K., Sharp, C., 1430 Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., 1431 Zhang, L., and V. Paxson, "Stream Control Transmission 1432 Protocol", RFC 2960, October 2000. 1434 [RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol 1435 Label Switching Architecture", RFC 3031, January 2001. 1437 [RFC3135] Border, J., Kojo, M., Griner, J., Montenegro, G., and Z. 1438 Shelby, "Performance Enhancing Proxies Intended to 1439 Mitigate Link-Related Degradations", RFC 3135, June 2001. 1441 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1442 of Explicit Congestion Notification (ECN) to IP", 1443 RFC 3168, September 2001. 1445 [RFC4301] Kent, S. and K. Seo, "Security Architecture for the 1446 Internet Protocol", RFC 4301, December 2005. 1448 [RFC4302] Kent, S., "IP Authentication Header", RFC 4302, 1449 December 2005. 1451 [RFC4303] Kent, S., "IP Encapsulating Security Payload (ESP)", 1452 RFC 4303, December 2005. 1454 [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram 1455 Congestion Control Protocol (DCCP)", RFC 4340, March 2006. 1457 [Sherwood05] 1458 Sherwood, R., Bhattacharjee, B., and R. Braud, 1459 "Misbehaving TCP receivers can cause internet-wide 1460 congestion collapse", ACM Proceedings of the 12th ACM 1461 Conference on Computer and Communications Security (CCS 1462 2005) pp. 383-392, Alexandria, VA, 2005. 1464 [Zhang05] Zhang, Y. and T. Henderson, "An Implementation and 1465 Experimental Study of the eXplicit Control Protocol 1466 (XCP)", IEEE Proceedings of the 24th IEEE 1467 International Conference on Computer Communications 1468 (INFOCOM 2005), pp. 1037-1048, Miami, Florida, Mar 1469 2005., 2005. 1471 Authors' Addresses 1473 Aaron Falk 1474 USC Information Sciences Institute 1475 4676 Admiralty Way 1476 Suite 1001 1477 Marina Del Rey, CA 90292 1479 Phone: 310-448-9327 1480 Email: falk@isi.edu 1481 URI: http://www.isi.edu/~falk 1483 Yuri Pryadkin 1484 USC Information Sciences Institute 1485 4676 Admiralty Way 1486 Suite 1001 1487 Marina Del Rey, CA 90292 1489 Phone: 310-448-8417 1490 Email: yuri@isi.edu 1492 Dina Katabi 1493 Massachusetts Institute of Technology 1494 200 Technology Square 1495 Cambridge, MA 02139 1497 Phone: 617-324-6027 1498 Email: dk@mit.edu 1499 URI: http://www.ana.lcs.mit.edu/dina/ 1501 Full Copyright Statement 1503 Copyright (C) The IETF Trust (2007). 1505 This document is subject to the rights, licenses and restrictions 1506 contained in BCP 78, and except as set forth therein, the authors 1507 retain all their rights. 1509 This document and the information contained herein are provided on an 1510 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1511 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 1512 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 1513 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 1514 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1515 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1517 Intellectual Property 1519 The IETF takes no position regarding the validity or scope of any 1520 Intellectual Property Rights or other rights that might be claimed to 1521 pertain to the implementation or use of the technology described in 1522 this document or the extent to which any license under such rights 1523 might or might not be available; nor does it represent that it has 1524 made any independent effort to identify any such rights. Information 1525 on the procedures with respect to rights in RFC documents can be 1526 found in BCP 78 and BCP 79. 1528 Copies of IPR disclosures made to the IETF Secretariat and any 1529 assurances of licenses to be made available, or the result of an 1530 attempt made to obtain a general license or permission for the use of 1531 such proprietary rights by implementers or users of this 1532 specification can be obtained from the IETF on-line IPR repository at 1533 http://www.ietf.org/ipr. 1535 The IETF invites any interested party to bring to its attention any 1536 copyrights, patents or patent applications, or other proprietary 1537 rights that may cover technology that may be required to implement 1538 this standard. Please address the information to the IETF at 1539 ietf-ipr@ietf.org. 1541 Acknowledgment 1543 Funding for the RFC Editor function is provided by the IETF 1544 Administrative Support Activity (IASA).