idnits 2.17.1 draft-ietf-mptcp-congestion-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Sep 2009 rather than the newer Notice from 28 Dec 2009. (See https://trustee.ietf.org/license-info/) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (April 11, 2011) is 4764 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'REF' is mentioned on line 415, but not defined == Outdated reference: A later version (-03) exists of draft-ford-mptcp-multiaddressed-01 -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force C. Raiciu 3 Internet-Draft M. Handley 4 Intended status: Experimental D. Wischik 5 Expires: October 13, 2011 University College London 6 April 11, 2011 8 Coupled Congestion Control for Multipath Transport Protocols 9 draft-ietf-mptcp-congestion-03 11 Abstract 13 Often endpoints are connected by multiple paths, but communications 14 are usually restricted to a single path per connection. Resource 15 usage within the network would be more efficient were it possible for 16 these multiple paths to be used concurrently. Multipath TCP is a 17 proposal to achieve multipath transport in TCP. 19 New congestion control algorithms are needed for multipath transport 20 protocols such as Multipath TCP, as single path algorithms have a 21 series of issues in the multipath context. One of the prominent 22 problems is that running existing algorithms such as TCP New Reno 23 independently on each path would give the multipath flow more than 24 its fair share at a bottleneck link traversed by more than one of its 25 subflows. Further, it is desirable that a source with multiple paths 26 available will transfer more traffic using the least congested of the 27 paths, hence achieving resource pooling. This would increase the 28 overall efficiency of the network and also its robustness to failure. 30 This document presents a congestion control algorithm which couples 31 the congestion control algorithms running on different subflows by 32 linking their increase functions, and dynamically controls the 33 overall aggresiveness of the multipath flow. The result is a 34 practical algorithm that is fair to TCP at bottlenecks while moving 35 traffic away from congested links. 37 Status of this Memo 39 This Internet-Draft is submitted to IETF in full conformance with the 40 provisions of BCP 78 and BCP 79. 42 Internet-Drafts are working documents of the Internet Engineering 43 Task Force (IETF), its areas, and its working groups. Note that 44 other groups may also distribute working documents as Internet- 45 Drafts. 47 Internet-Drafts are draft documents valid for a maximum of six months 48 and may be updated, replaced, or obsoleted by other documents at any 49 time. It is inappropriate to use Internet-Drafts as reference 50 material or to cite them other than as "work in progress." 52 The list of current Internet-Drafts can be accessed at 53 http://www.ietf.org/ietf/1id-abstracts.txt. 55 The list of Internet-Draft Shadow Directories can be accessed at 56 http://www.ietf.org/shadow.html. 58 This Internet-Draft will expire on October 13, 2011. 60 Copyright Notice 62 Copyright (c) 2011 IETF Trust and the persons identified as the 63 document authors. All rights reserved. 65 This document is subject to BCP 78 and the IETF Trust's Legal 66 Provisions Relating to IETF Documents 67 (http://trustee.ietf.org/license-info) in effect on the date of 68 publication of this document. Please review these documents 69 carefully, as they describe your rights and restrictions with respect 70 to this document. Code Components extracted from this document must 71 include Simplified BSD License text as described in Section 4.e of 72 the Trust Legal Provisions and are provided without warranty as 73 described in the BSD License. 75 Table of Contents 77 1. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 78 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 79 3. Coupled Congestion Control Algorithm . . . . . . . . . . . . . 6 80 4. Implementation Considerations . . . . . . . . . . . . . . . . 7 81 4.1. Implementation Considerations when CWND is Expressed 82 in Packets . . . . . . . . . . . . . . . . . . . . . . . . 9 83 5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 10 84 6. Security Considerations . . . . . . . . . . . . . . . . . . . 10 85 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10 86 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 87 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 11 88 9.1. Normative References . . . . . . . . . . . . . . . . . . . 11 89 9.2. Informative References . . . . . . . . . . . . . . . . . . 11 90 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 12 92 1. Requirements Language 94 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 95 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 96 document are to be interpreted as described in RFC 2119 [RFC2119]. 98 2. Introduction 100 Multipath TCP (MPTCP, [I-D.ford-mptcp-multiaddressed]) is a set of 101 extensions to regular TCP [RFC0793] that allows one TCP connection to 102 be spread across multiple paths. MPTCP distributes load through the 103 creation of separate "subflows" across potentially disjoint paths. 105 How should congestion control be performed for multipath TCP? First, 106 each subflow must have its own congestion control state (i.e. cwnd) 107 so that capacity on that path is matched by offered load. The 108 simplest way to achieve this goal is to simply run TCP New Reno 109 congestion control [RFC5681] on each subflow. However this solution 110 is unsatisfactory as it gives the multipath flow an unfair share when 111 the paths taken by its different subflows share a common bottleneck. 113 Bottleneck fairness is just one requirement multipath congestion 114 control should meet. The following three goals capture the desirable 115 properties of a practical multipath congestion control algorithm: 117 o Goal 1 (Improve Throughput) A multipath flow should perform at 118 least as well as a single path flow would on the best of the paths 119 available to it. 121 o Goal 2 (Do no harm) A multipath flow should not take up more 122 capacity from any of the resources shared by its different paths, 123 than if it was a single flow using only one of these paths. This 124 guarantees it will not unduly harm other flows. 126 o Goal 3 (Balance congestion) A multipath flow should move as much 127 traffic as possible off its most congested paths, subject to 128 meeting the first two goals. 130 Goals 1 and 2 together ensure fairness at the bottleneck. Goal 3 131 captures the concept of resource pooling [WISCHIK]: if each multipath 132 flow sends more data through its least congested path, the traffic in 133 the network will move away from congested areas. This improves 134 robustness and overall throughput, among other things. The way to 135 achieve resource pooling is to effectively "couple" the congestion 136 control loops for the different subflows. 138 We propose an algorithm that couples only the additive increase 139 function of the subflows, and uses unmodified TCP New Reno behavior 140 in case of a drop. The algorithm relies on the traditional TCP 141 mechanisms to detect drops, to retransmit data, etc. 143 Detecting shared bottlenecks reliably is quite difficult, but is just 144 one part of a bigger question. This bigger question is how much 145 bandwidth a multipath user should use in total, even if there is no 146 shared bottleneck. 148 The congestion controller aims to set the multipath flow's aggregate 149 bandwidth to be the same as a regular TCP flow would get on the best 150 path available to the multipath flow. To estimate the bandwidth of a 151 regular TCP flow, the multipath flow estimates loss rates and round 152 trip times and computes the target rate. Then it adjusts the overall 153 aggresiveness (parameter alpha) to achieve the desired rate. 155 While the mechanism above applies always, its effect depends on 156 whether the multipath TCP flow influences or does not influence the 157 link loss rates (high vs. low statistical multiplexing). If MPTCP 158 does not influence link loss rates, MPTCP will get the same 159 throughput as TCP on the best path. In cases with low statistical 160 multiplexing, where the multipath flow influences the loss rates on 161 the path, the multipath throughput will be strictly higher than a 162 single TCP would get on any of the paths. In particular, if using 163 two idle paths, multipath throughput will be sum of the two paths' 164 throughput. 166 This algorithm ensures bottleneck fairness and fairness in the 167 broader, network sense. We acknowledge that current TCP fairness 168 criteria are far from ideal, but a multipath TCP needs to be 169 deployable in the current Internet. If needed, new fairness criteria 170 can be implemented by the same algorithm we propose by appropriately 171 scaling the overall aggressiveness. 173 It is intended that the algorithm presented here can be applied to 174 other multipath transport protocols, such as alternative multipath 175 extensions to TCP, or indeed any other congestion-aware transport 176 protocols. However, for the purposes of example this document will, 177 where appropriate, refer to the MPTCP protocol. 179 The design decisions and evaluation of the congestion control 180 algorithm are published in [NSDI]. 182 The algorithm presented here only extends TCP New Reno congestion 183 control for multipath operation. It is foreseeable that other 184 congestion controllers will be implemented for multipath transport to 185 achieve the bandwidth-scaling properties of the newer congestion 186 control algorithms for regular TCP (such as Compound TCP and Cubic). 188 3. Coupled Congestion Control Algorithm 190 The algorithm we present only applies to the increase phase of the 191 congestion avoidance state specifying how the window inflates upon 192 receiving an ack. The slow start, fast retransmit, and fast recovery 193 algorithms, as well as the multiplicative decrease of the congestion 194 avoidance state are the same as in TCP [RFC5681]. 196 Let cwnd_i be the congestion window on the subflow i. Let tot_cwnd 197 be the sum of the congestion windows of all subflows in the 198 connection. Let p_i, rtt_i and mss_i be the loss rate, round trip 199 time (i.e. smoothed round trip time estimate) and maximum segment 200 size on subflow i. 202 We assume throughout this document that the congestion window is 203 maintained in bytes, unless otherwise specified. We briefly describe 204 the algorithm for packet-based implementations of cwnd in section 205 Section 4.1. 207 Our proposed "Linked Increases" algorithm MUST: 209 o For each ack received on subflow i, increase cwnd_i by min 210 (alpha*bytes_acked*mss_i/tot_cwnd , bytes_acked*mss_i/cwnd_i ) 212 The increase formula takes the minimum between the computed increase 213 for the multipath subflow (first argument to min), and the increase 214 TCP would get in the same scenario (the second argument). In this 215 way, we ensure that any multipath subflow cannot be more aggressive 216 than a TCP flow in the same circumstances, hence achieving goal 2 (do 217 no harm). 219 "alpha" is a parameter of the algorithm that describes the 220 aggresiveness of the multipath flow. To meet Goal 1 (improve 221 throughput), the value of alpha is chosen such that the aggregate 222 throughput of the multipath flow is equal to the throughput a TCP 223 flow would get if it ran on the best path. 225 To get an intuition of what the algorithm is trying to do, let's take 226 the case where all the subflows have the same round trip time and 227 MSS. In this case the algorithm will grow the total window by 228 approximately alpha*MSS per RTT. This increase is distributed to the 229 individual flows according to their instantaneous window size. 230 Subflow i will increase by alpha*cwnd_i/tot_cwnd segments per RTT. 232 Note that, as in standard TCP, when tot_cwnd is large the increase 233 may be 0. In this case the increase MUST be set to 1. We discuss 234 how to implement this formula in practice in the next section. 236 We assume appropriate byte counting (ABC, [RFC3465]) is used, hence 237 the bytes_acked variable records the number of bytes newly 238 acknowledged. If ABC is not used, bytes_acked SHOULD be set to 239 mss_i. 241 To compute tot_cwnd, it is an easy mistake to sum up cwnd_i across 242 all subflows: when a flow is in fast retransmit, its cwnd is 243 typically inflated and no longer represents the real congestion 244 window. The correct behavior is to use the ssthresh value for flows 245 in fast retransmit when computing tot_cwnd. To cater for connections 246 that are app limited, the computation should consider the minimum 247 between flight_size_i and cwnd_i, and flight_size_i and ssthresh_i 248 where appropriate. 250 The total throughput of a multipath flow depends on the value of 251 alpha and the loss rates, maximum segment sizes and round trip times 252 of its paths. Since we require that the total throughput is no worse 253 than the throughput a single TCP would get on the best path, it is 254 impossible to choose a-priori a single value of alpha that achieves 255 the desired throughput in every ocasion. Hence, alpha must be 256 computed based on the observed properties of the paths. 258 The formula to compute alpha is: 260 cwnd_i 261 max -------- 262 i 2 263 rtt_i 264 alpha = tot_cwnd * ---------------- 265 / cwnd_i \ 2 266 | sum ---------| 267 \ i rtt_i / 269 The formula is derived by equalizing the rate of the multipath flow 270 with the rate of a TCP running on the best path, and solving for 271 alpha. 273 4. Implementation Considerations 275 The formula for alpha above implies that alpha is a floating point 276 value. This would require performing costly floating point 277 operations whenever an ACK is received, Further, in many kernels 278 floating point operations are disabled. There is an easy way to 279 approximate the above calculations using integer arithmetic. 281 Let alpha_scale be an integer. When computing alpha, use alpha_scale 282 * tot_cwnd instead of tot_cwnd, and do all the operations in integer 283 arithmetic. 285 Then, scale down the increase per ack by alpha_scale. The algorithm 286 is: 288 o For each ack received on subflow i, increase cwnd_i by min ( 289 (alpha*bytes_acked*mss_i/tot_cwnd)/alpha_scale , 290 bytes_acked*mss_i/cwnd_i ) 292 Alpha scale denotes the precision we want for computing alpha. 293 Observe that the errors in computing the numerator or the denominator 294 in the formula for alpha are quite small, as the cwnd in bytes is 295 typically much larger than the RTT (measured in ms). 297 With these changes, all the operations can be done using integer 298 arithmetic. We propose alpha_scale be a small power of two, to allow 299 using faster shift operations instead of multiplication and division. 300 Our experiments show that using alpha_scale=512 works well in a wide 301 range of scenarios. Increasing alpha_scale increases precision, but 302 also increases the risk of overflow when computing alpha. Using 303 64bit operations would solve this issue. Another option is to 304 dynamically adjust alpha_scale when computing alpha; in this way we 305 avoid overflow and obtain maximum precision. 307 It is possible to implement the algorithm by calculating tot_cwnd on 308 each ack, however this would be costly especially when the number of 309 subflows is large. To avoid this overhead the implementation MAY 310 choose to maintain a new per connection state variable called 311 tot_cwnd. If it does so, the implementation will update tot_cwnd 312 value whenever the individual subflows' windows are updated. 313 Updating only requires one more addition or subtraction operation 314 compared to the regular, per subflow congestion control code, so its 315 performance impact should be minimal. 317 Computing alpha per ack is also costly. We propose alpha be a per 318 connection variable, computed whenever there is a drop and once per 319 RTT otherwise. More specifically, let cwnd_new be the new value of 320 the congestion window after it is inflated or after a drop. Update 321 alpha only if cwnd_i/mss_i != cwnd_new_i/mss_i. 323 In certain cases with small RTTs, computing alpha can still be 324 expensive. We observe that if RTTs were constant, it is sufficient 325 to compute alpha once per drop, as alpha does not change between 326 drops (the insight here is that cwnd_i/cwnd_j = constant as long as 327 both windows increase). Experimental results show that even if round 328 trip times are not constant, using average round trip time instead of 329 instantaneous round trip time gives good precision for computing 330 alpha. Hence, it is possible to compute alpha only once per drop 331 according to the formula above, by replacing rtt_i with rtt_avg_i. 333 If using average round trip time, rtt_avg_i will be computed by 334 sampling the rtt_i whenever the window can accomodate one more 335 packet, i.e. when cwnd / mss < (cwnd+increase)/mss. The samples are 336 averaged once per sawtooth into rtt_avg_i. This sampling ensures 337 that there is no sampling bias for larger windows. 339 Given tot_cwnd and alpha, the congestion control algorithm is run for 340 each subflow independently, with similar complexity to the standard 341 TCP increase code [RFC5681]. 343 4.1. Implementation Considerations when CWND is Expressed in Packets 345 When the congestion control algorithm maintains cwnd in packets 346 rather than bytes, the algorithms above must change to take into 347 account path mss. 349 To compute the increase when an ack is received, the implementation 350 for multipath congestion control is a simple extension of the TCP New 351 Reno code. In TCP New Reno cwnd_cnt is an additional state variable 352 that tracks the number of segments acked since the last cwnd 353 increment; cwnd is incremented only when cwnd_cnt > cwnd; then 354 cwnd_cnt is set to 0. 356 In the multipath case, cwnd_cnt_i is maintained for each subflow as 357 above, and cwnd_i is increased by 1 when cwnd_cnt_i > max(alpha_scale 358 * tot_cwnd / alpha, cwnd_i). 360 When computing alpha for packet-based stacks, the errors in computing 361 the terms in the denominator are larger (this is because cwnd is much 362 smaller and rtt may be comparatively large). Let max be the index of 363 the subflow used in the numerator. To reduce errors, it is easiest 364 to move rtt_max (once calculated) from the numerator to the 365 denominator, obtaining the equivalent formula below. 367 cwnd_max 368 alpha = alpha_scale * tot_cwnd * ----------------------- 369 / rtt_max * cwnd_i \ 2 370 | sum -----------------| 371 \ i rtt_i / 373 Note that the formula for computing alpha does not take into account 374 path mss, and is the same for stacks that keep cwnd in bytes or 375 packets. With this formula, the algorithm for computing alpha will 376 match the rate of TCP on the best path in B/s for byte-oriented 377 stacks, and in packets/s in packet-based stacks. In practice, mss 378 rarely changes between paths so this shouldn't be a problem. 380 However, it is simple to derive formulae allowing packet-based stacks 381 to achieve byte rate fairness (and viceversa) if needed. In 382 particular, for packet-based stacks wanting byte-rate fairness, the 383 formula above changes as follows: cwnd_max is replaced by cwnd_max * 384 mss_max * mss_max, while cwnd_i is replaced with cwnd_i * mss_i. 386 5. Discussion 388 To achieve perfect resource pooling, one must couple both increase 389 and decrease of congestion windows across subflows, as in [KELLY]. 390 Yet this tends to exhibit "flappiness": when the paths have similar 391 levels of congestion, the congestion controller will tend to allocate 392 all the window to one random subflow, and allocate zero window to the 393 other subflows. The controller will perform random flips between 394 these stable points. This doesn't seem desirable in general, and is 395 particularly bad when the achieved rates depend on the RTT (as in the 396 current Internet): in such a case, the resulting rate with fluctuate 397 unpredictably depending on which state the controller is in, hence 398 violating Goal 1. 400 By only coupling increases our proposal removes flappiness but also 401 reduces the extent of resource pooling the protocol achieves. The 402 algorithm will allocate window to the subflows such that p_i * cwnd_i 403 = constant, for all i. Thus, when the loss rates of the subflows are 404 equal, each subflow will get an equal window, removing flappiness. 405 When the loss rates differ, progressively more window will be 406 allocated to the flow with the lower loss rate. In contrast, perfect 407 resource pooling requires that all the window should be allocated on 408 the path with the lowest loss rate. 410 6. Security Considerations 412 None. 414 Detailed security analysis for the Multipath TCP protocol itself is 415 included in [I-D.ford-mptcp-multiaddressed] and [REF] 417 7. Acknowledgements 419 We thank Christoph Paasch for his suggestions for computing alpha in 420 packet-based stacks. The authors are supported by Trilogy 421 (http://www.trilogy-project.org), a research project (ICT-216372) 422 partially funded by the European Community under its Seventh 423 Framework Program. The views expressed here are those of the 424 author(s) only. The European Commission is not liable for any use 425 that may be made of the information in this document. 427 8. IANA Considerations 429 None. 431 9. References 433 9.1. Normative References 435 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 436 Requirement Levels", BCP 14, RFC 2119, March 1997. 438 9.2. Informative References 440 [I-D.ford-mptcp-multiaddressed] 441 Ford, A., Raiciu, C., Handley, M., and S. Barre, "TCP 442 Extensions for Multipath Operation with Multiple 443 Addresses", draft-ford-mptcp-multiaddressed-01 (work in 444 progress), July 2009. 446 [KELLY] Kelly, F. and T. Voice, "Stability of end-to-end 447 algorithms for joint routing and rate control", ACM 448 SIGCOMM CCR vol. 35 num. 2, pp. 5-12, 2005, 449 . 451 [NSDI] Wischik, D., Raiciu, C., Greenhalgh, A., and M. Handley, 452 "Design, Implementation and Evaluation of Congestion 453 Control for Multipath TCP", Usenix NSDI , March 2011, . 456 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 457 RFC 793, September 1981. 459 [RFC3465] Allman, M., "TCP Congestion Control with Appropriate Byte 460 Counting (ABC)", RFC 3465, February 2003. 462 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 463 Control", RFC 5681, September 2009. 465 [WISCHIK] Wischik, D., Handley, M., and M. Bagnulo Braun, "The 466 Resource Pooling Principle", ACM SIGCOMM CCR vol. 38 num. 467 5, pp. 47-52, October 2008, 468 . 470 Authors' Addresses 472 Costin Raiciu 473 University College London 474 Gower Street 475 London WC1E 6BT 476 UK 478 Email: c.raiciu@cs.ucl.ac.uk 480 Mark Handley 481 University College London 482 Gower Street 483 London WC1E 6BT 484 UK 486 Email: m.handley@cs.ucl.ac.uk 488 Damon Wischik 489 University College London 490 Gower Street 491 London WC1E 6BT 492 UK 494 Email: d.wischik@cs.ucl.ac.uk