idnits 2.17.1 draft-ietf-mptcp-congestion-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 16, 2011) is 4697 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force C. Raiciu 3 Internet-Draft M. Handley 4 Intended status: Experimental D. Wischik 5 Expires: December 18, 2011 University College London 6 June 16, 2011 8 Coupled Congestion Control for Multipath Transport Protocols 9 draft-ietf-mptcp-congestion-05 11 Abstract 13 Often endpoints are connected by multiple paths, but communications 14 are usually restricted to a single path per connection. Resource 15 usage within the network would be more efficient were it possible for 16 these multiple paths to be used concurrently. Multipath TCP is a 17 proposal to achieve multipath transport in TCP. 19 New congestion control algorithms are needed for multipath transport 20 protocols such as Multipath TCP, as single path algorithms have a 21 series of issues in the multipath context. One of the prominent 22 problems is that running existing algorithms such as standard TCP 23 independently on each path would give the multipath flow more than 24 its fair share at a bottleneck link traversed by more than one of its 25 subflows. Further, it is desirable that a source with multiple paths 26 available will transfer more traffic using the least congested of the 27 paths, hence achieving resource pooling. This would increase the 28 overall efficiency of the network and also its robustness to failure. 30 This document presents a congestion control algorithm which couples 31 the congestion control algorithms running on different subflows by 32 linking their increase functions, and dynamically controls the 33 overall aggressiveness of the multipath flow. The result is a 34 practical algorithm that is fair to TCP at bottlenecks while moving 35 traffic away from congested links. 37 Status of this Memo 39 This Internet-Draft is submitted in full conformance with the 40 provisions of BCP 78 and BCP 79. 42 Internet-Drafts are working documents of the Internet Engineering 43 Task Force (IETF). Note that other groups may also distribute 44 working documents as Internet-Drafts. The list of current Internet- 45 Drafts is at http://datatracker.ietf.org/drafts/current/. 47 Internet-Drafts are draft documents valid for a maximum of six months 48 and may be updated, replaced, or obsoleted by other documents at any 49 time. It is inappropriate to use Internet-Drafts as reference 50 material or to cite them other than as "work in progress." 52 This Internet-Draft will expire on December 17, 2011. 54 Copyright Notice 56 Copyright (c) 2011 IETF Trust and the persons identified as the 57 document authors. All rights reserved. 59 This document is subject to BCP 78 and the IETF Trust's Legal 60 Provisions Relating to IETF Documents 61 (http://trustee.ietf.org/license-info) in effect on the date of 62 publication of this document. Please review these documents 63 carefully, as they describe your rights and restrictions with respect 64 to this document. Code Components extracted from this document must 65 include Simplified BSD License text as described in Section 4.e of 66 the Trust Legal Provisions and are provided without warranty as 67 described in the Simplified BSD License. 69 Table of Contents 71 1. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 72 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 73 3. Coupled Congestion Control Algorithm . . . . . . . . . . . . . 6 74 4. Implementation Considerations . . . . . . . . . . . . . . . . 7 75 4.1. Computing alpha in Practice . . . . . . . . . . . . . . . 8 76 4.2. Implementation Considerations when CWND is Expressed 77 in Packets . . . . . . . . . . . . . . . . . . . . . . . . 9 78 5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 10 79 6. Security Considerations . . . . . . . . . . . . . . . . . . . 10 80 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 11 81 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 82 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 11 83 9.1. Normative References . . . . . . . . . . . . . . . . . . . 11 84 9.2. Informative References . . . . . . . . . . . . . . . . . . 11 85 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 12 87 1. Requirements Language 89 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 90 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 91 document are to be interpreted as described in RFC 2119 [RFC2119] . 93 2. Introduction 95 Multipath TCP (MPTCP, [I-D.ford-mptcp-multiaddressed]) is a set of 96 extensions to regular TCP [RFC0793] that allows one TCP connection to 97 be spread across multiple paths. MPTCP distributes load through the 98 creation of separate "subflows" across potentially disjoint paths. 100 How should congestion control be performed for multipath TCP? First, 101 each subflow must have its own congestion control state (i.e. cwnd) 102 so that capacity on that path is matched by offered load. The 103 simplest way to achieve this goal is to simply run standard TCP 104 congestion control on each subflow. However this solution is 105 unsatisfactory as it gives the multipath flow an unfair share when 106 the paths taken by its different subflows share a common bottleneck. 108 Bottleneck fairness is just one requirement multipath congestion 109 control should meet. The following three goals capture the desirable 110 properties of a practical multipath congestion control algorithm: 112 o Goal 1 (Improve Throughput) A multipath flow should perform at 113 least as well as a single path flow would on the best of the paths 114 available to it. 116 o Goal 2 (Do no harm) A multipath flow should not take up more 117 capacity from any of the resources shared by its different paths, 118 than if it was a single flow using only one of these paths. This 119 guarantees it will not unduly harm other flows. 121 o Goal 3 (Balance congestion) A multipath flow should move as much 122 traffic as possible off its most congested paths, subject to 123 meeting the first two goals. 125 Goals 1 and 2 together ensure fairness at the bottleneck. Goal 3 126 captures the concept of resource pooling [WISCHIK]: if each multipath 127 flow sends more data through its least congested path, the traffic in 128 the network will move away from congested areas. This improves 129 robustness and overall throughput, among other things. The way to 130 achieve resource pooling is to effectively "couple" the congestion 131 control loops for the different subflows. 133 We propose an algorithm that couples only the additive increase 134 function of the subflows, and uses unmodified TCP behavior in case of 135 a drop. The algorithm relies on the traditional TCP mechanisms to 136 detect drops, to retransmit data, etc. 138 Detecting shared bottlenecks reliably is quite difficult, but is just 139 one part of a bigger question. This bigger question is how much 140 bandwidth a multipath user should use in total, even if there is no 141 shared bottleneck. 143 The congestion controller aims to set the multipath flow's aggregate 144 bandwidth to be the same as a regular TCP flow would get on the best 145 path available to the multipath flow. To estimate the bandwidth of a 146 regular TCP flow, the multipath flow estimates loss rates and round 147 trip times and computes the target rate. Then it adjusts the overall 148 aggresiveness (parameter alpha) to achieve the desired rate. 150 While the mechanism above applies always, its effect depends on 151 whether the multipath TCP flow influences or does not influence the 152 link loss rates (low vs. high statistical multiplexing). If MPTCP 153 does not influence link loss rates, MPTCP will get the same 154 throughput as TCP on the best path. In cases with low statistical 155 multiplexing, where the multipath flow influences the loss rates on 156 the path, the multipath throughput will be strictly higher than a 157 single TCP would get on any of the paths. In particular, if using 158 two idle paths, multipath throughput will be sum of the two paths' 159 throughput. 161 This algorithm ensures bottleneck fairness and fairness in the 162 broader, network sense. We acknowledge that current TCP fairness 163 criteria are far from ideal, but a multipath TCP needs to be 164 deployable in the current Internet. If needed, new fairness criteria 165 can be implemented by the same algorithm we propose by appropriately 166 scaling the overall aggressiveness. 168 It is intended that the algorithm presented here can be applied to 169 other multipath transport protocols, such as alternative multipath 170 extensions to TCP, or indeed any other congestion-aware transport 171 protocols. However, for the purposes of example this document will, 172 where appropriate, refer to the MPTCP protocol. 174 The design decisions and evaluation of the congestion control 175 algorithm are published in [NSDI]. 177 The algorithm presented here only extends standard TCP congestion 178 control for multipath operation. It is foreseeable that other 179 congestion controllers will be implemented for multipath transport to 180 achieve the bandwidth-scaling properties of the newer congestion 181 control algorithms for regular TCP (such as Compound TCP and Cubic). 183 3. Coupled Congestion Control Algorithm 185 The algorithm we present only applies to the increase phase of the 186 congestion avoidance state specifying how the window inflates upon 187 receiving an ack. The slow start, fast retransmit, and fast recovery 188 algorithms, as well as the multiplicative decrease of the congestion 189 avoidance state are the same as in standard TCP[RFC5681]. 191 Let cwnd_i be the congestion window on the subflow i. Let tot_cwnd 192 be the sum of the congestion windows of all subflows in the 193 connection. Let p_i, rtt_i and mss_i be the loss rate, round trip 194 time (i.e. smoothed round trip time estimate used by TCP) and maximum 195 segment size on subflow i. 197 We assume throughout this document that the congestion window is 198 maintained in bytes, unless otherwise specified. We briefly describe 199 the algorithm for packet-based implementations of cwnd in section 200 Section 4.2. 202 Our proposed "Linked Increases" algorithm MUST: 204 o For each ack received on subflow i, increase cwnd_i by min 205 (alpha*bytes_acked*mss_i/tot_cwnd , bytes_acked*mss_i/cwnd_i ) 207 The increase formula takes the minimum between the computed increase 208 for the multipath subflow (first argument to min), and the increase 209 TCP would get in the same scenario (the second argument). In this 210 way, we ensure that any multipath subflow cannot be more aggressive 211 than a TCP flow in the same circumstances, hence achieving goal 2 (do 212 no harm). 214 "alpha" is a parameter of the algorithm that describes the 215 aggresiveness of the multipath flow. To meet Goal 1 (improve 216 throughput), the value of alpha is chosen such that the aggregate 217 throughput of the multipath flow is equal to the throughput a TCP 218 flow would get if it ran on the best path. 220 To get an intuition of what the algorithm is trying to do, let's take 221 the case where all the subflows have the same round trip time and 222 MSS. In this case the algorithm will grow the total window by 223 approximately alpha*MSS per RTT. This increase is distributed to the 224 individual flows according to their instantaneous window size. 225 Subflow i will increase by alpha*cwnd_i/tot_cwnd segments per RTT. 227 Note that, as in standard TCP, when tot_cwnd is large the increase 228 may be 0. In this case the increase MUST be set to 1. We discuss 229 how to implement this formula in practice in the next section. 231 We assume implementations use an approach similar to appropriate byte 232 counting (ABC, [RFC3465]), where the bytes_acked variable records the 233 number of bytes newly acknowledged. If this is not true, bytes_acked 234 SHOULD be set to mss_i. 236 To compute tot_cwnd, it is an easy mistake to sum up cwnd_i across 237 all subflows: when a flow is in fast retransmit, its cwnd is 238 typically inflated and no longer represents the real congestion 239 window. The correct behavior is to use the ssthresh value for flows 240 in fast retransmit when computing tot_cwnd. To cater for connections 241 that are app limited, the computation should consider the minimum 242 between flight_size_i and cwnd_i, and flight_size_i and ssthresh_i 243 where appropriate. 245 The total throughput of a multipath flow depends on the value of 246 alpha and the loss rates, maximum segment sizes and round trip times 247 of its paths. Since we require that the total throughput is no worse 248 than the throughput a single TCP would get on the best path, it is 249 impossible to choose a-priori a single value of alpha that achieves 250 the desired throughput in every occasion. Hence, alpha must be 251 computed based on the observed properties of the paths. 253 The formula to compute alpha is: 255 cwnd_i 256 max -------- 257 i 2 258 rtt_i 259 alpha = tot_cwnd * ---------------- 260 / cwnd_i \ 2 261 | sum ---------| 262 \ i rtt_i / 264 The formula is derived by equalizing the rate of the multipath flow 265 with the rate of a TCP running on the best path, and solving for 266 alpha. 268 4. Implementation Considerations 270 The formula for alpha above implies that alpha is a floating point 271 value. This would require performing costly floating point 272 operations whenever an ACK is received, Further, in many kernels 273 floating point operations are disabled. There is an easy way to 274 approximate the above calculations using integer arithmetic. 276 4.1. Computing alpha in Practice 278 Let alpha_scale be an integer. When computing alpha, use alpha_scale 279 * tot_cwnd instead of tot_cwnd, and do all the operations in integer 280 arithmetic. 282 Then, scale down the increase per ack by alpha_scale. The algorithm 283 is: 285 o For each ack received on subflow i, increase cwnd_i by min ( 286 (alpha*bytes_acked*mss_i/tot_cwnd)/alpha_scale , 287 bytes_acked*mss_i/cwnd_i ) 289 Alpha scale denotes the precision we want for computing alpha. 290 Observe that the errors in computing the numerator or the denominator 291 in the formula for alpha are quite small, as the cwnd in bytes is 292 typically much larger than the RTT (measured in ms). 294 With these changes, all the operations can be done using integer 295 arithmetic. We propose alpha_scale be a small power of two, to allow 296 using faster shift operations instead of multiplication and division. 297 Our experiments show that using alpha_scale=512 works well in a wide 298 range of scenarios. Increasing alpha_scale increases precision, but 299 also increases the risk of overflow when computing alpha. Using 300 64bit operations would solve this issue. Another option is to 301 dynamically adjust alpha_scale when computing alpha; in this way we 302 avoid overflow and obtain maximum precision. 304 It is possible to implement the algorithm by calculating tot_cwnd on 305 each ack, however this would be costly especially when the number of 306 subflows is large. To avoid this overhead the implementation MAY 307 choose to maintain a new per connection state variable called 308 tot_cwnd. If it does so, the implementation will update tot_cwnd 309 value whenever the individual subflows' windows are updated. 310 Updating only requires one more addition or subtraction operation 311 compared to the regular, per subflow congestion control code, so its 312 performance impact should be minimal. 314 Computing alpha per ack is also costly. We propose alpha to be a per 315 connection variable, computed whenever there is a drop and once per 316 RTT otherwise. More specifically, let cwnd_new be the new value of 317 the congestion window after it is inflated or after a drop. Update 318 alpha only if cwnd_i/mss_i != cwnd_new_i/mss_i. 320 In certain cases with small RTTs, computing alpha can still be 321 expensive. We observe that if RTTs were constant, it is sufficient 322 to compute alpha once per drop, as alpha does not change between 323 drops (the insight here is that cwnd_i/cwnd_j = constant as long as 324 both windows increase). Experimental results show that even if round 325 trip times are not constant, using average round trip time per 326 sawtooth instead of instantaneous round trip time (i.e. TCP's 327 smoothed RTT estimator) gives good precision for computing alpha. 328 Hence, it is possible to compute alpha only once per drop according 329 to the formula above, by replacing rtt_i with rtt_avg_i. 331 If using average round trip time, rtt_avg_i will be computed by 332 sampling the rtt_i whenever the window can accommodate one more 333 packet, i.e. when cwnd / mss < (cwnd+increase)/mss. The samples are 334 averaged once per sawtooth into rtt_avg_i. This sampling ensures 335 that there is no sampling bias for larger windows. 337 Given tot_cwnd and alpha, the congestion control algorithm is run for 338 each subflow independently, with similar complexity to the standard 339 TCP increase code [RFC5681]. 341 4.2. Implementation Considerations when CWND is Expressed in Packets 343 When the congestion control algorithm maintains cwnd in packets 344 rather than bytes, the algorithms above must change to take into 345 account path mss. 347 To compute the increase when an ack is received, the implementation 348 for multipath congestion control is a simple extension of the 349 standard TCP code. In standard TCP cwnd_cnt is an additional state 350 variable that tracks the number of segments acked since the last cwnd 351 increment; cwnd is incremented only when cwnd_cnt > cwnd; then 352 cwnd_cnt is set to 0. 354 In the multipath case, cwnd_cnt_i is maintained for each subflow as 355 above, and cwnd_i is increased by 1 when cwnd_cnt_i > max(alpha_scale 356 * tot_cwnd / alpha, cwnd_i). 358 When computing alpha for packet-based stacks, the errors in computing 359 the terms in the denominator are larger (this is because cwnd is much 360 smaller and rtt may be comparatively large). Let max be the index of 361 the subflow used in the numerator. To reduce errors, it is easiest 362 to move rtt_max (once calculated) from the numerator to the 363 denominator, obtaining the equivalent formula below. 365 cwnd_max 366 alpha = alpha_scale * tot_cwnd * ----------------------- 367 / rtt_max * cwnd_i \ 2 368 | sum -----------------| 369 \ i rtt_i / 371 Note that the formula for computing alpha does not take into account 372 path mss, and is the same for stacks that keep cwnd in bytes or 373 packets. With this formula, the algorithm for computing alpha will 374 match the rate of TCP on the best path in B/s for byte-oriented 375 stacks, and in packets/s in packet-based stacks. In practice, mss 376 rarely changes between paths so this shouldn't be a problem. 378 However, it is simple to derive formulae allowing packet-based stacks 379 to achieve byte rate fairness (and viceversa) if needed. In 380 particular, for packet-based stacks wanting byte-rate fairness, the 381 formula above changes as follows: cwnd_max is replaced by cwnd_max * 382 mss_max * mss_max, while cwnd_i is replaced with cwnd_i * mss_i. 384 5. Discussion 386 To achieve perfect resource pooling, one must couple both increase 387 and decrease of congestion windows across subflows, as in [KELLY]. 388 Yet this tends to exhibit "flappiness": when the paths have similar 389 levels of congestion, the congestion controller will tend to allocate 390 all the window to one random subflow, and allocate zero window to the 391 other subflows. The controller will perform random flips between 392 these stable points. This doesn't seem desirable in general, and is 393 particularly bad when the achieved rates depend on the RTT (as in the 394 current Internet): in such a case, the resulting rate with fluctuate 395 unpredictably depending on which state the controller is in, hence 396 violating Goal 1. 398 By only coupling increases our proposal removes flappiness but also 399 reduces the extent of resource pooling the protocol achieves. The 400 algorithm will allocate window to the subflows such that p_i * cwnd_i 401 = constant, for all i. Thus, when the loss rates of the subflows are 402 equal, each subflow will get an equal window, removing flappiness. 403 When the loss rates differ, progressively more window will be 404 allocated to the flow with the lower loss rate. In contrast, perfect 405 resource pooling requires that all the window should be allocated on 406 the path with the lowest loss rate. 408 6. Security Considerations 410 None. 412 Detailed security analysis for the Multipath TCP protocol itself is 413 included in [I-D.ford-mptcp-multiaddressed] and RFC 6181 [RFC6181] 415 7. Acknowledgements 417 We thank Christoph Paasch for his suggestions for computing alpha in 418 packet-based stacks. The authors are supported by Trilogy 419 (http://www.trilogy-project.org), a research project (ICT-216372) 420 partially funded by the European Community under its Seventh 421 Framework Program. The views expressed here are those of the 422 author(s) only. The European Commission is not liable for any use 423 that may be made of the information in this document. 425 8. IANA Considerations 427 This document does not require any action from IANA. 429 9. References 431 9.1. Normative References 433 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 434 RFC 793, September 1981. 436 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 437 Requirement Levels", BCP 14, RFC 2119, March 1997. 439 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 440 Control", RFC 5681, September 2009. 442 9.2. Informative References 444 [I-D.ford-mptcp-multiaddressed] 445 Ford, A., Raiciu, C., and M. Handley, "TCP Extensions for 446 Multipath Operation with Multiple Addresses", 447 draft-ford-mptcp-multiaddressed-03 (work in progress), 448 March 2010. 450 [KELLY] Kelly, F. and T. Voice, "Stability of end-to-end 451 algorithms for joint routing and rate control", ACM 452 SIGCOMM CCR vol. 35 num. 2, pp. 5-12, 2005, 453 . 455 [NSDI] Wischik, D., Raiciu, C., Greenhalgh, A., and M. Handley, 456 "Design, Implementation and Evaluation of Congestion 457 Control for Multipath TCP", Usenix NSDI , March 2011, . 460 [RFC3465] Allman, M., "TCP Congestion Control with Appropriate Byte 461 Counting (ABC)", RFC 3465, February 2003. 463 [RFC6181] Bagnulo, M., "Threat Analysis for TCP Extensions for 464 Multipath Operation with Multiple Addresses", RFC 6181, 465 March 2011. 467 [WISCHIK] Wischik, D., Handley, M., and M. Bagnulo Braun, "The 468 Resource Pooling Principle", ACM SIGCOMM CCR vol. 38 num. 469 5, pp. 47-52, October 2008, 470 . 472 Authors' Addresses 474 Costin Raiciu 475 University College London 476 Gower Street 477 London WC1E 6BT 478 UK 480 Email: c.raiciu@cs.ucl.ac.uk 482 Mark Handley 483 University College London 484 Gower Street 485 London WC1E 6BT 486 UK 488 Email: m.handley@cs.ucl.ac.uk 490 Damon Wischik 491 University College London 492 Gower Street 493 London WC1E 6BT 494 UK 496 Email: d.wischik@cs.ucl.ac.uk