idnits 2.17.1 draft-ietf-mptcp-congestion-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 29, 2011) is 4655 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-12) exists of draft-ietf-mptcp-multiaddressed-04 Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force C. Raiciu 3 Internet-Draft University Politehnica of 4 Intended status: Experimental Bucharest 5 Expires: January 30, 2012 M. Handley 6 D. Wischik 7 University College London 8 July 29, 2011 10 Coupled Congestion Control for Multipath Transport Protocols 11 draft-ietf-mptcp-congestion-07 13 Abstract 15 Often endpoints are connected by multiple paths, but communications 16 are usually restricted to a single path per connection. Resource 17 usage within the network would be more efficient were it possible for 18 these multiple paths to be used concurrently. Multipath TCP is a 19 proposal to achieve multipath transport in TCP. 21 New congestion control algorithms are needed for multipath transport 22 protocols such as Multipath TCP, as single path algorithms have a 23 series of issues in the multipath context. One of the prominent 24 problems is that running existing algorithms such as standard TCP 25 independently on each path would give the multipath flow more than 26 its fair share at a bottleneck link traversed by more than one of its 27 subflows. Further, it is desirable that a source with multiple paths 28 available will transfer more traffic using the least congested of the 29 paths, achieving a property called resource pooling where a bundle of 30 links effectively behaves like one shared link with bigger-capacity. 31 This would increase the overall efficiency of the network and also 32 its robustness to failure. 34 This document presents a congestion control algorithm which couples 35 the congestion control algorithms running on different subflows by 36 linking their increase functions, and dynamically controls the 37 overall aggressiveness of the multipath flow. The result is a 38 practical algorithm that is fair to TCP at bottlenecks while moving 39 traffic away from congested links. 41 Status of this Memo 43 This Internet-Draft is submitted in full conformance with the 44 provisions of BCP 78 and BCP 79. 46 Internet-Drafts are working documents of the Internet Engineering 47 Task Force (IETF). Note that other groups may also distribute 48 working documents as Internet-Drafts. The list of current Internet- 49 Drafts is at http://datatracker.ietf.org/drafts/current/. 51 Internet-Drafts are draft documents valid for a maximum of six months 52 and may be updated, replaced, or obsoleted by other documents at any 53 time. It is inappropriate to use Internet-Drafts as reference 54 material or to cite them other than as "work in progress." 56 This Internet-Draft will expire on January 30, 2012. 58 Copyright Notice 60 Copyright (c) 2011 IETF Trust and the persons identified as the 61 document authors. All rights reserved. 63 This document is subject to BCP 78 and the IETF Trust's Legal 64 Provisions Relating to IETF Documents 65 (http://trustee.ietf.org/license-info) in effect on the date of 66 publication of this document. Please review these documents 67 carefully, as they describe your rights and restrictions with respect 68 to this document. Code Components extracted from this document must 69 include Simplified BSD License text as described in Section 4.e of 70 the Trust Legal Provisions and are provided without warranty as 71 described in the Simplified BSD License. 73 Table of Contents 75 1. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 76 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 77 3. Coupled Congestion Control Algorithm . . . . . . . . . . . . . 6 78 4. Implementation Considerations . . . . . . . . . . . . . . . . 7 79 4.1. Computing alpha in Practice . . . . . . . . . . . . . . . 8 80 4.2. Implementation Considerations when CWND is Expressed 81 in Packets . . . . . . . . . . . . . . . . . . . . . . . . 9 82 5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 10 83 6. Security Considerations . . . . . . . . . . . . . . . . . . . 11 84 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 11 85 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 86 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12 87 9.1. Normative References . . . . . . . . . . . . . . . . . . . 12 88 9.2. Informative References . . . . . . . . . . . . . . . . . . 12 89 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13 91 1. Requirements Language 93 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 94 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 95 document are to be interpreted as described in RFC 2119 [RFC2119] . 97 2. Introduction 99 Multipath TCP (MPTCP, [I-D.ietf-mptcp-multiaddressed]) is a set of 100 extensions to regular TCP [RFC0793] that allows one TCP connection to 101 be spread across multiple paths. MPTCP distributes load through the 102 creation of separate "subflows" across potentially disjoint paths. 104 How should congestion control be performed for multipath TCP? First, 105 each subflow must have its own congestion control state (i.e. cwnd) 106 so that capacity on that path is matched by offered load. The 107 simplest way to achieve this goal is to simply run standard TCP 108 congestion control on each subflow. However this solution is 109 unsatisfactory as it gives the multipath flow an unfair share when 110 the paths taken by its different subflows share a common bottleneck. 112 Bottleneck fairness is just one requirement multipath congestion 113 control should meet. The following three goals capture the desirable 114 properties of a practical multipath congestion control algorithm: 116 o Goal 1 (Improve Throughput) A multipath flow should perform at 117 least as well as a single path flow would on the best of the paths 118 available to it. 120 o Goal 2 (Do no harm) A multipath flow should not take up more 121 capacity from any of the resources shared by its different paths, 122 than if it was a single flow using only one of these paths. This 123 guarantees it will not unduly harm other flows. 125 o Goal 3 (Balance congestion) A multipath flow should move as much 126 traffic as possible off its most congested paths, subject to 127 meeting the first two goals. 129 Goals 1 and 2 together ensure fairness at the bottleneck. Goal 3 130 captures the concept of resource pooling [WISCHIK]: if each multipath 131 flow sends more data through its least congested path, the traffic in 132 the network will move away from congested areas. This improves 133 robustness and overall throughput, among other things. The way to 134 achieve resource pooling is to effectively "couple" the congestion 135 control loops for the different subflows. 137 We propose an algorithm that couples the additive increase function 138 of the subflows, and uses unmodified TCP behavior in case of a drop. 139 The algorithm relies on the traditional TCP mechanisms to detect 140 drops, to retransmit data, etc. 142 Detecting shared bottlenecks reliably is quite difficult, but is just 143 one part of a bigger question. This bigger question is how much 144 bandwidth a multipath user should use in total, even if there is no 145 shared bottleneck. 147 The congestion controller aims to set the multipath flow's aggregate 148 bandwidth to be the same as a regular TCP flow would get on the best 149 path available to the multipath flow. To estimate the bandwidth of a 150 regular TCP flow, the multipath flow estimates loss rates and round 151 trip times and computes the target rate. Then it adjusts the overall 152 aggresiveness (parameter alpha) to achieve the desired rate. 154 While the mechanism above applies always, its effect depends on 155 whether the multipath TCP flow influences or does not influence the 156 link loss rates (low vs. high statistical multiplexing). If MPTCP 157 does not influence link loss rates, MPTCP will get the same 158 throughput as TCP on the best path. In cases with low statistical 159 multiplexing, where the multipath flow influences the loss rates on 160 the path, the multipath throughput will be strictly higher than a 161 single TCP would get on any of the paths. In particular, if using 162 two idle paths, multipath throughput will be sum of the two paths' 163 throughput. 165 This algorithm ensures bottleneck fairness and fairness in the 166 broader, network sense. We acknowledge that current TCP fairness 167 criteria are far from ideal, but a multipath TCP needs to be 168 deployable in the current Internet. If needed, new fairness criteria 169 can be implemented by the same algorithm we propose by appropriately 170 scaling the overall aggressiveness. 172 It is intended that the algorithm presented here can be applied to 173 other multipath transport protocols, such as alternative multipath 174 extensions to TCP, or indeed any other congestion-aware transport 175 protocols. However, for the purposes of example this document will, 176 where appropriate, refer to the MPTCP protocol. 178 The design decisions and evaluation of the congestion control 179 algorithm are published in [NSDI]. 181 The algorithm presented here only extends standard TCP congestion 182 control for multipath operation. It is foreseeable that other 183 congestion controllers will be implemented for multipath transport to 184 achieve the bandwidth-scaling properties of the newer congestion 185 control algorithms for regular TCP (such as Compound TCP and Cubic). 187 3. Coupled Congestion Control Algorithm 189 The algorithm we present only applies to the increase phase of the 190 congestion avoidance state specifying how the window inflates upon 191 receiving an ack. The slow start, fast retransmit, and fast recovery 192 algorithms, as well as the multiplicative decrease of the congestion 193 avoidance state are the same as in standard TCP[RFC5681]. 195 Let cwnd_i be the congestion window on the subflow i. Let tot_cwnd 196 be the sum of the congestion windows of all subflows in the 197 connection. Let p_i, rtt_i and mss_i be the loss rate, round trip 198 time (i.e. smoothed round trip time estimate used by TCP) and maximum 199 segment size on subflow i. 201 We assume throughout this document that the congestion window is 202 maintained in bytes, unless otherwise specified. We briefly describe 203 the algorithm for packet-based implementations of cwnd in section 204 Section 4.2. 206 Our proposed "Linked Increases" algorithm MUST: 208 o For each ack received on subflow i, increase cwnd_i by 210 alpha * bytes_acked * mss_i bytes_acked * mss_i 211 min ( --------------------------- , ------------------- ) (1) 212 tot_cwnd cwnd_i 214 The increase formula (1) takes the minimum between the computed 215 increase for the multipath subflow (first argument to min), and the 216 increase TCP would get in the same scenario (the second argument). 217 In this way, we ensure that any multipath subflow cannot be more 218 aggressive than a TCP flow in the same circumstances, hence achieving 219 goal 2 (do no harm). 221 "alpha" is a parameter of the algorithm that describes the 222 aggresiveness of the multipath flow. To meet Goal 1 (improve 223 throughput), the value of alpha is chosen such that the aggregate 224 throughput of the multipath flow is equal to the throughput a TCP 225 flow would get if it ran on the best path. 227 To get an intuition of what the algorithm is trying to do, let's take 228 the case where all the subflows have the same round trip time and 229 MSS. In this case the algorithm will grow the total window by 230 approximately alpha*MSS per RTT. This increase is distributed to the 231 individual flows according to their instantaneous window size. 232 Subflow i will increase by alpha*cwnd_i/tot_cwnd segments per RTT. 234 Note that, as in standard TCP, when tot_cwnd is large the increase 235 may be 0. In this case the increase MUST be set to 1. We discuss 236 how to implement this formula in practice in the next section. 238 We assume implementations use an approach similar to appropriate byte 239 counting (ABC, [RFC3465]), where the bytes_acked variable records the 240 number of bytes newly acknowledged. If this is not the case, 241 bytes_acked SHOULD be set to mss_i. 243 To compute tot_cwnd, it is an easy mistake to sum up cwnd_i across 244 all subflows: when a flow is in fast retransmit, its cwnd is 245 typically inflated and no longer represents the real congestion 246 window. The correct behavior is to use the ssthresh value for flows 247 in fast retransmit when computing tot_cwnd. To cater for connections 248 that are app limited, the computation should consider the minimum 249 between flight_size_i and cwnd_i, and flight_size_i and ssthresh_i 250 where appropriate. 252 The total throughput of a multipath flow depends on the value of 253 alpha and the loss rates, maximum segment sizes and round trip times 254 of its paths. Since we require that the total throughput is no worse 255 than the throughput a single TCP would get on the best path, it is 256 impossible to choose a-priori a single value of alpha that achieves 257 the desired throughput in every occasion. Hence, alpha must be 258 computed based on the observed properties of the paths. 260 The formula to compute alpha is: 262 cwnd_i 263 max -------- 264 i 2 265 rtt_i 266 alpha = tot_cwnd * ---------------- (2) 267 / cwnd_i \ 2 268 | sum ---------| 269 \ i rtt_i / 271 The formula (2) is derived by equalizing the rate of the multipath 272 flow with the rate of a TCP running on the best path, and solving for 273 alpha. 275 4. Implementation Considerations 277 Equation (2) implies that alpha is a floating point value. This 278 would require performing costly floating point operations whenever an 279 ACK is received, Further, in many kernels floating point operations 280 are disabled. There is an easy way to approximate the above 281 calculations using integer arithmetic. 283 4.1. Computing alpha in Practice 285 Let alpha_scale be an integer. When computing alpha, use alpha_scale 286 * tot_cwnd instead of tot_cwnd, and do all the operations in integer 287 arithmetic. 289 Then, scale down the increase per ack by alpha_scale. The resulting 290 algorithm is a simple change from Equation (1): 292 o For each ack received on subflow i, increase cwnd_i by 294 alpha * bytes_acked * mss_i bytes_acked * mss_i 295 min ( --------------------------- , ------------------- ) (3) 296 alpha_scale * tot_cwnd cwnd_i 298 The alpha_scale parameter denotes the precision we want for computing 299 alpha. Observe that the errors in computing the numerator or the 300 denominator in the formula for alpha are quite small, as the cwnd in 301 bytes is typically much larger than the RTT (measured in ms). 303 With these changes, all the operations can be done using integer 304 arithmetic. We propose alpha_scale be a small power of two, to allow 305 using faster shift operations instead of multiplication and division. 306 Our experiments show that using alpha_scale=512 works well in a wide 307 range of scenarios. Increasing alpha_scale increases precision, but 308 also increases the risk of overflow when computing alpha. Using 309 64bit operations would solve this issue. Another option is to 310 dynamically adjust alpha_scale when computing alpha; in this way we 311 avoid overflow and obtain maximum precision. 313 It is possible to implement the algorithm by calculating tot_cwnd on 314 each ack, however this would be costly especially when the number of 315 subflows is large. To avoid this overhead the implementation MAY 316 choose to maintain a new per connection state variable called 317 tot_cwnd. If it does so, the implementation will update tot_cwnd 318 value whenever the individual subflows' windows are updated. 319 Updating only requires one more addition or subtraction operation 320 compared to the regular, per subflow congestion control code, so its 321 performance impact should be minimal. 323 Computing alpha per ack is also costly. We propose alpha to be a per 324 connection variable, computed whenever there is a drop and once per 325 RTT otherwise. More specifically, let cwnd_new be the new value of 326 the congestion window after it is inflated or after a drop. Update 327 alpha only if the quotient of cwnd_i/mss_i differs from the quotient 328 of cwnd_new_i/mss_i. 330 In certain cases with small RTTs, computing alpha can still be 331 expensive. We observe that if RTTs were constant, it is sufficient 332 to compute alpha once per drop, as alpha does not change between 333 drops (the insight here is that cwnd_i/cwnd_j = constant as long as 334 both windows increase). Experimental results show that even if round 335 trip times are not constant, using average round trip time per 336 sawtooth instead of instantaneous round trip time (i.e. TCP's 337 smoothed RTT estimator) gives good precision for computing alpha. 338 Hence, it is possible to compute alpha only once per drop using a 339 modified version of equation (2) where rtt_i is replaced with 340 rtt_avg_i. 342 If using average round trip time, rtt_avg_i will be computed by 343 sampling the rtt_i whenever the window can accommodate one more 344 packet, i.e. when cwnd / mss < (cwnd+increase)/mss. The samples are 345 averaged once per sawtooth into rtt_avg_i. This sampling ensures 346 that there is no sampling bias for larger windows. 348 Given tot_cwnd and alpha, the congestion control algorithm is run for 349 each subflow independently, with similar complexity to the standard 350 TCP increase code [RFC5681]. 352 4.2. Implementation Considerations when CWND is Expressed in Packets 354 When the congestion control algorithm maintains cwnd in packets 355 rather than bytes, the algorithms above must change to take into 356 account path mss. 358 To compute the increase when an ack is received, the implementation 359 for multipath congestion control is a simple extension of the 360 standard TCP code. In standard TCP cwnd_cnt is an additional state 361 variable that tracks the number of segments acked since the last cwnd 362 increment; cwnd is incremented only when cwnd_cnt > cwnd; then 363 cwnd_cnt is set to 0. 365 In the multipath case, cwnd_cnt_i is maintained for each subflow as 366 above, and cwnd_i is increased by 1 when cwnd_cnt_i > max(alpha_scale 367 * tot_cwnd / alpha, cwnd_i). 369 When computing alpha for packet-based stacks, the errors in computing 370 the terms in the denominator are larger (this is because cwnd is much 371 smaller and rtt may be comparatively large). Let max be the index of 372 the subflow used in the numerator. To reduce errors, it is easiest 373 to move rtt_max (once calculated) from the numerator to the 374 denominator, changing equation (2) to obtain the equivalent formula 375 below. 377 cwnd_max 378 alpha = alpha_scale * tot_cwnd * ----------------------- (4) 379 / rtt_max * cwnd_i \ 2 380 | sum -----------------| 381 \ i rtt_i / 383 Note that the calculation of alpha does not take into account path 384 mss, and is the same for stacks that keep cwnd in bytes or packets. 385 With this formula, the algorithm for computing alpha will match the 386 rate of TCP on the best path in B/s for byte-oriented stacks, and in 387 packets/s in packet-based stacks. In practice, mss rarely changes 388 between paths so this shouldn't be a problem. 390 However, it is simple to derive formulae allowing packet-based stacks 391 to achieve byte rate fairness (and viceversa) if needed. In 392 particular, for packet-based stacks wanting byte-rate fairness, 393 equation (4) above changes as follows: cwnd_max is replaced by 394 cwnd_max * mss_max * mss_max, while cwnd_i is replaced with cwnd_i * 395 mss_i. 397 5. Discussion 399 The algorithm we've presented fully achieves Goals 1 and 2, but does 400 not achieve full resource pooling (Goal 3). Resource pooling 401 requires that no traffic should be transferred on links with higher 402 loss rates. To achieve perfect resource pooling, one must couple 403 both increase and decrease of congestion windows across subflows, as 404 in [KELLY]. 406 There are a few problems with such a fully-coupled controller. 407 First, it will probe insufficiently paths with high loss rates, and 408 will fail to detect free capacity when it becomes available. Second, 409 such controllers tend to exhibit "flappiness": when the paths have 410 similar levels of congestion, the congestion controller will tend to 411 allocate all the window to one random subflow, and allocate zero 412 window to the other subflows. The controller will perform random 413 flips between these stable points. This doesn't seem desirable in 414 general, and is particularly bad when the achieved rates depend on 415 the RTT (as in the current Internet): in such a case, the resulting 416 rate with fluctuate unpredictably depending on which state the 417 controller is in, hence violating Goal 1. 419 By only coupling increases our proposal probes high-loss paths, 420 detecting free capacity quicker. Our proposal does not suffer from 421 flappiness but also achieves less resource pooling. The algorithm 422 will allocate window to the subflows such that p_i * cwnd_i = 423 constant, for all i. Thus, when the loss rates of the subflows are 424 equal, each subflow will get an equal window, removing flappiness. 425 When the loss rates differ, progressively more window will be 426 allocated to the flow with the lower loss rate. In contrast, perfect 427 resource pooling requires that all the window should be allocated on 428 the path with the lowest loss rate. Further details can be found in 429 [NSDI]. 431 6. Security Considerations 433 One security concern relates to what we call the traffic-shifting 434 attack: on-path attackers can drop packets belonging to a multipath 435 subflow, which in turn makes the path seem congested and will force 436 the sender's congestion controller to avoid that path and push more 437 data over alternate subflows. 439 The attacker's goal is to create congestion on the corresponding 440 alternative paths. This behaviour is entirely feasible, but will 441 only have minor effects: by design, the coupled congestion controller 442 is less (or similarly) aggressive on any of its paths than a single 443 TCP flow. Thus, the biggest effect this attack can have is to make a 444 multipath subflow be as aggressive as a single TCP flow. 446 Another effect of the traffic-shifting attack is that the new path 447 can monitor all the traffic, whereas before it could only see a 448 subset of traffic. We believe that if privacy is needed, splitting 449 traffic across multiple paths with MPTCP is not the right solution in 450 the first place; end-to-end encryption should be used instead. 452 Besides the traffic-shifting attack mentioned above, the coupled 453 congestion control algorithm defined in this draft adds no other 454 security considerations to those found in 455 [I-D.ietf-mptcp-multiaddressed] and [RFC6181]. Detailed security 456 analysis for the Multipath TCP protocol itself is included in 457 [I-D.ietf-mptcp-multiaddressed] and [RFC6181]. 459 7. Acknowledgements 461 We thank Christoph Paasch for his suggestions for computing alpha in 462 packet-based stacks. The authors are supported by Trilogy 463 (http://www.trilogy-project.org), a research project (ICT-216372) 464 partially funded by the European Community under its Seventh 465 Framework Program. The views expressed here are those of the 466 author(s) only. The European Commission is not liable for any use 467 that may be made of the information in this document. 469 8. IANA Considerations 471 This document does not require any action from IANA. 473 9. References 475 9.1. Normative References 477 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 478 RFC 793, September 1981. 480 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 481 Requirement Levels", BCP 14, RFC 2119, March 1997. 483 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 484 Control", RFC 5681, September 2009. 486 9.2. Informative References 488 [I-D.ietf-mptcp-multiaddressed] 489 Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 490 "TCP Extensions for Multipath Operation with Multiple 491 Addresses", draft-ietf-mptcp-multiaddressed-04 (work in 492 progress), July 2011. 494 [KELLY] Kelly, F. and T. Voice, "Stability of end-to-end 495 algorithms for joint routing and rate control", ACM 496 SIGCOMM CCR vol. 35 num. 2, pp. 5-12, 2005, 497 . 499 [NSDI] Wischik, D., Raiciu, C., Greenhalgh, A., and M. Handley, 500 "Design, Implementation and Evaluation of Congestion 501 Control for Multipath TCP", Usenix NSDI , March 2011, . 504 [RFC3465] Allman, M., "TCP Congestion Control with Appropriate Byte 505 Counting (ABC)", RFC 3465, February 2003. 507 [RFC6181] Bagnulo, M., "Threat Analysis for TCP Extensions for 508 Multipath Operation with Multiple Addresses", RFC 6181, 509 March 2011. 511 [WISCHIK] Wischik, D., Handley, M., and M. Bagnulo Braun, "The 512 Resource Pooling Principle", ACM SIGCOMM CCR vol. 38 num. 513 5, pp. 47-52, October 2008, 514 . 516 Authors' Addresses 518 Costin Raiciu 519 University Politehnica of Bucharest 520 Splaiul Independentei 313 521 Bucharest 522 Romania 524 Email: costin.raiciu@cs.pub.ro 526 Mark Handley 527 University College London 528 Gower Street 529 London WC1E 6BT 530 UK 532 Email: m.handley@cs.ucl.ac.uk 534 Damon Wischik 535 University College London 536 Gower Street 537 London WC1E 6BT 538 UK 540 Email: d.wischik@cs.ucl.ac.uk