idnits 2.17.1 draft-ietf-mptcp-congestion-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Sep 2009 rather than the newer Notice from 28 Dec 2009. (See https://trustee.ietf.org/license-info/) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 14, 2011) is 4786 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'REF' is mentioned on line 411, but not defined == Outdated reference: A later version (-03) exists of draft-ford-mptcp-multiaddressed-01 -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force C. Raiciu 3 Internet-Draft M. Handley 4 Intended status: Experimental D. Wischik 5 Expires: September 15, 2011 University College London 6 March 14, 2011 8 Coupled Congestion Control for Multipath Transport Protocols 9 draft-ietf-mptcp-congestion-02 11 Abstract 13 Often endpoints are connected by multiple paths, but communications 14 are usually restricted to a single path per connection. Resource 15 usage within the network would be more efficient were it possible for 16 these multiple paths to be used concurrently. Multipath TCP is a 17 proposal to achieve multipath transport in TCP. 19 New congestion control algorithms are needed for multipath transport 20 protocols such as Multipath TCP, as single path algorithms have a 21 series of issues in the multipath context. One of the prominent 22 problems is that running existing algorithms such as TCP New Reno 23 independently on each path would give the multipath flow more than 24 its fair share at a bottleneck link traversed by more than one of its 25 subflows. Further, it is desirable that a source with multiple paths 26 available will transfer more traffic using the least congested of the 27 paths, hence achieving resource pooling. This would increase the 28 overall efficiency of the network and also its robustness to failure. 30 This document presents a congestion control algorithm which couples 31 the congestion control algorithms running on different subflows by 32 linking their increase functions, and dynamically controls the 33 overall aggresiveness of the multipath flow. The result is a 34 practical algorithm that is fair to TCP at bottlenecks while moving 35 traffic away from congested links. 37 Status of this Memo 39 This Internet-Draft is submitted to IETF in full conformance with the 40 provisions of BCP 78 and BCP 79. 42 Internet-Drafts are working documents of the Internet Engineering 43 Task Force (IETF), its areas, and its working groups. Note that 44 other groups may also distribute working documents as Internet- 45 Drafts. 47 Internet-Drafts are draft documents valid for a maximum of six months 48 and may be updated, replaced, or obsoleted by other documents at any 49 time. It is inappropriate to use Internet-Drafts as reference 50 material or to cite them other than as "work in progress." 52 The list of current Internet-Drafts can be accessed at 53 http://www.ietf.org/ietf/1id-abstracts.txt. 55 The list of Internet-Draft Shadow Directories can be accessed at 56 http://www.ietf.org/shadow.html. 58 This Internet-Draft will expire on September 15, 2011. 60 Copyright Notice 62 Copyright (c) 2011 IETF Trust and the persons identified as the 63 document authors. All rights reserved. 65 This document is subject to BCP 78 and the IETF Trust's Legal 66 Provisions Relating to IETF Documents 67 (http://trustee.ietf.org/license-info) in effect on the date of 68 publication of this document. Please review these documents 69 carefully, as they describe your rights and restrictions with respect 70 to this document. Code Components extracted from this document must 71 include Simplified BSD License text as described in Section 4.e of 72 the Trust Legal Provisions and are provided without warranty as 73 described in the BSD License. 75 Table of Contents 77 1. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 78 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 79 3. Coupled Congestion Control Algorithm . . . . . . . . . . . . . 5 80 4. Implementation Considerations . . . . . . . . . . . . . . . . 7 81 4.1. Implementation Considerations when CWND is Expressed 82 in Packets . . . . . . . . . . . . . . . . . . . . . . . . 9 83 5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 10 84 6. Security Considerations . . . . . . . . . . . . . . . . . . . 10 85 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10 86 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 87 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 11 88 9.1. Normative References . . . . . . . . . . . . . . . . . . . 11 89 9.2. Informative References . . . . . . . . . . . . . . . . . . 11 90 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 12 92 1. Requirements Language 94 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 95 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 96 document are to be interpreted as described in RFC 2119 [RFC2119]. 98 2. Introduction 100 Multipath TCP (MPTCP, [I-D.ford-mptcp-multiaddressed]) is a set of 101 extensions to regular TCP [RFC0793] that allows one TCP connection to 102 be spread across multiple paths. MPTCP distributes load through the 103 creation of separate "subflows" across potentially disjoint paths. 105 How should congestion control be performed for multipath TCP? First, 106 each subflow must have its own congestion control state (i.e. cwnd) 107 so that capacity on that path is matched by offered load. The 108 simplest way to achieve this goal is to simply run TCP New Reno 109 congestion control [RFC5681] on each subflow. However this solution 110 is unsatisfactory as it gives the multipath flow an unfair share when 111 the paths taken by its different subflows share a common bottleneck. 113 Bottleneck fairness is just one requirement multipath congestion 114 control should meet. The following three goals capture the desirable 115 properties of a practical multipath congestion control algorithm: 117 o Goal 1 (Improve Throughput) A multipath flow should perform at 118 least as well as a single path flow would on the best of the paths 119 available to it. 121 o Goal 2 (Do no harm) A multipath flow should not take up more 122 capacity on any one of its paths than if it was a single path flow 123 using only that route. This guarantees it will not unduly harm 124 other flows. 126 o Goal 3 (Balance congestion) A multipath flow should move as much 127 traffic as possible off its most congested paths, subject to 128 meeting the first two goals. 130 Goals 1 and 2 together ensure fairness at the bottleneck. Goal 3 131 captures the concept of resource pooling [WISCHIK]: if each multipath 132 flow sends more data through its least congested path, the traffic in 133 the network will move away from congested areas. This improves 134 robustness and overall throughput, among other things. The way to 135 achieve resource pooling is to effectively "couple" the congestion 136 control loops for the different subflows. 138 We propose an algorithm that couples only the additive increase 139 function of the subflows, and uses unmodified TCP New Reno behavior 140 in case of a drop. The algorithm relies on the traditional TCP 141 mechanisms to detect drops, to retransmit data, etc. 143 Detecting shared bottlenecks reliably is quite difficult, but is just 144 one part of a bigger question. This bigger question is how much 145 bandwidth a multipath user should use in total, even if there is no 146 shared bottleneck. 148 Our solution sets the multipath flow's aggregate bandwidth to be the 149 same bandwidth a regular TCP flow would get on the best path 150 available to the multipath flow. To estimate the bandwidth of a 151 regular TCP flow, the multipath flow estimates loss rates and round 152 trip times and computes the target rate. Then it adjusts the overall 153 aggresiveness (parameter alpha) to achieve the desired rate. 155 We note that in cases with low statistical multiplexing (where the 156 multipath flow influences the loss rates on the path) the multipath 157 throughput will be strictly higher than a single TCP would get on any 158 of the paths. In particular, if using two idle paths, multipath 159 throughput will be sum of the two paths' throughput. 161 This algorithm ensures bottleneck fairness and fairness in the 162 broader, network sense. We acknowledge that current TCP fairness 163 criteria are far from ideal, but a multipath TCP needs to be 164 deployable in the current Internet. If needed, new fairness criteria 165 can be implemented by the same algorithm we propose by appropriately 166 scaling the overall aggressiveness. 168 It is intended that the algorithm presented here can be applied to 169 other multipath transport protocols, such as alternative multipath 170 extensions to TCP, or indeed any other congestion-aware transport 171 protocols. However, for the purposes of example this document will, 172 where appropriate, refer to the MPTCP protocol. 174 The design decisions and evaluation of the congestion control 175 algorithm are published in [NSDI]. 177 The algorithm presented here only extends TCP New Reno congestion 178 control for multipath operation. It is foreseeable that other 179 congestion controllers will be implemented for multipath transport to 180 achieve the bandwidth-scaling properties of the newer congestion 181 control algorithms for regular TCP (such as Compound TCP and Cubic). 183 3. Coupled Congestion Control Algorithm 185 The algorithm we present only applies to the increase phase of the 186 congestion avoidance state specifying how the window inflates upon 187 receiving an ack. The slow start, fast retransmit, and fast recovery 188 algorithms, as well as the multiplicative decrease of the congestion 189 avoidance state are the same as in TCP [RFC5681]. 191 Let cwnd_i be the congestion window on the subflow i. Let tot_cwnd 192 be the sum of the congestion windows of all subflows in the 193 connection. Let p_i, rtt_i and mss_i be the loss rate, round trip 194 time (i.e. smoothed round trip time estimate) and maximum segment 195 size on subflow i. 197 We assume throughout this document that the congestion window is 198 maintained in bytes, unless otherwise specified. We briefly describe 199 the algorithm for packet-based implementations of cwnd in section 200 Section 4.1. 202 Our proposed "Linked Increases" algorithm MUST: 204 o For each ack received on subflow i, increase cwnd_i by min 205 (alpha*bytes_acked*mss_i/tot_cwnd , bytes_acked*mss_i/cwnd_i ) 207 The increase formula takes the minimum between the computed increase 208 for the multipath subflow (first argument to min), and the increase 209 TCP would get in the same scenario (the second argument). In this 210 way, we ensure that any multipath subflow cannot be more aggressive 211 than a TCP flow in the same circumstances, hence achieving goal 2 (do 212 no harm). 214 "alpha" is a parameter of the algorithm that describes the 215 aggresiveness of the multipath flow. To meet Goal 1 (improve 216 throughput), the value of alpha is chosen such that the aggregate 217 throughput of the multipath flow is equal to the rate a TCP flow 218 would get if it ran on the best path. 220 To get an intuition of what the algorithm is trying to do, let's take 221 the case where all the subflows have the same round trip time and 222 MSS. In this case the algorithm will grow the total window by 223 approximately alpha*MSS per RTT. This increase is distributed to the 224 individual flows according to their instantaneous window size. 225 Subflow i will increase by alpha*cwnd_i/tot_cwnd segments per RTT. 227 Note that, as in standard TCP, when tot_cwnd is large the increase 228 may be 0. In this case the increase MUST be set to 1. We discuss 229 how to implement this formula in practice in the next section. 231 We assume appropriate byte counting (ABC, [RFC3465]) is used, hence 232 the bytes_acked variable records the number of bytes newly 233 acknowledged. If ABC is not used, bytes_acked SHOULD be set to 234 mss_i. 236 To compute tot_cwnd, it is an easy mistake to sum up cwnd_i across 237 all subflows: when a flow is in fast retransmit, its cwnd is 238 typically inflated and no longer represents the real congestion 239 window. The correct behavior is to use the ssthresh value for flows 240 in fast retransmit when computing tot_cwnd. To cater for connections 241 that are app limited, the computation should consider the minimum 242 between flight_size_i and cwnd_i, and flight_size_i and ssthresh_i 243 where appropriate. 245 The total throughput of a multipath flow depends on the value of 246 alpha and the loss rates, maximum segment sizes and round trip times 247 of its paths. Since we require that the total throughput is no worse 248 than the throughput a single TCP would get on the best path, it is 249 impossible to choose a-priori a single value of alpha that achieves 250 the desired throughput in every ocasion. Hence, alpha must be 251 computed for each multipath flow, based on the observed properties of 252 the paths. 254 The formula to compute alpha is: 256 cwnd_i 257 max -------- 258 i 2 259 rtt_i 260 alpha = tot_cwnd * ---------------- 261 / cwnd_i \ 2 262 | sum ---------| 263 \ i rtt_i / 265 The formula is derived by equalizing the rate of the multipath flow 266 with the rate of a TCP running on the best path, and solving for 267 alpha. 269 4. Implementation Considerations 271 The formula for alpha above implies that alpha is a floating point 272 value. This would require performing costly floating point 273 operations whenever an ACK is received, Further, in many kernels 274 floating point operations are disabled. There is an easy way to 275 approximate the above calculations using integer arithmetic. 277 Let alpha_scale be an integer. When computing alpha, use alpha_scale 278 * tot_cwnd instead of tot_cwnd, and do all the operations in integer 279 arithmetic. 281 Then, scale down the increase per ack by alpha_scale. The algorithm 282 is: 284 o For each ack received on subflow i, increase cwnd_i by min ( 285 alpha*bytes_acked*mss_i/tot_cwnd/alpha_scale , bytes_acked*mss_i/ 286 cwnd_i ) 288 Alpha scale denotes the precision we want for computing alpha. 289 Observe that the errors in computing the numerator or the denominator 290 in the formula for alpha are quite small, as the cwnd in bytes is 291 typically much larger than the RTT (measured in ms). 293 With these changes, all the operations can be done using integer 294 arithmetic. We propose alpha_scale be a small power of two, to allow 295 using faster shift operations instead of multiplication and division. 296 Our experiments show that using alpha_scale=512 works well in a wide 297 range of scenarios. Increasing alpha_scale increases precision, but 298 also increases the risk of overflow when computing alpha. Using 299 64bit operations would solve this issue. Another option is to 300 dynamically adjust alpha_scale when computing alpha; in this way we 301 avoid overflow and obtain maximum precision. 303 It is possible to implement the algorithm by calculating tot_cwnd on 304 each ack, however this would be costly especially when the number of 305 subflows is large. To avoid this overhead the implementation MAY 306 choose to maintain a new per connection state variable called 307 tot_cwnd. If it does so, the implementation will update tot_cwnd 308 value whenever the individual subflows' windows are updated. 309 Updating only requires one more addition or subtraction operation 310 compared to the regular, per subflow congestion control code, so its 311 performance impact should be minimal. 313 Computing alpha per ack is also costly. We propose alpha be a per 314 connection variable, computed whenever there is a drop and once per 315 RTT otherwise. More specifically, let cwnd_new be the new value of 316 the congestion window after it is inflated or after a drop. Update 317 alpha only if cwnd_i/mss_i != cwnd_new_i/mss_i. 319 In certain cases with small RTTs, computing alpha can still be 320 expensive. We observe that if RTTs were constant, it is sufficient 321 to compute alpha once per drop, as alpha does not change between 322 drops (the insight here is that cwnd_i/cwnd_j = constant as long as 323 both windows increase). Experimental results show that even if round 324 trip times are not constant, using average round trip time instead of 325 instantaneous round trip time gives good precision for computing 326 alpha. Hence, it is possible to compute alpha only once per drop 327 according to the formula above, by replacing rtt_i with rtt_avg_i. 329 If using average round trip time, rtt_avg_i will be computed by 330 sampling the rtt_i whenever the window can accomodate one more 331 packet, i.e. when cwnd / mss < (cwnd+increase)/mss. The samples are 332 averaged once per sawtooth into rtt_avg_i. This sampling ensures 333 that there is no sampling bias for larger windows. 335 Given tot_cwnd and alpha, the congestion control algorithm is run for 336 each subflow independently, with similar complexity to the standard 337 TCP increase code [RFC5681]. 339 4.1. Implementation Considerations when CWND is Expressed in Packets 341 When the congestion control algorithm maintains cwnd in packets 342 rather than bytes, the algorithms above must change to take into 343 account path mss. 345 To compute the increase when an ack is received, the implementation 346 for multipath congestion control is a simple extension of the TCP New 347 Reno code. In TCP New Reno cwnd_cnt is an additional state variable 348 that tracks the number of bytes acked since the last cwnd increment; 349 cwnd is incremented only when cwnd_cnt > cwnd; then cwnd_cnt is set 350 to 0. 352 In the multipath case, cwnd_cnt_i is maintained for each subflow as 353 above, and cwnd_i is increased by 1 when cwnd_cnt_i > alpha_scale * 354 tot_cwnd / alpha. 356 When computing alpha for packet-based stacks, the errors in computing 357 the terms in the denominator are larger (this is because cwnd is much 358 smaller and rtt may be comparatively large). Let max be the index of 359 the subflow used in the numerator. To reduce errors, it is easiest 360 to move rtt_max (once calculated) from the numerator to the 361 denominator, obtaining the equivalent formula below. 363 cwnd_max 364 alpha = alpha_scale * tot_cwnd * ----------------------- 365 / rtt_max * cwnd_i \ 2 366 | sum -----------------| 367 \ i rtt_i / 369 Note that the formula for computing alpha does not take into account 370 path mss, and is the same for stacks that keep cwnd in bytes or 371 packets. With this formula, the algorithm for computing alpha will 372 match the rate of TCP on the best path in B/s for byte-oriented 373 stacks, and in packets/s in packet-based stacks. In practice, mss 374 rarely changes between paths so this shouldn't be a problem. 376 However, it is simple to derive formulae allowing packet-based stacks 377 to achieve byte rate fairness (and viceversa) if needed. In 378 particular, for packet-based stacks wanting byte-rate fairness, the 379 formula above changes as follows: cwnd_max is replaced by cwnd_max * 380 mss_max * mss_max, while cwnd_i is replaced with cwnd_i * mss_i. 382 5. Discussion 384 To achieve perfect resource pooling, one must couple both increase 385 and decrease of congestion windows across subflows, as in [KELLY]. 386 Yet this tends to exhibit "flappiness": when the paths have similar 387 levels of congestion, the congestion controller will tend to allocate 388 all the window to one random subflow, and allocate zero window to the 389 other subflows. The controller will perform random flips between 390 these stable points. This doesn't seem desirable in general, and is 391 particularly bad when the achieved rates depend on the RTT (as in the 392 current Internet): in such a case, the resulting rate with fluctuate 393 unpredictably depending on which state the controller is in, hence 394 violating Goal 1. 396 By only coupling increases our proposal removes flappiness but also 397 reduces the extent of resource pooling the protocol achieves. The 398 algorithm will allocate window to the subflows such that p_i * cwnd_i 399 = constant, for all i. Thus, when the loss rates of the subflows are 400 equal, each subflow will get an equal window, removing flappiness. 401 When the loss rates differ, progressively more window will be 402 allocated to the flow with the lower loss rate. In contrast, perfect 403 resource pooling requires that all the window should be allocated on 404 the path with the lowest loss rate. 406 6. Security Considerations 408 None. 410 Detailed security analysis for the Multipath TCP protocol itself is 411 included in [I-D.ford-mptcp-multiaddressed] and [REF] 413 7. Acknowledgements 415 We thank Christoph Paasch for his suggestions for computing alpha in 416 packet-based stacks. The authors are supported by Trilogy 417 (http://www.trilogy-project.org), a research project (ICT-216372) 418 partially funded by the European Community under its Seventh 419 Framework Program. The views expressed here are those of the 420 author(s) only. The European Commission is not liable for any use 421 that may be made of the information in this document. 423 8. IANA Considerations 425 None. 427 9. References 429 9.1. Normative References 431 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 432 Requirement Levels", BCP 14, RFC 2119, March 1997. 434 9.2. Informative References 436 [I-D.ford-mptcp-multiaddressed] 437 Ford, A., Raiciu, C., Handley, M., and S. Barre, "TCP 438 Extensions for Multipath Operation with Multiple 439 Addresses", draft-ford-mptcp-multiaddressed-01 (work in 440 progress), July 2009. 442 [KELLY] Kelly, F. and T. Voice, "Stability of end-to-end 443 algorithms for joint routing and rate control", ACM 444 SIGCOMM CCR vol. 35 num. 2, pp. 5-12, 2005, 445 . 447 [NSDI] Wischik, D., Raiciu, C., Greenhalgh, A., and M. Handley, 448 "Design, Implementation and Evaluation of Congestion 449 Control for Multipath TCP", Usenix NSDI , March 2011, . 452 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 453 RFC 793, September 1981. 455 [RFC3465] Allman, M., "TCP Congestion Control with Appropriate Byte 456 Counting (ABC)", RFC 3465, February 2003. 458 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 459 Control", RFC 5681, September 2009. 461 [WISCHIK] Wischik, D., Handley, M., and M. Bagnulo Braun, "The 462 Resource Pooling Principle", ACM SIGCOMM CCR vol. 38 num. 463 5, pp. 47-52, October 2008, 464 . 466 Authors' Addresses 468 Costin Raiciu 469 University College London 470 Gower Street 471 London WC1E 6BT 472 UK 474 Email: c.raiciu@cs.ucl.ac.uk 476 Mark Handley 477 University College London 478 Gower Street 479 London WC1E 6BT 480 UK 482 Email: m.handley@cs.ucl.ac.uk 484 Damon Wischik 485 University College London 486 Gower Street 487 London WC1E 6BT 488 UK 490 Email: d.wischik@cs.ucl.ac.uk