idnits 2.17.1 

draft-ietf-mptcp-congestion-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (June 16, 2011) is 4697 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)


     Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force                                C. Raiciu
3	Internet-Draft                                                M. Handley
4	Intended status: Experimental                                 D. Wischik
5	Expires: December 18, 2011                     University College London
6	                                                           June 16, 2011

8	      Coupled Congestion Control for Multipath Transport Protocols
9	                     draft-ietf-mptcp-congestion-05

11	Abstract

13	   Often endpoints are connected by multiple paths, but communications
14	   are usually restricted to a single path per connection.  Resource
15	   usage within the network would be more efficient were it possible for
16	   these multiple paths to be used concurrently.  Multipath TCP is a
17	   proposal to achieve multipath transport in TCP.

19	   New congestion control algorithms are needed for multipath transport
20	   protocols such as Multipath TCP, as single path algorithms have a
21	   series of issues in the multipath context.  One of the prominent
22	   problems is that running existing algorithms such as standard TCP
23	   independently on each path would give the multipath flow more than
24	   its fair share at a bottleneck link traversed by more than one of its
25	   subflows.  Further, it is desirable that a source with multiple paths
26	   available will transfer more traffic using the least congested of the
27	   paths, hence achieving resource pooling.  This would increase the
28	   overall efficiency of the network and also its robustness to failure.

30	   This document presents a congestion control algorithm which couples
31	   the congestion control algorithms running on different subflows by
32	   linking their increase functions, and dynamically controls the
33	   overall aggressiveness of the multipath flow.  The result is a
34	   practical algorithm that is fair to TCP at bottlenecks while moving
35	   traffic away from congested links.

37	Status of this Memo

39	   This Internet-Draft is submitted in full conformance with the
40	   provisions of BCP 78 and BCP 79.

42	   Internet-Drafts are working documents of the Internet Engineering
43	   Task Force (IETF).  Note that other groups may also distribute
44	   working documents as Internet-Drafts.  The list of current Internet-
45	   Drafts is at http://datatracker.ietf.org/drafts/current/.

47	   Internet-Drafts are draft documents valid for a maximum of six months
48	   and may be updated, replaced, or obsoleted by other documents at any
49	   time.  It is inappropriate to use Internet-Drafts as reference
50	   material or to cite them other than as "work in progress."

52	   This Internet-Draft will expire on December 17, 2011.

54	Copyright Notice

56	   Copyright (c) 2011 IETF Trust and the persons identified as the
57	   document authors.  All rights reserved.

59	   This document is subject to BCP 78 and the IETF Trust's Legal
60	   Provisions Relating to IETF Documents
61	   (http://trustee.ietf.org/license-info) in effect on the date of
62	   publication of this document.  Please review these documents
63	   carefully, as they describe your rights and restrictions with respect
64	   to this document.  Code Components extracted from this document must
65	   include Simplified BSD License text as described in Section 4.e of
66	   the Trust Legal Provisions and are provided without warranty as
67	   described in the Simplified BSD License.

69	Table of Contents

71	   1.  Requirements Language  . . . . . . . . . . . . . . . . . . . .  4
72	   2.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
73	   3.  Coupled Congestion Control Algorithm . . . . . . . . . . . . .  6
74	   4.  Implementation Considerations  . . . . . . . . . . . . . . . .  7
75	     4.1.  Computing alpha in Practice  . . . . . . . . . . . . . . .  8
76	     4.2.  Implementation Considerations when CWND is Expressed
77	           in Packets . . . . . . . . . . . . . . . . . . . . . . . .  9
78	   5.  Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 10
79	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 10
80	   7.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 11
81	   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 11
82	   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 11
83	     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 11
84	     9.2.  Informative References . . . . . . . . . . . . . . . . . . 11
85	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 12

87	1.  Requirements Language

89	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
90	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
91	   document are to be interpreted as described in RFC 2119 [RFC2119] .

93	2.  Introduction

95	   Multipath TCP (MPTCP, [I-D.ford-mptcp-multiaddressed]) is a set of
96	   extensions to regular TCP [RFC0793] that allows one TCP connection to
97	   be spread across multiple paths.  MPTCP distributes load through the
98	   creation of separate "subflows" across potentially disjoint paths.

100	   How should congestion control be performed for multipath TCP?  First,
101	   each subflow must have its own congestion control state (i.e. cwnd)
102	   so that capacity on that path is matched by offered load.  The
103	   simplest way to achieve this goal is to simply run standard TCP
104	   congestion control on each subflow.  However this solution is
105	   unsatisfactory as it gives the multipath flow an unfair share when
106	   the paths taken by its different subflows share a common bottleneck.

108	   Bottleneck fairness is just one requirement multipath congestion
109	   control should meet.  The following three goals capture the desirable
110	   properties of a practical multipath congestion control algorithm:

112	   o  Goal 1 (Improve Throughput) A multipath flow should perform at
113	      least as well as a single path flow would on the best of the paths
114	      available to it.

116	   o  Goal 2 (Do no harm) A multipath flow should not take up more
117	      capacity from any of the resources shared by its different paths,
118	      than if it was a single flow using only one of these paths.  This
119	      guarantees it will not unduly harm other flows.

121	   o  Goal 3 (Balance congestion) A multipath flow should move as much
122	      traffic as possible off its most congested paths, subject to
123	      meeting the first two goals.

125	   Goals 1 and 2 together ensure fairness at the bottleneck.  Goal 3
126	   captures the concept of resource pooling [WISCHIK]: if each multipath
127	   flow sends more data through its least congested path, the traffic in
128	   the network will move away from congested areas.  This improves
129	   robustness and overall throughput, among other things.  The way to
130	   achieve resource pooling is to effectively "couple" the congestion
131	   control loops for the different subflows.

133	   We propose an algorithm that couples only the additive increase
134	   function of the subflows, and uses unmodified TCP behavior in case of
135	   a drop.  The algorithm relies on the traditional TCP mechanisms to
136	   detect drops, to retransmit data, etc.

138	   Detecting shared bottlenecks reliably is quite difficult, but is just
139	   one part of a bigger question.  This bigger question is how much
140	   bandwidth a multipath user should use in total, even if there is no
141	   shared bottleneck.

143	   The congestion controller aims to set the multipath flow's aggregate
144	   bandwidth to be the same as a regular TCP flow would get on the best
145	   path available to the multipath flow.  To estimate the bandwidth of a
146	   regular TCP flow, the multipath flow estimates loss rates and round
147	   trip times and computes the target rate.  Then it adjusts the overall
148	   aggresiveness (parameter alpha) to achieve the desired rate.

150	   While the mechanism above applies always, its effect depends on
151	   whether the multipath TCP flow influences or does not influence the
152	   link loss rates (low vs. high statistical multiplexing).  If MPTCP
153	   does not influence link loss rates, MPTCP will get the same
154	   throughput as TCP on the best path.  In cases with low statistical
155	   multiplexing, where the multipath flow influences the loss rates on
156	   the path, the multipath throughput will be strictly higher than a
157	   single TCP would get on any of the paths.  In particular, if using
158	   two idle paths, multipath throughput will be sum of the two paths'
159	   throughput.

161	   This algorithm ensures bottleneck fairness and fairness in the
162	   broader, network sense.  We acknowledge that current TCP fairness
163	   criteria are far from ideal, but a multipath TCP needs to be
164	   deployable in the current Internet.  If needed, new fairness criteria
165	   can be implemented by the same algorithm we propose by appropriately
166	   scaling the overall aggressiveness.

168	   It is intended that the algorithm presented here can be applied to
169	   other multipath transport protocols, such as alternative multipath
170	   extensions to TCP, or indeed any other congestion-aware transport
171	   protocols.  However, for the purposes of example this document will,
172	   where appropriate, refer to the MPTCP protocol.

174	   The design decisions and evaluation of the congestion control
175	   algorithm are published in [NSDI].

177	   The algorithm presented here only extends standard TCP congestion
178	   control for multipath operation.  It is foreseeable that other
179	   congestion controllers will be implemented for multipath transport to
180	   achieve the bandwidth-scaling properties of the newer congestion
181	   control algorithms for regular TCP (such as Compound TCP and Cubic).

183	3.  Coupled Congestion Control Algorithm

185	   The algorithm we present only applies to the increase phase of the
186	   congestion avoidance state specifying how the window inflates upon
187	   receiving an ack.  The slow start, fast retransmit, and fast recovery
188	   algorithms, as well as the multiplicative decrease of the congestion
189	   avoidance state are the same as in standard TCP[RFC5681].

191	   Let cwnd_i be the congestion window on the subflow i.  Let tot_cwnd
192	   be the sum of the congestion windows of all subflows in the
193	   connection.  Let p_i, rtt_i and mss_i be the loss rate, round trip
194	   time (i.e. smoothed round trip time estimate used by TCP) and maximum
195	   segment size on subflow i.

197	   We assume throughout this document that the congestion window is
198	   maintained in bytes, unless otherwise specified.  We briefly describe
199	   the algorithm for packet-based implementations of cwnd in section
200	   Section 4.2.

202	   Our proposed "Linked Increases" algorithm MUST:

204	   o  For each ack received on subflow i, increase cwnd_i by min
205	      (alpha*bytes_acked*mss_i/tot_cwnd , bytes_acked*mss_i/cwnd_i )

207	   The increase formula takes the minimum between the computed increase
208	   for the multipath subflow (first argument to min), and the increase
209	   TCP would get in the same scenario (the second argument).  In this
210	   way, we ensure that any multipath subflow cannot be more aggressive
211	   than a TCP flow in the same circumstances, hence achieving goal 2 (do
212	   no harm).

214	   "alpha" is a parameter of the algorithm that describes the
215	   aggresiveness of the multipath flow.  To meet Goal 1 (improve
216	   throughput), the value of alpha is chosen such that the aggregate
217	   throughput of the multipath flow is equal to the throughput a TCP
218	   flow would get if it ran on the best path.

220	   To get an intuition of what the algorithm is trying to do, let's take
221	   the case where all the subflows have the same round trip time and
222	   MSS.  In this case the algorithm will grow the total window by
223	   approximately alpha*MSS per RTT.  This increase is distributed to the
224	   individual flows according to their instantaneous window size.
225	   Subflow i will increase by alpha*cwnd_i/tot_cwnd segments per RTT.

227	   Note that, as in standard TCP, when tot_cwnd is large the increase
228	   may be 0.  In this case the increase MUST be set to 1.  We discuss
229	   how to implement this formula in practice in the next section.

231	   We assume implementations use an approach similar to appropriate byte
232	   counting (ABC, [RFC3465]), where the bytes_acked variable records the
233	   number of bytes newly acknowledged.  If this is not true, bytes_acked
234	   SHOULD be set to mss_i.

236	   To compute tot_cwnd, it is an easy mistake to sum up cwnd_i across
237	   all subflows: when a flow is in fast retransmit, its cwnd is
238	   typically inflated and no longer represents the real congestion
239	   window.  The correct behavior is to use the ssthresh value for flows
240	   in fast retransmit when computing tot_cwnd.  To cater for connections
241	   that are app limited, the computation should consider the minimum
242	   between flight_size_i and cwnd_i, and flight_size_i and ssthresh_i
243	   where appropriate.

245	   The total throughput of a multipath flow depends on the value of
246	   alpha and the loss rates, maximum segment sizes and round trip times
247	   of its paths.  Since we require that the total throughput is no worse
248	   than the throughput a single TCP would get on the best path, it is
249	   impossible to choose a-priori a single value of alpha that achieves
250	   the desired throughput in every occasion.  Hence, alpha must be
251	   computed based on the observed properties of the paths.

253	   The formula to compute alpha is:

255	                                      cwnd_i
256	                                 max --------
257	                                  i         2
258	                                      rtt_i
259	             alpha = tot_cwnd * ----------------
260	                               /      cwnd_i \ 2
261	                               | sum ---------|
262	                               \  i   rtt_i  /

264	   The formula is derived by equalizing the rate of the multipath flow
265	   with the rate of a TCP running on the best path, and solving for
266	   alpha.

268	4.  Implementation Considerations

270	   The formula for alpha above implies that alpha is a floating point
271	   value.  This would require performing costly floating point
272	   operations whenever an ACK is received, Further, in many kernels
273	   floating point operations are disabled.  There is an easy way to
274	   approximate the above calculations using integer arithmetic.

276	4.1.  Computing alpha in Practice

278	   Let alpha_scale be an integer.  When computing alpha, use alpha_scale
279	   * tot_cwnd instead of tot_cwnd, and do all the operations in integer
280	   arithmetic.

282	   Then, scale down the increase per ack by alpha_scale.  The algorithm
283	   is:

285	   o  For each ack received on subflow i, increase cwnd_i by min (
286	      (alpha*bytes_acked*mss_i/tot_cwnd)/alpha_scale ,
287	      bytes_acked*mss_i/cwnd_i )

289	   Alpha scale denotes the precision we want for computing alpha.
290	   Observe that the errors in computing the numerator or the denominator
291	   in the formula for alpha are quite small, as the cwnd in bytes is
292	   typically much larger than the RTT (measured in ms).

294	   With these changes, all the operations can be done using integer
295	   arithmetic.  We propose alpha_scale be a small power of two, to allow
296	   using faster shift operations instead of multiplication and division.
297	   Our experiments show that using alpha_scale=512 works well in a wide
298	   range of scenarios.  Increasing alpha_scale increases precision, but
299	   also increases the risk of overflow when computing alpha.  Using
300	   64bit operations would solve this issue.  Another option is to
301	   dynamically adjust alpha_scale when computing alpha; in this way we
302	   avoid overflow and obtain maximum precision.

304	   It is possible to implement the algorithm by calculating tot_cwnd on
305	   each ack, however this would be costly especially when the number of
306	   subflows is large.  To avoid this overhead the implementation MAY
307	   choose to maintain a new per connection state variable called
308	   tot_cwnd.  If it does so, the implementation will update tot_cwnd
309	   value whenever the individual subflows' windows are updated.
310	   Updating only requires one more addition or subtraction operation
311	   compared to the regular, per subflow congestion control code, so its
312	   performance impact should be minimal.

314	   Computing alpha per ack is also costly.  We propose alpha to be a per
315	   connection variable, computed whenever there is a drop and once per
316	   RTT otherwise.  More specifically, let cwnd_new be the new value of
317	   the congestion window after it is inflated or after a drop.  Update
318	   alpha only if cwnd_i/mss_i != cwnd_new_i/mss_i.

320	   In certain cases with small RTTs, computing alpha can still be
321	   expensive.  We observe that if RTTs were constant, it is sufficient
322	   to compute alpha once per drop, as alpha does not change between
323	   drops (the insight here is that cwnd_i/cwnd_j = constant as long as
324	   both windows increase).  Experimental results show that even if round
325	   trip times are not constant, using average round trip time per
326	   sawtooth instead of instantaneous round trip time (i.e.  TCP's
327	   smoothed RTT estimator) gives good precision for computing alpha.
328	   Hence, it is possible to compute alpha only once per drop according
329	   to the formula above, by replacing rtt_i with rtt_avg_i.

331	   If using average round trip time, rtt_avg_i will be computed by
332	   sampling the rtt_i whenever the window can accommodate one more
333	   packet, i.e. when cwnd / mss < (cwnd+increase)/mss.  The samples are
334	   averaged once per sawtooth into rtt_avg_i.  This sampling ensures
335	   that there is no sampling bias for larger windows.

337	   Given tot_cwnd and alpha, the congestion control algorithm is run for
338	   each subflow independently, with similar complexity to the standard
339	   TCP increase code [RFC5681].

341	4.2.  Implementation Considerations when CWND is Expressed in Packets

343	   When the congestion control algorithm maintains cwnd in packets
344	   rather than bytes, the algorithms above must change to take into
345	   account path mss.

347	   To compute the increase when an ack is received, the implementation
348	   for multipath congestion control is a simple extension of the
349	   standard TCP code.  In standard TCP cwnd_cnt is an additional state
350	   variable that tracks the number of segments acked since the last cwnd
351	   increment; cwnd is incremented only when cwnd_cnt > cwnd; then
352	   cwnd_cnt is set to 0.

354	   In the multipath case, cwnd_cnt_i is maintained for each subflow as
355	   above, and cwnd_i is increased by 1 when cwnd_cnt_i > max(alpha_scale
356	   * tot_cwnd / alpha, cwnd_i).

358	   When computing alpha for packet-based stacks, the errors in computing
359	   the terms in the denominator are larger (this is because cwnd is much
360	   smaller and rtt may be comparatively large).  Let max be the index of
361	   the subflow used in the numerator.  To reduce errors, it is easiest
362	   to move rtt_max (once calculated) from the numerator to the
363	   denominator, obtaining the equivalent formula below.

365	                                               cwnd_max
366	        alpha = alpha_scale * tot_cwnd * -----------------------
367	                                        /     rtt_max * cwnd_i \ 2
368	                                        | sum -----------------|
369	                                        \  i        rtt_i      /

371	   Note that the formula for computing alpha does not take into account
372	   path mss, and is the same for stacks that keep cwnd in bytes or
373	   packets.  With this formula, the algorithm for computing alpha will
374	   match the rate of TCP on the best path in B/s for byte-oriented
375	   stacks, and in packets/s in packet-based stacks.  In practice, mss
376	   rarely changes between paths so this shouldn't be a problem.

378	   However, it is simple to derive formulae allowing packet-based stacks
379	   to achieve byte rate fairness (and viceversa) if needed.  In
380	   particular, for packet-based stacks wanting byte-rate fairness, the
381	   formula above changes as follows: cwnd_max is replaced by cwnd_max *
382	   mss_max * mss_max, while cwnd_i is replaced with cwnd_i * mss_i.

384	5.  Discussion

386	   To achieve perfect resource pooling, one must couple both increase
387	   and decrease of congestion windows across subflows, as in [KELLY].
388	   Yet this tends to exhibit "flappiness": when the paths have similar
389	   levels of congestion, the congestion controller will tend to allocate
390	   all the window to one random subflow, and allocate zero window to the
391	   other subflows.  The controller will perform random flips between
392	   these stable points.  This doesn't seem desirable in general, and is
393	   particularly bad when the achieved rates depend on the RTT (as in the
394	   current Internet): in such a case, the resulting rate with fluctuate
395	   unpredictably depending on which state the controller is in, hence
396	   violating Goal 1.

398	   By only coupling increases our proposal removes flappiness but also
399	   reduces the extent of resource pooling the protocol achieves.  The
400	   algorithm will allocate window to the subflows such that p_i * cwnd_i
401	   = constant, for all i.  Thus, when the loss rates of the subflows are
402	   equal, each subflow will get an equal window, removing flappiness.
403	   When the loss rates differ, progressively more window will be
404	   allocated to the flow with the lower loss rate.  In contrast, perfect
405	   resource pooling requires that all the window should be allocated on
406	   the path with the lowest loss rate.

408	6.  Security Considerations

410	   None.

412	   Detailed security analysis for the Multipath TCP protocol itself is
413	   included in [I-D.ford-mptcp-multiaddressed] and RFC 6181 [RFC6181]

415	7.  Acknowledgements

417	   We thank Christoph Paasch for his suggestions for computing alpha in
418	   packet-based stacks.  The authors are supported by Trilogy
419	   (http://www.trilogy-project.org), a research project (ICT-216372)
420	   partially funded by the European Community under its Seventh
421	   Framework Program.  The views expressed here are those of the
422	   author(s) only.  The European Commission is not liable for any use
423	   that may be made of the information in this document.

425	8.  IANA Considerations

427	   This document does not require any action from IANA.

429	9.  References

431	9.1.  Normative References

433	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
434	              RFC 793, September 1981.

436	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
437	              Requirement Levels", BCP 14, RFC 2119, March 1997.

439	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
440	              Control", RFC 5681, September 2009.

442	9.2.  Informative References

444	   [I-D.ford-mptcp-multiaddressed]
445	              Ford, A., Raiciu, C., and M. Handley, "TCP Extensions for
446	              Multipath Operation with Multiple Addresses",
447	              draft-ford-mptcp-multiaddressed-03 (work in progress),
448	              March 2010.

450	   [KELLY]    Kelly, F. and T. Voice, "Stability of end-to-end
451	              algorithms for joint routing and rate control", ACM
452	              SIGCOMM CCR vol. 35 num. 2, pp. 5-12, 2005,
453	              <http://portal.acm.org/citation.cfm?id=1064415>.

455	   [NSDI]     Wischik, D., Raiciu, C., Greenhalgh, A., and M. Handley,
456	              "Design, Implementation and Evaluation of Congestion
457	              Control for Multipath TCP", Usenix NSDI , March 2011, <htt
458	              p://www.cs.ucl.ac.uk/staff/c.raiciu/files/mptcp-nsdi.pdf>.

460	   [RFC3465]  Allman, M., "TCP Congestion Control with Appropriate Byte
461	              Counting (ABC)", RFC 3465, February 2003.

463	   [RFC6181]  Bagnulo, M., "Threat Analysis for TCP Extensions for
464	              Multipath Operation with Multiple Addresses", RFC 6181,
465	              March 2011.

467	   [WISCHIK]  Wischik, D., Handley, M., and M. Bagnulo Braun, "The
468	              Resource Pooling Principle", ACM SIGCOMM CCR vol. 38 num.
469	              5, pp. 47-52, October 2008,
470	              <http://ccr.sigcomm.org/online/files/p47-handleyA4.pdf>.

472	Authors' Addresses

474	   Costin Raiciu
475	   University College London
476	   Gower Street
477	   London  WC1E 6BT
478	   UK

480	   Email: c.raiciu@cs.ucl.ac.uk

482	   Mark Handley
483	   University College London
484	   Gower Street
485	   London  WC1E 6BT
486	   UK

488	   Email: m.handley@cs.ucl.ac.uk

490	   Damon Wischik
491	   University College London
492	   Gower Street
493	   London  WC1E 6BT
494	   UK

496	   Email: d.wischik@cs.ucl.ac.uk