idnits 2.17.1 

draft-ietf-mptcp-congestion-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** You're using the IETF Trust Provisions' Section 6.b License Notice from
     12 Sep 2009 rather than the newer Notice from 28 Dec 2009.  (See
     https://trustee.ietf.org/license-info/)


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (March 14, 2011) is 4786 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  == Missing Reference: 'REF' is mentioned on line 411, but not defined

  == Outdated reference: A later version (-03) exists of
     draft-ford-mptcp-multiaddressed-01

  -- Obsolete informational reference (is this intentional?): RFC  793
     (Obsoleted by RFC 9293)


     Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force                                C. Raiciu
3	Internet-Draft                                                M. Handley
4	Intended status: Experimental                                 D. Wischik
5	Expires: September 15, 2011                    University College London
6	                                                          March 14, 2011

8	      Coupled Congestion Control for Multipath Transport Protocols
9	                     draft-ietf-mptcp-congestion-02

11	Abstract

13	   Often endpoints are connected by multiple paths, but communications
14	   are usually restricted to a single path per connection.  Resource
15	   usage within the network would be more efficient were it possible for
16	   these multiple paths to be used concurrently.  Multipath TCP is a
17	   proposal to achieve multipath transport in TCP.

19	   New congestion control algorithms are needed for multipath transport
20	   protocols such as Multipath TCP, as single path algorithms have a
21	   series of issues in the multipath context.  One of the prominent
22	   problems is that running existing algorithms such as TCP New Reno
23	   independently on each path would give the multipath flow more than
24	   its fair share at a bottleneck link traversed by more than one of its
25	   subflows.  Further, it is desirable that a source with multiple paths
26	   available will transfer more traffic using the least congested of the
27	   paths, hence achieving resource pooling.  This would increase the
28	   overall efficiency of the network and also its robustness to failure.

30	   This document presents a congestion control algorithm which couples
31	   the congestion control algorithms running on different subflows by
32	   linking their increase functions, and dynamically controls the
33	   overall aggresiveness of the multipath flow.  The result is a
34	   practical algorithm that is fair to TCP at bottlenecks while moving
35	   traffic away from congested links.

37	Status of this Memo

39	   This Internet-Draft is submitted to IETF in full conformance with the
40	   provisions of BCP 78 and BCP 79.

42	   Internet-Drafts are working documents of the Internet Engineering
43	   Task Force (IETF), its areas, and its working groups.  Note that
44	   other groups may also distribute working documents as Internet-
45	   Drafts.

47	   Internet-Drafts are draft documents valid for a maximum of six months
48	   and may be updated, replaced, or obsoleted by other documents at any
49	   time.  It is inappropriate to use Internet-Drafts as reference
50	   material or to cite them other than as "work in progress."

52	   The list of current Internet-Drafts can be accessed at
53	   http://www.ietf.org/ietf/1id-abstracts.txt.

55	   The list of Internet-Draft Shadow Directories can be accessed at
56	   http://www.ietf.org/shadow.html.

58	   This Internet-Draft will expire on September 15, 2011.

60	Copyright Notice

62	   Copyright (c) 2011 IETF Trust and the persons identified as the
63	   document authors.  All rights reserved.

65	   This document is subject to BCP 78 and the IETF Trust's Legal
66	   Provisions Relating to IETF Documents
67	   (http://trustee.ietf.org/license-info) in effect on the date of
68	   publication of this document.  Please review these documents
69	   carefully, as they describe your rights and restrictions with respect
70	   to this document.  Code Components extracted from this document must
71	   include Simplified BSD License text as described in Section 4.e of
72	   the Trust Legal Provisions and are provided without warranty as
73	   described in the BSD License.

75	Table of Contents

77	   1.  Requirements Language  . . . . . . . . . . . . . . . . . . . .  4
78	   2.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
79	   3.  Coupled Congestion Control Algorithm . . . . . . . . . . . . .  5
80	   4.  Implementation Considerations  . . . . . . . . . . . . . . . .  7
81	     4.1.  Implementation Considerations when CWND is Expressed
82	           in Packets . . . . . . . . . . . . . . . . . . . . . . . .  9
83	   5.  Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 10
84	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 10
85	   7.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10
86	   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 11
87	   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 11
88	     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 11
89	     9.2.  Informative References . . . . . . . . . . . . . . . . . . 11
90	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 12

92	1.  Requirements Language

94	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
95	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
96	   document are to be interpreted as described in RFC 2119 [RFC2119].

98	2.  Introduction

100	   Multipath TCP (MPTCP, [I-D.ford-mptcp-multiaddressed]) is a set of
101	   extensions to regular TCP [RFC0793] that allows one TCP connection to
102	   be spread across multiple paths.  MPTCP distributes load through the
103	   creation of separate "subflows" across potentially disjoint paths.

105	   How should congestion control be performed for multipath TCP?  First,
106	   each subflow must have its own congestion control state (i.e. cwnd)
107	   so that capacity on that path is matched by offered load.  The
108	   simplest way to achieve this goal is to simply run TCP New Reno
109	   congestion control [RFC5681] on each subflow.  However this solution
110	   is unsatisfactory as it gives the multipath flow an unfair share when
111	   the paths taken by its different subflows share a common bottleneck.

113	   Bottleneck fairness is just one requirement multipath congestion
114	   control should meet.  The following three goals capture the desirable
115	   properties of a practical multipath congestion control algorithm:

117	   o  Goal 1 (Improve Throughput) A multipath flow should perform at
118	      least as well as a single path flow would on the best of the paths
119	      available to it.

121	   o  Goal 2 (Do no harm) A multipath flow should not take up more
122	      capacity on any one of its paths than if it was a single path flow
123	      using only that route.  This guarantees it will not unduly harm
124	      other flows.

126	   o  Goal 3 (Balance congestion) A multipath flow should move as much
127	      traffic as possible off its most congested paths, subject to
128	      meeting the first two goals.

130	   Goals 1 and 2 together ensure fairness at the bottleneck.  Goal 3
131	   captures the concept of resource pooling [WISCHIK]: if each multipath
132	   flow sends more data through its least congested path, the traffic in
133	   the network will move away from congested areas.  This improves
134	   robustness and overall throughput, among other things.  The way to
135	   achieve resource pooling is to effectively "couple" the congestion
136	   control loops for the different subflows.

138	   We propose an algorithm that couples only the additive increase
139	   function of the subflows, and uses unmodified TCP New Reno behavior
140	   in case of a drop.  The algorithm relies on the traditional TCP
141	   mechanisms to detect drops, to retransmit data, etc.

143	   Detecting shared bottlenecks reliably is quite difficult, but is just
144	   one part of a bigger question.  This bigger question is how much
145	   bandwidth a multipath user should use in total, even if there is no
146	   shared bottleneck.

148	   Our solution sets the multipath flow's aggregate bandwidth to be the
149	   same bandwidth a regular TCP flow would get on the best path
150	   available to the multipath flow.  To estimate the bandwidth of a
151	   regular TCP flow, the multipath flow estimates loss rates and round
152	   trip times and computes the target rate.  Then it adjusts the overall
153	   aggresiveness (parameter alpha) to achieve the desired rate.

155	   We note that in cases with low statistical multiplexing (where the
156	   multipath flow influences the loss rates on the path) the multipath
157	   throughput will be strictly higher than a single TCP would get on any
158	   of the paths.  In particular, if using two idle paths, multipath
159	   throughput will be sum of the two paths' throughput.

161	   This algorithm ensures bottleneck fairness and fairness in the
162	   broader, network sense.  We acknowledge that current TCP fairness
163	   criteria are far from ideal, but a multipath TCP needs to be
164	   deployable in the current Internet.  If needed, new fairness criteria
165	   can be implemented by the same algorithm we propose by appropriately
166	   scaling the overall aggressiveness.

168	   It is intended that the algorithm presented here can be applied to
169	   other multipath transport protocols, such as alternative multipath
170	   extensions to TCP, or indeed any other congestion-aware transport
171	   protocols.  However, for the purposes of example this document will,
172	   where appropriate, refer to the MPTCP protocol.

174	   The design decisions and evaluation of the congestion control
175	   algorithm are published in [NSDI].

177	   The algorithm presented here only extends TCP New Reno congestion
178	   control for multipath operation.  It is foreseeable that other
179	   congestion controllers will be implemented for multipath transport to
180	   achieve the bandwidth-scaling properties of the newer congestion
181	   control algorithms for regular TCP (such as Compound TCP and Cubic).

183	3.  Coupled Congestion Control Algorithm

185	   The algorithm we present only applies to the increase phase of the
186	   congestion avoidance state specifying how the window inflates upon
187	   receiving an ack.  The slow start, fast retransmit, and fast recovery
188	   algorithms, as well as the multiplicative decrease of the congestion
189	   avoidance state are the same as in TCP [RFC5681].

191	   Let cwnd_i be the congestion window on the subflow i.  Let tot_cwnd
192	   be the sum of the congestion windows of all subflows in the
193	   connection.  Let p_i, rtt_i and mss_i be the loss rate, round trip
194	   time (i.e. smoothed round trip time estimate) and maximum segment
195	   size on subflow i.

197	   We assume throughout this document that the congestion window is
198	   maintained in bytes, unless otherwise specified.  We briefly describe
199	   the algorithm for packet-based implementations of cwnd in section
200	   Section 4.1.

202	   Our proposed "Linked Increases" algorithm MUST:

204	   o  For each ack received on subflow i, increase cwnd_i by min
205	      (alpha*bytes_acked*mss_i/tot_cwnd , bytes_acked*mss_i/cwnd_i )

207	   The increase formula takes the minimum between the computed increase
208	   for the multipath subflow (first argument to min), and the increase
209	   TCP would get in the same scenario (the second argument).  In this
210	   way, we ensure that any multipath subflow cannot be more aggressive
211	   than a TCP flow in the same circumstances, hence achieving goal 2 (do
212	   no harm).

214	   "alpha" is a parameter of the algorithm that describes the
215	   aggresiveness of the multipath flow.  To meet Goal 1 (improve
216	   throughput), the value of alpha is chosen such that the aggregate
217	   throughput of the multipath flow is equal to the rate a TCP flow
218	   would get if it ran on the best path.

220	   To get an intuition of what the algorithm is trying to do, let's take
221	   the case where all the subflows have the same round trip time and
222	   MSS.  In this case the algorithm will grow the total window by
223	   approximately alpha*MSS per RTT.  This increase is distributed to the
224	   individual flows according to their instantaneous window size.
225	   Subflow i will increase by alpha*cwnd_i/tot_cwnd segments per RTT.

227	   Note that, as in standard TCP, when tot_cwnd is large the increase
228	   may be 0.  In this case the increase MUST be set to 1.  We discuss
229	   how to implement this formula in practice in the next section.

231	   We assume appropriate byte counting (ABC, [RFC3465]) is used, hence
232	   the bytes_acked variable records the number of bytes newly
233	   acknowledged.  If ABC is not used, bytes_acked SHOULD be set to
234	   mss_i.

236	   To compute tot_cwnd, it is an easy mistake to sum up cwnd_i across
237	   all subflows: when a flow is in fast retransmit, its cwnd is
238	   typically inflated and no longer represents the real congestion
239	   window.  The correct behavior is to use the ssthresh value for flows
240	   in fast retransmit when computing tot_cwnd.  To cater for connections
241	   that are app limited, the computation should consider the minimum
242	   between flight_size_i and cwnd_i, and flight_size_i and ssthresh_i
243	   where appropriate.

245	   The total throughput of a multipath flow depends on the value of
246	   alpha and the loss rates, maximum segment sizes and round trip times
247	   of its paths.  Since we require that the total throughput is no worse
248	   than the throughput a single TCP would get on the best path, it is
249	   impossible to choose a-priori a single value of alpha that achieves
250	   the desired throughput in every ocasion.  Hence, alpha must be
251	   computed for each multipath flow, based on the observed properties of
252	   the paths.

254	   The formula to compute alpha is:

256	                                      cwnd_i
257	                                 max --------
258	                                  i         2
259	                                      rtt_i
260	             alpha = tot_cwnd * ----------------
261	                               /      cwnd_i \ 2
262	                               | sum ---------|
263	                               \  i   rtt_i  /

265	   The formula is derived by equalizing the rate of the multipath flow
266	   with the rate of a TCP running on the best path, and solving for
267	   alpha.

269	4.  Implementation Considerations

271	   The formula for alpha above implies that alpha is a floating point
272	   value.  This would require performing costly floating point
273	   operations whenever an ACK is received, Further, in many kernels
274	   floating point operations are disabled.  There is an easy way to
275	   approximate the above calculations using integer arithmetic.

277	   Let alpha_scale be an integer.  When computing alpha, use alpha_scale
278	   * tot_cwnd instead of tot_cwnd, and do all the operations in integer
279	   arithmetic.

281	   Then, scale down the increase per ack by alpha_scale.  The algorithm
282	   is:

284	   o  For each ack received on subflow i, increase cwnd_i by min (
285	      alpha*bytes_acked*mss_i/tot_cwnd/alpha_scale , bytes_acked*mss_i/
286	      cwnd_i )

288	   Alpha scale denotes the precision we want for computing alpha.
289	   Observe that the errors in computing the numerator or the denominator
290	   in the formula for alpha are quite small, as the cwnd in bytes is
291	   typically much larger than the RTT (measured in ms).

293	   With these changes, all the operations can be done using integer
294	   arithmetic.  We propose alpha_scale be a small power of two, to allow
295	   using faster shift operations instead of multiplication and division.
296	   Our experiments show that using alpha_scale=512 works well in a wide
297	   range of scenarios.  Increasing alpha_scale increases precision, but
298	   also increases the risk of overflow when computing alpha.  Using
299	   64bit operations would solve this issue.  Another option is to
300	   dynamically adjust alpha_scale when computing alpha; in this way we
301	   avoid overflow and obtain maximum precision.

303	   It is possible to implement the algorithm by calculating tot_cwnd on
304	   each ack, however this would be costly especially when the number of
305	   subflows is large.  To avoid this overhead the implementation MAY
306	   choose to maintain a new per connection state variable called
307	   tot_cwnd.  If it does so, the implementation will update tot_cwnd
308	   value whenever the individual subflows' windows are updated.
309	   Updating only requires one more addition or subtraction operation
310	   compared to the regular, per subflow congestion control code, so its
311	   performance impact should be minimal.

313	   Computing alpha per ack is also costly.  We propose alpha be a per
314	   connection variable, computed whenever there is a drop and once per
315	   RTT otherwise.  More specifically, let cwnd_new be the new value of
316	   the congestion window after it is inflated or after a drop.  Update
317	   alpha only if cwnd_i/mss_i != cwnd_new_i/mss_i.

319	   In certain cases with small RTTs, computing alpha can still be
320	   expensive.  We observe that if RTTs were constant, it is sufficient
321	   to compute alpha once per drop, as alpha does not change between
322	   drops (the insight here is that cwnd_i/cwnd_j = constant as long as
323	   both windows increase).  Experimental results show that even if round
324	   trip times are not constant, using average round trip time instead of
325	   instantaneous round trip time gives good precision for computing
326	   alpha.  Hence, it is possible to compute alpha only once per drop
327	   according to the formula above, by replacing rtt_i with rtt_avg_i.

329	   If using average round trip time, rtt_avg_i will be computed by
330	   sampling the rtt_i whenever the window can accomodate one more
331	   packet, i.e. when cwnd / mss < (cwnd+increase)/mss.  The samples are
332	   averaged once per sawtooth into rtt_avg_i.  This sampling ensures
333	   that there is no sampling bias for larger windows.

335	   Given tot_cwnd and alpha, the congestion control algorithm is run for
336	   each subflow independently, with similar complexity to the standard
337	   TCP increase code [RFC5681].

339	4.1.  Implementation Considerations when CWND is Expressed in Packets

341	   When the congestion control algorithm maintains cwnd in packets
342	   rather than bytes, the algorithms above must change to take into
343	   account path mss.

345	   To compute the increase when an ack is received, the implementation
346	   for multipath congestion control is a simple extension of the TCP New
347	   Reno code.  In TCP New Reno cwnd_cnt is an additional state variable
348	   that tracks the number of bytes acked since the last cwnd increment;
349	   cwnd is incremented only when cwnd_cnt > cwnd; then cwnd_cnt is set
350	   to 0.

352	   In the multipath case, cwnd_cnt_i is maintained for each subflow as
353	   above, and cwnd_i is increased by 1 when cwnd_cnt_i > alpha_scale *
354	   tot_cwnd / alpha.

356	   When computing alpha for packet-based stacks, the errors in computing
357	   the terms in the denominator are larger (this is because cwnd is much
358	   smaller and rtt may be comparatively large).  Let max be the index of
359	   the subflow used in the numerator.  To reduce errors, it is easiest
360	   to move rtt_max (once calculated) from the numerator to the
361	   denominator, obtaining the equivalent formula below.

363	                                               cwnd_max
364	        alpha = alpha_scale * tot_cwnd * -----------------------
365	                                        /     rtt_max * cwnd_i \ 2
366	                                        | sum -----------------|
367	                                        \  i        rtt_i      /

369	   Note that the formula for computing alpha does not take into account
370	   path mss, and is the same for stacks that keep cwnd in bytes or
371	   packets.  With this formula, the algorithm for computing alpha will
372	   match the rate of TCP on the best path in B/s for byte-oriented
373	   stacks, and in packets/s in packet-based stacks.  In practice, mss
374	   rarely changes between paths so this shouldn't be a problem.

376	   However, it is simple to derive formulae allowing packet-based stacks
377	   to achieve byte rate fairness (and viceversa) if needed.  In
378	   particular, for packet-based stacks wanting byte-rate fairness, the
379	   formula above changes as follows: cwnd_max is replaced by cwnd_max *
380	   mss_max * mss_max, while cwnd_i is replaced with cwnd_i * mss_i.

382	5.  Discussion

384	   To achieve perfect resource pooling, one must couple both increase
385	   and decrease of congestion windows across subflows, as in [KELLY].
386	   Yet this tends to exhibit "flappiness": when the paths have similar
387	   levels of congestion, the congestion controller will tend to allocate
388	   all the window to one random subflow, and allocate zero window to the
389	   other subflows.  The controller will perform random flips between
390	   these stable points.  This doesn't seem desirable in general, and is
391	   particularly bad when the achieved rates depend on the RTT (as in the
392	   current Internet): in such a case, the resulting rate with fluctuate
393	   unpredictably depending on which state the controller is in, hence
394	   violating Goal 1.

396	   By only coupling increases our proposal removes flappiness but also
397	   reduces the extent of resource pooling the protocol achieves.  The
398	   algorithm will allocate window to the subflows such that p_i * cwnd_i
399	   = constant, for all i.  Thus, when the loss rates of the subflows are
400	   equal, each subflow will get an equal window, removing flappiness.
401	   When the loss rates differ, progressively more window will be
402	   allocated to the flow with the lower loss rate.  In contrast, perfect
403	   resource pooling requires that all the window should be allocated on
404	   the path with the lowest loss rate.

406	6.  Security Considerations

408	   None.

410	   Detailed security analysis for the Multipath TCP protocol itself is
411	   included in [I-D.ford-mptcp-multiaddressed] and [REF]

413	7.  Acknowledgements

415	   We thank Christoph Paasch for his suggestions for computing alpha in
416	   packet-based stacks.  The authors are supported by Trilogy
417	   (http://www.trilogy-project.org), a research project (ICT-216372)
418	   partially funded by the European Community under its Seventh
419	   Framework Program.  The views expressed here are those of the
420	   author(s) only.  The European Commission is not liable for any use
421	   that may be made of the information in this document.

423	8.  IANA Considerations

425	   None.

427	9.  References

429	9.1.  Normative References

431	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
432	              Requirement Levels", BCP 14, RFC 2119, March 1997.

434	9.2.  Informative References

436	   [I-D.ford-mptcp-multiaddressed]
437	              Ford, A., Raiciu, C., Handley, M., and S. Barre, "TCP
438	              Extensions for Multipath Operation with Multiple
439	              Addresses", draft-ford-mptcp-multiaddressed-01 (work in
440	              progress), July 2009.

442	   [KELLY]    Kelly, F. and T. Voice, "Stability of end-to-end
443	              algorithms for joint routing and rate control", ACM
444	              SIGCOMM CCR vol. 35 num. 2, pp. 5-12, 2005,
445	              <http://portal.acm.org/citation.cfm?id=1064415>.

447	   [NSDI]     Wischik, D., Raiciu, C., Greenhalgh, A., and M. Handley,
448	              "Design, Implementation and Evaluation of Congestion
449	              Control for Multipath TCP", Usenix NSDI , March 2011, <htt
450	              p://www.cs.ucl.ac.uk/staff/c.raiciu/files/mptcp-nsdi.pdf>.

452	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
453	              RFC 793, September 1981.

455	   [RFC3465]  Allman, M., "TCP Congestion Control with Appropriate Byte
456	              Counting (ABC)", RFC 3465, February 2003.

458	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
459	              Control", RFC 5681, September 2009.

461	   [WISCHIK]  Wischik, D., Handley, M., and M. Bagnulo Braun, "The
462	              Resource Pooling Principle", ACM SIGCOMM CCR vol. 38 num.
463	              5, pp. 47-52, October 2008,
464	              <http://ccr.sigcomm.org/online/files/p47-handleyA4.pdf>.

466	Authors' Addresses

468	   Costin Raiciu
469	   University College London
470	   Gower Street
471	   London  WC1E 6BT
472	   UK

474	   Email: c.raiciu@cs.ucl.ac.uk

476	   Mark Handley
477	   University College London
478	   Gower Street
479	   London  WC1E 6BT
480	   UK

482	   Email: m.handley@cs.ucl.ac.uk

484	   Damon Wischik
485	   University College London
486	   Gower Street
487	   London  WC1E 6BT
488	   UK

490	   Email: d.wischik@cs.ucl.ac.uk