idnits 2.17.1 

draft-briscoe-conex-policing-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (February 14, 2014) is 3725 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-02) exists of
     draft-briscoe-conex-data-centre-01

  == Outdated reference: A later version (-13) exists of
     draft-ietf-conex-abstract-mech-08

  == Outdated reference: A later version (-06) exists of
     draft-ietf-conex-mobile-03


     Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	ConEx                                                         B. Briscoe
3	Internet-Draft                                                        BT
4	Intended status: Informational                         February 14, 2014
5	Expires: August 18, 2014

7	        Network Performance Isolation using Congestion Policing
8	                    draft-briscoe-conex-policing-01

10	Abstract

12	   This document describes why policing using congestion information can
13	   isolate users from network performance degradation due to each
14	   other's usage, but without losing the multiplexing benefits of a LAN-
15	   style network where anyone can use any amount of any resource.
16	   Extensive numerical examples and diagrams are given.  The document is
17	   agnostic to how the congestion information reaches the policer.  The
18	   congestion exposure (ConEX) protocol is recommended, but other tunnel
19	   feedback mechanisms have been proposed.

21	Status of This Memo

23	   This Internet-Draft is submitted in full conformance with the
24	   provisions of BCP 78 and BCP 79.

26	   Internet-Drafts are working documents of the Internet Engineering
27	   Task Force (IETF).  Note that other groups may also distribute
28	   working documents as Internet-Drafts.  The list of current Internet-
29	   Drafts is at http://datatracker.ietf.org/drafts/current/.

31	   Internet-Drafts are draft documents valid for a maximum of six months
32	   and may be updated, replaced, or obsoleted by other documents at any
33	   time.  It is inappropriate to use Internet-Drafts as reference
34	   material or to cite them other than as "work in progress."

36	   This Internet-Draft will expire on August 18, 2014.

38	Copyright Notice

40	   Copyright (c) 2014 IETF Trust and the persons identified as the
41	   document authors.  All rights reserved.

43	   This document is subject to BCP 78 and the IETF Trust's Legal
44	   Provisions Relating to IETF Documents
45	   (http://trustee.ietf.org/license-info) in effect on the date of
46	   publication of this document.  Please review these documents
47	   carefully, as they describe your rights and restrictions with respect
48	   to this document.  Code Components extracted from this document must
49	   include Simplified BSD License text as described in Section 4.e of
50	   the Trust Legal Provisions and are provided without warranty as
51	   described in the Simplified BSD License.

53	Table of Contents

55	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
56	   2.  Example Bulk Congestion Policer . . . . . . . . . . . . . . .   4
57	   3.  Network Performance Isolation: Intuition  . . . . . . . . . .   8
58	     3.1.  The Problem . . . . . . . . . . . . . . . . . . . . . . .   8
59	     3.2.  Approach  . . . . . . . . . . . . . . . . . . . . . . . .  10
60	     3.3.  Terminology . . . . . . . . . . . . . . . . . . . . . . .  11
61	     3.4.  Simple Boundary Model of Congestion Control . . . . . . .  11
62	     3.5.  Long-Running Flows  . . . . . . . . . . . . . . . . . . .  12
63	     3.6.  On-Off Flows  . . . . . . . . . . . . . . . . . . . . . .  14
64	       3.6.1.  Numerical Examples Without Policing . . . . . . . . .  16
65	       3.6.2.  Congestion Policing of On-Off Flows . . . . . . . . .  19
66	     3.7.  Weighted Congestion Controls  . . . . . . . . . . . . . .  20
67	     3.8.  A Network of Links  . . . . . . . . . . . . . . . . . . .  22
68	       3.8.1.  Numerical Example: Isolation from Focused Load  . . .  23
69	       3.8.2.  Isolation in the Short-Term . . . . . . . . . . . . .  24
70	       3.8.3.  Encouraging Load Balancing  . . . . . . . . . . . . .  24
71	     3.9.  Tenants of Different Sizes  . . . . . . . . . . . . . . .  25
72	     3.10. Links of Different Sizes  . . . . . . . . . . . . . . . .  25
73	     3.11. Diverse Congestion Control Algorithms . . . . . . . . . .  26
74	   4.  Parameter Setting . . . . . . . . . . . . . . . . . . . . . .  29
75	   5.  Security Considerations . . . . . . . . . . . . . . . . . . .  30
76	   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  30
77	   7.  Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .  30
78	   8.  Acknowledgments . . . . . . . . . . . . . . . . . . . . . . .  30
79	   9.  Informative References  . . . . . . . . . . . . . . . . . . .  30
80	   Appendix A.  Summary of Changes between Drafts  . . . . . . . . .  33

82	1.  Introduction

84	   This document is informative, not normative.  It describes why
85	   congestion policing isolates the network performance of different
86	   parties who are using a network, and why it is more useful than other
87	   forms of performance isolation such as peak and committed information
88	   rate policing, weighted round-robin or weighted fair-queueing.

90	   The main point is that congestion policing isolates performance even
91	   when the load applied by everyone is highly variable, where variation
92	   may be over time, or by spreading traffic across different paths (the
93	   hose model).  Unlike other isolation techniques, congestion policing
94	   is an edge mechanism that works over a pool of links across whole
95	   network, not just at one link.  If the load from everyone happens to
96	   be constant, and there is a single bottleneck, congestion policing
97	   gives a nearly identical outcome to weighted round-robin or weighted
98	   fair-queuing at that bottleneck.  But otherwise, congestion policing
99	   is a far more general solution.

101	   The document complements others that describe various network
102	   arrangements that could use congestion policing, such as in a
103	   broadband access network [I-D.briscoe-conex-initial-deploy], in a
104	   mobile access network [I-D.ietf-conex-mobile] or in a data centre
105	   network [I-D.briscoe-conex-data-centre].  Nonetheless, all the
106	   numerical examples use the data centre scenario.

108	   The key to the solution is the use of congestion-bit-rate rather than
109	   bit-rate as the policing metric. _How _this works is very simple and
110	   quick to describe Section 2.

112	   However, it is much more difficult to understand _why_ this approach
113	   provides performance isolation.  In particular, why it provides
114	   performance isolation across a network of links, even though there is
115	   apparently no isolation mechanism in each link.  Section 3 builds up
116	   an intuition for why the approach works, and why other approaches
117	   fall down in different ways.  The bullets below provide a summary of
118	   that explanation, which builds from the simple case of long-running
119	   flows through a single link up to a full meshed network with on-off
120	   flows of different sizes and different behaviours:

122	   o  Starting with the simple case of long-running flows focused on a
123	      single bottleneck link, tenants get weighted shares of the link,
124	      much like weighted round robin, but with no mechanism in any of
125	      the links.  This is because losses (or ECN marks) are random, so
126	      if one tenant sends twice as much bit-rate it will suffer twice as
127	      many lost bits (or ECN-marked bits).  So, at least for constant
128	      long-running flows, regulating congestion-bits gives the same
129	      outcome as regulating bits;

131	   o  In the more realistic case where flows are not all long-running
132	      but a mix of short to very long, it is explained that bit-rate is
133	      not a sufficient metric for isolating performance; how _often_ a
134	      tenant is sending (or not sending) is the significant factor for
135	      performance isolation, not whether bit-rate is shared equally
136	      whenever a source happens to be sending;

138	   o  Although it might seem that data volume would be a good measure of
139	      how often a tenant is sending, we then show why it is not.  For
140	      instance, a tenant can send a large volume of data but hardly
141	      affect the performance of others -- by being more responsive to
142	      congestion.  Using congestion-volume (congestion-bit-rate over
143	      time) in a policer encourages large data senders to be more
144	      responsive (to yield), giving other tenants much higher
145	      performance while hardly affecting their own performance.
146	      Whereas, using straight volume as an allocation metric provides no
147	      distinction between high volume sources that yield and high volume
148	      sources that do not yield (the widespread behaviour today);

150	   o  We then show that a policer based on the congestion-bit-rate
151	      metric works across a network of links treating it as a pool of
152	      capacity, whereas other approaches treat each link independently,
153	      which is why the proposed approach requires none of the
154	      configuration complexity on switches that is involved in other
155	      approaches.

157	   o  We also show that a congestion policer can be arranged to limit
158	      bursts of congestion from sources that focus traffic onto a single
159	      link, even where one source may consist of a large aggregate of
160	      sources.

162	   o  We show that a congestion policer rewards traffic that shifts to
163	      less congested paths (e.g. multipath TCP or virtual machine
164	      motion).

166	   o  We show that congestion policing works on the pool of links,
167	      irrespective of whether individual links have significantly
168	      different capacities.

170	   o  We show that a congestion policer allows a wide variety of
171	      responses to congestion (e.g. New Reno TCP [RFC5681], Cubic TCP,
172	      Compound TCP, Data Centre TCP [DCTCP] and even unresponsive UDP
173	      traffic), while still encouraging and enforcing a sufficient
174	      response to congestion from all sources taken together, so that
175	      the performance of each application is sufficiently isolated from
176	      others.

178	2.  Example Bulk Congestion Policer

180	   In order to explain why a congestion policer works, we first describe
181	   a particular design of congestion policer, called a bulk congestion
182	   policer.  The aim is to give a concrete example to motivate the
183	   protocol changes necessary to implement such a policer.  This is
184	   merely an informative document, so there is no intention to
185	   standardise this or any specific congestion policing algorithm.  The
186	   aim is for different implementers to be free to develop their own
187	   improvements to this design.

189	   A number of other documents describe deployment arrangements that
190	   include a congestion policer, for instance in a broadband access
191	   network [I-D.briscoe-conex-initial-deploy], in a mobile access
192	   network [I-D.ietf-conex-mobile] and in a data centre network

194	   [I-D.briscoe-conex-data-centre].  The congestion policer described in
195	   the present document could be deployed in any of these arrangements.
196	   However, to be concrete, the example link capacities, topology and
197	   terminology used in the present document assume a multi-tenant data
198	   centre scenario [I-D.briscoe-conex-data-centre].  It is believed that
199	   similar examples could be drawn up for the other scenarios, just
200	   using different numbers for link capacities, etc and replacing some
201	   terminology, e.g.  substituting 'end-customer' for 'tenant'
202	   throughout.

204	   Figure 1 illustrates a bulk congestion policer.  It is very similar
205	   to a regular token bucket for policing bit-rate except the tokens do
206	   not represent bits, rather they represent 'congestion-bits'.  Using
207	   Congestion Exposure (ConEx) [I-D.ietf-conex-abstract-mech] is
208	   probably the easiest way to understand what congestion-bits are.
209	   However, there are other valid ways to signal congestion information
210	   to a congestion policer without ConEx (e.g. tunnelled feedback
211	   described in [I-D.briscoe-conex-data-centre].

213	   Using ConEx as an example, a ConEx-based congestion token bucket
214	   ignores any packets unless they are ConEx-marked.  The ConEx protocol
215	   includes audit capabilities that force the sender to reveal
216	   information it receives about losses anywhere on the downstream path
217	   (e.g. at the points marked X in Figure 1).  The sender signals this
218	   information with a ConEx-marking in the IP header of the packets it
219	   sends into the network, using an arrangement called re-inserted
220	   feedback or re-feedback.  So, every time the sender detects a loss of
221	   a 1500B packet, it has to apply a ConEx-marking to a subsequent 1500B
222	   packet (or it has to ConEx-mark at least as many bytes made up of
223	   smaller packets).  In Figure 1 ConEx-marked packets are shown tagged
224	   with a '*'.

226	   A bulk congestion policer is deployed at every point where traffic
227	   enters an operator's network (similar to the Diffserv architecture).
228	   At any one point of entry, the operator assigns a congestion token
229	   bucket to each tenant that is sending traffic into the network.  The
230	   operator assigns a contracted token-fill-rate w1 to tenant 1, w2 to
231	   tenant 2, and in general wi to tenant with index i, where i = 1,2,...
232	   This contracted token-fill-rate, wi, can be thought of like a weight,
233	   where a higher weight allows a tenant to contribute more traffic
234	   through a downstream congested link or to contribute traffic into
235	   more congested links, as we shall see.

237	   Tenants wanting to send more traffic can be assigned a higher weight
238	   (probably paying more for it).  Typically, tenants might sign-up to a
239	   traffic contract in two parts:

241	   1.  a peak rate, which represents the capacity of their access to the
242	       network

244	   2.  a congestion token-fill rate, which limits how much traffic can
245	       be sent that limits what others are sending.

247	   In Figure 1, as each packet arrives, the classifier assigns it to the
248	   congestion token bucket of the relevant tenant.  The token bucket
249	   ignores all non-ConEx-marked packets - they do not drain any tokens
250	   from the bucket.  Otherwise, if a packet is ConEx-marked and ,say,
251	   1500B in size it will drain 1500B of tokens from the congestion token
252	   bucket.  If the tenant is contributing to more congestion than its
253	   contracted congestion-fill-rate, the bucket level will decrease, and
254	   if it persists, the bucket level will become low enough to cause the
255	   policer to start discarding a proportion of all that tenant's traffic
256	   (whether ConEx-marked or not).

258	                                                 ______________________
259	                     |          |          |    | Legend               |
260	                     |w1        |w2        |wi  |                      |
261	                     |          |          |    | [_] [_]packet stream |
262	                     V          V          V    |                      |
263	       congestion    .          .          .    | [*]    marked packet |
264	       token bucket| . |      | . |    __|___|  | ___                  |
265	                 __|___|      | . |   |  |:::|  | \ /    policer       |
266	                |  |:::|    __|___|   |  |:::|  | /_\                  |
267	                |  +---+   |  +---+   |  +---+  |                      |
268	   bucket depth |    :     |    :     |    :    | /\     marking meter |
269	   controls the |    .     |    :     |    .    | \/                   |
270	      policer  _V_   .     |    :     |    .    |______________________|
271	           ____\ /__/\___________________________             downstream
272	          /[*] /_\  \/ [_] |    : [_] |    : [_] \            /->network
273	   class-/                 |    .     |    .      \          /    /--->
274	   ifier/                 _V_   .     |    .       \        /    /
275	  __,--.__________________\ /__/\___________________\______/____/ loss
276	    `--' [_]  [*]  [_]    /_\  \/ [_] |    .  [*]   /      \    \-X--->
277	        \                             |    .       /        \-->
278	         \                           _V_   :      /          \  loss
279	          \__________________________\ /__/\_____/            \-X--->
280	           [_]   [*]  [_] [*] [_]    /_\  \/ [_]               \
281	                                                                \--->

283	                Figure 1: Bulk Congestion Policer Schematic

285	   Note that this is called a bulk congestion policer because it polices
286	   all the flows of a tenant even if they spread out in a hose pattern
287	   across numerous downstream networks and even if only a few of these
288	   flows are primarily to blame for contributing heavily to downstream
289	   congestion, while others find plenty of capacity downstream.  This
290	   approach is motivated by simplicity rather than precision (if
291	   precision were required other policer designs could be used).
292	   Simplicity might be more important than precision if capacity were
293	   generally well-provisioned and the policer was intended to intervene
294	   only rarely to limit unusually extreme behaviour.  Also, a network
295	   might take such a simple but somewhat heavy-handed approach in order
296	   to motivate tenants to implement more complex flow-specific controls
297	   themselves [CongPol].

299	   Various more sophisticated congestion policer designs have been
300	   evaluated [CPolTrilogyExp].  In these experiments, it was found that
301	   it is better if the policer gradually increases discards as the
302	   bucket becomes empty.  Also isolation between tenants is better if
303	   each tenant is policed based on the combination of two buckets, not
304	   one (Figure 2):

306	   1.  A deep bucket (that would take minutes or even hours to fill at
307	       the contracted fill-rate) that constrains the tenant's long-term
308	       average rate of congestion (wi)

310	   2.  a very shallow bucket (e.g. only two or three MTU) that is filled
311	       considerably faster than the deep bucket (c * wi), where c = ~10,
312	       which prevents a tenant storing up a large backlog of tokens then
313	       causing congestion in one large burst.

315	   In this arrangement each marked packet drains tokens from both
316	   buckets, and the probability of policer discard is taken as the worse
317	   of the two buckets.

319	                                |         |
320	     Legend:                    |c*wi     |wi
321	     See previous figure        V         V
322	                                .         .
323	                                .       | . | deep bucket
324	                _ _ _ _ _ _ _ _ _ _ _ _ |___|
325	               |                .       |:::|
326	               |_ _ _ _ _ _ _ |___|     |:::|
327	               |      shallow +---+     +---+
328	   worse of the|       bucket
329	    two buckets|               \____   ____/
330	       triggers|                    \ / both buckets
331	      policing V                     :  drained by
332	              ___                    .  marked packets
333	   ___________\ /___________________/ \__________________
334	      [_] [_] /_\   [_]  [*]   [_]  \ /  [_]  [_]   [_]

336	      Figure 2: Dual Congestion Token Bucket (in place of each single
337	                      bucket in the previous figure)

339	   This design of congestion policer will be sufficient to understand
340	   the rest of the document.  However, it should be remembered that this
341	   is only an example, not an ideal design.  Also it should be
342	   remembered that other mechanisms such as tunnel feedback can be used
343	   instead of ConEx.

345	3.  Network Performance Isolation: Intuition

347	3.1.  The Problem

349	   Network performance isolation traditionally meant that each user
350	   could be sure of a minimum guaranteed bit-rate.  Such assurances are
351	   useful if traffic from each tenant follows relatively predictable
352	   paths and is fairly constant.  If traffic demand is more dynamic and
353	   unpredictable (both over time and across paths), minimum bit-rate
354	   assurances can still be given, but they have to be very small
355	   relative to the available capacity, because a large number of users
356	   might all want to simulataneously share any one link, even though
357	   they rarely all use it at the same time.

359	   This either means the shared capacity has to be greatly overprovided
360	   so that the assured level is large enough, or the assured level has
361	   to be small.  The former is unnecessarily expensive; the latter
362	   doesn't really give a sufficiently useful assurance.

364	   Round robin or fair queuing are other forms of isolation that
365	   guarantee that each user will get 1/N of the capacity of each link,
366	   where N is the number of active users at each link.  This is fine if
367	   the number of active users (N) sharing a link is fairly predictable.
368	   However, if large numbers of tenants do not typically share any one
369	   link but at any time they all could (as in a data centre), a 1/N
370	   assurance is fairly worthless.  Again, given N is typically small but
371	   could be very large, either the shared capacity has to be expensively
372	   overprovided, or the assured bit-rate has to be worthlessly small.
373	   The argument is no different for the weighted forms of these
374	   algorithms: WRR & WFQ).

376	   Both these traditional forms of isolation try to give one tenant
377	   assured instantaneous bit-rate by constraining the instantaneous bit-
378	   rate of everyone else.  This approach is flawed except in the special
379	   case when the load from every tenant on every link is continuous and
380	   fairly constant.  The reality is usually very different: sources are
381	   on-off and the route taken varies, so that on any one link a source
382	   is more often off than on.

384	      In these more realistic (non-constant) scenarios, the capacity
385	      available for any one tenant depends much more on _how often_
386	      everyone else uses a link, not just _how much_ bit-rate everyone
387	      else would be entitled to if they did use it.

389	   For instance, if 100 tenants are using a 1Gb/s link for 1% of the
390	   time, there is a good chance each will get the full 1Gb/s link
391	   capacity.  But if just six of those tenants suddenly start using the
392	   link 50% of the time, whenever the other 94 tenants need the link,
393	   they will typically find 3 of these heavier tenants using it already.
394	   If a 1/N approach like round-robin were used, then the light tenants
395	   would suddently get 1/4 * 1Gb/s = 250Mb/s on average.  Round-robin
396	   cannot claim to isolate tenants from each other if they usually get
397	   1Gb/s but sometimes they get 250Mb/s (and only 10Mb/s guaranteed in
398	   the worst case when all 100 tenants are active).

400	   In contrast, congestion policing is the key to network performance
401	   isolation because it focuses policing only on those tenants that go
402	   fast over congested path(s) excessively and persistently over time.
403	   This keeps congestion below a design threshold everywhere so that
404	   everyone else can go fast.  In this way, congestion policing takes
405	   account of highly variable loads (varying in time and varying across
406	   routes).  And, if everyone's load happens to be constant, congestion
407	   policing converges on the same outcome as the traditional forms of
408	   isolation.

410	   The other flaw in the traditional approaches to isolation, like WRR &
411	   WFQ, is that they actually prevent long-running flows from yielding
412	   to brief bursts from lighter tenants.  A long-running flow can yield
413	   to brief flows and still complete nearly as soon as it would have
414	   otherwise (the brief flows complete sooner, freeing up the capacity
415	   for the longer flow sooner--see Section 3.7).  However, WRR & WFQ
416	   prevent flows from even seeing the congestion signals that would
417	   allow them to co-ordinate between themselves, because they isolate
418	   each tenant completely into separate queues.

420	   In summary, superficially, traditional approaches with separate
421	   queues sound good for isolation, but:

423	   1.  not when everyone's load is variable, so each tenant has no
424	       assurance about how many other queues there will be;

426	   2.  and not when each tenant can no longer even see the congestion
427	       signals from other tenants, so no-one's control algorithms can
428	       determine whether they would benefit most by pushing harder or
429	       yielding.

431	3.2.  Approach

433	   It has not been easy to find a way to give the intuition on why
434	   congesiton policing isolates performance, particularly across a
435	   networks of links not just on a single link.  The approach used in
436	   this section, is to describe the system as if everyone is using the
437	   congestion response they would be forced to use if congestion
438	   policing had to intervene.  We therefore call this the boundary model
439	   of congestion control.  It is a very simple congestion response, so
440	   it is much easier to understand than if we introduced all the square
441	   root terms and other complexity of New Reno TCP's response.  And it
442	   means we don't have to try to decribe a mix of responses.

444	   The purpose of congestion policing is not to intervene in everyone's
445	   rate control all the time.  Rather it is to encourage each tenant to
446	   _avoid_ being policed -- to keep the aggregate of all their flows'
447	   rates within an overall envelope that is sufficiently responsive to
448	   congestion.  Nonetheless, congestion policing can and will enforce a
449	   congestion response if a particular tenant sends traffic that is
450	   completely unresponsive to congestion.  This upper bound enforced by
451	   everyone else's congestion policers is what ensures that each
452	   tenant's minimum performance is isolated from the combined effect of
453	   everyone else.

455	   We cannot emphasise enough that the intention is not to make
456	   individual flows conform to this boundary response to congestion.
457	   Indeed the intention is to allow a diverse evolving mix of congestion
458	   responses, but constrained in total within a simple overall envelope.

460	   After describing and further justifying using the simple boundary
461	   model of congestion control, we start by considering long-running
462	   flows sharing one link.  Then we will consider on-off traffic, before
463	   widening the scope from one link to a network of links and to links
464	   of different sizes.  Then we will depart from the initial simplified
465	   model of congestion control and consider diverse congestion control
466	   algorithms, including completely unresponsive end-systems.

468	   Formal analysis to back-up the intuition provided by this section
469	   will be made available in a more extensive companion technical report
470	   [conex-dc_tr].

472	3.3.  Terminology

474	   The term congestion probability will be used as a generisation of
475	   either loss probability or ECN marking probability.

477	   It is necessary to carefully distinguish:

479	   o  congestion bit-rate (or just congestion rate), which is an
480	      absolute measure of the rate of congested bits

482	   o  congestion probability, which is a relative measure of the
483	      proportion of congested bits to all bits

485	   Sometimes people loosely talk of loss rate when they mean loss
486	   probability (measured in %).  In this document, we will keep to
487	   strict terminology, so loss rate would be measured in lost packets
488	   per second, or lost bits per second.

490	3.4.  Simple Boundary Model of Congestion Control

492	   The boundary model of congestion control ensures a flow's bit-rate is
493	   inversely proportional to the congestion level that it detects.  For
494	   instance, if congestion probability doubles, the flow's bit-rate
495	   halves.  This is called a scalable congestion control because it
496	   maintains the same rate of congestion signals (marked or dropped
497	   packets) no matter how fast it goes.  Examples from the research
498	   community are Relentless TCP [Relentless] and Scalable TCP [STCP].

500	   In production systems, TCP algorithms based on New Reno [RFC5681]
501	   have been widely replaced by alternatives closer to this scalable
502	   ideal (e.g. Cubic TCP in Linux and Compound TCP in Windows), because
503	   at high rates New Reno generated congestion signals too infrequently
504	   to track available capacity fast enough [RFC3649].  More recent TCP
505	   updates (e.g. Data Centre TCP [DCTCP] in Windows 8 and available in
506	   Linux) are becoming closer still to the scalable ideal.

508	   For instance, consider a scenario where a flow with scalable
509	   congestion control is alone in a 1Gb/s link, then another similar
510	   flow from another tenant joins it.  Both will push up the congestion
511	   probability, which will push down their rates until they together fit
512	   into the link.  Because the flow's rate has to halve to accomodate
513	   the new flow, congestion probability will double (lets say from
514	   0.002% to 0.004%), by our initial assumption of a two similar
515	   scalable congestion controls.  When it is alone on the link, the
516	   congestion-bit-rate of the flow is 20kb/s (=1Gb/s * 0.002%), and when
517	   it shares the link it is still 20kb/s (= 500Mb/s * 0.004%).

519	   In summary, a congestion control can be considered scalable if the
520	   bit-rate of packets carrying congestion signals (the congestion-bit-
521	   rate) always stays the same no matter how much capacity it finds
522	   available.  This ensures there will always be enough signals in a
523	   round trip time to keep the dynamics under control.

525	   Reminder: Making individual flows conform to this boundary (scalable)
526	   response to congestion is a non-goal.  Although we start this
527	   explanation with this specific simple end-system congestion response,
528	   this is just to aid intuition.

530	3.5.  Long-Running Flows

532	   Table 1 shows various scenarios where each of five tenants has
533	   contracted for 40kb/s of congestion-bit-rate in order to share a 1Gb/
534	   s link.  In order to help intuition, we start with the (unlikely)
535	   scenario where all their flows are long-running.  A number of long-
536	   running flows will try to use all the link capacity, so for
537	   simplicity we take utilisation as a round figure of 100%.

539	   In the case we have just described (scenario A) neither tenant's
540	   policer is intervening at all, because both their congestion
541	   allowances are 40kb/s and each sends only one flow that contributes
542	   20kb/s of congestion -- half the allowance.

544	   +--------+-------------+----------+----------+----------+-----------+
545	   | Tenant | contracted  | scenario | scenario | scenario | scenario  |
546	   |        | congestion- | A        | B        | C        | D         |
547	   |        | bit-rate    | # : Mb/s | # : Mb/s | # : Mb/s | # : Mb/s  |
548	   |        | kb/s        |          |          |          |           |
549	   +--------+-------------+----------+----------+----------+-----------+
550	   |        |             |          |          |          |           |
551	   |    (a) | 40          | 1 : 500  | 5 : 250  | 5 : 200  | 5 : 250   |
552	   |    (b) | 40          | 1 : 500  | 3 : 250  | 3 : 200  | 2 : 250   |
553	   |    (c) | 40          | - : ---  | 3 : 250  | 3 : 200  | 2 : 250   |
554	   |    (d) | 40          | - : ---  | 2 : 250  | 2 : 200  | 1 : 125   |
555	   |    (e) | 40          | - : ---  | - : ---  | 2 : 200  | 1 : 125   |
556	   |        | Congestion  | 0.004%   | 0.016%   | 0.02%    | 0.016%    |
557	   |        | probability |          |          |          |           |
558	   +--------+-------------+----------+----------+----------+-----------+

560	      Table 1: Bit-rates that a congestion policer allocates to five
561	   tenants sharing a 1Gb/s link with various numbers (#) of long-running
562	      flows all using 'scalable congestion control' of weight 20kb/s

564	   Scenario B shows a case where four of the tenants all send 2 or more
565	   long-running flows.  Recall, from the assumption of scalable
566	   controls, that each flow always contributes 20kb/s no matter how fast
567	   it goes.  Therefore the policers of tenants (a-c) limit them to two
568	   flows-worth of congestion (2 x 20kb/s = 40kb/s).  Tenant (d) is only
569	   asking for 2 flows, so it gets them without being policed, and all
570	   four get the same quarter share of the link.

572	   Scenario C is similar, except the fifth tenant (e) joins in, so they
573	   all get equal 1/5 shares of the link.

575	   In Scenario D, only tenant (a) sends more than two flows, so (a)'s
576	   policer limits it to two flows-worth of congestion, and everyone else
577	   gets the number of flows-worth that they ask for.  This means that
578	   tenants (d) & (e) get less than everyone else, which is fine because
579	   they asked for less than they would have been allowed.  (Similarly,
580	   in Scenarios A & B, some of the tenants are inactive, so they get
581	   zero, which is also less than they could have had if they had
582	   wanted.)

584	   With lots of long-running flows, as in scenarios B & C, congestion
585	   policing seems to emulate per-tenant round robin scheduling,
586	   equalising the bit-rate of each tenant, no matter how many flows they
587	   run.  By configuring different contracted allowances for each tenant,
588	   it can easily be seen that congestion policing could emulate weighted
589	   round robin (WRR), with the relative sizes of the allowances acting
590	   as the weights.

592	   Scenario D departs from round-robin.  This is deliberate, the idea
593	   being that tenants are free to take less than their share in the
594	   short term, which allows them to take more at other times, as we
595	   shall see in Section 3.7.  In Scenario D, policing focuses only on
596	   the tenant (a) that is continually exceeding its contract.  This
597	   policer focuses discard solely on tenant a's traffic so that it
598	   cannot cause any more congestion at the shared link (shown as 0.016%
599	   in the last row).

601	   To summarise so far, ingress congestion policers control congestion-
602	   bit-rate in order to indirectly assure a minimum bit-rate per tenant.
603	   With lots of long-running flows, the outcome is somewhat similar to
604	   WRR, but without the need for any mechanism in each queue.

606	   However, aiming to behave like WRR is only useful when all flows are
607	   infinitely long.  When load is variable because transfers are of
608	   finite size, congestion policing properly isolates one tenant's
609	   performance from the behaviour of others, unlike WRR would, as will
610	   now be explained.

612	3.6.  On-Off Flows

614	   Figure 3 compares two example scenarios where tenant 'b' regularly
615	   sends small files in the top chart and the same size files but more
616	   often in the bottom chart (a higher 'on-off ratio').  This is the
617	   typical behaviour of a Web server as more clients request more files
618	   at peak time.  Meanwhile, in this example, tenant c's behaviour
619	   doesn't change between the two scenarios -- it sends a couple of
620	   large files, each starting at the same time in both cases.

622	   The capacity of the link that 'b' and 'c' share is shown as the full
623	   height of the plot.  The files sent by 'b' are shown as little
624	   rectangles. 'b' can go at the full bit-rate of the link when 'c' is
625	   not sending, which is represented by the tall thin rectangles part-
626	   way through the trace labelled 'b'.  We assume for simplicity that
627	   'b' and 'c' divide up the bit-rate equally.  So, when both 'b' and
628	   'c' are sending, the 'b' rectangles are half the height (bit-rate)
629	   and twice the width (duration) relative to when 'b' sends alone.  The
630	   area of a file to be transferred stays the same, whether tall and
631	   thin or short and fat, because the area represents the size of the
632	   file (bit-rate x duration = file size).  The files from 'c' look like
633	   inverted castellations, because 'c' uses half the link rate while
634	   each file from 'b' completes, then 'c' can fill the link until 'b'
635	   starts the next file.  The cross-hatched areas represent idle times
636	   when no-one is sending.

638	   For this simple scenario we ignore start-up dynamics and just focus
639	   on the rate and duration of flows that are long enough to stabilise,
640	   which is why they can be represented as simple rectangles.  We will
641	   introduce the effect of flow startups later.

643	   In the bottom case, where 'b' sends more often, the gaps between b's
644	   transfers are smaller, so 'c' has less opportunity to use the whole
645	   line rate.  This squeezes out the time it takes for 'c' to complete
646	   its file transfers (recall a file will always have the same area
647	   which represents its size).  Although 'c' finishes later, it still
648	   starts the next flow at the same time.  In turn, this means 'c' is
649	   sending during a greater proprotion of b's transfers, which extends
650	   b's average completion time too.

652	  ^ bit-rate
653	  |
654	  |---------------------------------------,--.---------,--.----,-------
655	  |                                       |  |\/\/\/\/\|  |/\/\|
656	  |                     c                 | b|/\/\/\/\/| b|\/\/|  c
657	  |------.      ,-----.      ,-----.      |  |\/\/\/\/\|  |/\/\|    ,--
658	  |  b   |      |  b  |      |  b  |      |  |/\/\/\/\/|  |\/\/|    | b
659	  |      |      |     |      |     |      |  |\/\/\/\/\|  |/\/\|    |
660	  +------'------'-----'------'-----'------'--'---------'--'----'----'-->
661	                                                                    time
662	  ^ bit-rate
663	  |
664	  |---------------------------------------------------.--,--.--,-------
665	  |                                                   |/\|  |\/|
666	  |                         c                         |\/| b|/\| c
667	  |------.  ,-----.  ,-----.  ,-----.  ,-----.  ,-----./\|  |\/|  ,----
668	  | b    |  | b   |  | b   |  | b   |  | b   |  | b   |\/|  |/\|  | b
669	  |      |  |     |  |     |  |     |  |     |  |     |/\|  |\/|  |
670	  +------'--'-----'--'-----'--'-----'--'-----'--'-----'--'--'--'--'---->
671	                                                                    time

673	    Figure 3: In the lower case, the on-off ratio of 'b' has increased,
674	           which extends all the completion times of 'c' and 'b'

676	   Round-robin would do little if anything to isolate 'c' from the
677	   effect of 'b' sending files more often.  Round-robin is designed to
678	   force 'b' and 'c' to share the capacity equally when they are both
679	   active.  But in both scenarios they already share capacity equally
680	   when they are both active.  The difference is in how often they are
681	   active.  Round-robin and other traditional fair queuing techniques
682	   don't have any memory to sense that 'b' has been active more of the
683	   time.

685	   In contrast, a congestion policer can tell when one tenant is sending
686	   files more frequently, by measuring the rate at which the tenant is
687	   contributing to congestion.  Our aim is to show that policers will be
688	   able to isolate performance properly by using the right metric
689	   (congestion bit-rate), rather than using the wrong metric (bit-rate),
690	   which doesn't sense whether the load over time is large or small.

692	3.6.1.  Numerical Examples Without Policing

694	   The usefulness of the congestion bit-rate metric will now be
695	   illustrated with the numerical examples in Table 2.  The scenarios
696	   illustrate what the congestion bit-rate would be without any policing
697	   or scheduling action in the network.  Then this metric can be
698	   monitored and limited by a policer, to prevent one tenant from
699	   harming the performance of others.

701	   The 2nd & 3rd columns (file-size and inter-arrival time) fully
702	   represent the behaviour of each tenant in each scenario.  All the
703	   other columns merely characterise the outcome in various ways.  The
704	   inter-arrival time (T) is the average time between starting one file
705	   and the next.  For instance, in scenario E tenant 'b' sends a 16Mb
706	   file every 200ms on average.  The formula in the heading of some
707	   columns shows how the column was derived from other columns.

709	   Scenario E is contrived so that the three tenants all offer the same
710	   load to the network, even though they send files of very different
711	   size (S).  The files sent by tenant 'a' are 100 times smaller than
712	   those of tenant 'b', but 'a' sends them 100 times more often.  In
713	   turn, b's files are 100 times smaller than c's, but 'b' in turn sends
714	   them 100 times more often.  Graphically, the scenario would look
715	   similar to Figure 3, except with three sizes of file, not just two.
716	   Scenarios E-G are designed to roughly represent various distributions
717	   of file sizes found in data centres, but still to be simple enough to
718	   facilitate intuition, even though each tenant would not normally send
719	   just one size file.

721	   The average completion time (t) and the maximum were calculated from
722	   a fairly simple analytical model (documented in a campanion technical
723	   report [conex-dc_tr]).  Using one data point as an example, it can be
724	   seen that a 1600Mb (200MB) file from tenant 'c' completes in 1905ms
725	   (about 1.9s).  The files that are 100 times smaller complete 100
726	   times more quickly on average.  In fact, in this scenario with equal
727	   loads, each tenant perceives that their files are being transferred
728	   at the same rate of 840Mb/s on average (file-size divided by
729	   completion time, as shown in the 'apparent bit-rate' column).  Thus
730	   on average all three tenants perceive they are getting 84% of the 1Gb
731	   /s link on average (due to the benefit of multiplexing and
732	   utilisation being low at 240Mb/s / 1Gb/s = 24% in this case).

734	   The completion times of the smaller files vary significantly,
735	   depending on whether a larger file transfer is proceeding at the same
736	   time.  We have already seen this effect in Figure 3, where, when
737	   tenant b's files share with 'c', they take twice as long to complete
738	   as when they don't. This is why the maximum completion time is
739	   greater than the average for the small files, whereas there is
740	   imperceptible variance for the largest files.

742	   The final column shows how congestion bit-rate will be a useful
743	   metric to enforce performance isolation (as we said, the table
744	   illustrates the situation before any enforcement mechanism is added).
745	   In the case of equal loads (scenario E), average congestion bit-rates
746	   are all equal.  In scenarios F and G average congestion bit-rates are
747	   higher, because all tenants are placing much more load on the network
748	   over time, even though each still sends at equal rate to others when
749	   they are active together.  Figure 3 illustrated a similar effect in
750	   the difference between the top and bottom scenarios.

752	   The maximum instantaneous congestion bit-rate is nearly always 20kb/
753	   s. That is because, by definition, all the tenants are using scalable
754	   congestion controls with a constant congestion rate of 20kb/s, and
755	   each tenant's flows rarely overlap because the per-tenant load in
756	   these scenarios is fairly low.  As we saw in Section 3.4, the
757	   congestion rate of a particular scalable congestion control is always
758	   the same, no matter how many other flows it competes with.

760	   Once it is understood that the congestion bit-rate of one scalable
761	   flow is always 'w' and doesn't change whenever a flow is active, it
762	   becomes clear what the congestion bit-rate will be when averaged over
763	   time; it will simply be 'w' multiplied by the proportion of time that
764	   the tenant's file transfers are active.  That is, w*t/T. For
765	   instance, in scenario E, on average tenant b's flows start 200ms
766	   apart, but they complete in 19ms.  So they are active for 19/200 =
767	   10% of the time (rounded).  A tenant that causes a congestion bit-
768	   rate of 20kb/s for 10% of the time will have an average congestion-
769	   bit-rate of 2kb/s, as shown.

771	   To summarise so far, no matter how many more files other tenants
772	   transfer at the same time, each scalable flow still contributes to
773	   congestion at the same rate, but it contributes for more of the time,
774	   because it squeezes out into the gap before the next flow from that
775	   tenant starts.

777	   +-----+------+--------+------+-------------+-----------+------------+
778	   | Ten | File |   Ave. | Ave. |  Completion |  Apparent |  Congest-  |
779	   | ant | size | inter- | load |        time |  bit-rate |  ion bit-  |
780	   |     |    S |   arr- |  S/T |  ave :  max | ave : min |    rate    |
781	   |     |   Mb |   ival | Mb/s |           t |    S/t    | ave : max  |
782	   |     |      |      T |      |          ms |    Mb/s   |   w*t/T    |
783	   |     |      |     ms |      |             |           |    kb/s    |
784	   +-----+------+--------+------+-------------+-----------+------------+
785	   |     |      |        |      |  Scenario E |           |            |
786	   |   a | 0.16 |      2 |   80 | 0.19 : 0.48 | 840 : 333 |   2 : 20   |
787	   |   b |   16 |    200 |   80 |   19 :   35 | 840 : 460 |   2 : 20   |
788	   |   c | 1600 |  20000 |   80 | 1905 : 1905 | 840 : 840 |   2 : 20   |
789	   |     |      |        | ____ |             |           |            |
790	   |     |      |        |  240 |             |           |            |
791	   |     |      |        |      |  Scenario F |           |            |
792	   |   a | 0.16 |   0.67 |  240 | 0.31 : 0.48 | 516 : 333 |   9 : 20   |
793	   |   b |   16 |     50 |  320 |   29 :   42 | 557 : 380 |  11 : 20   |
794	   |   c | 1600 |  10000 |  160 | 3636 : 3636 | 440 : 440 |   7 : 20   |
795	   |     |      |        | ____ |             |           |            |
796	   |     |      |        |  720 |             |           |            |
797	   |     |      |        |      |  Scenario G |           |            |
798	   |   a | 0.16 |   0.67 |  240 | 0.33 : 0.64 | 481 : 250 |  10 : 20   |
799	   |   b |   16 |     40 |  400 |   32 :   46 | 505 : 345 |  16 : 40   |
800	   |   c | 1600 |  10000 |  160 | 4543 : 4543 | 352 : 352 |   9 : 20   |
801	   |     |      |        | ____ |             |           |            |
802	   |     |      |        |  800 |             |           |            |
803	   +-----+------+--------+------+-------------+-----------+------------+

805	   Single link of capacity 1Gb/s. Each tenant uses a scalable congestion
806	   control which contributes a congestion-bit-rate for each flow of w =
807	                                  20kb/s.

809	   Table 2: How the effect on others of various file-transfer behaviours
810	           can be measured by the resulting congestion-bit-rate

812	   In scenario F, clients have increased the rate they request files
813	   from tenants a, b and c respectively by 3x, 4x and 2x relative to
814	   scenario E. The tenants send the same size files but 3x, 4x and 2x
815	   more often.  For instance tenant 'b' is sending 16Mb files four times
816	   as often as before, and they now take longer as well -- nearly 29ms
817	   rather than 19ms -- because the other tenants are active more often
818	   too, so completion gets squeezed to later.  Consequently, tenant 'b'
819	   is now sending 57% of the time, so its congestion-bit-rate is 20kb/s
820	   * 57% = 11kb/s. This is nearly 6x higher than in scenario E,
821	   reflecting both b's own increase by 4x and that this increase
822	   coincides with everyone else increasing their load.

824	   In scenario G, tenant 'b' increases even more, to 5x the load it
825	   offered in scenario E. This results in average utilisation of 800Mb/s
826	   / 1Gb/s = 80%, compared to 72% in scenario F and only 24% in scenario
827	   E. 'b' sends the same files but 5x more often, so its load rises 5x.

829	   Completion times rise for everyone due to the overall rise in load,
830	   but the congestion rates of 'a' and 'c' don't rise anything like as
831	   much as that of 'b', because they still leave large gaps between
832	   files.  For instance, tenant 'c' completes each large file transfer
833	   in 4.5s (compared to 1.9s in scenario E), but it still only sends
834	   files every 10s. So 'c' only sends 45% of the time, which is
835	   reflected in its congestion bit-rate of 20kb/s * 45% = 9kb/s.

837	   In contrast, on average tenant 'b' can only complete each medium-
838	   sized file transfer in 32ms (compared to 19ms in scenario E), but on
839	   average it starts sending another file after 40ms.  So 'b' sends 79%
840	   of the time, which is reflected in its congestion bit-rate of 20kb/s
841	   * 79% = 16kb/s (rounded).

843	   However, during the 45% of the time that 'c' sends a large file, b's
844	   completion time is higher than average (as shown in Figure 3).  In
845	   fact, as shown in the maximum completion time column, 'b' completes
846	   in 46ms, but it starts sending a new file after 40ms, which is before
847	   the previous one has completed.  Therefore, during each of c's large
848	   files, 'b' sends 46/40 = 116% of the time on average.

850	   This actually means 'b' is overlapping two files for 16% of the time
851	   on average and sending one file for the remaining 84%. Whenever two
852	   file transfers overlap, 'b' will be causing 2 x 20kb/s = 40kb/s of
853	   congestion, which explains why tenant b in scenario G is the only
854	   case with a maximum congestion rate of 40kb/s rather than 20kb/s as
855	   in every other case.  Over the duration of c's large files, 'b' would
856	   therefore cause congestion at an average rate of 20kb/s * 84% + 40kb/
857	   s * 16% = 23kb/s (or more simply 20kb/s * 116% = 23kb/s).  Of course,
858	   when 'c' is not sending a large file, 'b' will contribute less to
859	   congestion, which is why its average congestion rate is 16kb/s
860	   overall, as discussed earlier.

862	3.6.2.  Congestion Policing of On-Off Flows

864	   Still referring to the numerical examples in Table 2, we will now
865	   discuss the effect of limiting each tenant with a congestion policer.

867	   The network operator might have deployed congestion policers to cap
868	   each tenant's average congestion rate to 16kb/s. None of the tenants
869	   are exceeding this limit in any of the scenarios, but tenant 'b' is
870	   just shy of it in scenario G. Therefore all the tenants would be free
871	   to behave in all sorts of ways, such as those of scenarios E-G.

873	   However, they would be prevented from degrading the performance of
874	   the other tenants beyond the point reached by tenant 'b' in scenario
875	   G. If tenant 'b' added more load, the policer would prevent the extra
876	   load entering the network by focusing drop solely on tenant 'b',
877	   preventing the other tenants from experiencing any more congestion
878	   due to tenant 'b'.  Then tenants 'a' and 'c' would be assured the
879	   (average) apparent bit-rates shown, whatever the behaviour of 'b'.

881	   If 'a' added more load, 'c' would not suffer.  Instead 'b' would go
882	   over limit and its rate would be trimmed during congestion peaks,
883	   sacrificing some of its lead to 'a'.  Similarly, if 'c' added more
884	   load, 'b' would be made to sacrifice some of its performance, so that
885	   'a' would not suffer.  Further, if more tenants arrived to share the
886	   same link, the policer would force 'b' to sacrifice performance in
887	   favour of the additional tenants.

889	   There is nothing special about a policer limit of 16kb/s. The example
890	   when discussing infinite flows used a limit of 40kb/s per tenant.
891	   And some tenants can be given higher limits than others (e.g. at an
892	   additional charge).  If the operator gives out congestion limits that
893	   together add up to a higher amount but it doesn't increase the link
894	   capacity, it merely allows the tenants to apply more load (e.g. more
895	   files of the same size in the same time), but each with lower bit-
896	   rate.

898	3.7.  Weighted Congestion Controls

900	   At high speed, congestion controls such as Cubic TCP, Data Centre
901	   TCP, Compound TCP etc all contribute to congestion at widely
902	   differing rates, which is called their 'aggressiveness' or 'weight'.
903	   So far, we have made the simplifying assumption of a scalable
904	   congestion control algorithm that contributes to congestion at a
905	   constant rate of w = 20kb/s. We now assume tenant 'c' uses a similar
906	   congestion control to before, but with different parameters in the
907	   algorithm so that its weight is still constant, but much lower, at w
908	   = 2.2kb/s.

910	   Tenant 'b' is sending smaler files than 'c', so it still uses w =
911	   20kb/s. Then, when the two compete for the 1Gb/s link, they will
912	   share it in proportion to their weights, 20:2.2 (or 90%:10%).  That
913	   is, when competing, 'b' and 'c' will respectively get (20/22.2)*1Gb/s
914	   = 900Mb/s and (2.2/22.2)*1Gb/s = 100Mb/s of the 1Gb/s link.  Figure 4
915	   shows the situation before (upper) and after (lower) this change.

917	   When the two compete, 'b' transfers each file 9/5 faster than before
918	   (900Mb/s rather than 500Mb/s), so it completes them in 5/9 of the
919	   time. 'b' still contributes congestion at the same rate of 20kb/s,
920	   but for 5/9 less time than before.  Therefore, relative to before,
921	   'b' uses up its allowance only just over half as quickly, and it
922	   completes each transfer nearly twice as fast.

924	   Even though 'b' completes each small file much faster, 'c' should
925	   complete its file transfer hardly any later than before.  Although
926	   tenant 'b' goes faster, as each 'b' file finishes, it gets out of the
927	   way sooner.  Irrespective of its lower weight, 'c' can still use the
928	   full link when 'b' is not present, because weights only determine
929	   shares between active flows.  Because 'c' has the whole link sooner
930	   it can catch up to the same point it would have reached in its
931	   download if 'b' had been slower but longer.  Tenant 'c' will probably
932	   lose some completion time because it has to accelerate and decelerate
933	   more.  But, whenever it is sending a file, 'c' gains (20kb/s -2kb/s)
934	   = 18kb of allowance every second, which it can use for other
935	   transfers.

937	  ^ bit-rate
938	  |
939	  |---------------------------------------------------.--,--.--,-------
940	  |                                                   |/\|  |\/|
941	  |                         c                         |\/| b|/\| c
942	  |------.  ,-----.  ,-----.  ,-----.  ,-----.  ,-----./\|  |\/|  ,----
943	  | b    |  | b   |  | b   |  | b   |  | b   |  | b   |\/|  |/\|  | b
944	  |      |  |     |  |     |  |     |  |     |  |     |/\|  |\/|  |
945	  +------'--'-----'--'-----'--'-----'--'-----'--'-----'--'--'--'--'---->
946	                                                                    time

948	  ^ bit-rate
949	  |
950	  |---------------------------------------------------.--,--.--,-------
951	  |---.     ,---.    ,---.    ,---.    ,---.    ,---. |/\|  |\/|  ,---.
952	  |   |     |   |    |   |  c |   |    |   |    |   | |\/| b|/\| c|   |
953	  |   |     |   |    |   |    |   |    |   |    |   | |/\|  |\/|  |   |
954	  | b |     | b |    | b |    | b |    | b |    | b | |\/|  |/\|  | b |
955	  |   |     |   |    |   |    |   |    |   |    |   | |/\|  |\/|  |   |
956	  +---'-----'---'----'---'----'---'----'---'----'---'-'--'--'--'--'---'>
957	                                                                    time

959	   Figure 4: Weighted congestion controls with equal weights (upper) and
960	                              unequal (lower)

962	   It seems too good to be true that both tenants gain so much and lose
963	   so little by 'c' reducing its aggressiveness.  Effectively 'c' allows
964	   everyone else's shorter flows to proceed as if the long flows were
965	   hardly there, and 'c' hardly loses out at all itself.  The gains are
966	   unlikely to be as perfect as this simple model predicts, but we
967	   believe they will be nearly as substantial.

969	   It might seem that everyone can keep gaining by everyone agreeing to
970	   reduce their weights, ad infinitum.  However, the lower the weight,
971	   the less signals the congestion control gets, so it starts to lose
972	   its control during dynamics.  Nonetheless, congestion policing should
973	   encourage congestion control designs to keep reducing their weights,
974	   but they will have to stop when they reach the minimum necessary
975	   congestion in order to maintain sufficient control signals.

977	3.8.  A Network of Links

979	   So far we have only considered a single link.  Congestion policing at
980	   the network edge is designed to work across a network of links,
981	   treating them all as a pool of resources, as we shall now explain.
982	   We will use the dual-homed data centre topology shown in Figure 5
983	   (stretching the bounds of ASCII art) as a simple example of a pool of
984	   resources.

986	   In this case where there are 48 servers (H1, H2, ... Hn where n=48)
987	   on the left, with on average 8 virtual machines (VMs) running on each
988	   (e.g. server n is running Vn1, Vn2, ... to Vnm where m = 8).  Each
989	   server is connected by two 1Gb/s links, one to each top-of-rack
990	   switch S1 & S2.  To the right of the switches, there are 6 links of
991	   10Gb/s each, connecting onwards to customer networks or to the rest
992	   of the data centre.  There is a total of 48 *2 *1Gb/s = 96Gb/s
993	   capacity between the 48 servers and the 2 switches, but there is only
994	   6 *10Gb/s = 60Gb/s to the right of the switches.  Nonetheless, data
995	   centres are often designed with some level of contention like this,
996	   because at the ToR switches a proportion of the traffic from certain
997	   hosts turns round locally towards other hosts in the same rack.

999	       virtual      hosts        switches
1000	       machines

1002	   V11 V12     V1m                        __/
1003	   *   * ...   *   H1 ,-.__________+--+__/
1004	    \___\__   __\____/`-'       __-|S1|____,--
1005	                         `. _ ,' ,'|  |_______
1006	           .       H2 ,-._,`.  ,'  +--+
1007	           .        . `-'._  `.
1008	           .        .      `,' `.
1009	                    .     ,' `-. `.+--+_______
1010	   Vn1 Vn2     Vnm       /      `-_|S2|____
1011	   *   * ...   *   Hn ,-.__________|  |__  `--
1012	    \___\__   __\____/`-'          +--+  \__
1013	                                            \

1015	          Figure 5: Dual-Homed Topology -- a Simple Resource Pool

1017	   The congestion policer proposed in this document is based on the
1018	   'hose' model, where a tenant's congestion allowance can be used for
1019	   sending data over any path, including many paths at once.  Therefore,
1020	   any one of the virtual machines on the left can use its allowance to
1021	   contribute to congestion on any or all of the 6 links on the right
1022	   (or any other link in the diagram actually, including those from the
1023	   server to the switches and those turning back to other hosts).

1025	   Nonetheless, if congestion policers are to enforce performance
1026	   isolation, they should stop one tenant squeezing the capacity
1027	   available to another tenant who needs to use a particular bottleneck
1028	   link or links (e.g. to reach a particular receiver).  Policing should
1029	   work whether the offending tenant is acting deliberately or merely
1030	   carelessly.

1032	3.8.1.  Numerical Example: Isolation from Focused Load

1034	   Consider the following scenario: two tenants have equal congestion
1035	   allowances of 40kb/s, and let's imagine they each run servers that
1036	   send about 1 flow per second.  All their flows use the same scalable
1037	   congestion control and all use the same aggressiveness or weight of
1038	   20kb/s.

1040	   Tenant A has clients all over the network, so it spreads its traffic
1041	   over numerous links, and its flows usually complete within 1 second.
1042	   Therefore, on average, A has one flow running at any one instant, so
1043	   it consumes 20kb/s from its 40kb/s congestion allowance.

1045	   Then tenant B decides (carelessly or deliberately) to concentrate all
1046	   its flows into one of the links that A sometimes uses for one of its
1047	   flows.  All the flows through this new bottleneck (A's, B's and those
1048	   from other tenants) will still each consume 20kb/s of allowance while
1049	   they are active (Section 3.6), but they will all take longer to
1050	   complete.  Let's say they take 15 times longer to complete on average
1051	   (i.e. 15s not 1s).  And let's assume 5% of tenant A's flows pass
1052	   through this link.  Then each tenant will consume congestion
1053	   allowance at:

1055	   o  Tenant A: (95% * 1 * 20kb/s) + (5% * 15 * 20kb/s) = 34kb/s

1057	   o  Tenant B: (100% * 15 * 20kb/s) = 300kb/s

1059	   Tenant B's congestion token bucket will therefore drain 7.5 times
1060	   faster than it is filling (300 / 40 = 7.5) so it will rapidly empty
1061	   and start dropping some of tenant B's traffic before it enters the
1062	   network.  In fact, once tenant B's bucket is empty, it will be
1063	   limited to no more than 40kb/s of congestion at the bottleneck link,
1064	   rather than 300kb/s otherwise.  In effect, the policer shifts packet
1065	   drops in excess of the 40kb/s allowance out of the shared link an
1066	   into itself, to focus excess discards only on tenant B.

1068	   Once tenant B's bucket empties, tenant A's completion time for the
1069	   flows through this bottleneck would not return to 1s, but it would be
1070	   limited to about 2s (not 15s), and it would consume its congestion
1071	   allowance at about 21kb/s, well within its limt of 40kb/s.

1073	3.8.2.  Isolation in the Short-Term

1075	   It seems problematic that tenant B's congestion policer takes time to
1076	   empty before it kicks in, during which time other tenants suffer
1077	   reduced performance.  This is deliberate--to allow some give and take
1078	   over time.  Nonetheless, it is possible to limit this problem as
1079	   well, using the dual token bucket already described in Figure 5.  The
1080	   second shallow bucket sets an immediate limit on how much tenant B
1081	   can harm the performance of others.  The fill-rate of the shallow
1082	   bucket is c times that of the deeper bucket (let's say c = 8), so
1083	   tenant B is immediately limited to 8 * 40kb/s = 320kb/s of
1084	   congestion.  In the scenario described above, tenant B only reaches
1085	   300kb/s but at least this backstop prevents congestion from tenant B
1086	   ever exceeding 320kb/s.

1088	3.8.3.  Encouraging Load Balancing

1090	   A policer may not even have to directly intervene for tenants to be
1091	   protected; it might encourage load balancing to remove the problem
1092	   first.  Load balancing might either be provided by the network
1093	   (usually just random), or some of the tenants might themselves
1094	   actively shift traffic off the increasingly congested bottleneck and
1095	   onto other paths.  Some of them might be using the multipath TCP
1096	   protocol (MPTCP -- see experimental [RFC6356]) that would achieve
1097	   this automatically, or ultimately if the congestion signals
1098	   persisted, automatic VM placement algorithms might shift their
1099	   virtual machine to a different endpoint to circumvent the congestion
1100	   hot-spot completely.  Even if some tenants were not using MPTCP or
1101	   could not shift easily, others shifting away would achieve the same
1102	   outcome.  Essentially, the deterrent effect of congestion policers
1103	   encourages everyone to even out congestion across the network,
1104	   shifting load away from hot spots.  Then performance isolation
1105	   becomes an emergent property of everyone's behaviour, due to the
1106	   deterrent effect of policers, rather than always through explicit
1107	   policer intervention.

1109	   In such an environment, a policer is needed that shares out the whole
1110	   pool of network resources, not one that just controls shares of each
1111	   link individually.

1113	   In contrast, enforcement mechanisms based on scheduling algorithms
1114	   like WRR or WFQ have to be deployed at each link, and each one works
1115	   in ignorance of how much of other links the tenant is using.  These
1116	   algorithms were designed for networks with a single known bottleneck
1117	   per customer (e.g. many access networks).  However, in data centres
1118	   there are many potential bottlenecks and each tenant generally only
1119	   uses a share of a small number of them.  A mechanism like WRR would
1120	   not isolate anyone's performance if it gave every tenant the right to
1121	   use the same share of all the links in the network, without regard to
1122	   how many they were using.

1124	3.9.  Tenants of Different Sizes

1126	   The scenario in Section 3.8 assumed tenants A & B had equal
1127	   congestion allowances.  If one tenant bought a huge allowance (e.g.
1128	   500kb/s) while another tenant bought a small allowance (e.g. 10kb/s),
1129	   it would seem that the former could focus all its traffic on one link
1130	   to swamp the latter.

1132	   At this point, it needs to be pointed out that congestion allowances
1133	   do not preclude the need for decent capacity planning.  If we further
1134	   assume that tenant A has bought the same access capacity as tenant B,
1135	   50 times more congestion allowance implies tenant A expects to be
1136	   provided with 50 times the contention ratio of tenant B, i.e.  any
1137	   single links they share will be big enough for B to still have enough
1138	   if it gets 1 share for every 50 used by A.

1140	   However, there is still a problem in that A can store up congestion
1141	   allowance and cause much more than 500kb/s of congestion in one
1142	   burst.  The dual token bucket arrangement (Figure 5) limits tenant A
1143	   to c * 500kb/s of congestion, but if c=8 as before, 8 * 500kb/s = 4Mb
1144	   /s of congestion in a burst would seriously swamp tenant B if all
1145	   focused on one link where tenant B was quietly consuming its 10kb/s
1146	   allowance.

1148	   The answer to this problem is that the burst limit c can be smaller
1149	   for larger tenants, tending to 1 for the largest tenants (i.e. a
1150	   single shallow bucket would suffice for huge tenants).  The reasoning
1151	   is that the c factor allows a tenant some give-and-take over time,
1152	   but the aggregate of traffic from a larger tenant consists of enough
1153	   flows that it has sufficient give-and-take within its own aggregate
1154	   without needing give-and-take from others.

1156	3.10.  Links of Different Sizes

1158	   Congestion policing treats a Mb/s of capacity in one link as
1159	   identical to a Mb/s of capacity in another link, even if the size of
1160	   each link is different.  For instance, consider the case where one of
1161	   the three links to the right of each switch in Figure 5 were upgraded
1162	   to 40Gb/s while the other two remained at 10Gb/s (perhaps to
1163	   accommodate the extra traffic from a couple of the dual homed 1Gb/s
1164	   servers being upgraded to dual-homed 10Gb/s).

1166	   A congestion control algorithm running at the same rate will cause
1167	   the same level of congestion probability, whatever size link it is
1168	   sharing.  For example:

1170	   o  If 50 equal flows share a 10Gb/s link (10Gb/s / 50 = 200Mb/s each)
1171	      they will cause 0.01% congestion probability;

1173	   o  If 200 of the same flows share a 40Gb/s link (40Gb/s / 200 = 200Mb
1174	      /s each) they will still cause 0.01% congestion probability;

1176	   This is because the congestion probability is determined by the
1177	   congestion control algorithms, not by the link.

1179	   Therefore, if an average of 300 flows were spread across the above
1180	   links (1x 40Gb/s and 2 x 10Gb/s), the numbers on each link would tend
1181	   towards respectively 200:50:50, so that each flow would get 200Mb/s
1182	   and each link would have 0.01% congestion on it.

1184	   Sometimes, there might be more flows on one link, resulting in less
1185	   than 200Mb/s per flow and congestion higher than 0.01%. However,
1186	   whenever the congestion level was greater on one link than another,
1187	   congestion policing would encourage flows to balance out the
1188	   congestion level across the links (as long as some flows could use
1189	   congestion balancing mechanisms like MPTCP).

1191	   In summary, all the outcomes of congestion policing described so far
1192	   (emulating WRR, etc) apply across a pool of diverse link sizes just
1193	   as much as they apply to single links, without any need for the links
1194	   to signal to each other nor to a central controller.

1196	3.11.  Diverse Congestion Control Algorithms

1198	   Throughout this explanation we have assumed a scalable congestion
1199	   control algorithm, which we justified (Section 3.4) as the 'boundary'
1200	   case if congestion policing were to intervene, which is all that is
1201	   relevant when considering whether the policer can enforce performance
1202	   isolation.

1204	   The defining difference between the scalable congestion we have
1205	   assumed and the congestion controls in widespread production
1206	   operating systems (New Reno, Compound, Cubic, Data Centre TCP etc) is
1207	   the way congestion probability decreases as flow-rate increases (for
1208	   a long-running flow).  With a scalable congestion control, if flow-
1209	   rate doubles, congestion probability halves.  Whereas, with most
1210	   production congestion controls, if flow-rate doubles, congestion
1211	   probability reduces to less than half.  For instance, New Reno TCP
1212	   reduces congestion to a quarter.  The responses of Cubic and Compound
1213	   are closer to the ideal scalable control than to New Reno, but they
1214	   do not depart too far from New Reno to ensure they can co-exist
1215	   happily with it.

1217	   Congestion policing still works, whether or not the congestion
1218	   controls in daily use by tenants fit the scalable model.  A bulk
1219	   congestion policer constrains the sum of all the congestion controls
1220	   being used by a tenant so that they collectively remain below an
1221	   aggregate envelope that is itself shaped like the sum of many
1222	   scalable algorithms.  Bulk congestion policers will constrain the
1223	   overall congestion effect (the sum) of any mix of algorithms within
1224	   it, including flows that are completely unresponsive to congestion.

1226	   This will now be explained using Figure 6 (if the ASCII art is
1227	   incomprehensible, see a similar plot at Figure 3 of [CongPol]).

1229	      bit-rate[b/s]
1230	   / \
1231	    |                        * / / / / / / / / / / / / / / / / /
1232	    |    +New Reno            * Policed / / / / / / / / / / / / /
1233	    |     + TCP                * / / / / / / / / / / / / / / / /
1234	    |        +                   *  / / / / / / / / / / / / / / /
1235	    | - - - - - -+- - - - - - - - -*- - - - - - - - - - - - - - -
1236	    |                 +               * / / / / / unresponsive/ /
1237	    |                       +            * / / / / / / / / / / /
1238	    |                              +        * / / / / / / / / / /
1239	    |                                          +*  / / / / / / /
1240	    |                                                *  / / /+/ /
1241	    |                                                      * / /
1242	    |____________________________________________________________\
1243	    0         '     0.2%'          '     0.4%'        'congestion/

1245	    Figure 6: Schematic Plot of Bulk Policing of Traffic with Different
1246	                           Congestion Responses

1248	   The congestion policer prevents the aggregate of a tenant's traffic
1249	   from operating in the hatched region shown in Figure 6 (there is
1250	   deliberately no vertical scale, because the plot is schematic only).
1251	   The tenant can have either high congestion or high rate, but not both
1252	   at the same time.  It can be seen that the single New Reno TCP flow
1253	   responds to congestion in a similar way, but with a flatter slope.
1254	   It can also be seen that, at 0.25% prevailing congestion, this
1255	   policer would allow roughly two simultaneous New Reno TCP flows
1256	   without intervening.  At about 0.45% congestion it would allow only
1257	   one, and above 0.45% it would start to police even one flow.  But at
1258	   much lower congestion levels typical in a high-capacity data centre
1259	   (e.g. 0.01%) the policer would allow numerous New Reno TCP flows
1260	   (extrapolating the curve upwards).

1262	   The horizontal plot represents an unresponsive (UDP) flow.  The
1263	   congestion policer would allow one of these flows to run unimpeded up
1264	   to roughly 0.3% congestion, when the policer would effectively take
1265	   control and impose its own congestion response on this flow, dropping
1266	   more UDP packets for higher levels of prevailing congestion.  If a
1267	   tenant ran two of these UDP flows (or one at twice the rate), the
1268	   policer would leave them alone unless prevailing congestion was above
1269	   about 0.22% when it would intervene (by extrapolation of the visible
1270	   plots).

1272	   The New Reno TCP curve follows the characteristic k/sqrt(p) rule,
1273	   while the 'Policed' curve follows a simple w/p rule, where k is the
1274	   New Reno constant and w is the constant contracted fill-rate of a
1275	   particular policer.  A scalable TCP (not shown) would follow a v/p
1276	   rule where v is the weight of the individual flow.  If we imagine a
1277	   scenario where w = 8v, the policer would allow 8 of these scalable
1278	   TCP flows (averaged over time), and this would be true at any
1279	   congestion level, because the curvature of 8 scalable TCP's would
1280	   exacatly match the 'Policed' curve.

1282	   So far, the discussion has focused on the congestion avoidance phase
1283	   of each congestion response.  As each TCP flow starts, it typically
1284	   increases exponentially (TCP slow-start) until it overshoots
1285	   available capacity causing a spike of loss due to the feedback delay
1286	   over the following round-trip.  The ConEx audit mechanism requires
1287	   the source to send credit markings in advance to cover the risk of
1288	   these spikes of loss.  Credit marked ConEx packets drain tokens from
1289	   the congestion policer similarly to regular ConEx marked packets,
1290	   which reduces the available congestion allowance for other flows.  If
1291	   the tenant starts new flows fast enough, these credit markings alone
1292	   will empty the token bucket so that the policer starts to discard
1293	   some packets from the start-up phase before they can enter the
1294	   network.  (The tunnel feedback method for signalling congestion back
1295	   to the policer lacks such a credit mechanism, which is an important
1296	   advantage of the ConEx protocol, given Internet traffic generally
1297	   consists of a high proportion of short flows.)

1299	   A tenant could run a mix of multiple different types of TCP and UDP
1300	   flows, as long as, when all stacked up, the sum of them all still
1301	   fitted under the 'Policed' curve at the prevailing level of
1302	   congestion.

1304	4.  Parameter Setting

1306	   The primary parameter to set for each tenant is the contracted fill-
1307	   rate of their congestion token bucket.  For a server sending
1308	   predominantly long-running flows (e.g.a video server or an indexing
1309	   robot filling search databases), this fill-rate can be determined
1310	   from the required number of simultaneous TCP flows, and the typical
1311	   congestion-bit-rate of one TCP flow.

1313	   For a scalable TCP flow, the congestion bit-rate is constant.  But
1314	   currently (2013) TCP Cubic is predominant, and the congestion bit-
1315	   rate of one Cubic flow is not constant, but depends somewhat on bit-
1316	   rate and RTT.  For example, a Cubic flow with 100Mb/s throughput and
1317	   50ms RTT, has a congestion-bit-rate of ~2kbs, whereas at 200Mb/s and
1318	   5ms RTT, its congestion-bit-rate is ~6kb/s. Therefore, a reasonable
1319	   rule of thumb for Cubic's congestion-bit-rate at rates and RTTs of
1320	   about this order is 5kb/s.

1322	   Therefore, if a tenant wanted to serve no more on average than 12
1323	   simultaneous TCP Cubic flows, their contracted congestion-rate should
1324	   be roughly 12 * 5kb/s = 60kb/s.

1326	   The contracted fill rate of a congestion policer should not need to
1327	   change as flow-rates and link capacities scale, assuming the tenant
1328	   still utilises the higher capacity to the same extent.

1330	   A congestion policer based on the dual token bucket in Figure 2 also
1331	   needs to specify the c factor that determined the maximum congestion-
1332	   rate allowed, rather than the average.  A theoretical approach to
1333	   setting this factor is being worked on but, in the mean time, a
1334	   figure of about 10 seems reasonable for a tenant sending only a few
1335	   simultaneous flows (see Section 3.8.2), while Section 3.9 explains
1336	   that the c factor should tend to reduce down to 1 for very large
1337	   tenants.

1339	   The dual token bucket makes isolation fairly insensitive to the depth
1340	   of the deeper bucket, because the shallower bucket constrains the
1341	   congestion burst rate, which makes the congestion burst size less
1342	   important.  It is believed that the deeper bucket depth could be
1343	   minutes or even hours of token fill.  The depth of the shallower
1344	   bucket should be just a few MTU--more experimentation is needed to
1345	   determine how few.

1347	   Indeed, all the above recommendations on parameter setting are best
1348	   considered as initial values to be used in experiments to determine
1349	   whether other values would be better.

1351	   No recommendations can yet be given for parameter settings for a
1352	   tenant running a large proportion of short flows.  This requires
1353	   further experimentation.

1355	5.  Security Considerations

1357	   This document is about performance isolation, therefore the whole
1358	   document concerns security.  Where ConEx is used, the specification
1359	   of the ConEx Abstract Mechanism [I-D.ietf-conex-abstract-mech] should
1360	   be consulted for security matters, particularly the design of audit
1361	   to assure the integrity of ConEx signals.

1363	6.  IANA Considerations

1365	   {Section to be removed by the RFC Editor}. This document does not
1366	   require actions by IANA.

1368	7.  Conclusions

1370	   {ToDo}

1372	8.  Acknowledgments

1374	   Bob Briscoe is part-funded by the European Community under its
1375	   Seventh Framework Programme through the Trilogy 2 project (ICT-
1376	   317756).  The views expressed here are solely those of the author.

1378	9.  Informative References

1380	   [CPolTrilogyExp]
1381	              Raiciu, C., Ed., "Progress on resource control", Trilogy
1382	              EU 7th Framework Project ICT-216372 Deliverable 9,
1383	              December 2009, <http://trilogy-project.org/publications/
1384	              deliverables.html>.

1386	   [CongPol]  Jacquet, A., Briscoe, B., and T. Moncaster, "Policing
1387	              Freedom to Use the Internet Resource Pool", Proc ACM
1388	              Workshop on Re-Architecting the Internet (ReArch'08) ,
1389	              December 2008,
1390	              <http://bobbriscoe.net/projects/refb/#polfree>.

1392	   [DCTCP]    Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel,
1393	              P., Prabhakar, B., Sengupta, S., and M. Sridharan, "Data
1394	              Center TCP (DCTCP)", ACM SIGCOMM CCR 40(4)63--74, October
1395	              2010, <http://portal.acm.org/citation.cfm?id=1851192>.

1397	   [I-D.briscoe-conex-data-centre]
1398	              Briscoe, B. and M. Sridharan, "Network Performance
1399	              Isolation in Data Centres using Congestion Policing",
1400	              draft-briscoe-conex-data-centre-01 (work in progress),
1401	              February 2013.

1403	   [I-D.briscoe-conex-initial-deploy]
1404	              Briscoe, B., "Initial Congestion Exposure (ConEx)
1405	              Deployment Examples", draft-briscoe-conex-initial-
1406	              deploy-03 (work in progress), July 2012.

1408	   [I-D.ietf-conex-abstract-mech]
1409	              Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx)
1410	              Concepts and Abstract Mechanism", draft-ietf-conex-
1411	              abstract-mech-08 (work in progress), October 2013.

1413	   [I-D.ietf-conex-mobile]
1414	              Kutscher, D., Mir, F., Winter, R., Krishnan, S., Zhang,
1415	              Y., and C. Bernardos, "Mobile Communication Congestion
1416	              Exposure Scenario", draft-ietf-conex-mobile-03 (work in
1417	              progress), February 2014.

1419	   [RFC3649]  Floyd, S., "HighSpeed TCP for Large Congestion Windows",
1420	              RFC 3649, December 2003.

1422	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
1423	              Control", RFC 5681, September 2009.

1425	   [RFC6356]  Raiciu, C., Handley, M., and D. Wischik, "Coupled
1426	              Congestion Control for Multipath Transport Protocols", RFC
1427	              6356, October 2011.

1429	   [Relentless]
1430	              Mathis, M., "Relentless Congestion Control", Proc. Int'l
1431	              Wkshp on Protocols for Future, Large-scale & Diverse
1432	              Network Transports (PFLDNeT'09) , May 2009,
1433	              <http://www.hpcc.jp/pfldnet2009/Program.html>.

1435	   [STCP]     Kelly, T., "Scalable TCP: Improving Performance in
1436	              Highspeed Wide Area Networks", ACM SIGCOMM Computer
1437	              Communication Review 32(2), April 2003,
1438	              <http://www-lce.eng.cam.ac.uk/~ctk21/scalable/>.

1440	   [conex-dc_tr]
1441	              Briscoe, , "Network Performance Isolation in Data Centres
1442	              by Congestion Exposure to Edge Policers", BT Technical
1443	              Report TR-DES8-2011-004, November 2011.

1445	              To appear

1447	Appendix A.  Summary of Changes between Drafts

1449	   From briscoe-00 to briscoe-01:  No technical differences; merely
1450	      clarified throughout & updated references.

1452	   From briscoe-conex-data-centre-00 to briscoe-conex-policing-00:  This
1453	      draft was separated out from the -00 version of
1454	      [I-D.briscoe-conex-data-centre], taken mostly from what was
1455	      section 4.  Changes from that draft are listed below:

1457	      From section 4 of draft-briscoe-conex-data-centre-00:

1459	         +  New Introduction

1461	         +  Added section 2.  "Example Bulk Congestion Policer"

1463	         +  Rearranged and clarified the previous introductory text that
1464	            preceded what is now Section 3.4 into three subsections:

1466	               3.1 "The Problem"

1468	               3.2 "Approach"

1470	               3.3 "Terminology"

1472	         +  Corrected numerical examples:

1474	            -  Section 3.4 "Long-RunningFlows": 0.04% to 0.004% and
1475	               400kb/s to 40kb/s

1477	            -  Section 3.6.1 "Numerical Examples Without Policing": 10kb
1478	               /s to 20kb/s

1480	            -  section 3.7 "Weighted Congestion Controls": 22.2 to 20

1482	         +  Clarified where necessary, especially the descriptions of
1483	            the scenarios in 3.7 "Weighted Congestion Controls" and 3.8
1484	            "A Network of Links".

1486	         +  Section 3.8 "A Network of Links": added a numerical example
1487	            including congestion burst limiting that was not covered at
1488	            all previously

1490	         +  Added Section 3.9 "Tenants of Different Sizes"

1492	         +  Filled in absent text in:

1494	            -  Section 3.11 "Diverse Congestion Controls"
1495	            -  Section 4.  "Parameters Settings"

1497	            -  Section 5.  "Security Considerations"

1499	         +  Minor corrections throughout and updated references.

1501	Author's Address

1503	   Bob Briscoe
1504	   BT
1505	   B54/77, Adastral Park
1506	   Martlesham Heath
1507	   Ipswich  IP5 3RE
1508	   UK

1510	   Phone: +44 1473 645196
1511	   EMail: bob.briscoe@bt.com
1512	   URI:   http://bobbriscoe.net/